Real-world Data is Dirty Data Cleansing and The Merge Purge Problem

The total size of the database in bytes was

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: aximum duplicates per selected record. The total size of the database in bytes was approximately 1 MByte. Once read, the database stayed in core during all phases. We ran three independent single-pass runs using di erent keys and a multi-pass run using the results of the three single-pass runs. The parameters for this experiment were N = 13751 records and r = 3. For this particular case where w = 10, we have ' 6, c ' 1:2 10;5 , Tclosuresp = 1:2s, and Tclosuremp = 7. (Time is speci ed in seconds (s).) Thus, the multi-pass approach dominates the single sort approach for these datasets when W > 41. Figure 4(a) shows the time required to run each independent run of the on one processor, and the total time required for the multi-pass approach while gure 4(b) shows the accuracy of each independent run as well as the accuracy of the multi-pass approach (please note the logarithm scale). For w = 10, gure 4(a) shows that the multi-pass approach needed 18 1000 800 95 90 Duplicates Found (%) Total Time (s) 100 Key 1 (last name) Key 2 (first name) Key 3 (street addr) Multi-pass w. all keys 600 400 85 80 75 70 65 200 Key 1 (last name) Key 2 (first name) Key 3 (street addr) Multi-pass w. all keys 60 56.5 0 55 2 10 52 100 Window Size (W) 1000 2 (a) Time for each single-pass runs and the multi-pass run 10 52 100 Window Size (W) 1000 10000 (b) Ideal vs. Real Accuracy of each run Figure 4: Time and Accuracy for a Small Database 56:3s to produce an accuracy rate of 93.4% ( gure 4(b)). Looking now at the times for each single-pass run, their total time is close to 56s for W = 52, slightly higher than estimated with the above model. But the accuracy of all single-pass runs in gure 4(b) at W = 52 are from 73% to 80%, well below the 93.4% accuracy level of the multi-pass approach. Moreover, no single-pass run reaches an accuracy of more than 93% until W > 7000, at which point (not shown in gure 4(a)) their execution time are over 4,800 seconds (80 minutes). Let us now consider the issue when the process is I/O bound rather than a computebound main-memory process. Let B be the number of disk blocks used by the input data set and M the number of memory pages available. Each sorted-neighborhood method execution will access 2B logM ;1 B disk blocks3, plus B disk blocks will be read by the window scanning phase. The time for the sorted-neighborhood method can be expressed as: Tsnm = 2csort B logM ;1 B + cwscan B where csort represents the CPU cost of sorting the data in one block and cwscan represents the CPU cost of applying the window-scan method to the data in one block. 3 The 2 comes from the fact that we are counting both read and write operations. 19 Instead of sorting, we could divide the data into C buckets (e.g., hashing the records or using a multi-dimensional partitioning strategy 15]). We call this modi cation the clustering method. Assuming M = C +1, (1 page for each bucket plus one page for processing an input block), we need one pass over the entire data to partition the records into C buckets (B blocks are read). Writing the records into the buckets requires, approximately, B block writes. Assuming the partition algorithm is perfect, each bucket will use d B e blocks. We C B must then sort (2B logM ;1d C e block accesses) and apply the window-scanning phase to each bucket independently (approximately B block accesses). In total, the clustering method requires approximately 3B + 2B logM ;1d B e block accesses. The time for one pass of the C clustering method can be expressed as: Tcluster = 2ccluster B + 2csort B logM ;1d B e + cwscan B C where ccluster is the CPU cost of partitioning one block of data. Finally, the I/O cost of the multi-pass approach will be a multiple of the I/O cost of the method we chose for each pass plus the time needed to compute the transitive closure step. For instance, if we use the clustering method for 3 passes, we should expect about a time of about 3Tcluster + Txclosure. Figure 5 shows a time comparison between the clustering method and the sorted-neighborhood method. These results where gathered using a generated data set of 468,730 records (B = 31 250, block size = 1,024 bytes, M = 33 blocks). Notice that in all cases, the clustering method does better than the sorted-neighborhood method. However, the di erence in time is not large. This is mainly due to the fact that the equational theory used involved a large number of comparisons making cwscan a lot larger than both csort and ccluster . Thus, even though there are some time savings in initially partitioning the data, the savings are small compared to the overall time cost. In 16] we describe parallel variants of the basic techniques (including clustering) to show that with a modest amount of \cheap" parallel hardware, we can speed-up the multi-pass approach to a level comparable to the time to do a single-pass approach, but with a very high accuracy, i.e. a few small windows ultimately wins. 20 3500 Average single-pass time, Naive SNM Average single-pass time, Clustering SNM Total multi-pass time, Naive SNM Total multi-pass time, ClusteringSNM 3000 Time (s) 2500 2000 1500 1000 500 0 2 3 4 5 6 7 Window size (records) 8 9 10 Figure 5: Cluste...
View Full Document

This document was uploaded on 02/15/2014.

Ask a homework question - tutors are online