{[ promptMessage ]}

Bookmark it

{[ promptMessage ]}

Real-world Data is Dirty Data Cleansing and The Merge Purge Problem

The total size of the database in bytes was

Info icon This preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: aximum duplicates per selected record. The total size of the database in bytes was approximately 1 MByte. Once read, the database stayed in core during all phases. We ran three independent single-pass runs using di erent keys and a multi-pass run using the results of the three single-pass runs. The parameters for this experiment were N = 13751 records and r = 3. For this particular case where w = 10, we have ' 6, c ' 1:2 10;5 , Tclosuresp = 1:2s, and Tclosuremp = 7. (Time is speci ed in seconds (s).) Thus, the multi-pass approach dominates the single sort approach for these datasets when W > 41. Figure 4(a) shows the time required to run each independent run of the on one processor, and the total time required for the multi-pass approach while gure 4(b) shows the accuracy of each independent run as well as the accuracy of the multi-pass approach (please note the logarithm scale). For w = 10, gure 4(a) shows that the multi-pass approach needed 18 1000 800 95 90 Duplicates Found (%) Total Time (s) 100 Key 1 (last name) Key 2 (first name) Key 3 (street addr) Multi-pass w. all keys 600 400 85 80 75 70 65 200 Key 1 (last name) Key 2 (first name) Key 3 (street addr) Multi-pass w. all keys 60 56.5 0 55 2 10 52 100 Window Size (W) 1000 2 (a) Time for each single-pass runs and the multi-pass run 10 52 100 Window Size (W) 1000 10000 (b) Ideal vs. Real Accuracy of each run Figure 4: Time and Accuracy for a Small Database 56:3s to produce an accuracy rate of 93.4% ( gure 4(b)). Looking now at the times for each single-pass run, their total time is close to 56s for W = 52, slightly higher than estimated with the above model. But the accuracy of all single-pass runs in gure 4(b) at W = 52 are from 73% to 80%, well below the 93.4% accuracy level of the multi-pass approach. Moreover, no single-pass run reaches an accuracy of more than 93% until W > 7000, at which point (not shown in gure 4(a)) their execution time are over 4,800 seconds (80 minutes). Let us now consider the issue when the process is I/O bound rather than a computebound main-memory process. Let B be the number of disk blocks used by the input data set and M the number of memory pages available. Each sorted-neighborhood method execution will access 2B logM ;1 B disk blocks3, plus B disk blocks will be read by the window scanning phase. The time for the sorted-neighborhood method can be expressed as: Tsnm = 2csort B logM ;1 B + cwscan B where csort represents the CPU cost of sorting the data in one block and cwscan represents the CPU cost of applying the window-scan method to the data in one block. 3 The 2 comes from the fact that we are counting both read and write operations. 19 Instead of sorting, we could divide the data into C buckets (e.g., hashing the records or using a multi-dimensional partitioning strategy 15]). We call this modi cation the clustering method. Assuming M = C +1, (1 page for each bucket plus one page for processing an input block), we need one pass over the entire data to partition the records into C buckets (B blocks are read). Writing the records into the buckets requires, approximately, B block writes. Assuming the partition algorithm is perfect, each bucket will use d B e blocks. We C B must then sort (2B logM ;1d C e block accesses) and apply the window-scanning phase to each bucket independently (approximately B block accesses). In total, the clustering method requires approximately 3B + 2B logM ;1d B e block accesses. The time for one pass of the C clustering method can be expressed as: Tcluster = 2ccluster B + 2csort B logM ;1d B e + cwscan B C where ccluster is the CPU cost of partitioning one block of data. Finally, the I/O cost of the multi-pass approach will be a multiple of the I/O cost of the method we chose for each pass plus the time needed to compute the transitive closure step. For instance, if we use the clustering method for 3 passes, we should expect about a time of about 3Tcluster + Txclosure. Figure 5 shows a time comparison between the clustering method and the sorted-neighborhood method. These results where gathered using a generated data set of 468,730 records (B = 31 250, block size = 1,024 bytes, M = 33 blocks). Notice that in all cases, the clustering method does better than the sorted-neighborhood method. However, the di erence in time is not large. This is mainly due to the fact that the equational theory used involved a large number of comparisons making cwscan a lot larger than both csort and ccluster . Thus, even though there are some time savings in initially partitioning the data, the savings are small compared to the overall time cost. In 16] we describe parallel variants of the basic techniques (including clustering) to show that with a modest amount of \cheap" parallel hardware, we can speed-up the multi-pass approach to a level comparable to the time to do a single-pass approach, but with a very high accuracy, i.e. a few small windows ultimately wins. 20 3500 Average single-pass time, Naive SNM Average single-pass time, Clustering SNM Total multi-pass time, Naive SNM Total multi-pass time, ClusteringSNM 3000 Time (s) 2500 2000 1500 1000 500 0 2 3 4 5 6 7 Window size (records) 8 9 10 Figure 5: Cluste...
View Full Document

{[ snackBarMessage ]}

What students are saying

  • Left Quote Icon

    As a current student on this bumpy collegiate pathway, I stumbled upon Course Hero, where I can find study resources for nearly all my courses, get online help from tutors 24/7, and even share my old projects, papers, and lecture notes with other students.

    Student Picture

    Kiran Temple University Fox School of Business ‘17, Course Hero Intern

  • Left Quote Icon

    I cannot even describe how much Course Hero helped me this summer. It’s truly become something I can always rely on and help me. In the end, I was not only able to survive summer classes, but I was able to thrive thanks to Course Hero.

    Student Picture

    Dana University of Pennsylvania ‘17, Course Hero Intern

  • Left Quote Icon

    The ability to access any university’s resources through Course Hero proved invaluable in my case. I was behind on Tulane coursework and actually used UCLA’s materials to help me move forward and get everything together on time.

    Student Picture

    Jill Tulane University ‘16, Course Hero Intern