Real-world Data is Dirty Data Cleansing and The Merge Purge Problem

An awk script combined with a c program was used to

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: orhood method. An AWK script combined with a C program was used to implement the prime-representative selection part of the algorithm. The only strategy tested was the N-latest strategy, where N = 1 (i.e., only the latest record in a cluster was 31 600.0 Creating Clusters/Select Records 550.0 500.0 450.0 Incremental Merge/Purge Time (s) 400.0 350.0 300.0 Total time (Clustering + Incremental M/P) 250.0 200.0 150.0 100.0 Incremental M/P Accumulative Time 50.0 0.0 Delta 0 Delta 1 Delta 2 Delta 3 Increment (25,000 records each) Delta 4 Normal Multi-Pass Merge/Purge Figure 11: Incremental vs. Normal Multi-Pass Merge/Purge Times used as prime-representative). Figure 11 shows the time results for the ve-part incremental Merge/Purge procedure in contrast to the normal (non-incremental) Merge/Purge. These results were obtained with a three-pass basic multi-pass approach using the keys described in section 4.2, a window-size of 10 records, and using a Sun 5 Workstation running Solaris 2.3. The results in Figure 11 are divided by deltas. Five bars, each representing the actual time for a particular measured phase, are present for each division in Figure 11. The rst bar corresponds to the time taken to collect the prime-representatives of the previous run of the multi-pass approach (note this bar is 0 for the rst delta). The second bar represents the time for executing the multi-pass approach over the concatenation of the current delta with the prime-representatives records. The total time for the incremental Merge/Purge process is the addition of these two times and is represented by the third bar. The fourth bar shows the accumulated total time after each incremental Merge/Purge procedure. Finally, the last bar shows the time for a normal Merge/Purge procedure running over a databases composed of the concatenation of all deltas, up to and including the current one. Notice that for every case after the rst delta, the total time for the incremental Merge/Purge process is considerably less than the time for the normal process. For all cases tested in this 32 Number of Possible Misses/False-Positives Number of Individual Clusters 10000 Incremental M/P Normal M/P 8000 6000 4000 2000 0 0 1 2 Delta 3 4 1000 Misses, Incremental M/P Misses, Normal M/P False-Positives, Incremental M/P False-Positives, Normal M/P 800 600 400 200 0 0 (a) Clusters formed 1 2 Delta 3 4 (b) Possible Misses and False-Positives Figure 12: Accuracy of the Incremental M/P procedure experiment, the cumulative time for the incremental Merge/Purge process was larger than the total time for the normal Merge/Purge. This is due to the large time cost of clustering and selecting the prime-representative records for each cluster. In the current implementation, the entire dataset (the concatenation of all deltas, up to an including the current one) is sorted to nd the clusters and all records in the cluster are considered when selecting the prime-representatives. This is clearly not the optimal solution for the clustering of records and the selection of prime-representatives. A better implementation could incrementally select prime-representatives based on the previously computed one. The current implementation, nonetheless, gives a \worst-case" execution time for this phase. Any optimization will only decrease the total incremental Merge/Purge time. Finally, Figure 12 compares the accuracy results of the incremental Merge/Purge procedure with the normal procedure. The total number of individuals (clusters) detected, the number of possible misses and the number of possible false-positives went up with the use of the incremental Merge/Purge procedure. Nonetheless, the increase of all measures is almost negligible and arguably acceptable given the remarkable reduction of time provided by the incremental procedure. 33 6 Conclusion The sorted-neighborhood method is expensive due to the sorting phase, as well as the need to search in large windows for high accuracy. Alternative methods based on data clustering modestly improves the process in time as reported elsewhere. However, neither achieves high accuracy without inspecting large neighborhoods of records. Of particular interest is that performing the data cleansing process multiple times over small windows, followed by the computation of the transitive closure, dominates in accuracy for either method. While multiple passes with small windows increases the number of successful matches, small windows also favor decreases in false positives, leading to high overall accuracy of the merge phase. An alternative view is that a single pass approach would be far slower to achieve a comparable accuracy as a multi-pass approach. The results we demonstrate for statistically generated databases provide the means of quantifying the accuracy of the alternative methods. In real-world data we have no comparable means of rigorously evaluating these results. Nevertheless, the application of our program over real-world data provided by the State of Washington Child Welfare Department has validated our claims of improved accuracy of the multi-pass method based upon \eye-balling" a signi cant sample of data. Thus, what the...
View Full Document

This document was uploaded on 02/15/2014.

Ask a homework question - tutors are online