This preview shows page 1. Sign up to view the full content.
Unformatted text preview: orhood method. An AWK script combined with a C program was used
to implement the prime-representative selection part of the algorithm. The only strategy
tested was the N-latest strategy, where N = 1 (i.e., only the latest record in a cluster was
31 600.0 Creating Clusters/Select Records 550.0
Incremental Merge/Purge Time (s) 400.0
Total time (Clustering + Incremental M/P) 250.0
100.0 Incremental M/P Accumulative Time 50.0
Delta 0 Delta 1 Delta 2 Delta 3 Increment (25,000 records each) Delta 4
Normal Multi-Pass Merge/Purge Figure 11: Incremental vs. Normal Multi-Pass Merge/Purge Times
used as prime-representative).
Figure 11 shows the time results for the ve-part incremental Merge/Purge procedure in
contrast to the normal (non-incremental) Merge/Purge. These results were obtained with a
three-pass basic multi-pass approach using the keys described in section 4.2, a window-size
of 10 records, and using a Sun 5 Workstation running Solaris 2.3.
The results in Figure 11 are divided by deltas. Five bars, each representing the actual
time for a particular measured phase, are present for each division in Figure 11. The rst
bar corresponds to the time taken to collect the prime-representatives of the previous run of
the multi-pass approach (note this bar is 0 for the rst delta). The second bar represents the
time for executing the multi-pass approach over the concatenation of the current delta with
the prime-representatives records. The total time for the incremental Merge/Purge process
is the addition of these two times and is represented by the third bar. The fourth bar shows
the accumulated total time after each incremental Merge/Purge procedure. Finally, the last
bar shows the time for a normal Merge/Purge procedure running over a databases composed
of the concatenation of all deltas, up to and including the current one.
Notice that for every case after the rst delta, the total time for the incremental Merge/Purge
process is considerably less than the time for the normal process. For all cases tested in this
32 Number of Possible Misses/False-Positives Number of Individual Clusters 10000
8000 6000 4000 2000 0
0 1 2
Delta 3 4 1000
Misses, Incremental M/P
Misses, Normal M/P
False-Positives, Incremental M/P
False-Positives, Normal M/P 800 600 400 200 0
0 (a) Clusters formed 1 2
Delta 3 4 (b) Possible Misses and False-Positives Figure 12: Accuracy of the Incremental M/P procedure
experiment, the cumulative time for the incremental Merge/Purge process was larger than
the total time for the normal Merge/Purge. This is due to the large time cost of clustering
and selecting the prime-representative records for each cluster. In the current implementation, the entire dataset (the concatenation of all deltas, up to an including the current one)
is sorted to nd the clusters and all records in the cluster are considered when selecting the
prime-representatives. This is clearly not the optimal solution for the clustering of records
and the selection of prime-representatives. A better implementation could incrementally
select prime-representatives based on the previously computed one. The current implementation, nonetheless, gives a \worst-case" execution time for this phase. Any optimization
will only decrease the total incremental Merge/Purge time.
Finally, Figure 12 compares the accuracy results of the incremental Merge/Purge procedure with the normal procedure. The total number of individuals (clusters) detected, the
number of possible misses and the number of possible false-positives went up with the use of
the incremental Merge/Purge procedure. Nonetheless, the increase of all measures is almost
negligible and arguably acceptable given the remarkable reduction of time provided by the
incremental procedure. 33 6 Conclusion
The sorted-neighborhood method is expensive due to the sorting phase, as well as the need
to search in large windows for high accuracy. Alternative methods based on data clustering
modestly improves the process in time as reported elsewhere. However, neither achieves
high accuracy without inspecting large neighborhoods of records. Of particular interest is
that performing the data cleansing process multiple times over small windows, followed by
the computation of the transitive closure, dominates in accuracy for either method. While
multiple passes with small windows increases the number of successful matches, small windows also favor decreases in false positives, leading to high overall accuracy of the merge
phase. An alternative view is that a single pass approach would be far slower to achieve a
comparable accuracy as a multi-pass approach.
The results we demonstrate for statistically generated databases provide the means of
quantifying the accuracy of the alternative methods. In real-world data we have no comparable means of rigorously evaluating these results. Nevertheless, the application of our
program over real-world data provided by the State of Washington Child Welfare Department has validated our claims of improved accuracy of the multi-pass method based upon
\eye-balling" a signi cant sample of data. Thus, what the...
View Full Document
This document was uploaded on 02/15/2014.
- Spring '14