Unformatted text preview: orhood method. An AWK script combined with a C program was used
to implement the primerepresentative selection part of the algorithm. The only strategy
tested was the Nlatest strategy, where N = 1 (i.e., only the latest record in a cluster was
31 600.0 Creating Clusters/Select Records 550.0
500.0
450.0
Incremental Merge/Purge Time (s) 400.0
350.0
300.0
Total time (Clustering + Incremental M/P) 250.0
200.0
150.0
100.0 Incremental M/P Accumulative Time 50.0
0.0
Delta 0 Delta 1 Delta 2 Delta 3 Increment (25,000 records each) Delta 4
Normal MultiPass Merge/Purge Figure 11: Incremental vs. Normal MultiPass Merge/Purge Times
used as primerepresentative).
Figure 11 shows the time results for the vepart incremental Merge/Purge procedure in
contrast to the normal (nonincremental) Merge/Purge. These results were obtained with a
threepass basic multipass approach using the keys described in section 4.2, a windowsize
of 10 records, and using a Sun 5 Workstation running Solaris 2.3.
The results in Figure 11 are divided by deltas. Five bars, each representing the actual
time for a particular measured phase, are present for each division in Figure 11. The rst
bar corresponds to the time taken to collect the primerepresentatives of the previous run of
the multipass approach (note this bar is 0 for the rst delta). The second bar represents the
time for executing the multipass approach over the concatenation of the current delta with
the primerepresentatives records. The total time for the incremental Merge/Purge process
is the addition of these two times and is represented by the third bar. The fourth bar shows
the accumulated total time after each incremental Merge/Purge procedure. Finally, the last
bar shows the time for a normal Merge/Purge procedure running over a databases composed
of the concatenation of all deltas, up to and including the current one.
Notice that for every case after the rst delta, the total time for the incremental Merge/Purge
process is considerably less than the time for the normal process. For all cases tested in this
32 Number of Possible Misses/FalsePositives Number of Individual Clusters 10000
Incremental M/P
Normal M/P
8000 6000 4000 2000 0
0 1 2
Delta 3 4 1000
Misses, Incremental M/P
Misses, Normal M/P
FalsePositives, Incremental M/P
FalsePositives, Normal M/P 800 600 400 200 0
0 (a) Clusters formed 1 2
Delta 3 4 (b) Possible Misses and FalsePositives Figure 12: Accuracy of the Incremental M/P procedure
experiment, the cumulative time for the incremental Merge/Purge process was larger than
the total time for the normal Merge/Purge. This is due to the large time cost of clustering
and selecting the primerepresentative records for each cluster. In the current implementation, the entire dataset (the concatenation of all deltas, up to an including the current one)
is sorted to nd the clusters and all records in the cluster are considered when selecting the
primerepresentatives. This is clearly not the optimal solution for the clustering of records
and the selection of primerepresentatives. A better implementation could incrementally
select primerepresentatives based on the previously computed one. The current implementation, nonetheless, gives a \worstcase" execution time for this phase. Any optimization
will only decrease the total incremental Merge/Purge time.
Finally, Figure 12 compares the accuracy results of the incremental Merge/Purge procedure with the normal procedure. The total number of individuals (clusters) detected, the
number of possible misses and the number of possible falsepositives went up with the use of
the incremental Merge/Purge procedure. Nonetheless, the increase of all measures is almost
negligible and arguably acceptable given the remarkable reduction of time provided by the
incremental procedure. 33 6 Conclusion
The sortedneighborhood method is expensive due to the sorting phase, as well as the need
to search in large windows for high accuracy. Alternative methods based on data clustering
modestly improves the process in time as reported elsewhere. However, neither achieves
high accuracy without inspecting large neighborhoods of records. Of particular interest is
that performing the data cleansing process multiple times over small windows, followed by
the computation of the transitive closure, dominates in accuracy for either method. While
multiple passes with small windows increases the number of successful matches, small windows also favor decreases in false positives, leading to high overall accuracy of the merge
phase. An alternative view is that a single pass approach would be far slower to achieve a
comparable accuracy as a multipass approach.
The results we demonstrate for statistically generated databases provide the means of
quantifying the accuracy of the alternative methods. In realworld data we have no comparable means of rigorously evaluating these results. Nevertheless, the application of our
program over realworld data provided by the State of Washington Child Welfare Department has validated our claims of improved accuracy of the multipass method based upon
\eyeballing" a signi cant sample of data. Thus, what the...
View
Full Document
 Spring '14
 Relational model, records, data cleansing

Click to edit the document details