Real-world Data is Dirty Data Cleansing and The Merge Purge Problem

the number of possible false positives where records

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: t our program did not.) The number of possible false-positives where records were correctly merged by the multi-pass sorted-neighborhood method should be larger than the real false-positives (cases where our approach incorrectly merged records that OCAR did not). To study whether our results met the above conditions, we and the OCAR group manually inspected the possible misses and the possible false-positives for the case when the window size was 10. The results of this manual inspection are as follows: 26 Possible Misses: The multi-pass sorted-neighborhood method failed to detect 96 individuals that OCAR detected. Of these 96 possible misses: { 44 (45.8%) were correctly separated by our approach and therefore not real misses. (OCAR's results on these are wrong.) { 26 (27.1%) were incorrectly separated by our approach and therefore real misses. { 26 (27.1%) were special cases involving \ghost" records or records of payments to outside agencies. We agreed with OCAR to exclude these cases from further consideration. Possible False Positives: There were 580 instances of the multi-pass sorted-neighborhood method joining records as individuals that OCAR's did not. Of these 580 cases, we manually inspected 225 (38.7%) of them with the following results: { 14.0% of them were incorrectly merged by our approach. { 86.0% where correctly merged by our approach. By way of summary, 45.8% of the possible misses are not real misses but correctly classied records, and an estimated 86.0% of the possible false-positives are not real false positives. These results lead OCAR to be con dent that the multi-pass sorted-neighborhood method will improve their individual detection procedure. 5 Incremental Merge/Purge All versions of the sorted-neighborhood method we discussed in section 2 started the procedure by rst concatenating the input lists of tuples. This concatenation step is unavoidable and presumably acceptable the rst time a set of databases is received for processing. However, once the data has been cleansed (via the merge/purge process) and stored for future use, concatenation of this processed data with recently arrived data before re-applying a merge/purge process might not be the best strategy to follow. In particular, in situations where new increments of data are available in short periods of time, concatenating all data before merging and purging could prove prohibitively expensive in both time and space 27 required. In this section we describe an incremental version of the sorted-neighborhood procedure and provide some initial time and accuracy results for statistically generated datasets. Figure 10 summarizes an incremental Merge/Purge algorithm. The algorithm speci es a loop repeated for each increment of information received by the system. The increment is concatenated to a relation of prime-representatives pre-computed from the previous run of the incremental Merge/Purge algorithm, and any multi-pass sorted-neighborhood method is applied to the resulting relation. Here prime-representatives are a set of records extracted from each cluster of records used to represent the information in its cluster. From the pattern recognition community, we can think of these prime-representatives as analogous to the \cluster centroids" 10] generally used to represent clusters of information, or as the base element of an equivalence class. Initially, no previous set of prime-representatives exists and the rst increment is just the rst input relation. The concatenation step has, therefore, no e ect. After the execution of the merge/purge procedure, each record from the input relation can be separated into clusters of similar records. The rst time the algorithm is used, all records will go into new clusters. Then, starting the second time the algorithm is executed, records will be added to previously existing clusters as well as new clusters. Of particular importance to the success of this incremental procedure, in terms of accuracy of the results, is the correct selection of the prime-representatives of each formed cluster. As with many other phases of the merge/purge procedure, this selection is also a knowledge-intensive operation where the domain of each application will determine what is a good set of prime-representatives. Before describing some strategies for selecting these representatives, note that the description of step 4 of the algorithm in Figure 10 also implies that for some clusters, the best prime-representative is no representative at all. For a possible practical example where this strategy is true, consider the OCAR data described in chapter 4. There, clusters containing records dated as more than 10 years old are very unlikely to receive a new record. Such clusters can be removed from further consideration by not selecting a prime-representative for them. In the case where one or more prime-representatives per cluster are necessary, here are some possible strategies for their selection: 28 De nitions: R0 :...
View Full Document

This document was uploaded on 02/15/2014.

Ask a homework question - tutors are online