This preview shows page 1. Sign up to view the full content.
Unformatted text preview: t our program did not.)
The number of possible falsepositives where records were correctly merged by the
multipass sortedneighborhood method should be larger than the real falsepositives
(cases where our approach incorrectly merged records that OCAR did not).
To study whether our results met the above conditions, we and the OCAR group manually
inspected the possible misses and the possible falsepositives for the case when the window
size was 10. The results of this manual inspection are as follows:
26 Possible Misses: The multipass sortedneighborhood method failed to detect 96 individuals that OCAR detected. Of these 96 possible misses: { 44 (45.8%) were correctly separated by our approach and therefore not real misses.
(OCAR's results on these are wrong.) { 26 (27.1%) were incorrectly separated by our approach and therefore real misses.
{ 26 (27.1%) were special cases involving \ghost" records or records of payments
to outside agencies. We agreed with OCAR to exclude these cases from further
consideration. Possible False Positives: There were 580 instances of the multipass sortedneighborhood
method joining records as individuals that OCAR's did not. Of these 580 cases, we
manually inspected 225 (38.7%) of them with the following results: { 14.0% of them were incorrectly merged by our approach.
{ 86.0% where correctly merged by our approach.
By way of summary, 45.8% of the possible misses are not real misses but correctly classied records, and an estimated 86.0% of the possible falsepositives are not real false positives.
These results lead OCAR to be con dent that the multipass sortedneighborhood method
will improve their individual detection procedure. 5 Incremental Merge/Purge
All versions of the sortedneighborhood method we discussed in section 2 started the procedure by rst concatenating the input lists of tuples. This concatenation step is unavoidable
and presumably acceptable the rst time a set of databases is received for processing. However, once the data has been cleansed (via the merge/purge process) and stored for future
use, concatenation of this processed data with recently arrived data before reapplying a
merge/purge process might not be the best strategy to follow. In particular, in situations
where new increments of data are available in short periods of time, concatenating all data
before merging and purging could prove prohibitively expensive in both time and space
27 required. In this section we describe an incremental version of the sortedneighborhood procedure and provide some initial time and accuracy results for statistically generated datasets.
Figure 10 summarizes an incremental Merge/Purge algorithm. The algorithm speci es a
loop repeated for each increment of information received by the system. The increment is
concatenated to a relation of primerepresentatives precomputed from the previous run of
the incremental Merge/Purge algorithm, and any multipass sortedneighborhood method is
applied to the resulting relation. Here primerepresentatives are a set of records extracted
from each cluster of records used to represent the information in its cluster. From the
pattern recognition community, we can think of these primerepresentatives as analogous to
the \cluster centroids" 10] generally used to represent clusters of information, or as the base
element of an equivalence class.
Initially, no previous set of primerepresentatives exists and the rst increment is just
the rst input relation. The concatenation step has, therefore, no e ect. After the execution
of the merge/purge procedure, each record from the input relation can be separated into
clusters of similar records. The rst time the algorithm is used, all records will go into new
clusters. Then, starting the second time the algorithm is executed, records will be added to
previously existing clusters as well as new clusters.
Of particular importance to the success of this incremental procedure, in terms of accuracy of the results, is the correct selection of the primerepresentatives of each formed
cluster. As with many other phases of the merge/purge procedure, this selection is also a
knowledgeintensive operation where the domain of each application will determine what is
a good set of primerepresentatives. Before describing some strategies for selecting these
representatives, note that the description of step 4 of the algorithm in Figure 10 also implies that for some clusters, the best primerepresentative is no representative at all. For a
possible practical example where this strategy is true, consider the OCAR data described
in chapter 4. There, clusters containing records dated as more than 10 years old are very
unlikely to receive a new record. Such clusters can be removed from further consideration
by not selecting a primerepresentative for them.
In the case where one or more primerepresentatives per cluster are necessary, here are
some possible strategies for their selection:
28 De nitions: R0 :...
View
Full
Document
This document was uploaded on 02/15/2014.
 Spring '14

Click to edit the document details