This preview shows page 1. Sign up to view the full content.
Unformatted text preview: ecord will end
up close to a matching record after sorting. For instance, if an employee has two records in
the database, one with social security number 193456782 and another with social security
number 913456782 (the rst two numbers were transposed), and if the social security number
is used as the principal eld of the key, then it is very unlikely both records will fall under the
same window, i.e. the two records with transposed social security numbers will be far apart
in the sorted list and hence they may not be merged. As we will show in the next section,
the number of matching records missed by one run of the sorted-neighborhood method can
be large unless the neighborhood grows very large.
To increase the number of similar records merged, two options were explored. The rst is
simply widening the scanning window size by increasing w. Clearly this increases the computational complexity, and, as discussed in the next section, does not increase dramatically
the number of similar records merged in the test cases we ran (unless of course the window
spans the entire database which we have presumed is infeasible under strict time and cost
The alternative strategy we implemented is to execute several independent runs of the
sorted-neighborhood method, each time using a di erent key and a relatively small window.
We call this strategy the multi-pass approach. For instance, in one run, we use the address
as the principal part of the key while in another run we use the last name of the employee
as the principal part of the key. Each independent run will produce a set of pairs of records
which can be merged. We then apply the transitive closure to those pairs of records. The
11 results will be a union of all pairs discovered by all independent runs, with no duplicates,
plus all those pairs that can be inferred by transitivity of equality.
The reason this approach works for the test cases explored here has much to do with the
nature of the errors in the data. Transposing the rst two digits of the social security number
leads to non-mergeable records as we noted. However, in such records, the variability or error
appearing in another eld of the records may indeed not be so large. Therefore, although
the social security numbers in two records are grossly in error, the name elds may not be.
Hence, rst sorting on the name elds as the primary key will bring these two records closer
together lessening the negative e ects of a gross error in the social security eld.
Notice that the use of a transitive closure step is not limited to the multi-pass approach.
We can improve the accuracy of a single pass by computing the transitive closure of the
results. If records a and b are found to be similar and, at the same time, records b and c
are also found to be similar, the transitive closure step can mark a and c to be similar if
this relation was not detected by the equational theory. Moreover, records a and b must be
within w records to be marked as similar by the equational theory. The same is true for
records b and c. But, if the transitive closure step is used, a and c need not be within w
records to be detected as similar. The use of a transitive closure at the end of any single-pass
run of the sorted-neighborhood method should allow us to reduce the size of the scanning
window w and still detect a comparable number of similar pairs as we would nd without a
nal closure phase and a larger w. All single run results reported in the next section include
a nal closure phase.
The utility of this approach is therefore determined by the nature and occurrences of the
errors appearing in the data. The choice of keys for sorting, their order, and the extraction of
relevant information from a key eld is a knowledge intensive activity that must be explored
and carefully evaluated prior to running a data cleansing process.
In the next section we will show how the multi-pass approach can drastically improve
the accuracy of the results of only one run of the sorted-neighborhood method with varying
large windows. Of particular interest is the observation that only a small search window
was needed for the multi-pass approach to obtain high accuracy while no individual run with
a single key for sorting produced comparable accuracy results with a large window (other
12 than window sizes approaching the size of the full database). These results were found
consistently over a variety of generated databases with variable errors introduced in all elds
in a systematic fashion. 3 Experimental Results
3.1 Generating the databases
All databases used to test these methods were generated automatically by a database generator that allows us to perform controlled studies and to establish the accuracy of the solution
method. This database generator provides a user with a large number of parameters that
they may set including, the size of the database, the percentage of duplicate records in the
database, and the amount of error to be introduced...
View Full Document
This document was uploaded on 02/15/2014.
- Spring '14