Real-world Data is Dirty Data Cleansing and The Merge Purge Problem

For instance if an employee has two records in the

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: ecord will end up close to a matching record after sorting. For instance, if an employee has two records in the database, one with social security number 193456782 and another with social security number 913456782 (the rst two numbers were transposed), and if the social security number is used as the principal eld of the key, then it is very unlikely both records will fall under the same window, i.e. the two records with transposed social security numbers will be far apart in the sorted list and hence they may not be merged. As we will show in the next section, the number of matching records missed by one run of the sorted-neighborhood method can be large unless the neighborhood grows very large. To increase the number of similar records merged, two options were explored. The rst is simply widening the scanning window size by increasing w. Clearly this increases the computational complexity, and, as discussed in the next section, does not increase dramatically the number of similar records merged in the test cases we ran (unless of course the window spans the entire database which we have presumed is infeasible under strict time and cost constraints). The alternative strategy we implemented is to execute several independent runs of the sorted-neighborhood method, each time using a di erent key and a relatively small window. We call this strategy the multi-pass approach. For instance, in one run, we use the address as the principal part of the key while in another run we use the last name of the employee as the principal part of the key. Each independent run will produce a set of pairs of records which can be merged. We then apply the transitive closure to those pairs of records. The 11 results will be a union of all pairs discovered by all independent runs, with no duplicates, plus all those pairs that can be inferred by transitivity of equality. The reason this approach works for the test cases explored here has much to do with the nature of the errors in the data. Transposing the rst two digits of the social security number leads to non-mergeable records as we noted. However, in such records, the variability or error appearing in another eld of the records may indeed not be so large. Therefore, although the social security numbers in two records are grossly in error, the name elds may not be. Hence, rst sorting on the name elds as the primary key will bring these two records closer together lessening the negative e ects of a gross error in the social security eld. Notice that the use of a transitive closure step is not limited to the multi-pass approach. We can improve the accuracy of a single pass by computing the transitive closure of the results. If records a and b are found to be similar and, at the same time, records b and c are also found to be similar, the transitive closure step can mark a and c to be similar if this relation was not detected by the equational theory. Moreover, records a and b must be within w records to be marked as similar by the equational theory. The same is true for records b and c. But, if the transitive closure step is used, a and c need not be within w records to be detected as similar. The use of a transitive closure at the end of any single-pass run of the sorted-neighborhood method should allow us to reduce the size of the scanning window w and still detect a comparable number of similar pairs as we would nd without a nal closure phase and a larger w. All single run results reported in the next section include a nal closure phase. The utility of this approach is therefore determined by the nature and occurrences of the errors appearing in the data. The choice of keys for sorting, their order, and the extraction of relevant information from a key eld is a knowledge intensive activity that must be explored and carefully evaluated prior to running a data cleansing process. In the next section we will show how the multi-pass approach can drastically improve the accuracy of the results of only one run of the sorted-neighborhood method with varying large windows. Of particular interest is the observation that only a small search window was needed for the multi-pass approach to obtain high accuracy while no individual run with a single key for sorting produced comparable accuracy results with a large window (other 12 than window sizes approaching the size of the full database). These results were found consistently over a variety of generated databases with variable errors introduced in all elds in a systematic fashion. 3 Experimental Results 3.1 Generating the databases All databases used to test these methods were generated automatically by a database generator that allows us to perform controlled studies and to establish the accuracy of the solution method. This database generator provides a user with a large number of parameters that they may set including, the size of the database, the percentage of duplicate records in the database, and the amount of error to be introduced...
View Full Document

{[ snackBarMessage ]}

Ask a homework question - tutors are online