Real-world Data is Dirty Data Cleansing and The Merge Purge Problem

Thus we consider alternative metrics for the purposes

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: e also reducing the time complexity. Thus, we consider alternative metrics for the purposes of merge/purge to include how accurately can you data cleanse for a xed dollar and given time constraint, rather than the speci c cost- and time-based metrics proposed in 20]. 2.2 Selection of Keys The e ectiveness of the sorted-neighborhood method highly depends on the key selected to sort the records. Here a key is de ned to be a sequence of a subset of attributes, or substrings within the attributes, chosen from the record. For example, consider the four records displayed in table 1. For this particular application, suppose the \key designer" for the sorting phase has determined that for a typical data set the following keys should be extracted from the data since they provide su cient discriminating power in identifying 6 First Sal Sal Sal Sal Last Stolfo Stolfo Stolpho Stiles Address 123 First Street 123 First Street 123 First Street 123 Forest Street ID 45678987 45678987 45678987 45654321 Key STLSAL123FRST456 STLSAL123FRST456 STLSAL123FRST456 STLSAL123FRST456 Table 1: Example Records and Keys likely candidates for matching. The key consists of the concatenation of several ordered elds (or attributes) in the data: The rst three consonants of a last name are concatenated with the rst three letters of the rst name eld, followed by the address number eld, and all of the consonants of the street name. This is followed by the rst three digits of the social security eld. These choices are made since the key designer determined that last names are typically misspelled (due to mistakes in vocalized sounds, vowels), but rst names are typically more common and less prone to being misunderstood and hence less likely to be recorded incorrectly. The keys are now used for sorting the entire dataset with the intention that all equivalent or matching data will appear close to each other in the nal sorted list. Notice how the rst and second records are exact duplicates, while the third is likely the same person but with a misspelled last name. We would expect that this \phonetically-based" mistake will be caught by a reasonable equational theory. However, the fourth record, although having the exact same key as the prior three records, appears unlikely to be the same person. 2.3 Equational theory The comparison of records, during the merge phase, to determine their equivalence is a complex inferential process that considers much more information in the compared records than the keys used for sorting. For example, suppose two person names are spelled nearly (but not) identically, and have the exact same address. We might infer they are the same person. On the other hand, suppose two records have exactly the same social security numbers, but the names and addresses are completely di erent. We could either assume 7 the records represent the same person who changed his name and moved, or the records represent di erent persons, and the social security number eld is incorrect for one of them. Without any further information, we may perhaps assume the latter. The more information there is in the records, the better inferences can be made. For example, Michael Smith and Michele Smith could have the same address, and their names are \reasonably close". If gender and age information is available in some eld of the data, we could perhaps infer that Michael and Michele are either married or siblings. What we need to specify for these inferences is an equational theory that dictates the logic of domain equivalence, not simply value or string equivalence. Users of a general purpose data cleansing facility bene t from higher level formalisms and languages permitting ease of experimentation and modi cation. For these reasons, a natural approach to specifying an equational theory and making it practical would be the use of a declarative rule language. Rule languages have been e ectively used in a wide range of applications requiring inference over large data sets. Much research has been conducted to provide e cient means for their compilation and evaluation, and this technology can be exploited here for purposes of data cleansing e ciently. As an example, here is a simpli ed rule in English that exempli es one axiom of our equational theory relevant to our idealized employee database: Given two records, r1 and r2. IF the last name of r1 equals the last name of r2, AND the first names differ slightly, AND the address of r1 equals the address of r2 THEN r1 is equivalent to r2. The implementation of \differ slightly" speci ed here in English is based upon the computation of a distance function applied to the rst name elds of two records, and the comparison of its results to a threshold to capture obvious typographical errors that may occur in the data. The selection of a distance function and a proper threshold is also a knowledge intensive activity that demands experimental evaluation. An improperly chosen threshold will lead to either an increase in the number of falsely matched records or to a decrease in the num...
View Full Document

This document was uploaded on 02/15/2014.

Ask a homework question - tutors are online