This preview shows page 1. Sign up to view the full content.
Unformatted text preview: e also reducing the
time complexity. Thus, we consider alternative metrics for the purposes of merge/purge to
include how accurately can you data cleanse for a xed dollar and given time constraint,
rather than the speci c cost- and time-based metrics proposed in 20]. 2.2 Selection of Keys
The e ectiveness of the sorted-neighborhood method highly depends on the key selected
to sort the records. Here a key is de ned to be a sequence of a subset of attributes, or
substrings within the attributes, chosen from the record. For example, consider the four
records displayed in table 1. For this particular application, suppose the \key designer"
for the sorting phase has determined that for a typical data set the following keys should
be extracted from the data since they provide su cient discriminating power in identifying
123 First Street
123 First Street
123 First Street
123 Forest Street ID
STLSAL123FRST456 Table 1: Example Records and Keys
likely candidates for matching. The key consists of the concatenation of several ordered
elds (or attributes) in the data: The rst three consonants of a last name are concatenated
with the rst three letters of the rst name eld, followed by the address number eld,
and all of the consonants of the street name. This is followed by the rst three digits of
the social security eld. These choices are made since the key designer determined that
last names are typically misspelled (due to mistakes in vocalized sounds, vowels), but rst
names are typically more common and less prone to being misunderstood and hence less
likely to be recorded incorrectly. The keys are now used for sorting the entire dataset with
the intention that all equivalent or matching data will appear close to each other in the
nal sorted list. Notice how the rst and second records are exact duplicates, while the
third is likely the same person but with a misspelled last name. We would expect that this
\phonetically-based" mistake will be caught by a reasonable equational theory. However,
the fourth record, although having the exact same key as the prior three records, appears
unlikely to be the same person. 2.3 Equational theory
The comparison of records, during the merge phase, to determine their equivalence is a
complex inferential process that considers much more information in the compared records
than the keys used for sorting. For example, suppose two person names are spelled nearly
(but not) identically, and have the exact same address. We might infer they are the same
person. On the other hand, suppose two records have exactly the same social security
numbers, but the names and addresses are completely di erent. We could either assume
7 the records represent the same person who changed his name and moved, or the records
represent di erent persons, and the social security number eld is incorrect for one of them.
Without any further information, we may perhaps assume the latter. The more information
there is in the records, the better inferences can be made. For example, Michael Smith
and Michele Smith could have the same address, and their names are \reasonably close".
If gender and age information is available in some eld of the data, we could perhaps infer
that Michael and Michele are either married or siblings.
What we need to specify for these inferences is an equational theory that dictates the logic
of domain equivalence, not simply value or string equivalence. Users of a general purpose
data cleansing facility bene t from higher level formalisms and languages permitting ease of
experimentation and modi cation. For these reasons, a natural approach to specifying an
equational theory and making it practical would be the use of a declarative rule language.
Rule languages have been e ectively used in a wide range of applications requiring inference
over large data sets. Much research has been conducted to provide e cient means for their
compilation and evaluation, and this technology can be exploited here for purposes of data
cleansing e ciently.
As an example, here is a simpli ed rule in English that exempli es one axiom of our
equational theory relevant to our idealized employee database:
Given two records, r1 and r2.
IF the last name of r1 equals the last name of r2,
AND the first names differ slightly,
AND the address of r1 equals the address of r2
r1 is equivalent to r2. The implementation of \differ slightly" speci ed here in English is based upon the
computation of a distance function applied to the rst name elds of two records, and
the comparison of its results to a threshold to capture obvious typographical errors that
may occur in the data. The selection of a distance function and a proper threshold is
also a knowledge intensive activity that demands experimental evaluation. An improperly
chosen threshold will lead to either an increase in the number of falsely matched records
or to a decrease in the num...
View Full Document
This document was uploaded on 02/15/2014.
- Spring '14