Unformatted text preview: ber of matching records that should be merged. A number of
8 alternative distance functions for typographical mistakes were implemented and tested in the
experiments reported below including distances based upon edit distance, phonetic distance
and \typewriter" distance. The results displayed in section 3 are based upon edit distance
computation since the outcome of the program did not vary much among the di erent
distance functions for the particular databases used in our study.
Notice that rules do not necessarily need to compare values from the same attribute (or
same domain). For instance, to detect a transposition in a person's name we could write a
rule that compares the rst name of one record with the last name of the second record and
the last name of the rst record with the rst name of the second record (see appendix A for
such an example rule). Modern object-relational databases allow users to add complex data
types (and functions to manipulate values in the domain of the data type) to the database
engine. Functions to compare these complex data types (e.g., sets, images, sound, etc.) could
also be used within rules to perform the matching of complex tuples.
For the purpose of experimental study, we wrote an OPS5 13] rule program consisting
of 26 rules for this particular domain of employee records and was tested repeatedly over
relatively small databases of records. Once we were satis ed with the performance of our
rules, distance functions, and thresholds, we recoded the rules directly in C to obtain speedup over the OPS5 implementation.
Appendix A shows the OPS5 version of the equational theory implemented for this work.
Only those rules used encoding the knowledge of the equational theory are shown in the
The inference process encoded in the rules is divided into three stages. In the rst stage,
all records within a window are compared to see if they have \similar" elds, namely, the
social security eld, the name eld, and the street address eld. In the second stage, the
information gathered during the rst stage is combined to see if can merge pairs of records.
For example, if a pair of records have similar social security numbers and similar names
then the rule similar-ssn-and-names declares them merged. For those pair of records that
could not be merged because not enough information was gathered on the rst stage, the rule
program takes a closer look at other elds like the city name, state and zipcode to see if a
merge can be done. Otherwise, in the third stage, more precise \edit-distance" functions are
950982319 Name (First, Initial, Last)
Diana D. Ambrosion
Diana A. Dambrosion
Ivette A Keegan
Yvette A Kegan Address
144 Wars St.
144 Ward St.
38 Ward St.
38 Ward St.
40 Brik Church Av.
40 Brick Church Av.
600 113th St. apt. 5a5
600 113th St. ap. 585
23 Florida Av.
23 Florida St. Table 2: Example of matching records detected by our equational theory rule base.
used over some elds as a last attempt for merging a pair of records. Table 2 demonstrates
a number of actual records the rule-program correctly deems equivalent.
Appendix B shows the C version of the equational theory. The appendix only shows the
subroutine rule program() which is the main code for the rule implementation in C. The
comments in the code show where each rule of the OPS5 version is implemented.
It is important to note that the essence of the approach proposed here permits a wide
range of equational theories on various data types. We chose to use string data in this study
(e.g., names, addresses) for pedagogical reasons (after all everyone gets \faulty" junk mail).
We could equally as well demonstrate the concepts using alternative databases of di erent
typed objects and correspondingly di erent rule sets.
Table 2 displays records with such errors that may commonly be found in mailing lists,
for example. (Indeed, poor implementations of the merge/purge task by commercial organizations typically lead to several pieces of the same mail being mailed at obviously greater
expense to the same household, as nearly everyone has experienced.) These records are
identi ed by our rule base as equivalent.
The process of creating a good equational theory is similar to the process of creating
a good knowledge-base for an expert system. In complex problems, an expert is needed to
describe the matching process. A knowledge engineer will then encode the expert's knowledge
10 as rules. The rules will then be tested and the results discussed with the expert. Several
sessions between the expert and the knowledge-engineer might be needed before the rule set
is completed. 2.4 Computing the transitive closure over the results of independent runs
In general, no single key will be su cient to catch all matching records. The attributes or
elds that appear rst in the key have higher discriminating power than those appearing
after them. Hence, if the error in a record occurs in the particular eld or portion of the
eld that is the most important part of the key, there may be little chance a r...
View Full Document
- Spring '14
- Relational model, records, data cleansing