Real-world Data is Dirty Data Cleansing and The Merge Purge Problem

A number of 8 alternative distance functions for

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: ber of matching records that should be merged. A number of 8 alternative distance functions for typographical mistakes were implemented and tested in the experiments reported below including distances based upon edit distance, phonetic distance and \typewriter" distance. The results displayed in section 3 are based upon edit distance computation since the outcome of the program did not vary much among the di erent distance functions for the particular databases used in our study. Notice that rules do not necessarily need to compare values from the same attribute (or same domain). For instance, to detect a transposition in a person's name we could write a rule that compares the rst name of one record with the last name of the second record and the last name of the rst record with the rst name of the second record (see appendix A for such an example rule). Modern object-relational databases allow users to add complex data types (and functions to manipulate values in the domain of the data type) to the database engine. Functions to compare these complex data types (e.g., sets, images, sound, etc.) could also be used within rules to perform the matching of complex tuples. For the purpose of experimental study, we wrote an OPS5 13] rule program consisting of 26 rules for this particular domain of employee records and was tested repeatedly over relatively small databases of records. Once we were satis ed with the performance of our rules, distance functions, and thresholds, we recoded the rules directly in C to obtain speedup over the OPS5 implementation. Appendix A shows the OPS5 version of the equational theory implemented for this work. Only those rules used encoding the knowledge of the equational theory are shown in the appendix. The inference process encoded in the rules is divided into three stages. In the rst stage, all records within a window are compared to see if they have \similar" elds, namely, the social security eld, the name eld, and the street address eld. In the second stage, the information gathered during the rst stage is combined to see if can merge pairs of records. For example, if a pair of records have similar social security numbers and similar names then the rule similar-ssn-and-names declares them merged. For those pair of records that could not be merged because not enough information was gathered on the rst stage, the rule program takes a closer look at other elds like the city name, state and zipcode to see if a merge can be done. Otherwise, in the third stage, more precise \edit-distance" functions are 9 SSN 334600443 334600443 525520001 525250001 0 0 0 0 850982319 950982319 Name (First, Initial, Last) Lisa Boardman Lisa Brown Ramon Bonilla Raymond Bonilla Diana D. Ambrosion Diana A. Dambrosion Colette Johnen John Colette Ivette A Keegan Yvette A Kegan Address 144 Wars St. 144 Ward St. 38 Ward St. 38 Ward St. 40 Brik Church Av. 40 Brick Church Av. 600 113th St. apt. 5a5 600 113th St. ap. 585 23 Florida Av. 23 Florida St. Table 2: Example of matching records detected by our equational theory rule base. used over some elds as a last attempt for merging a pair of records. Table 2 demonstrates a number of actual records the rule-program correctly deems equivalent. Appendix B shows the C version of the equational theory. The appendix only shows the subroutine rule program() which is the main code for the rule implementation in C. The comments in the code show where each rule of the OPS5 version is implemented. It is important to note that the essence of the approach proposed here permits a wide range of equational theories on various data types. We chose to use string data in this study (e.g., names, addresses) for pedagogical reasons (after all everyone gets \faulty" junk mail). We could equally as well demonstrate the concepts using alternative databases of di erent typed objects and correspondingly di erent rule sets. Table 2 displays records with such errors that may commonly be found in mailing lists, for example. (Indeed, poor implementations of the merge/purge task by commercial organizations typically lead to several pieces of the same mail being mailed at obviously greater expense to the same household, as nearly everyone has experienced.) These records are identi ed by our rule base as equivalent. The process of creating a good equational theory is similar to the process of creating a good knowledge-base for an expert system. In complex problems, an expert is needed to describe the matching process. A knowledge engineer will then encode the expert's knowledge 10 as rules. The rules will then be tested and the results discussed with the expert. Several sessions between the expert and the knowledge-engineer might be needed before the rule set is completed. 2.4 Computing the transitive closure over the results of independent runs In general, no single key will be su cient to catch all matching records. The attributes or elds that appear rst in the key have higher discriminating power than those appearing after them. Hence, if the error in a record occurs in the particular eld or portion of the eld that is the most important part of the key, there may be little chance a r...
View Full Document

This document was uploaded on 02/15/2014.

Ask a homework question - tutors are online