Real-world Data is Dirty Data Cleansing and The Merge Purge Problem

The rst issue where databases have di erent schema

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: i cult. The rst issue, where databases have di erent schema, has been addressed extensively in the literature and is known as the schema integration problem 3]. We are primarily interested in the second problem: heterogeneous representations of data and its implication when merging or joining multiple datasets. The fundamental problem in merge/purge is that the data supplied by various sources typically include identi ers or string data, that are either di erent among di erent datasets or simply erroneous due to a variety of reasons (including typographical or transcription errors, or purposeful fraudulent activity (aliases) in the case of names). Hence, the equality of two values over the domain of the common join attribute is not speci ed as a simple arithmetic predicate, but rather by a set of equational axioms that de ne equivalence, i.e., by an equational theory. Determining that two records from two databases provide information about the same entity can be highly complex. We use a rule-based knowledge base to implement an equational theory. The problem of identifying similar instances of the same real-world entity by means of an inexact match has been studied by the Fuzzy Database 5] community. Much of the work has concentrated on the problem of executing a query Q over a fuzzy relational database. The answer for Q is the set of all tuples satisfying Q in a non-fuzzy relational database and all tuples that satisfy Q within a threshold value. Fuzzy relational databases can explicitly store possibility distributions for each value in a tuple, or use possibility-based relations to determine how strongly records belong to the fuzzy set de ned by a query 14]. The problem we study in this paper is closely related to the problem studied by the fuzzy database community. However, while fuzzy querying systems are concerned with the accurate and e cient fuzzy retrieval of tuples given a query Q, we are concerned with the pre-processing of the entire data set before it is even ready for querying. The process we study is o -line and involves clustering all tuples into equivalence classes. This clustering is guided by the equational theory which can include fuzzy matching techniques. 2 Since we are dealing with large databases, we seek to reduce the complexity of the problem by partitioning the database into partitions or clusters in such a way that the potentially matching records are assigned to the same cluster. (Here we use the term cluster in line with the common terminology of statistical pattern recognition.) In this paper we discuss solutions to merge/purge in which sorting of the entire data-set is used to bring the matching records close together in a bounded neighborhood in a linear list, as well as an optimization of this basic technique that seeks to eliminate records during sorting with exact duplicate keys. Elsewhere we have treated the case of clustering in which sorting is replaced by a single-scan process 16]. This clustering resembles the hierarchical clustering strategy proposed in 6] to e ciently perform queries over large fuzzy relational databases. However, we demonstrate that, as one may expect, none of these basic approaches alone can guarantee high accuracy. Here, accuracy means how many of the actual duplicates appearing in the data have been matched and merged correctly. This paper is organized as follows. In section 2 we detail a system we have implemented that performs a generic Merge/Purge process that includes a declarative rule language for specifying an equational theory making it easier to experiment and modify the criteria for equivalence. (This is a very important desideratum of commercial organizations that work under strict time constraints and thus have precious little time to experiment with alternative matching criteria.) Then in section 3 we demonstrate that no single pass over the data using one particular scheme as a sorting key performs as well as computing the transitive closure over several independent runs each using a di erent sorting key for ordering data. The moral is simply that several distinct \cheap" passes over the data produce more accurate results than one \expensive" pass over the data. This result was veri ed independently by Monge and Elkan 19] who recently studied the same problem using a domain-independent matching algorithm as an equational theory. In section 4 we provide a detailed treatment of a real-world data set, provided by the Child Welfare Department of the State of Washington, which was used to establish the validity of these results. Our work using statistically generated databases allowed us to devise controlled studies whereby the optimal accuracy of the results were known a priori. In real world datasets, obviously one can not know the best attainable results with high 3 precision without a time consuming and expensive human inspection and validation process. In cases where the datasets are huge, this may not be feasible. Therefore, the results reported here are due to the human inspection of a small but substantial sample of data relative to the e...
View Full Document

Ask a homework question - tutors are online