This preview shows page 1. Sign up to view the full content.
Unformatted text preview: i cult. The rst
issue, where databases have di erent schema, has been addressed extensively in the literature and is known as the schema integration problem 3]. We are primarily interested in the
second problem: heterogeneous representations of data and its implication when merging or
joining multiple datasets.
The fundamental problem in merge/purge is that the data supplied by various sources
typically include identi ers or string data, that are either di erent among di erent datasets
or simply erroneous due to a variety of reasons (including typographical or transcription
errors, or purposeful fraudulent activity (aliases) in the case of names). Hence, the equality
of two values over the domain of the common join attribute is not speci ed as a simple
arithmetic predicate, but rather by a set of equational axioms that de ne equivalence, i.e., by
an equational theory. Determining that two records from two databases provide information
about the same entity can be highly complex. We use a rule-based knowledge base to
implement an equational theory.
The problem of identifying similar instances of the same real-world entity by means of
an inexact match has been studied by the Fuzzy Database 5] community. Much of the work
has concentrated on the problem of executing a query Q over a fuzzy relational database.
The answer for Q is the set of all tuples satisfying Q in a non-fuzzy relational database and
all tuples that satisfy Q within a threshold value. Fuzzy relational databases can explicitly
store possibility distributions for each value in a tuple, or use possibility-based relations to
determine how strongly records belong to the fuzzy set de ned by a query 14]. The problem
we study in this paper is closely related to the problem studied by the fuzzy database
community. However, while fuzzy querying systems are concerned with the accurate and
e cient fuzzy retrieval of tuples given a query Q, we are concerned with the pre-processing
of the entire data set before it is even ready for querying. The process we study is o -line
and involves clustering all tuples into equivalence classes. This clustering is guided by the
equational theory which can include fuzzy matching techniques.
2 Since we are dealing with large databases, we seek to reduce the complexity of the problem
by partitioning the database into partitions or clusters in such a way that the potentially
matching records are assigned to the same cluster. (Here we use the term cluster in line
with the common terminology of statistical pattern recognition.) In this paper we discuss
solutions to merge/purge in which sorting of the entire data-set is used to bring the matching
records close together in a bounded neighborhood in a linear list, as well as an optimization of
this basic technique that seeks to eliminate records during sorting with exact duplicate keys.
Elsewhere we have treated the case of clustering in which sorting is replaced by a single-scan
process 16]. This clustering resembles the hierarchical clustering strategy proposed in 6] to
e ciently perform queries over large fuzzy relational databases. However, we demonstrate
that, as one may expect, none of these basic approaches alone can guarantee high accuracy.
Here, accuracy means how many of the actual duplicates appearing in the data have been
matched and merged correctly.
This paper is organized as follows. In section 2 we detail a system we have implemented
that performs a generic Merge/Purge process that includes a declarative rule language for
specifying an equational theory making it easier to experiment and modify the criteria for
equivalence. (This is a very important desideratum of commercial organizations that work
under strict time constraints and thus have precious little time to experiment with alternative
matching criteria.) Then in section 3 we demonstrate that no single pass over the data using
one particular scheme as a sorting key performs as well as computing the transitive closure
over several independent runs each using a di erent sorting key for ordering data. The moral
is simply that several distinct \cheap" passes over the data produce more accurate results
than one \expensive" pass over the data. This result was veri ed independently by Monge
and Elkan 19] who recently studied the same problem using a domain-independent matching
algorithm as an equational theory.
In section 4 we provide a detailed treatment of a real-world data set, provided by the
Child Welfare Department of the State of Washington, which was used to establish the
validity of these results. Our work using statistically generated databases allowed us to
devise controlled studies whereby the optimal accuracy of the results were known a priori.
In real world datasets, obviously one can not know the best attainable results with high
3 precision without a time consuming and expensive human inspection and validation process.
In cases where the datasets are huge, this may not be feasible. Therefore, the results reported
here are due to the human inspection of a small but substantial sample of data relative to
View Full Document
- Spring '14