Duplicate Record Detection: A Survey
Ahmed K. Elmagarmid,
Panagiotis G. Ipeirotis,
IEEE Computer Society
Vassilios S. Verykios,
IEEE Computer Society
—Often, in the real world, entities have two or more representations in databases. Duplicate records do not share a common
key and/or they contain errors that make duplicate matching a difficult task. Errors are introduced as the result of transcription errors,
incomplete information, lack of standard formats, or any combination of these factors. In this paper, we present a thorough analysis of
the literature on duplicate record detection. We cover similarity metrics that are commonly used to detect similar field entries, and we
present an extensive set of duplicate detection algorithms that can detect approximately duplicate records in a database. We also
cover multiple techniques for improving the efficiency and scalability of approximate duplicate detection algorithms. We conclude with
coverage of existing tools and with a brief discussion of the big open problems in the area.
—Duplicate detection, data cleaning, data integration, record linkage, data deduplication, instance identification,
database hardening, name matching, identity uncertainty, entity resolution, fuzzy duplicate detection, entity matching.
play an important role in today’s IT-based
economy. Many industries and systems depend on the
accuracy of databases to carry out operations. Therefore, the
quality of the information (or the lack thereof) stored in the
databases can have significant cost implications to a system
that relies on information to function and conduct business.
In an error-free system with perfectly clean data, the
construction of a comprehensive view of the data consists
of linking—in relational terms, joining—two or more tables
on their key fields. Unfortunately, data often lack a unique,
global identifier that would permit such an operation.
Furthermore, the data are neither carefully controlled for
quality nor defined in a consistent way across different data
sources. Thus, data quality is often compromised by many
factors, including data entry errors (e.g.,
), missing integrity constraints (e.g., allowing
entries such as
), and multiple conven-
tions for recording information (e.g.,
44 W. 4th St.
44 West Fourth Street
). To make things worse, in indepen-
dently managed databases, not only the values, but also the
structure, semantics, and underlying assumptions about the
data may differ as well.
Often, while integrating data from different sources to
implement a data warehouse, organizations become aware
of potential systematic differences or conflicts. Such pro-
blems fall under the umbrella-term