Info iconThis preview shows pages 1–2. Sign up to view the full content.

View Full Document Right Arrow Icon
Duplicate Record Detection: A Survey Ahmed K. Elmagarmid, Senior Member , IEEE , Panagiotis G. Ipeirotis, Member , IEEE Computer Society , and Vassilios S. Verykios, Member , IEEE Computer Society Abstract —Often, in the real world, entities have two or more representations in databases. Duplicate records do not share a common key and/or they contain errors that make duplicate matching a difficult task. Errors are introduced as the result of transcription errors, incomplete information, lack of standard formats, or any combination of these factors. In this paper, we present a thorough analysis of the literature on duplicate record detection. We cover similarity metrics that are commonly used to detect similar field entries, and we present an extensive set of duplicate detection algorithms that can detect approximately duplicate records in a database. We also cover multiple techniques for improving the efficiency and scalability of approximate duplicate detection algorithms. We conclude with coverage of existing tools and with a brief discussion of the big open problems in the area. Index Terms —Duplicate detection, data cleaning, data integration, record linkage, data deduplication, instance identification, database hardening, name matching, identity uncertainty, entity resolution, fuzzy duplicate detection, entity matching. Ç 1I NTRODUCTION D ATABASES play an important role in today’s IT-based economy. Many industries and systems depend on the accuracy of databases to carry out operations. Therefore, the quality of the information (or the lack thereof) stored in the databases can have significant cost implications to a system that relies on information to function and conduct business. In an error-free system with perfectly clean data, the construction of a comprehensive view of the data consists of linking—in relational terms, joining—two or more tables on their key fields. Unfortunately, data often lack a unique, global identifier that would permit such an operation. Furthermore, the data are neither carefully controlled for quality nor defined in a consistent way across different data sources. Thus, data quality is often compromised by many factors, including data entry errors (e.g., Microsft instead of Microsoft ), missing integrity constraints (e.g., allowing entries such as EmployeeAge ¼ 567 ), and multiple conven- tions for recording information (e.g., 44 W. 4th St. versus 44 West Fourth Street ). To make things worse, in indepen- dently managed databases, not only the values, but also the structure, semantics, and underlying assumptions about the data may differ as well. Often, while integrating data from different sources to implement a data warehouse, organizations become aware of potential systematic differences or conflicts. Such pro- blems fall under the umbrella-term data heterogeneity [1].
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Image of page 2
This is the end of the preview. Sign up to access the rest of the document.

This note was uploaded on 11/12/2010 for the course CSCI 271 taught by Professor Wilczynski during the Spring '08 term at USC.

Page1 / 16


This preview shows document pages 1 - 2. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online