fuzzy - Robust Identification of Fuzzy Duplicates Surajit...

Info iconThis preview shows pages 1–2. Sign up to view the full content.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Robust Identification of Fuzzy Duplicates Surajit Chaudhuri Venkatesh Ganti Rajeev Motwani Microsoft Research Stanford University {surajitc, vganti}@microsoft.com rajeev@cs.stanford.edu Abstract Detecting and eliminating fuzzy duplicates is a critical data cleaning task that is required by many applications. Fuzzy duplicates are multiple seemingly distinct tuples which represent the same real-world entity. We propose two novel criteria that enable characterization of fuzzy duplicates more accurately than is possible with existing techniques. Using these criteria, we propose a novel framework for the fuzzy duplicate elimination problem. We show that solutions within the new framework result in better accuracy than earlier approaches. We present an efficient algorithm for solving instantiations within the framework. We evaluate it on real datasets to demonstrate the accuracy and scalability of our algorithm. 1. Introduction Detecting and eliminating duplicated data is an important problem in the broader area of data cleaning and data quality. For example, when Lisa purchases products from SuperMart twice, she might be entered as two different customers, e.g., [Lisa Simpson, Seattle, WA, USA, 98025] and [Simson Lisa, Seattle, WA, United States, 98025]. Many times, the same logical real world entity has multiple representations in a relation, due to data entry errors, varying conventions, and a variety of other reasons. Such duplicated information can cause significant problems for the users of the data. For example, it can lead to increased direct mailing costs because several customers like Lisa may be sent multiple catalogs. Or, such duplicates could cause incorrect results in analytic queries (say, the number of SuperMart customers in Seattle), and lead to erroneous data mining models. Hence, a significant amount of time and money are spent on the task of detecting and eliminating duplicates. We refer to this problem of detecting and eliminating multiple distinct records representing the same real world entity or phenomenon as the fuzzy duplicate elimination problem. This problem is similar to the merge/purge, deduping, and record linkage problems [e.g., 13, 17, 15, 21, 26, 1, 25]. Note that our problem generalizes the standard duplicate elimination problem for answering select distinct queries in relational database systems, which consider two tuples to be duplicates if they match exactly on all attributes [3]. However, in this paper, we use duplicate elimination to mean fuzzy duplicate elimination. Previous solutions (as discussed further in Section 6) to duplicate elimination can be classified into supervised and unsupervised approaches. Supervised approaches learn rules characterizing pairs of duplicates from training data consisting of known duplicates [11, 5, 26, 28]. These approaches are limited by their dependence on comprehensive training data which exhibit the variety and distribution of errors observed in practice, or on manual guidance. In many real data integration scenarios, it is not guidance....
View Full Document

This note was uploaded on 11/12/2010 for the course CSCI 271 taught by Professor Wilczynski during the Spring '08 term at USC.

Page1 / 12

fuzzy - Robust Identification of Fuzzy Duplicates Surajit...

This preview shows document pages 1 - 2. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online