vldb-duplicate

vldb-duplicate - Eliminating Fuzzy Duplicates in Data...

Info iconThis preview shows pages 1–2. Sign up to view the full content.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Eliminating Fuzzy Duplicates in Data Warehouses Rohit Ananthakrishna 1 Surajit Chaudhuri Venkatesh Ganti Cornell University Microsoft Research rohit@cs.cornell.edu {surajitc, vganti}@microsoft.com 1 Work done while visiting Microsoft Research Abstract The duplicate elimination problem of detecting multiple tuples, which describe the same real world entity, is an important data cleaning problem. Previous domain independent solutions to this problem relied on standard textual similarity functions (e.g., edit distance, cosine metric) between multi-attribute tuples. However, such approaches result in large numbers of false positives if we want to identify domain-specific abbreviations and conventions . In this paper, we develop an algorithm for eliminating duplicates in dimensional tables in a data warehouse, which are usually associated with hierarchies. We exploit hierarchies to develop a high quality, scalable duplicate elimination algorithm, and evaluate it on real datasets from an operational data warehouse. 1. Introduction Decision support analysis on data warehouses influences important business decisions; therefore, accuracy of such analysis is crucial. However, data received at the data warehouse from external sources usually contains errors: spelling mistakes, inconsistent conventions, etc. Hence, significant amount of time and money are spent on data cleaning , the task of detecting and correcting errors in data. The problem of detecting and eliminating duplicated data is one of the major problems in the broad area of data cleaning and data quality [e.g., HS95, ME97, RD00]. Many times, the same logical real world entity may have multiple representations in the data warehouse. For example, when Lisa purchases products from SuperMart twice, she might be entered as two different customers [Lisa Simpson, Seattle, WA, USA, 98025] and [Lisa Simson, Seattle, WA, United States, 98025]due to data entry errors. Such duplicated information can significantly increase direct mailing costs because several customers like Lisa may be sent multiple catalogs. Moreover, such duplicates can cause incorrect results in analysis queries (say, the number of SuperMart customers in Seattle), and erroneous data mining models to be built. We refer to this problem of detecting and eliminating multiple distinct records representing the same real world entity as the fuzzy duplicate elimination problem , which is sometimes also called merge/purge, dedup, record linkage problems [e,g., HS95, ME97, FS69]. This problem is different from the standard duplicate elimination problem, say for answering select distinct queries, in relational database systems which considers two tuples to be duplicates if they match exactly on all attributes. However, data cleaning deals with fuzzy duplicate elimination, which is our focus in this paper. Henceforth, we use duplicate elimination to mean fuzzy duplicate elimination....
View Full Document

This note was uploaded on 11/12/2010 for the course CSCI 271 taught by Professor Wilczynski during the Spring '08 term at USC.

Page1 / 12

vldb-duplicate - Eliminating Fuzzy Duplicates in Data...

This preview shows document pages 1 - 2. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online