Real-world Data is Dirty Data Cleansing and The Merge Purge Problem

Accuracy is measured as the percentage of the number

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: in the duplicated records in any of the attribute elds. Accuracy is measured as the percentage of the number of duplicates correctly found by the process. False positives are measured as the percentage of records claimed to be equivalent but which are not actual duplicates. Here, each generated database is viewed as the concatenation of multiple databases. The merging of records in the resultant single database is the object of study in these experiments. Each record generated consists of the following elds, some of which can be empty: social security number, rst name, initial, last name, address, apartment, city, state, and zip code. The names were chosen randomly from a list of 63000 real names1. The cities, states, and zip codes (all from the U.S.A) come from publicly available lists2. The data generated was intended to be a good model of what might actually be processed in real-world datasets. The errors introduced in the duplicate records range from small typographical mistakes, to complete change of last names and addresses. When setting the parameters for typographical errors, we used known frequencies from studies in spelling correction algorithms 21, 7, 17]. For this study, the generator selected from 10% to 50% of the generated records for duplication with errors, where the error in the spelling of words, See ftp://ftp.denet.dk/pub/wordlists 2 Ftp into cdrom.com and cd /pub/FreeBSD/FreeBSD-current/src/share/misc . 1 13 names and cities was controlled according to these published statistics found for common real world datasets. In this paper, the performance measurement of accuracy (percentage of duplicates captured) using this \standard error model" is plotted over varying sized windows so that we may better understand the relationship and tradeo s between computational complexity and accuracy. We do not believe the results will be substantially di erent for di erent databases with the same sorts of errors in the duplicated records. Future work will help to better establish this conjecture over widely varying error models, a orded by our database generator. However, other statistically generated databases may bear no direct relationship to real data. We believe the present experiments are more realistic. Section 5 provides substantial evidence for this case. 3.2 Results on accuracy The purpose of this rst experiment was to determine baseline accuracy of the sortedneighborhood method. We ran three independent runs of the sorted-neighborhood method over each database, and used a di erent key during the sorting phase of each independent run. On the rst run the last name was the principal eld of the key (i.e., the last name was the rst attribute in the key). On the second run, the last name was the principal eld, while, in the last run, the street address was the principal eld. Our selection of the attribute ordering of the keys was purely arbitrary. We could have used the social-security number instead of, say, the street address. We assume all elds are noisy (and under the control of our data generator to be made so) and therefore it does not matter what eld ordering we select for purposes of this study. Figure 2(a) shows the e ect of varying the window size from 2 to 60 records in a database with 1,000,000 records and with an additional 423644 duplicate records with varying errors. A record may be duplicated more than once. Notice that each independent run found from 50% to 70% of the duplicated pairs. Notice also that increasing the window size does not help much and taking in consideration that the time complexity of the procedure goes up as the window size increases, it is obviously fruitless at some point to use a large window. The line marked as Multi-pass over 3 keys in gure 2(a) shows our results when the 14 Sorted-Neighborhood Method (1M records + 423644 duplicates) Sorted-Neighborhood Method (1M records + 423644 duplicates) 0.2 Key #1 (Last Name) Key #2 (First Name) Key #3 (St. Addr.) Multi-pass over 3 keys 90 Key #1 (Last Name) Key #2 (First Name) Key #3 (St. Addr.) Multi-pass over 3 keys 0.15 False Positives (%) Duplicates Detected (%) 100 80 70 60 0.1 0.05 50 40 0 0 10 20 30 40 Window Size (records) 50 60 0 (a) Percent of correctly detected duplicated pairs 10 20 30 Window Size (records) 40 50 (b) Percent of incorrectly detected duplicated pairs Figure 2: Accuracy results for a 1,000,000 records database program computes the transitive closure over the pairs found by the three independent runs. The percent of duplicates found goes up to almost 90%. A manual inspection of those records not found as equivalent revealed that most of them are pairs that would be hard for a human to identify without further information. The equational theory is not completely trustworthy. It can decide that two records are similar or equivalent even though they may not represent the same real-world entity these incorrectly paired records are called \false-positives". Figure 2(b) shows the percent of those records incorrectly marked as dupl...
View Full Document

This document was uploaded on 02/15/2014.

Ask a homework question - tutors are online