Unformatted text preview: in the duplicated records in any of
the attribute elds. Accuracy is measured as the percentage of the number of duplicates
correctly found by the process. False positives are measured as the percentage of records
claimed to be equivalent but which are not actual duplicates.
Here, each generated database is viewed as the concatenation of multiple databases. The
merging of records in the resultant single database is the object of study in these experiments.
Each record generated consists of the following elds, some of which can be empty: social
security number, rst name, initial, last name, address, apartment, city, state, and zip code.
The names were chosen randomly from a list of 63000 real names1. The cities, states, and
zip codes (all from the U.S.A) come from publicly available lists2.
The data generated was intended to be a good model of what might actually be processed
in real-world datasets. The errors introduced in the duplicate records range from small
typographical mistakes, to complete change of last names and addresses. When setting
the parameters for typographical errors, we used known frequencies from studies in spelling
correction algorithms 21, 7, 17]. For this study, the generator selected from 10% to 50% of
the generated records for duplication with errors, where the error in the spelling of words,
2 Ftp into cdrom.com and cd /pub/FreeBSD/FreeBSD-current/src/share/misc .
1 13 names and cities was controlled according to these published statistics found for common
real world datasets.
In this paper, the performance measurement of accuracy (percentage of duplicates captured) using this \standard error model" is plotted over varying sized windows so that we
may better understand the relationship and tradeo s between computational complexity and
accuracy. We do not believe the results will be substantially di erent for di erent databases
with the same sorts of errors in the duplicated records. Future work will help to better
establish this conjecture over widely varying error models, a orded by our database generator. However, other statistically generated databases may bear no direct relationship to real
data. We believe the present experiments are more realistic. Section 5 provides substantial
evidence for this case. 3.2 Results on accuracy
The purpose of this rst experiment was to determine baseline accuracy of the sortedneighborhood method. We ran three independent runs of the sorted-neighborhood method
over each database, and used a di erent key during the sorting phase of each independent
run. On the rst run the last name was the principal eld of the key (i.e., the last name
was the rst attribute in the key). On the second run, the last name was the principal eld,
while, in the last run, the street address was the principal eld. Our selection of the attribute
ordering of the keys was purely arbitrary. We could have used the social-security number
instead of, say, the street address. We assume all elds are noisy (and under the control of
our data generator to be made so) and therefore it does not matter what eld ordering we
select for purposes of this study.
Figure 2(a) shows the e ect of varying the window size from 2 to 60 records in a database
with 1,000,000 records and with an additional 423644 duplicate records with varying errors.
A record may be duplicated more than once. Notice that each independent run found from
50% to 70% of the duplicated pairs. Notice also that increasing the window size does not
help much and taking in consideration that the time complexity of the procedure goes up as
the window size increases, it is obviously fruitless at some point to use a large window.
The line marked as Multi-pass over 3 keys in gure 2(a) shows our results when the
14 Sorted-Neighborhood Method (1M records + 423644 duplicates) Sorted-Neighborhood Method (1M records + 423644 duplicates)
0.2 Key #1 (Last Name)
Key #2 (First Name)
Key #3 (St. Addr.)
Multi-pass over 3 keys 90 Key #1 (Last Name)
Key #2 (First Name)
Key #3 (St. Addr.)
Multi-pass over 3 keys
False Positives (%) Duplicates Detected (%) 100 80
60 0.1 0.05
0 10 20
Window Size (records) 50 60 0 (a) Percent of correctly detected duplicated
pairs 10 20
Window Size (records) 40 50 (b) Percent of incorrectly detected duplicated pairs Figure 2: Accuracy results for a 1,000,000 records database
program computes the transitive closure over the pairs found by the three independent runs.
The percent of duplicates found goes up to almost 90%. A manual inspection of those records
not found as equivalent revealed that most of them are pairs that would be hard for a human
to identify without further information.
The equational theory is not completely trustworthy. It can decide that two records
are similar or equivalent even though they may not represent the same real-world entity
these incorrectly paired records are called \false-positives". Figure 2(b) shows the percent
of those records incorrectly marked as dupl...
View Full Document
- Spring '14
- Relational model, records, data cleansing