This preview shows page 1. Sign up to view the full content.
Unformatted text preview: ntire data set. The results on the real-world data validate our previous predictions as
being quite accurate. (One may view the formal results of this comparative evaluation by
browsing the site http://www.cs.columbia.edu/~sal.)
Finally, in section 5, we present initial results on an Incremental Merge/Purge algorithm.
The basic Merge/Purge procedure presented in section 2 assumes a single data set. If a new
data set arrives, it must be concatenated to the previously processed data set and the basic
Merge/Purge procedure executed over this entire data set. The Incremental algorithm removes this restriction by using information gathered from previous Merge/Purge executions.
Several strategies for determining what information to gather at the end of each execution
of the incremental algorithm are proposed. We present initial experimental results showing
that the incremental algorithm reduces the time needed to execute a Merge/Purge procedure
when compared with the basic algorithm. 2 Basic Data Cleansing Solutions
In our previous work we introduced the basic \sorted-neighborhood method" for solving
merge/purge as well as a variant \duplicate elimination" method. Here we describe in detail
this basic approach, followed by a description of an incremental variant that merges a new
(smaller) increment of data with an existing previously cleansed dataset. 2.1 The Basic Sorted-Neighborhood Method
Given a collection of two or more databases, we rst concatenate them into one sequential
list of N records (after conditioning the records) and then apply the sorted-neighborhood
method. The sorted-neighborhood method for solving the merge/purge problem can be
summarized in three phases:
1. Create Keys : Compute a key for each record in the list by extracting relevant elds
or portions of elds. The choice of the key depends upon an \error model" that may
4 Current window
of records w
w Next window
of records Figure 1: Window Scan during Data Cleansing
be viewed as knowledge intensive and domain-speci c the e ectiveness of the sortedneighborhood method highly depends on a properly chosen key with the intent that
common but erroneous data will have closely matching keys. We discuss the e ect of
the choice of the key in section 2.2.
2. Sort Data : Sort the records in the data list using the key of step 1.
3. Merge : Move a xed size window through the sequential list of records limiting the
comparisons for matching records to those records in the window. If the size of the
window is w records, then every new record entering the window is compared with the
previous w ; 1 records to nd \matching" records. The rst record in the window
slides out of the window (See gure 1).
When this procedure is executed serially as a main-memory based process, the create
keys phase is an O(N ) operation, the sorting phase is O(N log N ), and the merging phase is
O(wN ), where N is the number of records in the database. Thus, the total time complexity
of this method is O(N log N ) if w < dlog N e, O(wN ) otherwise. However, the constants in
the equations di er greatly. It could be relatively expensive to extract relevant key values
from a record during the create key phase. Sorting requires a few machine instructions to
compare the keys. The merge phase requires the application of a potentially large number
of rules to compare two records, and thus has the potential for the largest constant factor.
5 Notice that w is a parameter of the window-scanning procedure. The legitimate values of
w may range from 2 (whereby only two consecutive elements are compared) to N (whereby
each element is compared to all others). The latter case pertains to the full quadratic (O(N 2))
time process at the maximal potential accuracy (as de ned by the equational theory to be the
percentage of all duplicates correctly found in the merging process). The former case (where
w may be viewed as a small constant relative to N ) pertains to optimal time performance
(only O(N ) time) but at minimal accuracy. The fundamental question is what are the optimal
settings for w to maximize accuracy while minimizing computational cost?
Note, however, that for very large databases the dominant cost is likely disk I/O, and
hence the number of passes over the data set. In this case, at least three passes would be
needed, one pass for conditioning the data and preparing keys, at least a second pass, likely
more, for a high speed sort like, for example, the AlphaSort 20], and a nal pass for window
processing and application of the rule program for each record entering the sliding window.
Depending upon the complexity of the rule program and window size w, the last pass may
indeed be the dominant cost. We introduced elsewhere 16] the means of speeding up this
phase by processing \parallel windows" in the sorted list.
We note with interest that the sorts of optimizations detailed in the AlphaSort paper 20]
may of course be fruitfully applied here. We are more concerned with alternative process
architectures that lead to higher accuracies in the computed results whil...
View Full Document
This document was uploaded on 02/15/2014.
- Spring '14