Real-world Data is Dirty Data Cleansing and The Merge Purge Problem

The results on the real world data validate our

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: ntire data set. The results on the real-world data validate our previous predictions as being quite accurate. (One may view the formal results of this comparative evaluation by browsing the site http://www.cs.columbia.edu/~sal.) Finally, in section 5, we present initial results on an Incremental Merge/Purge algorithm. The basic Merge/Purge procedure presented in section 2 assumes a single data set. If a new data set arrives, it must be concatenated to the previously processed data set and the basic Merge/Purge procedure executed over this entire data set. The Incremental algorithm removes this restriction by using information gathered from previous Merge/Purge executions. Several strategies for determining what information to gather at the end of each execution of the incremental algorithm are proposed. We present initial experimental results showing that the incremental algorithm reduces the time needed to execute a Merge/Purge procedure when compared with the basic algorithm. 2 Basic Data Cleansing Solutions In our previous work we introduced the basic \sorted-neighborhood method" for solving merge/purge as well as a variant \duplicate elimination" method. Here we describe in detail this basic approach, followed by a description of an incremental variant that merges a new (smaller) increment of data with an existing previously cleansed dataset. 2.1 The Basic Sorted-Neighborhood Method Given a collection of two or more databases, we rst concatenate them into one sequential list of N records (after conditioning the records) and then apply the sorted-neighborhood method. The sorted-neighborhood method for solving the merge/purge problem can be summarized in three phases: 1. Create Keys : Compute a key for each record in the list by extracting relevant elds or portions of elds. The choice of the key depends upon an \error model" that may 4 Current window of records w w Next window of records Figure 1: Window Scan during Data Cleansing be viewed as knowledge intensive and domain-speci c the e ectiveness of the sortedneighborhood method highly depends on a properly chosen key with the intent that common but erroneous data will have closely matching keys. We discuss the e ect of the choice of the key in section 2.2. 2. Sort Data : Sort the records in the data list using the key of step 1. 3. Merge : Move a xed size window through the sequential list of records limiting the comparisons for matching records to those records in the window. If the size of the window is w records, then every new record entering the window is compared with the previous w ; 1 records to nd \matching" records. The rst record in the window slides out of the window (See gure 1). When this procedure is executed serially as a main-memory based process, the create keys phase is an O(N ) operation, the sorting phase is O(N log N ), and the merging phase is O(wN ), where N is the number of records in the database. Thus, the total time complexity of this method is O(N log N ) if w < dlog N e, O(wN ) otherwise. However, the constants in the equations di er greatly. It could be relatively expensive to extract relevant key values from a record during the create key phase. Sorting requires a few machine instructions to compare the keys. The merge phase requires the application of a potentially large number of rules to compare two records, and thus has the potential for the largest constant factor. 5 Notice that w is a parameter of the window-scanning procedure. The legitimate values of w may range from 2 (whereby only two consecutive elements are compared) to N (whereby each element is compared to all others). The latter case pertains to the full quadratic (O(N 2)) time process at the maximal potential accuracy (as de ned by the equational theory to be the percentage of all duplicates correctly found in the merging process). The former case (where w may be viewed as a small constant relative to N ) pertains to optimal time performance (only O(N ) time) but at minimal accuracy. The fundamental question is what are the optimal settings for w to maximize accuracy while minimizing computational cost? Note, however, that for very large databases the dominant cost is likely disk I/O, and hence the number of passes over the data set. In this case, at least three passes would be needed, one pass for conditioning the data and preparing keys, at least a second pass, likely more, for a high speed sort like, for example, the AlphaSort 20], and a nal pass for window processing and application of the rule program for each record entering the sliding window. Depending upon the complexity of the rule program and window size w, the last pass may indeed be the dominant cost. We introduced elsewhere 16] the means of speeding up this phase by processing \parallel windows" in the sorted list. We note with interest that the sorts of optimizations detailed in the AlphaSort paper 20] may of course be fruitfully applied here. We are more concerned with alternative process architectures that lead to higher accuracies in the computed results whil...
View Full Document

This document was uploaded on 02/15/2014.

Ask a homework question - tutors are online