Real-world Data is Dirty Data Cleansing and The Merge Purge Problem

I the i th increment relation ci a relation of only

This preview shows page 1. Sign up to view the full content.

This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: The initial relation. i : The i-th increment relation. ci : A relation of only \prime representatives" of the clusters Initially: c0 i 0 identi ed by the Merge/Purge procedure. R0 0 Incremental Algorithm: For every i do: begin 1. Let Ii CONCATENATE(ci i). 2. Apply any Merge/Purge procedure to Ii. The result is a cluster assignment for every record in Ii. 3. Separate each record in Ii into the clusters assigned by the previous step. 4. For every cluster of records, if necessary, select one or more records as prime representatives for the cluster. Call the relation formed of all selected prime representatives, ci+1. end. Figure 10: Incremental Merge/Purge Algorithm 29 Random Sample: Select a sample of records at random from each cluster. N-Latest: Data is sometimes physically ordered by the time of entry into the relation. In many such cases, the most recent elements entered in the database can be assumed to better represent the cluster (e.g., the OCAR data is such an example). In this strategy, the N latest elements are selected as prime-representatives. Generalization: Generate the prime-representatives by generalizing the data col- lected from several positive examples (records) of the concept represented by the cluster. Techniques for generalizing concepts are well known from machine learning 9, 18]. Syntactic: Choose the largest or more complete record. Utility: Choose the record that matched others more frequently. In this section we present initial results comparing the time and accuracy performance of incremental Merge/Purge with the basic Merge/Purge algorithm. We selected the NLatest prime-representative strategy for our experiments for its implementation simplicity. Experiments are underway to test and compare all the above strategies. Results will be described in a future report. Two important assumptions were made while describing the Incremental Merge/Purge algorithm. First, it was assumed that no data previously used to select each cluster's primerepresentative will be deleted (i.e., no negative deltas). Second, it was also assumed that no changes in the rule-set will occur after the rst increment of data is processed. We now discuss, brie y, the implications of these two assumptions. Removing records already clustered could split some clusters. If a removed record was responsible for merging two clusters, the original two clusters so merged will become separated. Two new prime-representatives must be computed before the next increment of data arrives for processing. The procedure to follow in case of deletions is the following: 1. Delay all deletions until after step 3 of the Incremental Algorithm in Figure 10. 2. Perform all deletions. Remember cluster IDs of all clusters a ected. 3. Re-compute the closure in all clusters a ected, splitting existing clusters as necessary. 30 Then, the Incremental Algorithm resumes at step 4 by recomputing a new prime-representative for all clusters, including the new one formed after the deletions. Changes to the data are a little more di cult. Changes could be treated as a deletion followed by an insertion. However, it is often the case (in particular, if it is a human making the change) that the new record should belong to the same cluster as the removed one. Here a user-set parameter should determine how and in what circumstances changes to data should be treated as a deletion followed by an insertion (to be evaluated in the next increment evaluation) or just a direct change into an existing cluster. Changes of the rule-base de ning the equational theory are even more di cult to correctly incorporate into the Incremental Algorithm. Minor changes to the rule-base (for example, small changes to some thresholds de ning equality over two elds, deletion of rules that have rarely red) are expected to have little impact on the contents of the formed clusters. Nonetheless, depending on the data or if major changes are made to the rule-base, a large number of current clusters could be erroneous. Unfortunately, the only solution to this problem is to run a Merge/Purge procedure once again using all available data. On the other hand, depending on the application, a slight number of inconsistencies might be acceptable therefore avoiding the need to run the entire procedure. Here, once again, the decision is highly application dependent and requires human intervention to resolve. 5.1 Initial experimental results on the Incremental Algorithm We conducted a number of experiments to test the incremental Merge/ Purge algorithm. In these experiments we were interested in studying the time performance of the di erent stages of the algorithm and the e ect on the accuracy of the results. To this end, we started with the OCAR sample described in section 4.1 (128,439 records) and divided it into ve (5) parts, 0 1 2 3 4, with 25 000, 25 000, 25 000, 25 000 and 28 439 records, respectively. The incremental Merge/Purge algorithm was implement as a UNIX shell script which concatenated and fed the proper parts to the basic multipass sorted-neighb...
View Full Document

This document was uploaded on 02/15/2014.

Ask a homework question - tutors are online