Unformatted text preview: The initial relation.
i : The i-th increment relation.
ci : A relation of only \prime representatives" of the clusters Initially:
i 0 identi ed by the Merge/Purge procedure. R0
0 Incremental Algorithm:
For every i do:
1. Let Ii CONCATENATE(ci i).
2. Apply any Merge/Purge procedure to Ii. The result is a cluster
assignment for every record in Ii.
3. Separate each record in Ii into the clusters assigned by the
4. For every cluster of records, if necessary, select one or more
records as prime representatives for the cluster. Call the
relation formed of all selected prime representatives, ci+1.
end. Figure 10: Incremental Merge/Purge Algorithm 29 Random Sample: Select a sample of records at random from each cluster.
N-Latest: Data is sometimes physically ordered by the time of entry into the relation.
In many such cases, the most recent elements entered in the database can be assumed
to better represent the cluster (e.g., the OCAR data is such an example). In this
strategy, the N latest elements are selected as prime-representatives. Generalization: Generate the prime-representatives by generalizing the data col- lected from several positive examples (records) of the concept represented by the cluster. Techniques for generalizing concepts are well known from machine learning 9, 18]. Syntactic: Choose the largest or more complete record.
Utility: Choose the record that matched others more frequently.
In this section we present initial results comparing the time and accuracy performance
of incremental Merge/Purge with the basic Merge/Purge algorithm. We selected the NLatest prime-representative strategy for our experiments for its implementation simplicity.
Experiments are underway to test and compare all the above strategies. Results will be
described in a future report.
Two important assumptions were made while describing the Incremental Merge/Purge
algorithm. First, it was assumed that no data previously used to select each cluster's primerepresentative will be deleted (i.e., no negative deltas). Second, it was also assumed that
no changes in the rule-set will occur after the rst increment of data is processed. We now
discuss, brie y, the implications of these two assumptions.
Removing records already clustered could split some clusters. If a removed record was
responsible for merging two clusters, the original two clusters so merged will become separated. Two new prime-representatives must be computed before the next increment of data
arrives for processing. The procedure to follow in case of deletions is the following:
1. Delay all deletions until after step 3 of the Incremental Algorithm in Figure 10.
2. Perform all deletions. Remember cluster IDs of all clusters a ected.
3. Re-compute the closure in all clusters a ected, splitting existing clusters as necessary.
30 Then, the Incremental Algorithm resumes at step 4 by recomputing a new prime-representative
for all clusters, including the new one formed after the deletions.
Changes to the data are a little more di cult. Changes could be treated as a deletion
followed by an insertion. However, it is often the case (in particular, if it is a human making
the change) that the new record should belong to the same cluster as the removed one. Here a
user-set parameter should determine how and in what circumstances changes to data should
be treated as a deletion followed by an insertion (to be evaluated in the next increment
evaluation) or just a direct change into an existing cluster.
Changes of the rule-base de ning the equational theory are even more di cult to correctly
incorporate into the Incremental Algorithm. Minor changes to the rule-base (for example,
small changes to some thresholds de ning equality over two elds, deletion of rules that
have rarely red) are expected to have little impact on the contents of the formed clusters.
Nonetheless, depending on the data or if major changes are made to the rule-base, a large
number of current clusters could be erroneous. Unfortunately, the only solution to this
problem is to run a Merge/Purge procedure once again using all available data. On the other
hand, depending on the application, a slight number of inconsistencies might be acceptable
therefore avoiding the need to run the entire procedure. Here, once again, the decision is
highly application dependent and requires human intervention to resolve. 5.1 Initial experimental results on the Incremental Algorithm
We conducted a number of experiments to test the incremental Merge/ Purge algorithm.
In these experiments we were interested in studying the time performance of the di erent
stages of the algorithm and the e ect on the accuracy of the results.
To this end, we started with the OCAR sample described in section 4.1 (128,439 records)
and divided it into ve (5) parts, 0 1 2 3 4, with 25 000, 25 000, 25 000, 25 000
and 28 439 records, respectively. The incremental Merge/Purge algorithm was implement
as a UNIX shell script which concatenated and fed the proper parts to the basic multipass sorted-neighb...
View Full Document
- Spring '14
- Relational model, records, data cleansing