Unformatted text preview: aximum duplicates per selected record. The
total size of the database in bytes was approximately 1 MByte. Once read, the database
stayed in core during all phases. We ran three independent singlepass runs using di erent
keys and a multipass run using the results of the three singlepass runs. The parameters
for this experiment were N = 13751 records and r = 3. For this particular case where
w = 10, we have ' 6, c ' 1:2 10;5 , Tclosuresp = 1:2s, and Tclosuremp = 7. (Time is
speci ed in seconds (s).) Thus, the multipass approach dominates the single sort approach
for these datasets when W > 41.
Figure 4(a) shows the time required to run each independent run of the on one processor,
and the total time required for the multipass approach while gure 4(b) shows the accuracy
of each independent run as well as the accuracy of the multipass approach (please note
the logarithm scale). For w = 10, gure 4(a) shows that the multipass approach needed
18 1000 800 95
90
Duplicates Found (%) Total Time (s) 100 Key 1 (last name)
Key 2 (first name)
Key 3 (street addr)
Multipass w. all keys 600 400 85
80
75
70
65 200 Key 1 (last name)
Key 2 (first name)
Key 3 (street addr)
Multipass w. all keys 60
56.5
0 55
2 10 52
100
Window Size (W) 1000 2 (a) Time for each singlepass runs and the
multipass run 10 52 100
Window Size (W) 1000 10000 (b) Ideal vs. Real Accuracy of each run Figure 4: Time and Accuracy for a Small Database
56:3s to produce an accuracy rate of 93.4% ( gure 4(b)). Looking now at the times for each
singlepass run, their total time is close to 56s for W = 52, slightly higher than estimated
with the above model. But the accuracy of all singlepass runs in gure 4(b) at W = 52 are
from 73% to 80%, well below the 93.4% accuracy level of the multipass approach. Moreover,
no singlepass run reaches an accuracy of more than 93% until W > 7000, at which point
(not shown in gure 4(a)) their execution time are over 4,800 seconds (80 minutes).
Let us now consider the issue when the process is I/O bound rather than a computebound mainmemory process. Let B be the number of disk blocks used by the input data set
and M the number of memory pages available. Each sortedneighborhood method execution
will access 2B logM ;1 B disk blocks3, plus B disk blocks will be read by the window scanning
phase. The time for the sortedneighborhood method can be expressed as: Tsnm = 2csort B logM ;1 B + cwscan B
where csort represents the CPU cost of sorting the data in one block and cwscan represents
the CPU cost of applying the windowscan method to the data in one block.
3 The 2 comes from the fact that we are counting both read and write operations. 19 Instead of sorting, we could divide the data into C buckets (e.g., hashing the records or
using a multidimensional partitioning strategy 15]). We call this modi cation the clustering
method. Assuming M = C +1, (1 page for each bucket plus one page for processing an input
block), we need one pass over the entire data to partition the records into C buckets (B
blocks are read). Writing the records into the buckets requires, approximately, B block
writes. Assuming the partition algorithm is perfect, each bucket will use d B e blocks. We
C
B
must then sort (2B logM ;1d C e block accesses) and apply the windowscanning phase to each
bucket independently (approximately B block accesses). In total, the clustering method
requires approximately 3B + 2B logM ;1d B e block accesses. The time for one pass of the
C
clustering method can be expressed as: Tcluster = 2ccluster B + 2csort B logM ;1d B e + cwscan B
C
where ccluster is the CPU cost of partitioning one block of data.
Finally, the I/O cost of the multipass approach will be a multiple of the I/O cost of the
method we chose for each pass plus the time needed to compute the transitive closure step.
For instance, if we use the clustering method for 3 passes, we should expect about a time of
about 3Tcluster + Txclosure.
Figure 5 shows a time comparison between the clustering method and the sortedneighborhood
method. These results where gathered using a generated data set of 468,730 records (B =
31 250, block size = 1,024 bytes, M = 33 blocks). Notice that in all cases, the clustering
method does better than the sortedneighborhood method. However, the di erence in time
is not large. This is mainly due to the fact that the equational theory used involved a large
number of comparisons making cwscan a lot larger than both csort and ccluster . Thus, even
though there are some time savings in initially partitioning the data, the savings are small
compared to the overall time cost.
In 16] we describe parallel variants of the basic techniques (including clustering) to show
that with a modest amount of \cheap" parallel hardware, we can speedup the multipass
approach to a level comparable to the time to do a singlepass approach, but with a very
high accuracy, i.e. a few small windows ultimately wins.
20 3500 Average singlepass time, Naive SNM
Average singlepass time, Clustering SNM
Total multipass time, Naive SNM
Total multipass time, ClusteringSNM 3000 Time (s) 2500
2000
1500
1000
500
0
2 3 4 5
6
7
Window size (records) 8 9 10 Figure 5: Cluste...
View
Full Document
 Spring '14
 Relational model, records, data cleansing

Click to edit the document details