Unformatted text preview: aximum duplicates per selected record. The
total size of the database in bytes was approximately 1 MByte. Once read, the database
stayed in core during all phases. We ran three independent single-pass runs using di erent
keys and a multi-pass run using the results of the three single-pass runs. The parameters
for this experiment were N = 13751 records and r = 3. For this particular case where
w = 10, we have ' 6, c ' 1:2 10;5 , Tclosuresp = 1:2s, and Tclosuremp = 7. (Time is
speci ed in seconds (s).) Thus, the multi-pass approach dominates the single sort approach
for these datasets when W > 41.
Figure 4(a) shows the time required to run each independent run of the on one processor,
and the total time required for the multi-pass approach while gure 4(b) shows the accuracy
of each independent run as well as the accuracy of the multi-pass approach (please note
the logarithm scale). For w = 10, gure 4(a) shows that the multi-pass approach needed
18 1000 800 95
Duplicates Found (%) Total Time (s) 100 Key 1 (last name)
Key 2 (first name)
Key 3 (street addr)
Multi-pass w. all keys 600 400 85
65 200 Key 1 (last name)
Key 2 (first name)
Key 3 (street addr)
Multi-pass w. all keys 60
2 10 52
Window Size (W) 1000 2 (a) Time for each single-pass runs and the
multi-pass run 10 52 100
Window Size (W) 1000 10000 (b) Ideal vs. Real Accuracy of each run Figure 4: Time and Accuracy for a Small Database
56:3s to produce an accuracy rate of 93.4% ( gure 4(b)). Looking now at the times for each
single-pass run, their total time is close to 56s for W = 52, slightly higher than estimated
with the above model. But the accuracy of all single-pass runs in gure 4(b) at W = 52 are
from 73% to 80%, well below the 93.4% accuracy level of the multi-pass approach. Moreover,
no single-pass run reaches an accuracy of more than 93% until W > 7000, at which point
(not shown in gure 4(a)) their execution time are over 4,800 seconds (80 minutes).
Let us now consider the issue when the process is I/O bound rather than a computebound main-memory process. Let B be the number of disk blocks used by the input data set
and M the number of memory pages available. Each sorted-neighborhood method execution
will access 2B logM ;1 B disk blocks3, plus B disk blocks will be read by the window scanning
phase. The time for the sorted-neighborhood method can be expressed as: Tsnm = 2csort B logM ;1 B + cwscan B
where csort represents the CPU cost of sorting the data in one block and cwscan represents
the CPU cost of applying the window-scan method to the data in one block.
3 The 2 comes from the fact that we are counting both read and write operations. 19 Instead of sorting, we could divide the data into C buckets (e.g., hashing the records or
using a multi-dimensional partitioning strategy 15]). We call this modi cation the clustering
method. Assuming M = C +1, (1 page for each bucket plus one page for processing an input
block), we need one pass over the entire data to partition the records into C buckets (B
blocks are read). Writing the records into the buckets requires, approximately, B block
writes. Assuming the partition algorithm is perfect, each bucket will use d B e blocks. We
must then sort (2B logM ;1d C e block accesses) and apply the window-scanning phase to each
bucket independently (approximately B block accesses). In total, the clustering method
requires approximately 3B + 2B logM ;1d B e block accesses. The time for one pass of the
clustering method can be expressed as: Tcluster = 2ccluster B + 2csort B logM ;1d B e + cwscan B
where ccluster is the CPU cost of partitioning one block of data.
Finally, the I/O cost of the multi-pass approach will be a multiple of the I/O cost of the
method we chose for each pass plus the time needed to compute the transitive closure step.
For instance, if we use the clustering method for 3 passes, we should expect about a time of
about 3Tcluster + Txclosure.
Figure 5 shows a time comparison between the clustering method and the sorted-neighborhood
method. These results where gathered using a generated data set of 468,730 records (B =
31 250, block size = 1,024 bytes, M = 33 blocks). Notice that in all cases, the clustering
method does better than the sorted-neighborhood method. However, the di erence in time
is not large. This is mainly due to the fact that the equational theory used involved a large
number of comparisons making cwscan a lot larger than both csort and ccluster . Thus, even
though there are some time savings in initially partitioning the data, the savings are small
compared to the overall time cost.
In 16] we describe parallel variants of the basic techniques (including clustering) to show
that with a modest amount of \cheap" parallel hardware, we can speed-up the multi-pass
approach to a level comparable to the time to do a single-pass approach, but with a very
high accuracy, i.e. a few small windows ultimately wins.
20 3500 Average single-pass time, Naive SNM
Average single-pass time, Clustering SNM
Total multi-pass time, Naive SNM
Total multi-pass time, ClusteringSNM 3000 Time (s) 2500
2 3 4 5
Window size (records) 8 9 10 Figure 5: Cluste...
View Full Document
- Spring '14
- Relational model, records, data cleansing