Unformatted text preview: icates as a function of the window size. The
percent of false positives is almost insigni cant for each independent run and grows slowly
as the window size increases. The percent of false positives after the transitive closure is
also very small, but grows faster than each individual run alone. This suggests that the
transitiveclosure may not be as accurate if the window size of each constituent pass is very
large!
The number of independent runs needed to obtain good results with the computation
of the transitive closure depends on how corrupt the data is and the keys selected. The
more corrupted the data, more runs might be needed to capture the matching records. The
transitive closure, however, is executed on pairs of tuple id's, each at most 30 bits, and fast
15 SortedNeighborhood Method
3500
3000 10% duplicates
30% duplicates
50% duplicates Time (s) 2500
2000
1500
1000
500
0.4 0.6 0.8
1
1.2
1.4
Total number of records (x 1M) 1.6 1.8 Figure 3: Time performance of the sortedneighborhood methods for di erent size databases.
solutions to compute transitive closure exist 2]. From observing real world scenarios, the
size of the data set over which the closure is computed is at least one order of magnitude
smaller than the corresponding database of records, and thus does not contribute a large
cost. But note we pay a heavy price due to the number of sorts or clusterings of the original
large data set. We presented some parallel implementation alternatives to reduce this cost
in 16]. 3.2.1 Scaling Up
Finally, we demonstrate that the sortedneighborhood method scales well as the size of the
database increases. Due to the limitations of our available disk space, we could only grow our
databases to about 3,000,000 records. We again ran three independent runs of the sortedneighborhood method, each with a di erent key, and then computed the transitive closure
of the results. We did this for the 12 databases in Table 3.2.1. We started with four (4)
\noduplicate databases" and for each we created duplicates for 10%, 30%, and 50% of the
records, for a total of twelve (12) distinct databases. The results are shown in Figure 3. For
these relatively large size databases, the time seems to increase linearly as the size of the
databases increase independent of the duplication factor. 16 Original number
Total records
Total size (Mbytes)
of records
10%
30%
50% 10% 30% 50%
500000
584495 754354 924029 45.4 58.6 71.8
1000000
1169238 1508681 1847606 91.3 118.1 144.8
1500000
1753892 2262808 2770641 138.1 178.4 218.7
1750000
2046550 2639892 3232258 161.6 208.7 255.7 Table 3: Database sizes 3.3 Analysis
The natural question to pose is when is the multipass approach superior to the singlepass
case? The answer to this question lies in the complexity of the two approaches for a xed
accuracy rate (here we consider the percentage of correctly found matches).
Here we consider this question in the context of a mainmemory based sequential process.
The reason being that, as we shall see, clustering provides the opportunity to reduce the
problem of sorting the entire diskresident database to a sequence of smaller, mainmemory
based analysis tasks. The serial time complexity of the multipass approach (with r passes) is
given by the time to create the keys, the time to sort r times, the time to window scan r times
(of window size w) plus the time to compute the transitive closure. In our experiments, the
creation of the keys was integrated into the sorting phase. Therefore, we treat both phases
as one in this analysis. Under the simplifying assumption that all data is memory resident
(i.e., we are not I/O bound), Tmultipass = csort rN log N + cwscan rwN + Tclosuremp
where r is the number of passes and Tclosuremp is the time for the transitive closure. The
constants depict the costs for comparison only and are related as cwscan = csort , where
> 1. From analyzing our experimental program, the window scanning phase contributes
a constant, cwscan , which is at least = 6 times as large as the comparisons performed
in sorting. We replace the constants in term of the single constant c. The complexity
of the closure is directly related to the accuracy rate of each pass and depends upon the
17 duplication in the database. However, we assume the time to compute the transitive closure
on a database that is orders of magnitude smaller than the input database to be less than
the time to scan the input database once (i.e. it contributes a factor of cclosure N < N ).
Therefore,
Tmultipass = crN log N + crwN + Tclosuremp
for a window size of w. The complexity of the single pass sortedneighborhood method is
similarly given by:
Tsinglepass = cN log N + cWN + Tclosuresp
for a window size of W.
For a xed accuracy rate, the question is then for what value of W of the single pass
sortedneighborhood method does the multipass approach perform better in time, i.e., cN log N + cWN + Tclosuresp > crN log N + crwN + Tclosuremp
or 1
W > r ; 1 log N + rw + cN Tclosuremp ; Tclosuresp To validate this model, we generated a small database of 13,751 records (7,500 original
records, 50% selected for duplications, and 5 m...
View
Full Document
 Spring '14
 Relational model, records, data cleansing

Click to edit the document details