Real-world Data is Dirty Data Cleansing and The Merge Purge Problem

The percent of false positives is almost insigni cant

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: icates as a function of the window size. The percent of false positives is almost insigni cant for each independent run and grows slowly as the window size increases. The percent of false positives after the transitive closure is also very small, but grows faster than each individual run alone. This suggests that the transitive-closure may not be as accurate if the window size of each constituent pass is very large! The number of independent runs needed to obtain good results with the computation of the transitive closure depends on how corrupt the data is and the keys selected. The more corrupted the data, more runs might be needed to capture the matching records. The transitive closure, however, is executed on pairs of tuple id's, each at most 30 bits, and fast 15 Sorted-Neighborhood Method 3500 3000 10% duplicates 30% duplicates 50% duplicates Time (s) 2500 2000 1500 1000 500 0.4 0.6 0.8 1 1.2 1.4 Total number of records (x 1M) 1.6 1.8 Figure 3: Time performance of the sorted-neighborhood methods for di erent size databases. solutions to compute transitive closure exist 2]. From observing real world scenarios, the size of the data set over which the closure is computed is at least one order of magnitude smaller than the corresponding database of records, and thus does not contribute a large cost. But note we pay a heavy price due to the number of sorts or clusterings of the original large data set. We presented some parallel implementation alternatives to reduce this cost in 16]. 3.2.1 Scaling Up Finally, we demonstrate that the sorted-neighborhood method scales well as the size of the database increases. Due to the limitations of our available disk space, we could only grow our databases to about 3,000,000 records. We again ran three independent runs of the sortedneighborhood method, each with a di erent key, and then computed the transitive closure of the results. We did this for the 12 databases in Table 3.2.1. We started with four (4) \no-duplicate databases" and for each we created duplicates for 10%, 30%, and 50% of the records, for a total of twelve (12) distinct databases. The results are shown in Figure 3. For these relatively large size databases, the time seems to increase linearly as the size of the databases increase independent of the duplication factor. 16 Original number Total records Total size (Mbytes) of records 10% 30% 50% 10% 30% 50% 500000 584495 754354 924029 45.4 58.6 71.8 1000000 1169238 1508681 1847606 91.3 118.1 144.8 1500000 1753892 2262808 2770641 138.1 178.4 218.7 1750000 2046550 2639892 3232258 161.6 208.7 255.7 Table 3: Database sizes 3.3 Analysis The natural question to pose is when is the multi-pass approach superior to the single-pass case? The answer to this question lies in the complexity of the two approaches for a xed accuracy rate (here we consider the percentage of correctly found matches). Here we consider this question in the context of a main-memory based sequential process. The reason being that, as we shall see, clustering provides the opportunity to reduce the problem of sorting the entire disk-resident database to a sequence of smaller, main-memory based analysis tasks. The serial time complexity of the multi-pass approach (with r passes) is given by the time to create the keys, the time to sort r times, the time to window scan r times (of window size w) plus the time to compute the transitive closure. In our experiments, the creation of the keys was integrated into the sorting phase. Therefore, we treat both phases as one in this analysis. Under the simplifying assumption that all data is memory resident (i.e., we are not I/O bound), Tmultipass = csort rN log N + cwscan rwN + Tclosuremp where r is the number of passes and Tclosuremp is the time for the transitive closure. The constants depict the costs for comparison only and are related as cwscan = csort , where > 1. From analyzing our experimental program, the window scanning phase contributes a constant, cwscan , which is at least = 6 times as large as the comparisons performed in sorting. We replace the constants in term of the single constant c. The complexity of the closure is directly related to the accuracy rate of each pass and depends upon the 17 duplication in the database. However, we assume the time to compute the transitive closure on a database that is orders of magnitude smaller than the input database to be less than the time to scan the input database once (i.e. it contributes a factor of cclosure N < N ). Therefore, Tmultipass = crN log N + crwN + Tclosuremp for a window size of w. The complexity of the single pass sorted-neighborhood method is similarly given by: Tsinglepass = cN log N + cWN + Tclosuresp for a window size of W. For a xed accuracy rate, the question is then for what value of W of the single pass sorted-neighborhood method does the multi-pass approach perform better in time, i.e., cN log N + cWN + Tclosuresp > crN log N + crwN + Tclosuremp or 1 W > r ; 1 log N + rw + cN Tclosuremp ; Tclosuresp To validate this model, we generated a small database of 13,751 records (7,500 original records, 50% selected for duplications, and 5 m...
View Full Document

This document was uploaded on 02/15/2014.

Ask a homework question - tutors are online