spring 2013 cs480 principles of data management

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: angmi Lee Pallickara 19 CS480 Principles of Data Management What if ... Spring 2013 CS480 Principles of Data Management Sorted-neighborhood (Windowing method)   The duplicates are not only within a par��on?   Mul�-­‐pass method –  The blocking algorithm is run mul�ple �mes with a different par��oning key. Step 1. Chose a set of sor�ng keys K and assign a sor�ng key k each record. –  Does not need to be unique –  Can be generated by concatena�ng characters –  Apply the transi�ve closure   Duplicity is inherently a transi�ve rela�on   Maine St. vs Main St. vs. Moin St. –  First 3 constants of last name|First, or 2 digits of zip code –  With an edit distance threshold of 1 –  Only two pairs are recognized as duplicates and the third pair is discovered only through transi�vity. Sangmi Lee Pallickara CS480 Principles of Data Management Spring 2013 21 Spring 2013 Sorted-neighborhood (Windowing method) Sangmi Lee Pallickara CS480 Principles1 Data Management 6 of 2 3 4 5 7 8 9 22 19 20 10 11 12 13 14 15 16 17 18 Spring 2013 1 2 3 4 5 6 Step 2. For each key k in K, sort all records according to that key –  Duplicates have similar keys –  A�er sor�ng, duplicates will locate close to each other Step 3. Slides a window of fixed size across the sorted list 7 8 9 10 11 12 13 14 15 Sangmi Lee Pallickara 16 Step 4. Determine transi�ve closure 17 18 Sangmi Lee Pallickara 23 19 4 3/4/13 CS480 Principles of Data Management Transitive closure step 1 2 3 4 5 6 7 8 9 Spring 2013 CS480 Principles of Data Management Is your key perfect? Spring 2013   There is a chance that the sor�ng characters contain errors.   Perform the sor�ng and windowing mul�ple �mes 10 11 12 13 14 15 –  To avoid mis-­‐sorts, mul�-­‐pass variants of SNM Record 2 and record 5 are duplicates Record 5 and record 7 are duplicates Record 2 and record 7 cannot be in the same window : detected by transi�ve closure step Problem? 25 Sangmi Lee Pallickara CS480 Principles of Data Management Spring 2013 Implementing SNM   Using DBMS 26 Sangmi Lee Pallickara CS480 Principles of Data Management Complexity Analysis Spring 2013   Number of comparisons is greatly reduced –  Approximately w x n, where w is the size of window, n is the size of the records –  Crea�ng a temporary table for a new key and a surrogate key of the original table –  Sort the table based on the key –  Join the the temp...
View Full Document

Ask a homework question - tutors are online