lecture6-2

# spring 2013 cs480 principles of data management

This preview shows page 1. Sign up to view the full content.

This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: angmi Lee Pallickara 19 CS480 Principles of Data Management What if ... Spring 2013 CS480 Principles of Data Management Sorted-neighborhood (Windowing method)   The duplicates are not only within a par��on?   Mul�-­‐pass method –  The blocking algorithm is run mul�ple �mes with a diﬀerent par��oning key. Step 1. Chose a set of sor�ng keys K and assign a sor�ng key k each record. –  Does not need to be unique –  Can be generated by concatena�ng characters –  Apply the transi�ve closure   Duplicity is inherently a transi�ve rela�on   Maine St. vs Main St. vs. Moin St. –  First 3 constants of last name|First, or 2 digits of zip code –  With an edit distance threshold of 1 –  Only two pairs are recognized as duplicates and the third pair is discovered only through transi�vity. Sangmi Lee Pallickara CS480 Principles of Data Management Spring 2013 21 Spring 2013 Sorted-neighborhood (Windowing method) Sangmi Lee Pallickara CS480 Principles1 Data Management 6 of 2 3 4 5 7 8 9 22 19 20 10 11 12 13 14 15 16 17 18 Spring 2013 1 2 3 4 5 6 Step 2. For each key k in K, sort all records according to that key –  Duplicates have similar keys –  A�er sor�ng, duplicates will locate close to each other Step 3. Slides a window of ﬁxed size across the sorted list 7 8 9 10 11 12 13 14 15 Sangmi Lee Pallickara 16 Step 4. Determine transi�ve closure 17 18 Sangmi Lee Pallickara 23 19 4 3/4/13 CS480 Principles of Data Management Transitive closure step 1 2 3 4 5 6 7 8 9 Spring 2013 CS480 Principles of Data Management Is your key perfect? Spring 2013   There is a chance that the sor�ng characters contain errors.   Perform the sor�ng and windowing mul�ple �mes 10 11 12 13 14 15 –  To avoid mis-­‐sorts, mul�-­‐pass variants of SNM Record 2 and record 5 are duplicates Record 5 and record 7 are duplicates Record 2 and record 7 cannot be in the same window : detected by transi�ve closure step Problem? 25 Sangmi Lee Pallickara CS480 Principles of Data Management Spring 2013 Implementing SNM   Using DBMS 26 Sangmi Lee Pallickara CS480 Principles of Data Management Complexity Analysis Spring 2013   Number of comparisons is greatly reduced –  Approximately w x n, where w is the size of window, n is the size of the records –  Crea�ng a temporary table for a new key and a surrogate key of the original table –  Sort the table based on the key –  Join the the temp...
View Full Document

{[ snackBarMessage ]}

### What students are saying

• As a current student on this bumpy collegiate pathway, I stumbled upon Course Hero, where I can find study resources for nearly all my courses, get online help from tutors 24/7, and even share my old projects, papers, and lecture notes with other students.

Kiran Temple University Fox School of Business ‘17, Course Hero Intern

• I cannot even describe how much Course Hero helped me this summer. It’s truly become something I can always rely on and help me. In the end, I was not only able to survive summer classes, but I was able to thrive thanks to Course Hero.

Dana University of Pennsylvania ‘17, Course Hero Intern

• The ability to access any university’s resources through Course Hero proved invaluable in my case. I was behind on Tulane coursework and actually used UCLA’s materials to help me move forward and get everything together on time.

Jill Tulane University ‘16, Course Hero Intern