Obtain parons that represent duplicates total runme

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: ates           Total run�me complexity (by (a) and (b)) of itera�ve duplicate detec�on is, O(n2)   A half year! –  Detec�ng a graph of nodes that have been classified as duplicates. –  Simple breadth-­‐first-­‐search or depth-­‐first-­‐search –  O(n+d) where d is the number of detected duplicate pairs (d<=n2) –  Average case O(n), and the worst case O(n2) –(b) CS480 Principles of Data Management One million movie subscriber candidates 5 x 1011 pairwise comparisons Assume that a comparison can be done in 0.1 ms. Then, the total duplica�on detec�on �me: And currently Ne�lix has more than 23.6 million customers.. 15 Sangmi Lee Pallickara Spring 2013 Spring 2013 Sangmi Lee Pallickara CS480 Principles of Data Management Blocking methods 16 Spring 2013   Strictly par��on records into disjoint subsets –  Using zip codes as par��oning key Duplicate Detection Algorithms� : Pairwise comparison algorithms  Pairwise comparison algorithms  Algorithms for data with complex rela�onships  Clustering algorithms Sangmi Lee Pallickara 17   Overall number of comparisons is reduced.   Given n records and b par��ons, the average size of each par��on is n/b.   Total number of pairwise comparisons over b par��ons, n nn ( − 1) n ( − 1) 1 n 2 bb =b = ( − n) b× 2b 2 2 Sangmi Lee Pallickara 18 3 3/4/13 CS480 Principles of Data Management 1 2 3 4 5 6 7 8 9 Spring 2013 10 11 12 13 14 15 16 17 18 19 20 CS480 Principles of Data Management 2         3 4 5 6 7 8 9 Spring 2013 Choice of the partitioning predicate 1 Determine number and size of the par��ons Create par��on keys Sort records based on the par��on key Should consider poten�al duplicates that appear in the same par��on –  Customer record management: 10   use zip code or area code   Duplicates have the same zip code   Use address, employer, last name 11 12 13 14 15 Sangmi Lee Pallickara   Same sized par��ons? Or variable sized par��ons? 16 17 18 20 S...
View Full Document

Ask a homework question - tutors are online