Unformatted text preview: ates           Total run�me complexity (by (a) and (b)) of itera�ve duplicate detec�on is, O(n2)   A half year! –  Detec�ng a graph of nodes that have been classified as duplicates. –  Simple breadth-­‐first-­‐search or depth-­‐first-­‐search –  O(n+d) where d is the number of detected duplicate pairs (d<=n2) –  Average case O(n), and the worst case O(n2) –(b) CS480 Principles of Data Management One million movie subscriber candidates 5 x 1011 pairwise comparisons Assume that a comparison can be done in 0.1 ms. Then, the total duplica�on detec�on �me: And currently Ne�lix has more than 23.6 million customers.. 15 Sangmi Lee Pallickara Spring 2013 Spring 2013 Sangmi Lee Pallickara CS480 Principles of Data Management Blocking methods 16 Spring 2013   Strictly par��on records into disjoint subsets –  Using zip codes as par��oning key Duplicate Detection Algorithms� : Pairwise comparison algorithms  Pairwise comparison algorithms  Algorithms for data with complex rela�onships  Clustering algorithms Sangmi Lee Pallickara 17   Overall number of comparisons is reduced.   Given n records and b par��ons, the average size of each par��on is n/b.   Total number of pairwise comparisons over b par��ons, n nn ( − 1) n ( − 1) 1 n 2 bb =b = ( − n) b× 2b 2 2 Sangmi Lee Pallickara 18 3 3/4/13 CS480 Principles of Data Management 1 2 3 4 5 6 7 8 9 Spring 2013 10 11 12 13 14 15 16 17 18 19 20 CS480 Principles of Data Management 2         3 4 5 6 7 8 9 Spring 2013 Choice of the partitioning predicate 1 Determine number and size of the par��ons Create par��on keys Sort records based on the par��on key Should consider poten�al duplicates that appear in the same par��on –  Customer record management: 10   use zip code or area code   Duplicates have the same zip code   Use address, employer, last name 11 12 13 14 15 Sangmi Lee Pallickara   Same sized par��ons? Or variable sized par��ons? 16 17 18 20 S...
