Unformatted text preview: orary table with original table – If w log n the comparisons are dominated by the sor�ng phase Sor�ng phase requires approximately O(nlogn) comparisons. 27 Sangmi Lee Pallickara CS480 Principles of Data Management Spring 2013 Comparison Number of comparisons Key genera�ons Sor�ng Detec�on Overall Blocking (n2/b –n)/2 Windowing (w-1)(n-w/2) O(n) n/a O(nlogn) O(n2 /b) O(n(n/b+log
n)) O(nlogn) n/a O(wn) O(n2) O(n(w+log n)) O(n2) CS480 Principles of Data Management Spring 2013 Full enum. (n2-n)/2 O(n) 28 Sangmi Lee Pallickara Sangmi Lee Pallickara Duplicate Detection Algorithms�
:For Complex Relationships Pairwise comparison algorithms Algorithms for data with complex rela�onships Clustering algorithms 29 Sangmi Lee Pallickara 30 5 3/4/13 CS480 Principles of Data Management Hierarchical relationships Spring 2013 Canada Colorado Fort Collins Arizona Denver Spring 2013 Two candidates on level li+1 may only be duplicated if their parents on level li are duplicated. (or same parent) North America USA CS480 Principles of Data Management Mexico – e.g. the ci�es under diﬀerent states do not need to be compared. Nevada We can prune comparisons based on duplicate classiﬁca�ons previously performed on ancestors. Boulder Traverse the tree in a top-‐down fashion – Candidates at the top-‐most level l1 are compared before we proceed to level l2 and so on. – 1:N rela�onships between parent and child elements 31 Sangmi Lee Pallickara CS480 Principles of Data Management SXNM algorithm Spring 2013 32 Sangmi Lee Pallickara CS480 Principles of Data Management Spring 2013 Relationships forming a graph Does not assume a 1:N rela�onship between parent and child elements In general, rela�onships between candidates can form a graph Traverses the hierarchy from bo�om to top Duplicate detec�on on rela�onship graphs as graph algorithms – Detect duplicate authors that are nested under non-‐
duplicate books. Uses the output of level li for coparing items of level li-1 33 Sangmi Lee Pallickara CS480 Principles o...
View Full Document
This note was uploaded on 02/11/2014 for the course CS 480 taught by Professor Staff during the Spring '08 term at Colorado State.
- Spring '08
- Data Management