This preview shows page 1. Sign up to view the full content.
Unformatted text preview: f Data Management Relationship graph Spring 2013 Candidates of type author, paper and venue. Rela�onship descrip�ons translate the following rela�onships: – The set of papers appearing in a venue is descrip�ve of that venue – The set of authors of a paper is descrip�ve of the paper Data may be extracted from two or more services and integrated. Sangmi Lee Pallickara CS480 Principles of Data Management Spring 2013 Sample data Informa�on extracted from publica�on P1 P1: Duplicate record detec�on: a survey A1:A.K. Elmagarmid A2:P.G. Ipeiro�s v1 A3:V.S. Verykios V1:Transac�ons on knowledge and data engineering Informa�on extracted from publica�on P2 P1: Duplicate Record Detec�on P1 A4:Elmagarmid A3:P.G. Ipeiro�s A5: Verykios V1:TKDE a1 35 34 Sangmi Lee Pallickara a2 Sangmi Lee Pallickara v2 P2 a3 a4 a5 36 6 3/4/13 CS480 Principles of Data Management Spring 2013 If papers are detected to be duplicates or if they are very similar – The likelihood increases that their respec�ve venues are duplicates as well. CS480 Principles of Data Management Spring 2013 Iterative phase of graph algorithms For all given candidates, we need to classify the following candidate pairs as duplicates or non-‐
duplicates: – (v1,v2),(p1,p2),(a1,a2),(a1,a3),(a1,a4),(a1,a5),(a2,a3),
(a2,a4),(a2,a5),(a3,a4),(a3,a5),(a4,a5) Pairs of candidates depend on pairs of related candidates. – Comparison of (v1,v2) depends on the comparison of (p1,p2) 37 Sangmi Lee Pallickara CS480 Principles of Data Management Spring 2013 Choosing a “smart” comparison order
Case A – (v1,v2),(p1,p2),(a1,a2),(a1,a3),(a1,a4),(a1,a5),(a2,a3),
(a2,a4),(a2,a5),(a3,a4),(a3,a5),(a4,a5) Order may need to be updated before the next pair is compared. – Maintaining a priority queue Computational Complexity Spring 2013 Start with a priority queue PQ with all candidate pairs we want to compare Step 1: Retrieval Step 2: Classiﬁca�on – (a1,a2),(a1,a3),(a1,a4),(a1,a5),(a2,a3),(a2,a4),(a2,a5),
(a3,a4),(a3,a5),(a4,a5),(v1,v2),(p1,p2) CS480 Principles of Data Management CS480 Principles of Data Management – Retrieve the ﬁrst pair in PQ as determined by the heuris�c used for ordering Case B Sangmi Lee Pallickara 38 Sangmi Lee Pallickara 39 – Classify the retrieved pair using similarity measure – Decide whether to propagate these results to dependent candidate pairs Step 3: Update – If the output needs to be propagated, iden�fy all dependent pairs from PQ, – Update all necessary informa�on used by the next retrieval and classiﬁca�on step Sangmi Lee Pallickara 40 Spring 2013 On average, the complexity is in O(n2) Where n is the number of candidates of the same type Sangmi Lee Pallickara 41 7...
View Full Document
- Spring '08
- Data Management