Relaonship descripons translate the following

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: f Data Management Relationship graph Spring 2013   Candidates of type author, paper and venue.   Rela�onship descrip�ons translate the following rela�onships: –  The set of papers appearing in a venue is descrip�ve of that venue –  The set of authors of a paper is descrip�ve of the paper   Data may be extracted from two or more services and integrated. Sangmi Lee Pallickara CS480 Principles of Data Management Spring 2013 Sample data Informa�on extracted from publica�on P1   P1: Duplicate record detec�on: a survey   A1:A.K. Elmagarmid   A2:P.G. Ipeiro�s v1   A3:V.S. Verykios   V1:Transac�ons on knowledge and data engineering Informa�on extracted from publica�on P2   P1: Duplicate Record Detec�on P1   A4:Elmagarmid   A3:P.G. Ipeiro�s   A5: Verykios   V1:TKDE a1 35 34 Sangmi Lee Pallickara a2 Sangmi Lee Pallickara v2 P2 a3 a4 a5 36 6 3/4/13 CS480 Principles of Data Management Spring 2013   If papers are detected to be duplicates or if they are very similar –  The likelihood increases that their respec�ve venues are duplicates as well. CS480 Principles of Data Management Spring 2013 Iterative phase of graph algorithms   For all given candidates, we need to classify the following candidate pairs as duplicates or non-­‐ duplicates: –  (v1,v2),(p1,p2),(a1,a2),(a1,a3),(a1,a4),(a1,a5),(a2,a3), (a2,a4),(a2,a5),(a3,a4),(a3,a5),(a4,a5)   Pairs of candidates depend on pairs of related candidates. –  Comparison of (v1,v2) depends on the comparison of (p1,p2) 37 Sangmi Lee Pallickara CS480 Principles of Data Management Spring 2013 Choosing a “smart” comparison order   Case A –  (v1,v2),(p1,p2),(a1,a2),(a1,a3),(a1,a4),(a1,a5),(a2,a3), (a2,a4),(a2,a5),(a3,a4),(a3,a5),(a4,a5)   Order may need to be updated before the next pair is compared. –  Maintaining a priority queue Computational Complexity Spring 2013   Start with a priority queue PQ with all candidate pairs we want to compare   Step 1: Retrieval   Step 2: Classifica�on –  (a1,a2),(a1,a3),(a1,a4),(a1,a5),(a2,a3),(a2,a4),(a2,a5), (a3,a4),(a3,a5),(a4,a5),(v1,v2),(p1,p2) CS480 Principles of Data Management CS480 Principles of Data Management –  Retrieve the first pair in PQ as determined by the heuris�c used for ordering   Case B Sangmi Lee Pallickara 38 Sangmi Lee Pallickara 39 –  Classify the retrieved pair using similarity measure –  Decide whether to propagate these results to dependent candidate pairs   Step 3: Update –  If the output needs to be propagated, iden�fy all dependent pairs from PQ, –  Update all necessary informa�on used by the next retrieval and classifica�on step Sangmi Lee Pallickara 40 Spring 2013   On average, the complexity is in O(n2)   Where n is the number of candidates of the same type Sangmi Lee Pallickara 41 7...
View Full Document

Ask a homework question - tutors are online