Otherwise with the normalized similarity measures

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview:   If the similarity between the nodes is above the threshold, the rela/onship “is ­duplicate ­of” is established. –  This rela/on is transi/ve? –  Build a graph based on the rela/onship Sangmi Lee Pallickara, CS480, Spring 2012 23 4 2/19/13 CS480 Principles of Data Management Spring 2013 Similarity Functions 1 CS480 Principles of Data Management Spring 2013 Similarity Functions 2 •  Given a similarity threshold θ , we classify two candidates c and c’ using a similarity measure sim by, •  Given a similarity threshold θ , we classify two candidates c and c’ using a similarity measure dist() by, classify(c, c’)= c and c’ are duplicates if sim(c, c’) >θ c and c’ are non-duplicates otherwise classify(c, c’) = c and c’ are duplicates if dist(c, c’) ≤θ c and c’ are non-duplicates otherwise With the normalized similarity measures between 0 to 1, dist(c, c’) = 1 - sim(c, c’) Sangmi Lee Pallickara, CS480, Spring 2012 CS480 Principles of Data Management 25 Spring 2013 Token-based similarity Sangmi Lee Pallickara, CS480, Spring 2012 CS480 Principles of Data Management 26 Spring 2013 Token-based similarity •  Divide two...
View Full Document

Ask a homework question - tutors are online