lecture5-1

# Tomas sean connery sir sean connery tokenizetomas sean

Unformatted text preview: e union of these two sets. •  What are P and Q? –  Set of tokens from Strings –  Complete descrip/on about data (candidates) Sangmi Lee Pallickara, CS480, Spring 2012 30 5 2/19/13 CS480 Principles of Data Management Spring 2013 CS480 Principles of Data Management Spring 2013 String comparison with JC Candidate comparison with JC •  Given a tokeniza/on func/on tokenize(s) that tokenizes a string s into a set of string tokens {s1, s2, … , sn } •  Given two candidates, the Jaccard coeﬃcient of two candidates c1 and c2 is given by, DescriptionJaccard(c1, c 2) = •  We compute the Jaccard coeﬃcient of two string s1, and s2 StringJaccard( s1, s2) = | tocknize( s1) tocknize( s2) | | tokenize( s1) tokenize( s2) | Sangmi Lee Pallickara, CS480, Spring 2012 € CS480 Principles of Data Management | OD(c1) OD(c 2) | | OD(c1) OD(c 2) | € 31 Spring 2013 Example Sangmi Lee Pallickara, CS480, Spring 2012 CS480 Principles of Data Management 32 Spring 2013 Example continued •  A Person is a candidate type with a descrip/on acribute Name. Tomas Sean Connery Sir Sean Connery Tokenize(Tomas Sean Connery) = {Tomas, Sean, Connery} Tokenize(Sir Sean Connery) = {Sir, Sean, Connery} Tokenize(Tomas Sean Connery) = {Tomas, Sean, Connery} Tokenize(Sir Sean Connery) = {Sir, S...
