nce, ci = log10 (10/6) Name c1 •  Compute the similarity between the two strings s1=Farmers Insurance, s2 = Liberty Insurance 0 1 c9 Safeway Insurance Group c10 Wescield … 17 Sangmi Lee Pallickara 18 3 2/22/13 CS480 Principles of Data Management Spring 2013 CS480 Principles of Data Management Spring 2013 Example continued •  Compute the similarity between the two strings s1=Farmers Insurance, s2 = Liberty Insurance V •W || V || × || W || CosineSimilarity(V,W) = cos(α ) = = € 0.23 × 0.23 0.232 + 12 2 Using q-GRAMS = 0.047 € •  What is the Jaccard similarity for the same case? 19 Sangmi Lee Pallickara CS480 Principles of Data Management 20 Sangmi Lee Pallickara Spring 2013 CS480 Principles of Data Management Using q-GRAMS Spring 2013 Generating q-grams •  A string is divided into smaller tokens of size q. s1 = Henri Waternoose s2 = Henry Waternose –  q ­grams or n ­grams –  Size of q ­gram is string with a length q •  Generate 3 ­grams •  Tokens overlap –  Q ­grams of s1 = {##H, #He, Hen, enr, nri, ri_, i_W, _Wa, Wat, ate, ter, ern, rno, noo, oos, ose, se#, e##} –  One character in a string appears in several tokens (at least q tokens) –  Q ­grams of s2 = {##H, #He, Hen, enr, nry, ry_, y_W, _Wa, Wat, ate, ter, ern, rno, nos, ose, se#, e##} •  Genera>ng q ­grams –  Slide a window of size q over the string 21 Sangmi Lee Pallickara CS480 Principles of Data Management 22 Sangmi Lee Pallickara Spring 2013 CS480 Principles of Data Management Spring 2013 q-gram based token similarity q-gram based token similarity q ­grams of s1 = {##H, #He, Hen, enr, nri, ri_, i_W, _Wa, Wat, ate, ter, ern, rno, noo, oos, ose, se#, e##} q ­grams of s2 = {##H, #He, Hen, enr, nry, ry_, y_W, _Wa, Wat, ate, ter, ern, rno, nos, ose, se#, e##} q ­grams of s1 = {##H, #He, Hen, enr, nri, ri_, i_W, _Wa, Wat, ate, ter, ern, rno, noo, oos, ose, se#, e##} q ­grams of s2 = {##H, #He, Hen, enr, nry, ry_, y_W, _Wa, Wat, ate, ter, ern, rno, nos, ose, se#, e##} •  13 overlaps among total 22 dis>nct q ­grams. •  Using the cosine similarity with c ­idf weights –  Overlaps: number of two item pairs having overlap CosSimilarity (V ,W ) = 0 •  Jaccard similarity StringJaccard( s1, s2 ) = 13 = 0.59 22 € Sangmi Lee Pallickara 23 Sangmi Lee Pallickara 24 € 4 2/22/13 CS480 Principles of Data Management Spring 2013 Using q-grams •  Using the same similarity measures used in other token based similarity computa>ons •  Less sensi>ve to typographical errors •  What if we change the size of q? Sangmi Lee Pallickara 25
