lecture5-2

# 23 023 0232 12 2 using q grams 0047 what is the

This preview shows page 1. Sign up to view the full content.

This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: nce, ci = log10 (10/6) Name c1 •  Compute the similarity between the two strings s1=Farmers Insurance, s2 = Liberty Insurance 0 1 c9 Safeway Insurance Group c10 Wescield … 17 Sangmi Lee Pallickara 18 3 2/22/13 CS480 Principles of Data Management Spring 2013 CS480 Principles of Data Management Spring 2013 Example continued •  Compute the similarity between the two strings s1=Farmers Insurance, s2 = Liberty Insurance V •W || V || × || W || CosineSimilarity(V,W) = cos(α ) = = € 0.23 × 0.23 0.232 + 12 2 Using q-GRAMS = 0.047 € •  What is the Jaccard similarity for the same case? 19 Sangmi Lee Pallickara CS480 Principles of Data Management 20 Sangmi Lee Pallickara Spring 2013 CS480 Principles of Data Management Using q-GRAMS Spring 2013 Generating q-grams •  A string is divided into smaller tokens of size q. s1 = Henri Waternoose s2 = Henry Waternose –  q ­grams or n ­grams –  Size of q ­gram is string with a length q •  Generate 3 ­grams •  Tokens overlap –  Q ­grams of s1 = {##H, #He, Hen, enr, nri, ri_, i_W, _Wa, Wat, ate, ter, ern, rno, noo, oos, ose, se#, e##} –  One character in a string appears in several tokens (at least q tokens) –  Q ­grams of s2 = {##H, #He, Hen, enr, nry, ry_, y_W, _Wa, Wat, ate, ter, ern, rno, nos, ose, se#, e##} •  Genera>ng q ­grams –  Slide a window of size q over the string 21 Sangmi Lee Pallickara CS480 Principles of Data Management 22 Sangmi Lee Pallickara Spring 2013 CS480 Principles of Data Management Spring 2013 q-gram based token similarity q-gram based token similarity q ­grams of s1 = {##H, #He, Hen, enr, nri, ri_, i_W, _Wa, Wat, ate, ter, ern, rno, noo, oos, ose, se#, e##} q ­grams of s2 = {##H, #He, Hen, enr, nry, ry_, y_W, _Wa, Wat, ate, ter, ern, rno, nos, ose, se#, e##} q ­grams of s1 = {##H, #He, Hen, enr, nri, ri_, i_W, _Wa, Wat, ate, ter, ern, rno, noo, oos, ose, se#, e##} q ­grams of s2 = {##H, #He, Hen, enr, nry, ry_, y_W, _Wa, Wat, ate, ter, ern, rno, nos, ose, se#, e##} •  13 overlaps among total 22 dis>nct q ­grams. •  Using the cosine similarity with c ­idf weights –  Overlaps: number of two item pairs having overlap CosSimilarity (V ,W ) = 0 •  Jaccard similarity StringJaccard( s1, s2 ) = 13 = 0.59 22 € Sangmi Lee Pallickara 23 Sangmi Lee Pallickara 24 € 4 2/22/13 CS480 Principles of Data Management Spring 2013 Using q-grams •  Using the same similarity measures used in other token based similarity computa>ons •  Less sensi>ve to typographical errors •  What if we change the size of q? Sangmi Lee Pallickara 25 5...
View Full Document

Ask a homework question - tutors are online