# 7 sangmi lee pallickara cs480 principles of data

Unformatted text preview: less descrip>ve than firstname and lastname! Cosine Similarity •  Very sensi>ve to typographical errors in single tokens –  Shean Conery and Sean Connery have a similarity of zero. 7 Sangmi Lee Pallickara CS480 Principles of Data Management Spring 2013 CS480 Principles of Data Management Cosine similarity Spring 2013 Cosine similarity - Continued •  Given two n ­dimensional vectors V and W, the cosine similarity computes the cosign of the angle α between these two vectors as CosineSimilarity(V,W) = cos(α ) = 8 Sangmi Lee Pallickara •  The vectors V and W –  Tokens in a string –  Descrip>ons of a candidate V •W || V || × || W || •  The d dimensions of these vectors correspond to all d dis>nct tokens in a set of strings. –  Denoted as D Where ||V|| is the length of the vector V = [a,b,c..] computed as € a 2 + b 2 + c 2 + ... •  For a large database, d may be large –  V and W have high dimensionality d 9 Sangmi Lee Pallickara 10 Sangmi Lee Pallickara € CS480 Principles of Data Management Spring 2013 CS480 Principles of Data Management Weight of Token Spring 2013 tf-idf weighting •  The product of its c weight and its idf weight •  Vector contains a weight for each of the d dis>nct tokens W t ,d = (1 +log10 tf t ,d ) × log10 ( N / df t ) •  How to measure the weight? For the total number of candidates, N –  Measuring frequency –  Term frequency – inverse document frequency (c ­idf) •  Best known weigh>ng scheme in informa>on retrieval € –  Note: the “ ­” in c ­idf is a hyphen, not a minus sign! –  Alterna>ve names: c.idf, c x idf •  Increases with the number of occurrences within a document •  Increases with the rarity of the term in the collec>on Sangmi Lee Pallickara 11 Sangmi Lee Pallickara 12 2 2/22/13 CS480 Principles of Data Management Spring 2013 CS480 Principles of Data Management Inverse document frequency Spring 2013 Example CID America c3 American Na>onal Insurance Company Automobile c4 Farmers Insurance Associa&...
