This preview shows page 1. Sign up to view the full content.
Unformatted text preview: less descrip>ve than firstname and lastname! Cosine Similarity • Very sensi>ve to typographical errors in single tokens – Shean Conery and Sean Connery have a similarity of zero. 7 Sangmi Lee Pallickara CS480 Principles of Data Management Spring 2013 CS480 Principles of Data Management Cosine similarity Spring 2013 Cosine similarity  Continued • Given two n
dimensional vectors V and W, the cosine similarity computes the cosign of the angle α between these two vectors as CosineSimilarity(V,W) = cos(α ) = 8 Sangmi Lee Pallickara • The vectors V and W – Tokens in a string – Descrip>ons of a candidate V •W
 V  ×  W  • The d dimensions of these vectors correspond to all d dis>nct tokens in a set of strings. – Denoted as D Where V is the length of the vector V = [a,b,c..] computed as €
a 2 + b 2 + c 2 + ... • For a large database, d may be large – V and W have high dimensionality d 9 Sangmi Lee Pallickara 10 Sangmi Lee Pallickara € CS480 Principles of Data Management Spring 2013 CS480 Principles of Data Management Weight of Token Spring 2013 tfidf weighting
• The product of its c weight and its idf weight • Vector contains a weight for each of the d dis>nct tokens W t ,d = (1 +log10 tf t ,d ) × log10 ( N / df t ) • How to measure the weight? For the total number of candidates, N – Measuring frequency – Term frequency – inverse document frequency (c
idf) • Best known weigh>ng scheme in informa>on retrieval € – Note: the “
” in c
idf is a hyphen, not a minus sign! – Alterna>ve names: c.idf, c x idf • Increases with the number of occurrences within a document • Increases with the rarity of the term in the collec>on Sangmi Lee Pallickara 11 Sangmi Lee Pallickara 12 2 2/22/13 CS480 Principles of Data Management Spring 2013 CS480 Principles of Data Management Inverse document frequency Spring 2013 Example
CID America c3 American Na>onal Insurance Company Automobile c4 Farmers Insurance Associa&...
View
Full
Document
This note was uploaded on 02/11/2014 for the course CS 480 taught by Professor Staff during the Spring '08 term at Colorado State.
 Spring '08
 Staff
 Data Management

Click to edit the document details