7 sangmi lee pallickara cs480 principles of data

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: less descrip>ve than firstname and lastname! Cosine Similarity •  Very sensi>ve to typographical errors in single tokens –  Shean Conery and Sean Connery have a similarity of zero. 7 Sangmi Lee Pallickara CS480 Principles of Data Management Spring 2013 CS480 Principles of Data Management Cosine similarity Spring 2013 Cosine similarity - Continued •  Given two n ­dimensional vectors V and W, the cosine similarity computes the cosign of the angle α between these two vectors as CosineSimilarity(V,W) = cos(α ) = 8 Sangmi Lee Pallickara •  The vectors V and W –  Tokens in a string –  Descrip>ons of a candidate V •W || V || × || W || •  The d dimensions of these vectors correspond to all d dis>nct tokens in a set of strings. –  Denoted as D Where ||V|| is the length of the vector V = [a,b,c..] computed as € a 2 + b 2 + c 2 + ... •  For a large database, d may be large –  V and W have high dimensionality d 9 Sangmi Lee Pallickara 10 Sangmi Lee Pallickara € CS480 Principles of Data Management Spring 2013 CS480 Principles of Data Management Weight of Token Spring 2013 tf-idf weighting •  The product of its c weight and its idf weight •  Vector contains a weight for each of the d dis>nct tokens W t ,d = (1 +log10 tf t ,d ) × log10 ( N / df t ) •  How to measure the weight? For the total number of candidates, N –  Measuring frequency –  Term frequency – inverse document frequency (c ­idf) •  Best known weigh>ng scheme in informa>on retrieval € –  Note: the “ ­” in c ­idf is a hyphen, not a minus sign! –  Alterna>ve names: c.idf, c x idf •  Increases with the number of occurrences within a document •  Increases with the rarity of the term in the collec>on Sangmi Lee Pallickara 11 Sangmi Lee Pallickara 12 2 2/22/13 CS480 Principles of Data Management Spring 2013 CS480 Principles of Data Management Inverse document frequency Spring 2013 Example CID America c3 American Na>onal Insurance Company Automobile c4 Farmers Insurance Associa&...
View Full Document

This note was uploaded on 02/11/2014 for the course CS 480 taught by Professor Staff during the Spring '08 term at Colorado State.

Ask a homework question - tutors are online