lecture6-tfidf-handout-6-per

# Instead rank more relevant documents higher than less

Unformatted text preview: duc)on to Informa)on Retrieval S ec. 6.3 Binary → count → weight matrix ! Hamlet Othello Macbeth 0 0 0 0.35 1.21 6.1 0 1 0 0 8.59 2.54 0 1.51 0.25 0 Calpurnia 0 1.54 0 0 0 0 Cleopatra 2.85 0 0 0 0 0 mercy   How o is computed (with/without logs)   Whether the terms in the query are also weighted   … The Tempest 3.18 Brutus   There are many variants Julius Caesar 5.25 Caesar t "q#d Antony and Cleopatra Antony tf.idft ,d 1.51 0 1.9 0.12 5.25 0.88 worser 1.37 0 0.11 4.15 0.25 1.95 Each document is now represented by a real-valued vector of tf-idf weights ∈ R|V| 25 Introduc)on to Informa)on Retrieval Sec. 6.3 Introduc)on to Informa)on Retrieval Sec. 6.3 Documents as vectors Queries as vectors So we have a |V| ­dimensional vector space Terms are axes of the space Documents are points or vectors in this space Very high ­dimensional: tens of millions of dimensions when you apply this to a web search engine   These are very sparse vectors  ­ most entries are zero.   Key idea 1: Do the same for queries: represent them as vectors in the space   Key idea 2: Rank documents according to their proximity to the query in this space   proximity = similarity of vectors   proximity ≈ inverse of distance   Recall: We do this because we want to get away from the you re ­either ­in ­or ­out Boolean model.   Instead: rank more relevant documents higher than less relevant documents         Introduc)on to Informa)on Retrieval Sec. 6.3 Formalizing vector space proximity   First cut: distance between two points   ( = distance between the end points of the two vectors)   Euclidean distance?   Euclidean distance is a bad idea . . .   . . . because Euclidean distance is large f...
