lecture6-tfidf-handout-6-per

Long and short documents now have comparable weights

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: or vectors of different lengths. Introduc)on to Informa)on Retrieval Sec. 6.3 Why distance is a bad idea The Euclidean distance between q and d2 is large even though the distribu*on of terms in the query q and the distribu*on of terms in the document d2 are very similar. 5 Introduc)on to Informa)on Retrieval Sec. 6.3 Introduc)on to Informa)on Retrieval Sec. 6.3 Use angle instead of distance From angles to cosines   Thought experiment: take a document d and append it to itself. Call this document d+ .   Seman*cally d and d+ have the same content   The Euclidean distance between the two documents can be quite large   The angle between the two documents is 0, corresponding to maximal similarity.   The following two no*ons are equivalent.   Rank documents in decreasing order of the angle between query and document   Rank documents in increasing order of cosine (query,document)   Cosine is a monotonically decreasing func*on for the interval [0o, 180o]   Key idea: Rank documents according to angle with query. Introduc)on to Informa)on Retrieval Sec. 6.3 Introduc)on to Informa)on Retrieval From angles to cosines Sec. 6.3 Length normaliza*on   A vector can be (length ­) normalized by dividing each of its components by its length – for this we use the L2 norm: 2 x2= ∑x ii   Dividing a vector by its L2 norm makes it a unit (length) vector (on surface of unit hypersphere)   Effect on the two documents d and d+ (d appended to i...
View Full Document

This document was uploaded on 02/26/2014.

Ask a homework question - tutors are online