{[ promptMessage ]}

Bookmark it

{[ promptMessage ]}

lecture6-tfidf-handout-6-per

# Long and short documents now have comparable weights

This preview shows page 1. Sign up to view the full content.

This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: or vectors of diﬀerent lengths. Introduc)on to Informa)on Retrieval Sec. 6.3 Why distance is a bad idea The Euclidean distance between q and d2 is large even though the distribu*on of terms in the query q and the distribu*on of terms in the document d2 are very similar. 5 Introduc)on to Informa)on Retrieval Sec. 6.3 Introduc)on to Informa)on Retrieval Sec. 6.3 Use angle instead of distance From angles to cosines   Thought experiment: take a document d and append it to itself. Call this document d+ .   Seman*cally d and d+ have the same content   The Euclidean distance between the two documents can be quite large   The angle between the two documents is 0, corresponding to maximal similarity.   The following two no*ons are equivalent.   Rank documents in decreasing order of the angle between query and document   Rank documents in increasing order of cosine (query,document)   Cosine is a monotonically decreasing func*on for the interval [0o, 180o]   Key idea: Rank documents according to angle with query. Introduc)on to Informa)on Retrieval Sec. 6.3 Introduc)on to Informa)on Retrieval From angles to cosines Sec. 6.3 Length normaliza*on   A vector can be (length ­) normalized by dividing each of its components by its length – for this we use the L2 norm: 2 x2= ∑x ii   Dividing a vector by its L2 norm makes it a unit (length) vector (on surface of unit hypersphere)   Eﬀect on the two documents d and d+ (d appended to i...
View Full Document

{[ snackBarMessage ]}

Ask a homework question - tutors are online