This preview shows page 1. Sign up to view the full content.
Unformatted text preview: log( N /nt ) —, a length normalization factor is used to ensure that all
documents have equal chances of being retrieved independent of their lengths:
tf(d, t) log( N /nt ) w ( d, t ) = ∑m 1
j= t f (d, t j )2 (log( N /nt j ))2 (2) , where N is the size of the document collection D and nt is the number of
documents in D that contain term t.
Based on a weighting scheme a document d is deﬁned by a vector of term
weights w(d) = (w(d, t1 ), . . . , w(d, tm )) and the similarity S of two documents
d1 and d2 (or the similarity of a document and a query vector) can be computed
based on the inner product of the vectors (by which – if we assume normalized
vectors – the cosine between the two document vectors is computed), i.e.
S ( d1 , d2 ) = m ∑ k =1 w ( d 1 , t k ) · w ( d 2 , t k ). (3) A frequently used distance measure is the Euclidian distance. We calculate the
distance between two text documents d1 , d2 ∈ D as follows:
dist(d1 , d2 ) = 2 m ∑ k =1 | w ( d 1 , t k ) − w ( d 2 , t k ) | 2 . (4) However, the Euclidean distance should only be used for normalized vectors,
since otherwise the different lengths of documents can result in a smalle...
View Full Document
This note was uploaded on 06/19/2011 for the course IT 2258 taught by Professor Aymenali during the Summer '11 term at Abu Dhabi University.
- Summer '11