# Based on a weighting scheme a document d is dened by

Unformatted text preview: log( N /nt ) —, a length normalization factor is used to ensure that all documents have equal chances of being retrieved independent of their lengths: tf(d, t) log( N /nt ) w ( d, t ) = ∑m 1 j= t f (d, t j )2 (log( N /nt j ))2 (2) , where N is the size of the document collection D and nt is the number of documents in D that contain term t. Based on a weighting scheme a document d is deﬁned by a vector of term weights w(d) = (w(d, t1 ), . . . , w(d, tm )) and the similarity S of two documents d1 and d2 (or the similarity of a document and a query vector) can be computed based on the inner product of the vectors (by which – if we assume normalized vectors – the cosine between the two document vectors is computed), i.e. S ( d1 , d2 ) = m ∑ k =1 w ( d 1 , t k ) · w ( d 2 , t k ). (3) A frequently used distance measure is the Euclidian distance. We calculate the distance between two text documents d1 , d2 ∈ D as follows: dist(d1 , d2 ) = 2 m ∑ k =1 | w ( d 1 , t k ) − w ( d 2 , t k ) | 2 . (4) However, the Euclidean distance should only be used for normalized vectors, since otherwise the different lengths of documents can result in a smalle...
