Based on a weighting scheme a document d is dened by

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: log( N /nt ) —, a length normalization factor is used to ensure that all documents have equal chances of being retrieved independent of their lengths: tf(d, t) log( N /nt ) w ( d, t ) = ∑m 1 j= t f (d, t j )2 (log( N /nt j ))2 (2) , where N is the size of the document collection D and nt is the number of documents in D that contain term t. Based on a weighting scheme a document d is defined by a vector of term weights w(d) = (w(d, t1 ), . . . , w(d, tm )) and the similarity S of two documents d1 and d2 (or the similarity of a document and a query vector) can be computed based on the inner product of the vectors (by which – if we assume normalized vectors – the cosine between the two document vectors is computed), i.e. S ( d1 , d2 ) = m ∑ k =1 w ( d 1 , t k ) · w ( d 2 , t k ). (3) A frequently used distance measure is the Euclidian distance. We calculate the distance between two text documents d1 , d2 ∈ D as follows: dist(d1 , d2 ) = 2 m ∑ k =1 | w ( d 1 , t k ) − w ( d 2 , t k ) | 2 . (4) However, the Euclidean distance should only be used for normalized vectors, since otherwise the different lengths of documents can result in a smalle...
View Full Document

This note was uploaded on 06/19/2011 for the course IT 2258 taught by Professor Aymenali during the Summer '11 term at Abu Dhabi University.

Ask a homework question - tutors are online