10.1.1.153.6679

# tm be the dictionary ie the set of all different

This preview shows page 1. Sign up to view the full content.

This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: document collection. In order to allow a more formal description of the algorithms, we deﬁne ﬁrst some terms and variables that will be frequently used in the following: Let D be the set of documents and T = {t1 , . . . , tm } be the dictionary, i.e. the set of all different terms occurring in D, then the absolute frequency of term t ∈ T in document d ∈ D is given by tf(d, t). We denote the term vectors td = (tf(d, t1 ), . . . , tf(d, tm )). Later on, we will also need the notion of the centroid of a set X of term vectors. It is deﬁned as the mean value 1 t X := |X | ∑t ∈X td of its term vectors. In the sequel, we will apply tf also on d subsets of terms: For T ⊆ T , we let tf(d, T ) := ∑t∈T tf(d, t). 2.1.1 Filtering, Lemmatization and Stemming In order to reduce the size of the dictionary and thus the dimensionality of the description of documents within the collection, the set of words describing the documents can be reduced by ﬁltering and lemmatization or stemming methods. Filtering methods remove...
View Full Document

## This note was uploaded on 06/19/2011 for the course IT 2258 taught by Professor Aymenali during the Summer '11 term at Abu Dhabi University.

Ask a homework question - tutors are online