tm be the dictionary ie the set of all different

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: document collection. In order to allow a more formal description of the algorithms, we define first some terms and variables that will be frequently used in the following: Let D be the set of documents and T = {t1 , . . . , tm } be the dictionary, i.e. the set of all different terms occurring in D, then the absolute frequency of term t ∈ T in document d ∈ D is given by tf(d, t). We denote the term vectors td = (tf(d, t1 ), . . . , tf(d, tm )). Later on, we will also need the notion of the centroid of a set X of term vectors. It is defined as the mean value 1 t X := |X | ∑t ∈X td of its term vectors. In the sequel, we will apply tf also on d subsets of terms: For T ⊆ T , we let tf(d, T ) := ∑t∈T tf(d, t). 2.1.1 Filtering, Lemmatization and Stemming In order to reduce the size of the dictionary and thus the dimensionality of the description of documents within the collection, the set of words describing the documents can be reduced by filtering and lemmatization or stemming methods. Filtering methods remove...
View Full Document

This note was uploaded on 06/19/2011 for the course IT 2258 taught by Professor Aymenali during the Summer '11 term at Abu Dhabi University.

Ask a homework question - tutors are online