This preview shows page 1. Sign up to view the full content.
Unformatted text preview: words from the dictionary and thus from the documents. A standard ﬁltering method is stop word ﬁltering. The idea of stop
word ﬁltering is to remove words that bear little or no content information,
like articles, conjunctions, prepositions, etc. Furthermore, words that occur
extremely often can be said to be of little information content to distinguish Band 20 – 2005 25 Hotho, Nürnberger, and Paaß
between documents, and also words that occur very seldom are likely to be
of no particular statistical relevance and can be removed from the dictionary
(Frakes & Baeza-Yates 1992). In order to further reduce the number of words in
the dictionary, also (index) term selection methods can be used (see Sect. 2.1.2).
Lemmatization methods try to map verb forms to the inﬁnite tense and nouns
to the singular form. However, in order to achieve this, the word form has to
be known, i.e. the part of speech of every word in the text document has to be
assigned. Since this tagging process is usually quite time consuming and still
View Full Document
This note was uploaded on 06/19/2011 for the course IT 2258 taught by Professor Aymenali during the Summer '11 term at Abu Dhabi University.
- Summer '11