Furthermore words that occur extremely often can be

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: words from the dictionary and thus from the documents. A standard filtering method is stop word filtering. The idea of stop word filtering is to remove words that bear little or no content information, like articles, conjunctions, prepositions, etc. Furthermore, words that occur extremely often can be said to be of little information content to distinguish Band 20 – 2005 25 Hotho, Nürnberger, and Paaß between documents, and also words that occur very seldom are likely to be of no particular statistical relevance and can be removed from the dictionary (Frakes & Baeza-Yates 1992). In order to further reduce the number of words in the dictionary, also (index) term selection methods can be used (see Sect. 2.1.2). Lemmatization methods try to map verb forms to the infinite tense and nouns to the singular form. However, in order to achieve this, the word form has to be known, i.e. the part of speech of every word in the text document has to be assigned. Since this tagging process is usually quite time consuming and still er...
View Full Document

This note was uploaded on 06/19/2011 for the course IT 2258 taught by Professor Aymenali during the Summer '11 term at Abu Dhabi University.

Ask a homework question - tutors are online