This preview shows page 1. Sign up to view the full content.
Unformatted text preview: nguistic preprocessing is of limited value compared to the simple bag-of-words approach with
basic preprocessing. The reason is that the co-occurrence of terms in the vector
representation serves as an automatic disambiguation, e.g. for classiﬁcation
(Leopold & Kindermann 2002). Recently some progress was made by enhancing
bag of words with linguistic feature for text clustering and classiﬁcation (Hotho
et al. 2003; Bloehdorn & Hotho 2004). Band 20 – 2005 29 Hotho, Nürnberger, and Paaß
3 Data Mining Methods for Text One main reason for applying data mining methods to text document collections
is to structure them. A structure can signiﬁcantly simplify the access to a document collection for a user. Well known access structures are library catalogues
or book indexes. However, the problem of manual designed indexes is the
time required to maintain them. Therefore, they are very often not up-to-date
and thus not usable for recent publications or frequently changing information
sources like the World Wide Web. The existing methods fo...
View Full Document
- Summer '11