Unformatted text preview: Text Mining
order to be able to deﬁne at least the importance of a word within a given document, usually a vector representation is used, where for each word a numerical
"importance" value is stored. The currently predominant approaches based on
this idea are the vector space model (Salton et al. 1975), the probabilistic model
(Robertson 1977) and the logical model (van Rijsbergen 1986).
In the following we brieﬂy describe, how a bag-of-words representation can be
obtained. Furthermore, we describe the vector space model and corresponding
similarity measures in more detail, since this model will be used by several text
mining approaches discussed in this article.
2.1 Text Preprocessing In order to obtain all words that are used in a given text, a tokenization process
is required, i.e. a text document is split into a stream of words by removing all
punctuation marks and by replacing tabs and other non-text characters by single
white spaces. This tokenized representation is then used for further processing.
The set of different words obtained by merging all text documents of a collection
is called the dictionary of a...
View Full Document