Unformatted text preview: r
distance between documents that share less words than between documents
that have more words in common and should be considered therefore as more
Note that for normalized vectors the scalar product is not much different in
behavior from the Euclidean distance, since for two vectors x and y it is
cos ϕ = xy
= 1 − d2
| x | · |y|
| x | |y| . For a more detailed discussion of the vector space model and weighting schemes 28 LDV-FORUM A Brief Survey of Text Mining
see, e.g. Baeza-Yates & Ribeiro-Neto (1999); Greiff (1998); Salton & Buckley
(1988); Salton et al. (1975).
2.3 Linguistic Preprocessing Often text mining methods may be applied without further preprocessing. Sometimes, however, additional linguistic preprocessing (c.f. Manning & Schütze
(2001)) may be used to enhance the available information about terms. For this,
the following approaches are frequently applied:
Part-of-speech tagging (POS) determines the part of speech tag, e.g. noun, verb, adjective...
View Full Document