For a more detailed discussion of the vector space

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: r distance between documents that share less words than between documents that have more words in common and should be considered therefore as more similar. Note that for normalized vectors the scalar product is not much different in behavior from the Euclidean distance, since for two vectors x and y it is cos ϕ = xy 1 xy , = 1 − d2 | x | · |y| 2 | x | |y| . For a more detailed discussion of the vector space model and weighting schemes 28 LDV-FORUM A Brief Survey of Text Mining see, e.g. Baeza-Yates & Ribeiro-Neto (1999); Greiff (1998); Salton & Buckley (1988); Salton et al. (1975). 2.3 Linguistic Preprocessing Often text mining methods may be applied without further preprocessing. Sometimes, however, additional linguistic preprocessing (c.f. Manning & Schütze (2001)) may be used to enhance the available information about terms. For this, the following approaches are frequently applied: Part-of-speech tagging (POS) determines the part of speech tag, e.g. noun, verb, adjective...
View Full Document

Ask a homework question - tutors are online