This preview shows page 1. Sign up to view the full content.
Unformatted text preview: tead of building explicit models for the different classes we may select
documents from the training set which are “similar” to the target document.
The class of the target document subsequently may be inferred from the class
labels of these similar documents. If k similar documents are considered, the
approach is also known as k-nearest neighbor classiﬁcation.
There is a large number of similarity measures used in text mining. One
possibility is simply to count the number of common words in two documents.
Obviously this has to be normalized to account for documents of different
lengths. On the other hand words have greatly varying information content. A
standard way to measure the latter is the cosine similarity as deﬁned in (3). Note
that only a small fraction of all possible terms appear in this sums as w(d, t) = 0
if the term t is not present in the document d. Other similarity measures are
discussed in Baeza-Yates & Ribeiro-Neto (1999).
For deciding whether document di belongs to class Lm , the similarity S(di , d j )
to all document...
View Full Document
This note was uploaded on 06/19/2011 for the course IT 2258 taught by Professor Aymenali during the Summer '11 term at Abu Dhabi University.
- Summer '11