This preview shows page 1. Sign up to view the full content.
Unformatted text preview: use index term selection methods as discussed
above in Sect. 3.1.1. 32 LDV-FORUM A Brief Survey of Text Mining
Although this model is unrealistic due to its restrictive independence assumption it yields surprisingly good classiﬁcations (Dumais et al. 1998; Joachims
1998). It may be extended into several directions (Sebastiani 2002).
As the effort for manually labeling the documents of the training set is high,
some authors use unlabeled documents for training. Assume that from a
small training set it has been established that word ti is highly correlated with
class Lc . If from unlabeled documents it may be determined that word t j is
highly correlated with ti , then also t j is a good predictor for class Lc . In this
way unlabeled documents may improve classiﬁcation performance. In Nigam
et al. (2000) the authors used a combination of Expectation-Maximization (EM,
Dempster et al. (1977)) and a naïve Bayes classiﬁer and were able to reduce the
classiﬁcation error by up to 30%.
3.1.3 Nearest Neighbor Classiﬁer Ins...
View Full Document
This note was uploaded on 06/19/2011 for the course IT 2258 taught by Professor Aymenali during the Summer '11 term at Abu Dhabi University.
- Summer '11