This preview shows page 1. Sign up to view the full content.
Unformatted text preview: lassiﬁcation problem at
hand. One commonly used ranking score is the information gain which for a
term t j is deﬁned as
IG (t j ) = 2 ∑ c =1 p( Lc ) log2 1
− ∑ p(t j =m) ∑ p( Lc |t j =m) log2
p ( L c ) m =0
p ( L c | t j =m )
c =1 (8) Here p( Lc ) is the fraction of training documents with classes L1 and L2 , p(t j =1)
and p(t j =0) is the number of documents with / without term t j and p( Lc |t j =m)
is the conditional probability of classes L1 and L2 if term t j is contained in the
document or is missing. It measures how useful t j is for predicting L1 from an
information-theoretic point of view. We may determine IG (t j ) for all terms and
remove those with very low information gain from the dictionary.
In the following sections we describe the most frequently used data mining
methods for text categorization. Band 20 – 2005 31 Hotho, Nürnberger, and Paaß
3.1.2 Naïve Bayes Classiﬁer Probabilistic classiﬁers start with the assumption that the words of a document di have been generated by a probabilistic mech...
View Full Document
- Summer '11