This preview shows pages 1–2. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.View Full Document
Unformatted text preview: Using Maximum Entropy for Text Classification Kamal Nigam email@example.com John Lafferty firstname.lastname@example.org Andrew McCallum email@example.com School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 Just Research 4616 Henry Street Pittsburgh, PA 15213 Abstract This paper proposes the use of maximum en- tropy techniques for text classification. Maxi- mum entropy is a probability distribution esti- mation technique widely used for a variety of natural language tasks, such as language mod- eling, part-of-speech tagging, and text segmen- tation. The underlying principle of maximum entropy is that without external knowledge, one should prefer distributions that are uni- form. Constraints on the distribution, derived from labeled training data, inform the tech- nique where to be minimally non-uniform. The maximum entropy formulation has a unique so- lution which can be found by the improved it- erative scaling algorithm. In this paper, max- imum entropy is used for text classification by estimating the conditional distribution of the class variable given the document. In experi- ments on several text datasets we compare ac- curacy to naive Bayes and show that maximum entropy is sometimes significantly better, but also sometimes worse. Much future work re- mains, but the results indicate that maximum entropy is a promising technique for text clas- sification. 1 Introduction A variety of techniques for supervised learning algo- rithms have demonstrated reasonable performance for text classification; a non-exhaustive list includes naive Bayes [Lewis, 1998; McCallum and Nigam, 1998; Sa- hami, 1996], k-nearest neighbor [Yang, 1999], support vector machines [Joachims, 1998; Dumais et al. , 1998], boosting [Schapire and Singer, 1999] and rule learn- ing algorithms [Cohen and Singer, 1996; Slattery and Craven, 1998]. Among these, however, no single tech- nique has proven to consistently outperform the others across many domains. This paper explores the use of maximum entropy for text classification as an alternative to previously used text classification algorithms. Maximum entropy has al- ready been widely used for a variety of natural language tasks, including language modeling [Chen and Rosenfeld, 1999; Rosenfeld, 1994], text segmentation [Beeferman et al. , 1999], part-of-speech tagging [Ratnaparkhi, 1996], and prepositional phrase attachment [Ratnaparkhi et al. , 1994]. Maximum entropy has been shown to be a viable and competitive algorithm in these domains. Maximum entropy is a general technique for estimat- ing probability distributions from data. The over-riding principle in maximum entropy is that when nothing is known, the distribution should be as uniform as possible, that is, have maximal entropy. Labeled training data is used to derive a set of constraints for the model that characterize the class-specific expectations for the distri- bution. Constraints are represented as expected values of features, any real-valued function of an example.of features, any real-valued function of an example....
View Full Document
This note was uploaded on 10/18/2011 for the course CS 479 taught by Professor Ericringger during the Fall '11 term at BYU.
- Fall '11
- Computer Science