NigamLaffertyMcCallum-maxent-ijcaiws99

NigamLaffertyMcCallum-maxent-ijcaiws99 - Using Maximum...

Info iconThis preview shows pages 1–2. Sign up to view the full content.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Using Maximum Entropy for Text Classification Kamal Nigam knigam@cs.cmu.edu John Lafferty lafferty@cs.cmu.edu Andrew McCallum mccallum@justresearch.com School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 Just Research 4616 Henry Street Pittsburgh, PA 15213 Abstract This paper proposes the use of maximum en- tropy techniques for text classification. Maxi- mum entropy is a probability distribution esti- mation technique widely used for a variety of natural language tasks, such as language mod- eling, part-of-speech tagging, and text segmen- tation. The underlying principle of maximum entropy is that without external knowledge, one should prefer distributions that are uni- form. Constraints on the distribution, derived from labeled training data, inform the tech- nique where to be minimally non-uniform. The maximum entropy formulation has a unique so- lution which can be found by the improved it- erative scaling algorithm. In this paper, max- imum entropy is used for text classification by estimating the conditional distribution of the class variable given the document. In experi- ments on several text datasets we compare ac- curacy to naive Bayes and show that maximum entropy is sometimes significantly better, but also sometimes worse. Much future work re- mains, but the results indicate that maximum entropy is a promising technique for text clas- sification. 1 Introduction A variety of techniques for supervised learning algo- rithms have demonstrated reasonable performance for text classification; a non-exhaustive list includes naive Bayes [Lewis, 1998; McCallum and Nigam, 1998; Sa- hami, 1996], k-nearest neighbor [Yang, 1999], support vector machines [Joachims, 1998; Dumais et al. , 1998], boosting [Schapire and Singer, 1999] and rule learn- ing algorithms [Cohen and Singer, 1996; Slattery and Craven, 1998]. Among these, however, no single tech- nique has proven to consistently outperform the others across many domains. This paper explores the use of maximum entropy for text classification as an alternative to previously used text classification algorithms. Maximum entropy has al- ready been widely used for a variety of natural language tasks, including language modeling [Chen and Rosenfeld, 1999; Rosenfeld, 1994], text segmentation [Beeferman et al. , 1999], part-of-speech tagging [Ratnaparkhi, 1996], and prepositional phrase attachment [Ratnaparkhi et al. , 1994]. Maximum entropy has been shown to be a viable and competitive algorithm in these domains. Maximum entropy is a general technique for estimat- ing probability distributions from data. The over-riding principle in maximum entropy is that when nothing is known, the distribution should be as uniform as possible, that is, have maximal entropy. Labeled training data is used to derive a set of constraints for the model that characterize the class-specific expectations for the distri- bution. Constraints are represented as expected values of features, any real-valued function of an example.of features, any real-valued function of an example....
View Full Document

This note was uploaded on 10/18/2011 for the course CS 479 taught by Professor Ericringger during the Fall '11 term at BYU.

Page1 / 7

NigamLaffertyMcCallum-maxent-ijcaiws99 - Using Maximum...

This preview shows document pages 1 - 2. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online