This preview has intentionally blurred sections. Sign up to view the full version.View Full Document
Unformatted text preview: A Comparison of Event Models for Naive Bayes Text Classification Andrew McCallum ‡† [email protected] Kamal Nigam † [email protected] ‡ Just Research 4616 Henry Street Pittsburgh, PA 15213 † School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 Abstract Recent approaches to text classification have used two different first-order probabilistic models for classifica- tion, both of which make the naive Bayes assumption . Some use a multi-variate Bernoulli model, that is, a Bayesian Network with no dependencies between words and binary word features ( e.g. Larkey and Croft 1996; Koller and Sahami 1997). Others use a multinomial model, that is, a uni-gram language model with integer word counts ( e.g. Lewis and Gale 1994; Mitchell 1997). This paper aims to clarify the confusion by describing the differences and details of these two models, and by empirically comparing their classification performance on five text corpora. We find that the multi-variate Bernoulli performs well with small vocabulary sizes, but that the multinomial performs usually performs even better at larger vocabulary sizes—providing on average a 27% reduction in error over the multi-variate Bernoulli model at any vocabulary size. Introduction Simple Bayesian classifiers have been gaining popularity lately, and have been found to perform surprisingly well (Friedman 1997; Friedman et al. 1997; Sahami 1996; Langley et al. 1992). These probabilistic approaches make strong assumptions about how the data is gen- erated, and posit a probabilistic model that embodies these assumptions; then they use a collection of labeled training examples to estimate the parameters of the generative model. Classification on new examples is performed with Bayes’ rule by selecting the class that is most likely to have generated the example. The naive Bayes classifier is the simplest of these models, in that it assumes that all attributes of the examples are independent of each other given the con- text of the class. This is the so-called “naive Bayes assumption.” While this assumption is clearly false in most real-world tasks, naive Bayes often performs classification very well. This paradox is explained by the fact that classification estimation is only a function of the sign (in binary cases) of the function estima- tion; the function approximation can still be poor while classification accuracy remains high (Friedman 1997; Domingos and Pazzani 1997). Because of the indepen- dence assumption, the parameters for each attribute can be learned separately, and this greatly simplifies learning, especially when the number of attributes is large. Document classification is just such a domain with a large number of attributes. The attributes of the examples to be classified are words, and the number of different words can be quite large indeed. While some simple document classification tasks can be ac- curately performed with vocabulary sizes less than one hundred, many complex tasks on real-world data from...
View Full Document
- Fall '11
- Computer Science