boostexter - Machine Learning 39(2/3:135-168 2000...

Info iconThis preview shows pages 1–3. Sign up to view the full content.

View Full Document Right Arrow Icon
Machine Learning, 39(2/3):135-168, 2000. BoosTexter: A Boosting-based System for Text Categorization [email protected] AT&T Labs, Shannon Laboratory, 180 Park Avenue, Room A279, Florham Park, NJ 07932-0971 [email protected] AT&T Labs, Shannon Laboratory, 180 Park Avenue, Room A277, Florham Park, NJ 07932-0971 Abstract. This work focuses on algorithms which learn from examples to perform multiclass text and speech categorization tasks. Our approach is based on a new and improved family of boosting algorithms. We describe in detail an implementation, called BoosTexter, of the new boosting algorithms for text categorization tasks. We present results comparingthe performanceof BoosTexterand a numberof other text-categorizationalgorithms on a variety of tasks. We conclude by describing the application of our system to automatic call-type identification from unconstrainedspoken customer responses. 1. Introduction Text categorization is the problem of classifying text documents into categories or classes. For instance, a typical problem is that of classifying news articles by topic based on their textual content. Another problem is to automatically identify the type of call requested by a customer; for instance, if the customer says, “Yes, I would like to charge this call to my Visa,” we want the system to recognize that this is a calling-card call and to process the call accordingly. (Although this is actually a speech-categorization problem, we can nevertheless apply a text-based system by passing the spoken responses through a speech recognizer.) In this paper, we introduce the use of a machine-learning technique called boosting to the problem of text categorization. The main idea of boosting is to combine many simple and moderately inaccurate categorization rules into a single, highly accurate categorization rule. The simple rules are trained sequentially; conceptually, each rule is trained on the examples which were most difficult to classify by the preceding rules. Our approach is based on a new and improved family of boosting algorithms which we have described and analyzed in detail in a companion paper (Schapire & Singer, 1998). This new family extends and generalizes Freund and Schapire’s AdaBoost algorithm (Fre- und & Schapire, 1997), which has been studied extensively and which has been shown to perform well on standard machine-learning tasks (Breiman, 1998; Drucker & Cortes, 1996; Freund & Schapire, 1996, 1997; Maclin & Opitz, 1997; Margineantu & Dietterich, 1997; Quinlan, 1996; Schapire, 1997; Schapire, Freund, Bartlett, & Lee, 1998). The pur- pose of the current work is to describe some ways in which boosting can be applied to the problem of text categorization, and to test its performance relative to a number of other text-categorization algorithms. Text-categorization problems are usually multiclass in the sense that there are usually more than two possible categories. Although in some applications there may be a very
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
large number of categories, in this work, we focus on the case in which there are a small to moderate number of categories. It is also common for text-categorization tasks to be
Background image of page 2
Image of page 3
This is the end of the preview. Sign up to access the rest of the document.

{[ snackBarMessage ]}

Page1 / 34

boostexter - Machine Learning 39(2/3:135-168 2000...

This preview shows document pages 1 - 3. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online