This preview shows pages 1–2. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.View Full Document
Unformatted text preview: Improving Classification Accuracy Using Automatically Extracted Training Data Ariel Fuxman Microsoft Research Mountain View, CA, USA firstname.lastname@example.org Anitha Kannan Microsoft Research Mountain View, CA, USA email@example.com Andrew B. Goldberg * Univ. of Wisconsin-Madison Madison, WI, USA firstname.lastname@example.org Rakesh Agrawal Microsoft Research Mountain View, CA, USA email@example.com Panayiotis Tsaparas Microsoft Research Mountain View, CA, USA firstname.lastname@example.org John Shafer Microsoft Research Mountain View, CA, USA email@example.com ABSTRACT Classification is a core task in knowledge discovery and data mining, and there has been substantial research effort in developing sophisticated classification models. In a parallel thread, recent work from the NLP community suggests that for tasks such as natural language disambiguation even a simple algorithm can outperform a sophisticated one, if it is provided with large quantities of high quality training data. In those applications, training data occurs naturally in text corpora, and high quality training data sets running into billions of words have been reportedly used. We explore how we can apply the lessons from the NLP community to KDD tasks. Specifically, we investigate how to identify data sources that can yield training data at low cost and study whether the quantity of the automatically extracted training data can compensate for its lower qual- ity. We carry out this investigation for the specific task of inferring whether a search query has commercial intent. We mine toolbar and click logs to extract queries from sites that are predominantly commercial (e.g., Amazon) and non- commercial (e.g., Wikipedia). We compare the accuracy obtained using such training data against manually labeled training data. Our results show that we can have large ac- curacy gains using automatically extracted training data at much lower cost. Categories and Subject Descriptors H.2.8 [ Database management ]: Database applications - data mining General Terms Algorithms, Experimentation * Work done when the author interned at Microsoft Re- search. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. KDD’09, June 28–July 1, 2009, Paris, France. 1. INTRODUCTION Classification lies at the core of many knowledge discov- ery and data mining (KDD) applications whose success de- pends critically on the quality of the classifier. There has been substantial research in developing sophisticated classi- fication models and algorithms with the goal of improving classification accuracy, and currently there is a rich body of such classifiers....
View Full Document
This note was uploaded on 04/08/2010 for the course CS 420 taught by Professor Dawsonengler during the Spring '02 term at San Jose State University .
- Spring '02