Improving Classification Accuracy Using Automatically Extracted Training Data

Improving Classification Accuracy Using Automatically Extracted Training Data

Info iconThis preview shows pages 1–2. Sign up to view the full content.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Improving Classification Accuracy Using Automatically Extracted Training Data Ariel Fuxman Microsoft Research Mountain View, CA, USA arielf@microsoft.com Anitha Kannan Microsoft Research Mountain View, CA, USA ankannan@microsoft.com Andrew B. Goldberg * Univ. of Wisconsin-Madison Madison, WI, USA goldberg@cs.wisc.edu Rakesh Agrawal Microsoft Research Mountain View, CA, USA rakesha@microsoft.com Panayiotis Tsaparas Microsoft Research Mountain View, CA, USA panats@microsoft.com John Shafer Microsoft Research Mountain View, CA, USA jshafer@microsoft.com ABSTRACT Classification is a core task in knowledge discovery and data mining, and there has been substantial research effort in developing sophisticated classification models. In a parallel thread, recent work from the NLP community suggests that for tasks such as natural language disambiguation even a simple algorithm can outperform a sophisticated one, if it is provided with large quantities of high quality training data. In those applications, training data occurs naturally in text corpora, and high quality training data sets running into billions of words have been reportedly used. We explore how we can apply the lessons from the NLP community to KDD tasks. Specifically, we investigate how to identify data sources that can yield training data at low cost and study whether the quantity of the automatically extracted training data can compensate for its lower qual- ity. We carry out this investigation for the specific task of inferring whether a search query has commercial intent. We mine toolbar and click logs to extract queries from sites that are predominantly commercial (e.g., Amazon) and non- commercial (e.g., Wikipedia). We compare the accuracy obtained using such training data against manually labeled training data. Our results show that we can have large ac- curacy gains using automatically extracted training data at much lower cost. Categories and Subject Descriptors H.2.8 [ Database management ]: Database applications - data mining General Terms Algorithms, Experimentation * Work done when the author interned at Microsoft Re- search. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. KDD’09, June 28–July 1, 2009, Paris, France. 1. INTRODUCTION Classification lies at the core of many knowledge discov- ery and data mining (KDD) applications whose success de- pends critically on the quality of the classifier. There has been substantial research in developing sophisticated classi- fication models and algorithms with the goal of improving classification accuracy, and currently there is a rich body of such classifiers....
View Full Document

This note was uploaded on 04/08/2010 for the course CS 420 taught by Professor Dawsonengler during the Spring '02 term at San Jose State University .

Page1 / 9

Improving Classification Accuracy Using Automatically Extracted Training Data

This preview shows document pages 1 - 2. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online