124.11.lec4

124.11.lec4 - CS 124/LINGUIST 180 From Click to edit Master...

Info iconThis preview shows pages 1–11. Sign up to view the full content.

View Full Document Right Arrow Icon
Click to edit Master subtitle style 1/10/09 Dan Jurafsky Lecture 4: Text Classification; The Naïve Bayes algorithm IP notice: most slides for today from: Chris Manning , plus some from William Cohen, Chien Chin Chen, Jason Eisner, David Yarowsky, P Nakov, Marti Hearst, Barbara Rosario
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Outline Introduction to Text Classification Also called “text categorization”. Naïve Bayes text classification
Background image of page 2
Is this spam?
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
More Applications of Text Classification Authorship identification Age/gender identification Language Identification Assigning topics such as Yahoo-categories e.g., "finance," "sports," "news>world>asia>business" Genre-detection e.g., "editorials" "movie-reviews" "news“ Opinion/sentiment analysis on a person/product e.g., “like”, “hate”, “neutral”
Background image of page 4
Text Classification: definition The classifier: Input : a document d Output : a predicted class c from some fixed set of labels c1,. ..,cK The learner: Input: a set of m hand-labeled documents (d1,c1),. ...,(dm,cm) Output: a learned classifier f:d c Slide from William Cohen
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Multimedia GUI Garb.Coll. Semantics ML Planning planning temporal reasoning plan language ... programming semantics language proof ... learning intelligence algorithm reinforcement network. .. garbage collection memory optimization region. .. “planning   language   proof   intelligence” Training Data: Test Data: Classes: (AI) Document Classification Slide from Chris Manning (Programming) (HCI) ... ...
Background image of page 6
Classification Methods: Hand-coded rules Some spam/email filters, etc. E.g., assign category if document contains a given boolean combination of words Accuracy is often very high if a rule has been carefully refined over time by a subject expert Building and maintaining these rules is expensive Slide from Chris Manning
Background image of page 7

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Classification Methods: Machine Learning Supervised Machine Learning To learn a function from documents (or sentences) to labels Naive Bayes (simple, common method) Others k-Nearest Neighbors (simple, powerful) Support-vector machines (new, more powerful) … plus many other methods No free lunch: requires hand-classified training data But data can be built up (and refined) by amateurs Slide from Chris Manning
Background image of page 8
Naïve Bayes Intuition
Background image of page 9

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Representing text for Slide from William Cohen ARGENTINE 1986/87 GRAIN/OILSEED REGISTRATIONS BUENOS AIRES, Feb 26 Argentine grain board figures show crop registrations of grains, oilseeds and their products to February 11, in thousands of tonnes, showing those for future shipments month, 1986/87 total and 1985/86 total to February 12, 1986, in brackets: Bread wheat prev 1,655.8, Feb 872.0, March 164.6, total 2,692.4 (4,161.0). Maize Mar 48.0, total 48.0 (nil).
Background image of page 10
Image of page 11
This is the end of the preview. Sign up to access the rest of the document.

Page1 / 49

124.11.lec4 - CS 124/LINGUIST 180 From Click to edit Master...

This preview shows document pages 1 - 11. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online