{[ promptMessage ]}

Bookmark it

{[ promptMessage ]}

124.11.lec4 - CS 124/LINGUIST 180 From Languages to...

Info iconThis preview shows pages 1–11. Sign up to view the full content.

View Full Document Right Arrow Icon
Dan Jurafsky Lecture 4: Text Classifica8on; The Naïve Bayes algorithm CS 124/LINGUIST 180: From Languages to Information IP notice: most slides for today from: Chris Manning , plus some from William Cohen, Chien Chin Chen, Jason Eisner, David Yarowsky, P Nakov, Marti Hearst, Barbara Rosario
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Outline Introduc8on to Text Classifica8on Also called “text categoriza8on”. Naïve Bayes text classifica8on
Background image of page 2
Is this spam?
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
More Applications of Text Classification Authorship iden8fica8on Age/gender iden8fica8on Language Iden8fica8on Assigning topics such as Yahoo-­૒categories e.g., "finance," "sports," "news>world>asia>business" Genre-­૒detec8on e.g., "editorials" "movie-­૒reviews" "news“ Opinion/sen8ment analysis on a person/product e.g., “like”, “hate”, “neutral” Labels may be domain-­૒specific e.g., “contains adult language” : “doesn’t”
Background image of page 4
Text Classification: definition The classifier: Input : a document d Output : a predicted class c from some fixed set of labels c 1 ,...,c K The learner: Input: a set of m hand-­૒labeled documents (d 1 ,c 1 ), .... , (d m ,c m ) Output: a learned classifier f:d c Slide from William Cohen
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Multimedia GUI Garb.Coll. Semantics ML Planning planning temporal reasoning plan language ... programming semantics language proof ... learning intelligence algorithm reinforcement network... garbage collection memory optimization region... “planning language proof intelligence” Training Data: Test Data: Classes: (AI) Document Classification Slide from Chris Manning (Programming) (HCI) ... ...
Background image of page 6
Classification Methods: Hand-coded rules Some spam/email filters, etc. E.g., assign category if document contains a given boolean combina8on of words Accuracy is oTen very high if a rule has been carefully refined over 8me by a subject expert Building and maintaining these rules is expensive Slide from Chris Manning
Background image of page 7

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Classification Methods: Machine Learning Supervised Machine Learning To learn a func8on from documents (or sentences) to labels Naive Bayes (simple, common method) Others k-­૒Nearest Neighbors (simple, powerful) Support-­૒vector machines (new, more powerful) … plus many other methods No free lunch: requires hand-­૒classified training data But data can be built up (and refined) by amateurs Slide from Chris Manning
Background image of page 8
Naïve Bayes Intuition
Background image of page 9

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Representing text for classification Slide from William Cohen ARGENTINE 1986/87 GRAIN/OILSEED REGISTRATIONS BUENOS AIRES, Feb 26 Argentine grain board figures show crop registrations of grains, oilseeds and their products to February 11, in thousands of tonnes, showing those for future shipments month, 1986/87 total and 1985/86 total to February 12, 1986, in brackets: Bread wheat prev 1,655.8, Feb 872.0, March 164.6, total 2,692.4 (4,161.0).
Background image of page 10
Image of page 11
This is the end of the preview. Sign up to access the rest of the document.

{[ snackBarMessage ]}