class18-textcat-nbayes

class18-textcat-nbayes - Introduc)on to Informa)on...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Introduc)on to Informa)on Retrieval Ch. 13 Standing queries   The path from IR to text classifica:on:   You have an informa:on need to monitor, say:   Unrest in the Niger delta region   You want to rerun an appropriate query periodically to find new news items on this topic   You will be sent new documents that are found   I.e., it’s text classifica:on not ranking   Such queries are called standing queries   Long used by “informa:on professionals”   A modern mass instan:a:on is Google Alerts   Standing queries are (hand ­wriMen) text classifiers Introduc)on to Informa)on Retrieval Spam filtering: Another text classifica:on task From: "" <takworlld@hotmail.com> Subject: real estate is the only way... gem oalvgkay Anyone can buy real estate with no money down Stop paying rent TODAY ! There is no need to spend hundreds or even thousands for similar courses I am 22 years old and I have already purchased 6 proper:es using the methods outlined in this truly INCREDIBLE ebook. Change your life NOW ! ================================================= Click Below to order: hMp://www.wholesaledaily.com/sales/nmd.htm ================================================= Ch. 13 1 Introduc)on to Informa)on Retrieval Ch. 13 Text classifica:on   Today:   Introduc:on to Text Classifica:on   Also widely known as “text categoriza:on”. Same thing.   Naïve Bayes text classifica:on   Including a liMle on Probabilis:c Language Models Introduc)on to Informa)on Retrieval Sec. 13.1 Categoriza:on/Classifica:on   Given:   A descrip:on of an instance, d ∈ X   X is the instance language or instance space.   Issue: how to represent text documents.   Usually some type of high ­dimensional space   A fixed set of classes: C = {c1, c2,…, cJ}   Determine:   The category of d: γ(d) ∈ C, where γ(d) is a classifica)on func)on whose domain is X and whose range is C.   We want to know how to build classifica:on func:ons (“classifiers”). 2 Introduc)on to Informa)on Retrieval Sec. 13.1 Supervised Classifica:on   Given:   A descrip:on of an instance, d ∈ X   X is the instance language or instance space.   A fixed set of classes: C = {c1, c2,…, cJ}   A training set D of labeled documents with each labeled document ⟨d,c⟩X×C   Determine:   A learning method or algorithm which will enable us to learn a classifier γ:X→C   For a test document d, we assign it the class γ(d) ∈ C Introduc)on to Informa)on Retrieval Sec. 13.1 Document Classifica:on Test! Data:! (AI)! Classes:! ML! Training! Data:! Planning! Semantics! Garb.Coll.! Multimedia! GUI! ...! (Programming)! “planning! language! proof! intelligence”! (HCI)! learning! planning! programming! garbage! ...! intelligence! temporal! semantics! collection! algorithm! reasoning! language! memory! reinforcement! plan! proof...! optimization! network...! language...! region...! (Note: in real life there is often a hierarchy, not present in the above problem statement; and also, you get papers on ML approaches to Garb. Coll.) 3 Introduc)on to Informa)on Retrieval Ch. 13 More Text Classifica:on Examples Many search engine func:onali:es use classifica:on   Assigning labels to documents or web ­pages:   Labels are most olen topics such as Yahoo ­categories   "finance," "sports," "news>world>asia>business"   "editorials" "movie ­reviews" "news”   “like”, “hate”, “neutral”             Labels may be genres   Labels may be opinion on a person/product   Labels may be domain ­specific "interes)ng ­to ­me" : "not ­interes)ng ­to ­me” “contains adult language” : “doesn’t” language iden)fica)on: English, French, Chinese, … search ver)cal: about Linux versus not “link spam” : “not link spam” Introduc)on to Informa)on Retrieval Ch. 13 Classifica:on Methods (1)   Manual classifica:on           Used by the original Yahoo! Directory Looksmart, about.com, ODP, PubMed Very accurate when job is done by experts Consistent when the problem size and team is small Difficult and expensive to scale   Means we need automa:c classifica:on methods for big problems 4 Introduc)on to Informa)on Retrieval Ch. 13 Classifica:on Methods (2)   Automa:c document classifica:on   Hand ­coded rule ­based systems   One technique used by CS dept’s spam filter, Reuters, CIA, etc.   It’s what Google Alerts is doing   Widely deployed in government and enterprise   Companies provide “IDE” for wri:ng such rules   E.g., assign category if document contains a given boolean combina:on of words   Standing queries: Commercial systems have complex query languages (everything in IR query languages +score accumulators)   Accuracy is olen very high if a rule has been carefully refined over :me by a subject expert   Building and maintaining these rules is expensive Introduc)on to Informa)on Retrieval Ch. 13 A Verity topic A complex classifica:on rule   Note:   maintenance issues (author, etc.)   Hand ­weigh:ng of terms [Verity was bought by Autonomy.] 5 Introduc)on to Informa)on Retrieval Ch. 13 Classifica:on Methods (3)   Supervised learning of a document ­label assignment func:on   Many systems partly rely on machine learning (Autonomy, Microsol, Enkata, Yahoo!, …)             k ­Nearest Neighbors (simple, powerful) Naive Bayes (simple, common method) Support ­vector machines (new, more powerful) … plus many other methods No free lunch: requires hand ­classified training data But data can be built up (and refined) by amateurs   Many commercial systems use a mixture of methods Introduc)on to Informa)on Retrieval Sec. 9.1.2 Probabilis:c relevance feedback   Rather than reweigh:ng in a vector space…   If user has told us some relevant and some irrelevant documents, then we can proceed to build a probabilis:c classifier,   such as the Naive Bayes model we will look at today:   P(tk|R) = |Drk| / |Dr|   P(tk|NR) = |Dnrk| / |Dnr|   tk is a term; Dr is the set of known relevant documents; Drk is the subset that contain tk; Dnr is the set of known irrelevant documents; Dnrk is the subset that contain tk. 6 Introduc)on to Informa)on Retrieval Recall a few probability basics   For events a and b:   Bayes’ Rule Prior Posterior   Odds: Introduc)on to Informa)on Retrieval Sec.13.2 Bayesian Methods   Our focus this lecture   Learning and classifica:on methods based on probability theory.   Bayes theorem plays a cri:cal role in probabilis:c learning and classifica:on.   Builds a genera)ve model that approximates how data is produced   Uses prior probability of each category given no informa:on about an item.   Categoriza:on produces a posterior probability distribu:on over the possible categories given a descrip:on of an item. 7 Introduc)on to Informa)on Retrieval Sec.13.2 Bayes’ Rule for text classifica:on   For a document d and a class c Introduc)on to Informa)on Retrieval Sec.13.2 Naive Bayes Classifiers Task: Classify a new instance d based on a tuple of aMribute values into one of the classes cj ∈ C MAP is “maximum a posteriori” = most likely class 8 Introduc)on to Informa)on Retrieval Naïve Bayes Classifier: Naïve Bayes Assump:on   P(cj) Sec.13.2   Can be es:mated from the frequency of classes in the training examples.   P(x1,x2,…,xn|cj)   O(|X|n•|C|) parameters   Could only be es:mated if a very, very large number of training examples was available. Naïve Bayes Condi:onal Independence Assump:on:   Assume that the probability of observing the conjunc:on of aMributes is equal to the product of the individual probabili:es P(xi|cj). Introduc)on to Informa)on Retrieval Sec.13.3 The Naïve Bayes Classifier Flu X1 runnynose X2 sinus X3 cough X4 fever X5 muscle-ache   Condi0onal Independence Assump0on: features detect term presence and are independent of each other given the class:   This model is appropriate for binary variables   Mul:variate Bernoulli model 9 Introduc)on to Informa)on Retrieval Sec.13.3 Learning the Model C X1 X2 X3 X4 X5 X6   First aMempt: maximum likelihood es:mates   simply use the frequencies in the data Introduc)on to Informa)on Retrieval Sec.13.3 Problem with Maximum Likelihood Flu X1 runnynose X2 sinus X3 cough X4 fever X5 muscle-ache   What if we have seen no training documents with the word muscle ­ ache and classified in the topic Flu?   Zero probabili:es cannot be condi:oned away, no maMer the other evidence! 10 Introduc)on to Informa)on Retrieval Sec.13.3 Smoothing to Avoid Overfiyng # of values of Xi   Somewhat more subtle version overall fraction in data where Xi=xi,k extent of “smoothing” Introduc)on to Informa)on Retrieval Sec.13.2.1 Stochas:c Language Models   Model probability of genera:ng strings (each word in turn) in a language (commonly all strings over alphabet ∑). E.g., a unigram model the a man woman said likes Model M 0.2 0.1 0.01 0.01 0.03 0.02 … the 0.2 man 0.01 likes 0.02 the 0.2 woman 0.01 multiply P(s | M) = 0.00000008 11 Introduc)on to Informa)on Retrieval Sec.13.2.1 Stochas:c Language Models   Model probability of genera:ng any string Model M1 0.2 0.01 the class Model M2 0.2 0.03 0.02 0.1 0.01 the sayst yon maiden 0.0001 class the class pleaseth yon maiden 0.0001 sayst 0.0001 pleaseth 0.0001 yon 0.0005 maiden 0.01 woman 0.2 pleaseth 0.2 0.01 0.0001 0.0001 0.02 0.0001 0.0005 0.1 0.01 0.0001 woman P(s|M2) > P(s|M1) Introduc)on to Informa)on Retrieval Sec.13.2.1 Unigram and higher ­order models P( ) )P( | )P( | )P ( | ) Easy. Effective!   = P (   Unigram Language Models P( P( ) P( ) P( | )P( )P( ) ) P( | )   Bigram (generally, n ­gram) Language Models )P( |   Other Language Models   Grammar ­based models (PCFGs), etc.   Probably not the first thing to try in IR 12 Introduc)on to Informa)on Retrieval Sec.13.2 Naïve Bayes via a class condi:onal language model = mul:nomial NB C w1 w2 w3 w4 w5 w6   Effec:vely, the probability of each class is done as a class ­specific unigram language model Introduc)on to Informa)on Retrieval Sec.13.2 Using Mul:nomial Naive Bayes Classifiers to Classify Text: Basic method   AMributes are text posi:ons, values are words.     Still too many possibilities Assume that classification is independent of the positions of the words     Use same parameters for each position Result is bag of words model (over tokens not types) 13 Introduc)on to Informa)on Retrieval Sec.13.2 Naive Bayes: Learning   From training corpus, extract Vocabulary   Calculate required P(cj) and P(xk | cj) terms   For each cj in C do   docsj ← subset of documents for which the target class is cj   Textj ← single document containing all docsj   for each word xk in Vocabulary   nk ← number of occurrences of xk in Textj     Introduc)on to Informa)on Retrieval Sec.13.2 Naive Bayes: Classifying   positions ← all word posi:ons in current document which contain tokens found in Vocabulary   Return cNB, where 14 Introduc)on to Informa)on Retrieval Sec.13.2 Naive Bayes: Time Complexity   Training Time: O(|D|Lave + |C||V|)) where Lave is the average length of a document in D.   Assumes all counts are pre ­computed in O(|D|Lave) :me during one pass through all of the data.   Generally just O(|D|Lave) since usually |C||V| < |D|Lave Why?   Test Time: O(|C| Lt) where Lt is the average length of a test document.   Very efficient overall, linearly propor:onal to the :me needed to just read in all the data. Introduc)on to Informa)on Retrieval Sec.13.2 Underflow Preven:on: using logs   Mul:plying lots of probabili:es, which are between 0 and 1 by defini:on, can result in floa:ng ­point underflow.   Since log(xy) = log(x) + log(y), it is beMer to perform all computa:ons by summing logs of probabili:es rather than mul:plying probabili:es.   Class with highest final un ­normalized log probability score is s:ll the most probable.   Note that model is now just max of sum of weights… 15 Introduc)on to Informa)on Retrieval Naive Bayes Classifier   Simple interpreta:on: Each condi:onal parameter log P(xi|cj) is a weight that indicates how good an indicator xi is for cj.   The prior log P(cj) is a weight that indicates the rela:ve frequency of cj.   The sum is then a measure of how much evidence there is for the document being in the class.   We select the class with the most evidence for it 31 Introduc)on to Informa)on Retrieval Two Naive Bayes Models   Model 1: Mul:variate Bernoulli   One feature Xw for each word in dic:onary   Xw = true in document d if w appears in d   Naive Bayes assump:on:   Given the document’s topic, appearance of one word in the document tells us nothing about chances that another word appears   This is the model used in the binary independence model in classic probabilis:c relevance feedback on hand ­classified data (Maron in IR was a very early user of NB) 16 Introduc)on to Informa)on Retrieval Two Models   Model 2: Mul:nomial = Class condi:onal unigram   One feature Xi for each word pos in document   Value of Xi is the word in posi:on i   Naïve Bayes assump:on:   Second assump:on:   Word appearance does not depend on posi:on   feature’s values are all words in dic:onary   Given the document’s topic, word in one posi:on in the document tells us nothing about words in other posi:ons for all positions i,j, word w, and class c   Just have one mul:nomial feature predic:ng all words Introduc)on to Informa)on Retrieval Parameter es:ma:on   Mul:variate Bernoulli model:   fraction of documents of topic cj in which word w appears   Mul:nomial model:   Can create a mega ­document for topic j by concatena:ng all documents in this topic   Use frequency of w in mega ­document fraction of times in which word w appears among all words in documents of topic cj 17 Introduc)on to Informa)on Retrieval Classifica:on   Mul:nomial vs Mul:variate Bernoulli?   Mul:nomial model is almost always more effec:ve in text applica:ons!   See results figures later   See IIR sec:ons 13.2 and 13.3 for worked examples with each model Introduc)on to Informa)on Retrieval Sec.13.5 Feature Selec:on: Why?   Text collec:ons have a large number of features   10,000 – 1,000,000 unique words … and more   May make using a par:cular classifier feasible   Some classifiers can’t deal with 100,000 of features   Reduces training :me   Training :me for some methods is quadra:c or worse in the number of features   Can improve generaliza:on (performance)   Eliminates noise features   Avoids overfiyng 18 Introduc)on to Informa)on Retrieval Sec.13.5 Feature selec:on: how?   Two ideas:   Hypothesis tes:ng sta:s:cs:   Are we confident that the value of one categorical variable is associated with the value of another   Chi ­square test (χ2)   Informa:on theory:   How much informa:on does the value of one categorical variable give you about the value of another   Mutual informa:on   They’re similar, but χ2 measures confidence in associa:on, (based on available sta:s:cs), while MI measures extent of associa:on (assuming perfect knowledge of probabili:es) Introduc)on to Informa)on Retrieval Sec.13.5.2 χ2 sta:s:c (CHI)   χ2 is interested in (fo – fe)2/fe summed over all table entries: is the observed number what you’d expect given the marginals?   The null hypothesis is rejected with confidence .999,   since 12.9 > 10.83 (the value for .999 confidence). Term = jaguar Term ≠ jaguar Class = auto Class ≠ auto 2 (0.25) 3 (4.75) 500 (502) observed: fo expected: fe 9500 (9498) 19 Introduc)on to Informa)on Retrieval Sec.13.5.2 χ2 statistic (CHI) There is a simpler formula for 2x2 χ2: A = #(t,c) B = #(t,¬c) C = #(¬t,c) D = #(¬t, ¬c) N=A+B+C+D Value for complete independence of term and category? Introduc)on to Informa)on Retrieval Feature selec:on via Mutual Informa:on Sec.13.5.1   In training set, choose k words which best discriminate (give most info on) the categories.   The Mutual Informa:on between a word, class is:   For each word w and each category c 20 Introduc)on to Informa)on Retrieval Sec.13.5.1 Feature selec:on via MI (contd.)   For each category we build a list of k most discrimina:ng terms.   For example (on 20 Newsgroups):   sci.electronics: circuit, voltage, amp, ground, copy, baMery, electronics, cooling, …   rec.autos: car, cars, engine, ford, dealer, mustang, oil, collision, autos, :res, toyota, …   Greedy: does not account for correla:ons between terms   Why? Introduc)on to Informa)on Retrieval Sec.13.5 Feature Selec:on   Mutual Informa:on   Clear informa:on ­theore:c interpreta:on   May select rare uninforma:ve terms   Chi ­square   Sta:s:cal founda:on   May select very slightly informa:ve frequent terms that are not very useful for classifica:on   Just use the commonest terms?   No par:cular founda:on   In prac:ce, this is olen 90% as good 21 Introduc)on to Informa)on Retrieval Sec.13.5 Feature selec:on for NB   In general feature selec:on is necessary for mul:variate Bernoulli NB.   Otherwise you suffer from noise, mul: ­coun:ng   “Feature selec:on” really means something different for mul:nomial NB. It means dic:onary trunca:on   The mul:nomial NB model only has 1 feature   This “feature selec:on” normally isn’t needed for mul:nomial NB, but may help a frac:on with quan::es that are badly es:mated Introduc)on to Informa)on Retrieval Sec.13.6 Evalua:ng Categoriza:on   Evalua:on must be done on test data that are independent of the training data (usually a disjoint set of instances).   Some:mes use cross ­valida:on (averaging results over mul:ple training and test splits of the overall data)   It’s easy to get good performance on a test set that was available to the learner during training (e.g., just memorize the test set).   Measures: precision, recall, F1, classifica:on accuracy   Classifica)on accuracy: c/n where n is the total number of test instances and c is the number of test instances correctly classified by the system.   Adequate if one class per document   Otherwise F measure for each class 22 Introduc)on to Informa)on Retrieval Sec.13.6 Naive Bayes vs. other methods 45 Introduc)on to Informa)on Retrieval Sec.13.6 WebKB Experiment (1998)   Classify webpages from CS departments into:   student, faculty, course,project   Train on ~5,000 hand ­labeled web pages   Cornell, Washington, U.Texas, Wisconsin   Crawl and classify a new site (CMU)   Results: 23 Introduc)on to Informa)on Retrieval Sec.13.6 NB Model Comparison: WebKB Introduc)on to Informa)on Retrieval 24 Introduc)on to Informa)on Retrieval Sec.13.6 Naïve Bayes on spam email Introduc)on to Informa)on Retrieval SpamAssassin   Naïve Bayes has found a home in spam filtering   Paul Graham’s A Plan for Spam   A mutant with more mutant offspring...   Naive Bayes ­like classifier with weird parameter es:ma:on   Widely used in spam filters   Classic Naive Bayes superior when appropriately used   According to David D. Lewis   But also many other things: black hole lists, etc.   Many email topic filters also use NB classifiers 25 Introduc)on to Informa)on Retrieval Viola:on of NB Assump:ons   The independence assump:ons do not really hold of documents wriMen in natural language.   Condi:onal independence   Posi:onal independence   Examples? Introduc)on to Informa)on Retrieval Example: Sensors Reality Raining Sunny P(+,+,r) = 3/8 P(-,-,r) = 1/8 P(+,+,s) = 1/8 P(-,-,s) = 3/8 NB Model Raining? M1 M2 NB FACTORS:   P(s) = 1/2   P(+|s) = 1/4   P(+|r) = 3/4 PREDICTIONS:   P(r,+,+) = (½)(¾)(¾)   P(s,+,+) = (½)(¼)(¼)   P(r|+,+) = 9/10   P(s|+,+) = 1/10 26 Introduc)on to Informa)on Retrieval Naïve Bayes Posterior Probabili:es   Classifica:on results of naïve Bayes (the class with maximum posterior probability) are usually fairly accurate.   However, due to the inadequacy of the condi:onal independence assump:on, the actual posterior ­ probability numerical es:mates are not.   Output probabili:es are commonly very close to 0 or 1.   Correct es:ma:on ⇒ accurate predic:on, but correct probability es:ma:on is NOT necessary for accurate predic:on ( just need right ordering of probabili:es) Introduc)on to Informa)on Retrieval Naive Bayes is Not So Naive   Naive Bayes won 1st and 2nd place in KDD ­CUP 97 compe::on out of 16 systems Goal: Financial services industry direct mail response predic:on model: Predict if the recipient of mail will actually respond to the adver:sement – 750,000 records.   More robust to irrelevant features than many learning methods Irrelevant Features cancel each other without affec:ng results Decision Trees can suffer heavily from this.   More robust to concept dril (changing class defini:on over :me)   Very good in domains with many equally important features Decision Trees suffer from fragmenta)on in such cases – especially if liMle data   A good dependable baseline for text classifica:on (but not the best)!   Op:mal if the Independence Assump:ons hold: Bayes Op:mal Classifier Never true for text, but possible in some domains   Very Fast Learning and Tes:ng (basically just count the data)   Low Storage requirements 27 Introduc)on to Informa)on Retrieval Ch. 13 Resources for today’s lecture   IIR 13   Fabrizio Sebas:ani. Machine Learning in Automated Text Categoriza:on. ACM Compu)ng Surveys, 34(1):1 ­47, 2002.   Yiming Yang & Xin Liu, A re ­examina:on of text categoriza:on methods. Proceedings of SIGIR, 1999.   Andrew McCallum and Kamal Nigam. A Comparison of Event Models for Naive Bayes Text Classifica:on. In AAAI/ICML ­98 Workshop on Learning for Text Categoriza)on, pp. 41 ­48.   Tom Mitchell, Machine Learning. McGraw ­Hill, 1997.   Open Calais: Automa:c Seman:c Tagging   Clear simple explana:on of Naïve Bayes   Free (but they can keep your data), provided by Thompson/Reuters (ex ­ClearForest)   Weka: A data mining solware package that includes an implementa:on of Naive Bayes   Reuters ­21578 – the most famous text classifica:on evalua:on set   S:ll widely used by lazy people (but now it’s too small for realis:c experiments – you should use Reuters RCV1) 28 ...
View Full Document

This note was uploaded on 01/21/2011 for the course CSCP 689 taught by Professor James during the Spring '10 term at Texas A&M.

Ask a homework question - tutors are online