This preview shows page 1. Sign up to view the full content.
Unformatted text preview: Introduc)on to Informa)on Retrieval Ch. 13 Standing queries The path from IR to text classiﬁca:on: You have an informa:on need to monitor, say: Unrest in the Niger delta region You want to rerun an appropriate query periodically to ﬁnd new news items on this topic You will be sent new documents that are found I.e., it’s text classiﬁca:on not ranking Such queries are called standing queries Long used by “informa:on professionals” A modern mass instan:a:on is Google Alerts Standing queries are (hand
wriMen) text classiﬁers Introduc)on to Informa)on Retrieval Spam ﬁltering: Another text classiﬁca:on task From: "" <takworlld@hotmail.com> Subject: real estate is the only way... gem oalvgkay Anyone can buy real estate with no money down Stop paying rent TODAY ! There is no need to spend hundreds or even thousands for similar courses I am 22 years old and I have already purchased 6 proper:es using the methods outlined in this truly INCREDIBLE ebook. Change your life NOW ! ================================================= Click Below to order: hMp://www.wholesaledaily.com/sales/nmd.htm ================================================= Ch. 13 1 Introduc)on to Informa)on Retrieval Ch. 13 Text classiﬁca:on Today: Introduc:on to Text Classiﬁca:on Also widely known as “text categoriza:on”. Same thing. Naïve Bayes text classiﬁca:on Including a liMle on Probabilis:c Language Models Introduc)on to Informa)on Retrieval Sec. 13.1 Categoriza:on/Classiﬁca:on Given: A descrip:on of an instance, d ∈ X X is the instance language or instance space. Issue: how to represent text documents. Usually some type of high
dimensional space A ﬁxed set of classes: C = {c1, c2,…, cJ} Determine: The category of d: γ(d) ∈ C, where γ(d) is a classiﬁca)on func)on whose domain is X and whose range is C. We want to know how to build classiﬁca:on func:ons (“classiﬁers”). 2 Introduc)on to Informa)on Retrieval Sec. 13.1 Supervised Classiﬁca:on Given: A descrip:on of an instance, d ∈ X X is the instance language or instance space. A ﬁxed set of classes: C = {c1, c2,…, cJ} A training set D of labeled documents with each labeled document ⟨d,c⟩X×C Determine: A learning method or algorithm which will enable us to learn a classiﬁer γ:X→C For a test document d, we assign it the class γ(d) ∈ C Introduc)on to Informa)on Retrieval Sec. 13.1 Document Classiﬁca:on Test! Data:! (AI)! Classes:! ML! Training! Data:! Planning! Semantics! Garb.Coll.! Multimedia! GUI! ...! (Programming)!
“planning! language! proof! intelligence”! (HCI)! learning! planning! programming! garbage! ...! intelligence! temporal! semantics! collection! algorithm! reasoning! language! memory! reinforcement! plan! proof...! optimization! network...! language...! region...! (Note: in real life there is often a hierarchy, not present in the above problem statement; and also, you get papers on ML approaches to Garb. Coll.) 3 Introduc)on to Informa)on Retrieval Ch. 13 More Text Classiﬁca:on Examples Many search engine func:onali:es use classiﬁca:on Assigning labels to documents or web
pages: Labels are most olen topics such as Yahoo
categories "ﬁnance," "sports," "news>world>asia>business" "editorials" "movie
reviews" "news” “like”, “hate”, “neutral” Labels may be genres Labels may be opinion on a person/product Labels may be domain
speciﬁc "interes)ng
to
me" : "not
interes)ng
to
me” “contains adult language” : “doesn’t” language iden)ﬁca)on: English, French, Chinese, … search ver)cal: about Linux versus not “link spam” : “not link spam” Introduc)on to Informa)on Retrieval Ch. 13 Classiﬁca:on Methods (1) Manual classiﬁca:on Used by the original Yahoo! Directory Looksmart, about.com, ODP, PubMed Very accurate when job is done by experts Consistent when the problem size and team is small Diﬃcult and expensive to scale Means we need automa:c classiﬁca:on methods for big problems 4 Introduc)on to Informa)on Retrieval Ch. 13 Classiﬁca:on Methods (2) Automa:c document classiﬁca:on Hand
coded rule
based systems One technique used by CS dept’s spam ﬁlter, Reuters, CIA, etc. It’s what Google Alerts is doing Widely deployed in government and enterprise Companies provide “IDE” for wri:ng such rules E.g., assign category if document contains a given boolean combina:on of words Standing queries: Commercial systems have complex query languages (everything in IR query languages +score accumulators) Accuracy is olen very high if a rule has been carefully reﬁned over :me by a subject expert Building and maintaining these rules is expensive Introduc)on to Informa)on Retrieval Ch. 13 A Verity topic A complex classiﬁca:on rule Note: maintenance issues (author, etc.) Hand
weigh:ng of terms [Verity was bought by Autonomy.] 5 Introduc)on to Informa)on Retrieval Ch. 13 Classiﬁca:on Methods (3) Supervised learning of a document
label assignment func:on Many systems partly rely on machine learning (Autonomy, Microsol, Enkata, Yahoo!, …) k
Nearest Neighbors (simple, powerful) Naive Bayes (simple, common method) Support
vector machines (new, more powerful) … plus many other methods No free lunch: requires hand
classiﬁed training data But data can be built up (and reﬁned) by amateurs Many commercial systems use a mixture of methods Introduc)on to Informa)on Retrieval Sec. 9.1.2 Probabilis:c relevance feedback Rather than reweigh:ng in a vector space… If user has told us some relevant and some irrelevant documents, then we can proceed to build a probabilis:c classiﬁer, such as the Naive Bayes model we will look at today: P(tkR) = Drk / Dr P(tkNR) = Dnrk / Dnr tk is a term; Dr is the set of known relevant documents; Drk is the subset that contain tk; Dnr is the set of known irrelevant documents; Dnrk is the subset that contain tk. 6 Introduc)on to Informa)on Retrieval Recall a few probability basics For events a and b: Bayes’ Rule Prior Posterior Odds: Introduc)on to Informa)on Retrieval Sec.13.2 Bayesian Methods Our focus this lecture Learning and classiﬁca:on methods based on probability theory. Bayes theorem plays a cri:cal role in probabilis:c learning and classiﬁca:on. Builds a genera)ve model that approximates how data is produced Uses prior probability of each category given no informa:on about an item. Categoriza:on produces a posterior probability distribu:on over the possible categories given a descrip:on of an item. 7 Introduc)on to Informa)on Retrieval Sec.13.2 Bayes’ Rule for text classiﬁca:on For a document d and a class c Introduc)on to Informa)on Retrieval Sec.13.2 Naive Bayes Classiﬁers Task: Classify a new instance d based on a tuple of aMribute values into one of the classes cj ∈ C MAP is “maximum a posteriori” = most likely class 8 Introduc)on to Informa)on Retrieval Naïve Bayes Classiﬁer: Naïve Bayes Assump:on P(cj) Sec.13.2 Can be es:mated from the frequency of classes in the training examples. P(x1,x2,…,xncj)
O(Xn•C) parameters Could only be es:mated if a very, very large number of training examples was available. Naïve Bayes Condi:onal Independence Assump:on: Assume that the probability of observing the conjunc:on of aMributes is equal to the product of the individual probabili:es P(xicj). Introduc)on to Informa)on Retrieval Sec.13.3 The Naïve Bayes Classiﬁer Flu X1 runnynose X2 sinus X3 cough X4 fever X5 muscleache Condi0onal Independence Assump0on: features detect term presence and are independent of each other given the class: This model is appropriate for binary variables Mul:variate Bernoulli model 9 Introduc)on to Informa)on Retrieval Sec.13.3 Learning the Model C X1 X2 X3 X4 X5 X6 First aMempt: maximum likelihood es:mates simply use the frequencies in the data Introduc)on to Informa)on Retrieval Sec.13.3 Problem with Maximum Likelihood Flu X1 runnynose X2 sinus X3 cough X4 fever X5 muscleache What if we have seen no training documents with the word muscle
ache and classiﬁed in the topic Flu? Zero probabili:es cannot be condi:oned away, no maMer the other evidence! 10 Introduc)on to Informa)on Retrieval Sec.13.3 Smoothing to Avoid Overﬁyng # of values of Xi Somewhat more subtle version overall fraction in data where Xi=xi,k extent of “smoothing” Introduc)on to Informa)on Retrieval Sec.13.2.1 Stochas:c Language Models Model probability of genera:ng strings (each word in turn) in a language (commonly all strings over alphabet ∑). E.g., a unigram model the a man woman said likes Model M
0.2 0.1 0.01 0.01 0.03 0.02 … the 0.2 man 0.01 likes 0.02 the 0.2 woman 0.01 multiply P(s  M) = 0.00000008 11 Introduc)on to Informa)on Retrieval Sec.13.2.1 Stochas:c Language Models Model probability of genera:ng any string Model M1
0.2 0.01 the class Model M2
0.2 0.03 0.02 0.1 0.01 the sayst yon maiden 0.0001 class the class pleaseth yon maiden 0.0001 sayst 0.0001 pleaseth 0.0001 yon 0.0005 maiden 0.01 woman 0.2 pleaseth 0.2 0.01 0.0001 0.0001 0.02 0.0001 0.0005 0.1 0.01 0.0001 woman P(sM2) > P(sM1) Introduc)on to Informa)on Retrieval Sec.13.2.1 Unigram and higher
order models P( ) )P(  )P(  )P (  )
Easy. Effective! = P ( Unigram Language Models P( P( ) P( ) P(  )P( )P( ) ) P(  ) Bigram (generally, n
gram) Language Models )P(  Other Language Models Grammar
based models (PCFGs), etc. Probably not the ﬁrst thing to try in IR 12 Introduc)on to Informa)on Retrieval Sec.13.2 Naïve Bayes via a class condi:onal language model = mul:nomial NB C w1 w2 w3 w4 w5 w6 Eﬀec:vely, the probability of each class is done as a class
speciﬁc unigram language model Introduc)on to Informa)on Retrieval Sec.13.2 Using Mul:nomial Naive Bayes Classiﬁers to Classify Text: Basic method AMributes are text posi:ons, values are words. Still too many possibilities Assume that classification is independent of the positions of the words
Use same parameters for each position Result is bag of words model (over tokens not types) 13 Introduc)on to Informa)on Retrieval Sec.13.2 Naive Bayes: Learning From training corpus, extract Vocabulary Calculate required P(cj) and P(xk  cj) terms For each cj in C do docsj ← subset of documents for which the target class is cj Textj ← single document containing all docsj for each word xk in Vocabulary nk ← number of occurrences of xk in Textj
Introduc)on to Informa)on Retrieval Sec.13.2 Naive Bayes: Classifying positions ← all word posi:ons in current document which contain tokens found in Vocabulary Return cNB, where 14 Introduc)on to Informa)on Retrieval Sec.13.2 Naive Bayes: Time Complexity Training Time: O(DLave + CV)) where Lave is the average length of a document in D. Assumes all counts are pre
computed in O(DLave) :me during one pass through all of the data. Generally just O(DLave) since usually CV < DLave Why? Test Time: O(C Lt) where Lt is the average length of a test document. Very eﬃcient overall, linearly propor:onal to the :me needed to just read in all the data. Introduc)on to Informa)on Retrieval Sec.13.2 Underﬂow Preven:on: using logs Mul:plying lots of probabili:es, which are between 0 and 1 by deﬁni:on, can result in ﬂoa:ng
point underﬂow. Since log(xy) = log(x) + log(y), it is beMer to perform all computa:ons by summing logs of probabili:es rather than mul:plying probabili:es. Class with highest ﬁnal un
normalized log probability score is s:ll the most probable. Note that model is now just max of sum of weights… 15 Introduc)on to Informa)on Retrieval Naive Bayes Classiﬁer Simple interpreta:on: Each condi:onal parameter log P(xicj) is a weight that indicates how good an indicator xi is for cj. The prior log P(cj) is a weight that indicates the rela:ve frequency of cj. The sum is then a measure of how much evidence there is for the document being in the class. We select the class with the most evidence for it 31 Introduc)on to Informa)on Retrieval Two Naive Bayes Models Model 1: Mul:variate Bernoulli One feature Xw for each word in dic:onary Xw = true in document d if w appears in d Naive Bayes assump:on: Given the document’s topic, appearance of one word in the document tells us nothing about chances that another word appears This is the model used in the binary independence model in classic probabilis:c relevance feedback on hand
classiﬁed data (Maron in IR was a very early user of NB) 16 Introduc)on to Informa)on Retrieval Two Models Model 2: Mul:nomial = Class condi:onal unigram One feature Xi for each word pos in document Value of Xi is the word in posi:on i Naïve Bayes assump:on: Second assump:on: Word appearance does not depend on posi:on feature’s values are all words in dic:onary Given the document’s topic, word in one posi:on in the document tells us nothing about words in other posi:ons for all positions i,j, word w, and class c
Just have one mul:nomial feature predic:ng all words Introduc)on to Informa)on Retrieval Parameter es:ma:on Mul:variate Bernoulli model: fraction of documents of topic cj in which word w appears Mul:nomial model: Can create a mega
document for topic j by concatena:ng all documents in this topic Use frequency of w in mega
document fraction of times in which word w appears among all words in documents of topic cj 17 Introduc)on to Informa)on Retrieval Classiﬁca:on Mul:nomial vs Mul:variate Bernoulli? Mul:nomial model is almost always more eﬀec:ve in text applica:ons! See results ﬁgures later See IIR sec:ons 13.2 and 13.3 for worked examples with each model Introduc)on to Informa)on Retrieval Sec.13.5 Feature Selec:on: Why? Text collec:ons have a large number of features 10,000 – 1,000,000 unique words … and more May make using a par:cular classiﬁer feasible Some classiﬁers can’t deal with 100,000 of features Reduces training :me Training :me for some methods is quadra:c or worse in the number of features Can improve generaliza:on (performance) Eliminates noise features Avoids overﬁyng 18 Introduc)on to Informa)on Retrieval Sec.13.5 Feature selec:on: how? Two ideas: Hypothesis tes:ng sta:s:cs: Are we conﬁdent that the value of one categorical variable is associated with the value of another Chi
square test (χ2) Informa:on theory: How much informa:on does the value of one categorical variable give you about the value of another Mutual informa:on They’re similar, but χ2 measures conﬁdence in associa:on, (based on available sta:s:cs), while MI measures extent of associa:on (assuming perfect knowledge of probabili:es) Introduc)on to Informa)on Retrieval Sec.13.5.2 χ2 sta:s:c (CHI) χ2 is interested in (fo – fe)2/fe summed over all table entries: is the observed number what you’d expect given the marginals? The null hypothesis is rejected with conﬁdence .999, since 12.9 > 10.83 (the value for .999 conﬁdence). Term = jaguar Term ≠ jaguar Class = auto Class ≠ auto 2 (0.25) 3 (4.75) 500 (502)
observed: fo expected: fe 9500 (9498) 19 Introduc)on to Informa)on Retrieval Sec.13.5.2 χ2 statistic (CHI)
There is a simpler formula for 2x2 χ2: A = #(t,c) B = #(t,¬c) C = #(¬t,c) D = #(¬t, ¬c) N=A+B+C+D
Value for complete independence of term and category? Introduc)on to Informa)on Retrieval Feature selec:on via Mutual Informa:on Sec.13.5.1 In training set, choose k words which best discriminate (give most info on) the categories. The Mutual Informa:on between a word, class is: For each word w and each category c 20 Introduc)on to Informa)on Retrieval Sec.13.5.1 Feature selec:on via MI (contd.) For each category we build a list of k most discrimina:ng terms. For example (on 20 Newsgroups): sci.electronics: circuit, voltage, amp, ground, copy, baMery, electronics, cooling, … rec.autos: car, cars, engine, ford, dealer, mustang, oil, collision, autos, :res, toyota, … Greedy: does not account for correla:ons between terms Why? Introduc)on to Informa)on Retrieval Sec.13.5 Feature Selec:on Mutual Informa:on Clear informa:on
theore:c interpreta:on May select rare uninforma:ve terms Chi
square Sta:s:cal founda:on May select very slightly informa:ve frequent terms that are not very useful for classiﬁca:on Just use the commonest terms? No par:cular founda:on In prac:ce, this is olen 90% as good 21 Introduc)on to Informa)on Retrieval Sec.13.5 Feature selec:on for NB In general feature selec:on is necessary for mul:variate Bernoulli NB. Otherwise you suﬀer from noise, mul:
coun:ng “Feature selec:on” really means something diﬀerent for mul:nomial NB. It means dic:onary trunca:on The mul:nomial NB model only has 1 feature This “feature selec:on” normally isn’t needed for mul:nomial NB, but may help a frac:on with quan::es that are badly es:mated Introduc)on to Informa)on Retrieval Sec.13.6 Evalua:ng Categoriza:on Evalua:on must be done on test data that are independent of the training data (usually a disjoint set of instances). Some:mes use cross
valida:on (averaging results over mul:ple training and test splits of the overall data) It’s easy to get good performance on a test set that was available to the learner during training (e.g., just memorize the test set). Measures: precision, recall, F1, classiﬁca:on accuracy Classiﬁca)on accuracy: c/n where n is the total number of test instances and c is the number of test instances correctly classiﬁed by the system. Adequate if one class per document Otherwise F measure for each class 22 Introduc)on to Informa)on Retrieval Sec.13.6 Naive Bayes vs. other methods 45 Introduc)on to Informa)on Retrieval Sec.13.6 WebKB Experiment (1998) Classify webpages from CS departments into: student, faculty, course,project Train on ~5,000 hand
labeled web pages Cornell, Washington, U.Texas, Wisconsin Crawl and classify a new site (CMU) Results: 23 Introduc)on to Informa)on Retrieval Sec.13.6 NB Model Comparison: WebKB Introduc)on to Informa)on Retrieval 24 Introduc)on to Informa)on Retrieval Sec.13.6 Naïve Bayes on spam email Introduc)on to Informa)on Retrieval SpamAssassin Naïve Bayes has found a home in spam ﬁltering Paul Graham’s A Plan for Spam A mutant with more mutant oﬀspring... Naive Bayes
like classiﬁer with weird parameter es:ma:on Widely used in spam ﬁlters Classic Naive Bayes superior when appropriately used According to David D. Lewis But also many other things: black hole lists, etc. Many email topic ﬁlters also use NB classiﬁers 25 Introduc)on to Informa)on Retrieval Viola:on of NB Assump:ons The independence assump:ons do not really hold of documents wriMen in natural language. Condi:onal independence Posi:onal independence Examples? Introduc)on to Informa)on Retrieval Example: Sensors Reality Raining Sunny P(+,+,r) = 3/8 P(,,r) = 1/8 P(+,+,s) = 1/8 P(,,s) = 3/8 NB Model Raining? M1 M2 NB FACTORS: P(s) = 1/2 P(+s) = 1/4 P(+r) = 3/4 PREDICTIONS: P(r,+,+) = (½)(¾)(¾) P(s,+,+) = (½)(¼)(¼) P(r+,+) = 9/10 P(s+,+) = 1/10 26 Introduc)on to Informa)on Retrieval Naïve Bayes Posterior Probabili:es Classiﬁca:on results of naïve Bayes (the class with maximum posterior probability) are usually fairly accurate. However, due to the inadequacy of the condi:onal independence assump:on, the actual posterior
probability numerical es:mates are not. Output probabili:es are commonly very close to 0 or 1. Correct es:ma:on ⇒ accurate predic:on, but correct probability es:ma:on is NOT necessary for accurate predic:on ( just need right ordering of probabili:es) Introduc)on to Informa)on Retrieval Naive Bayes is Not So Naive Naive Bayes won 1st and 2nd place in KDD
CUP 97 compe::on out of 16 systems Goal: Financial services industry direct mail response predic:on model: Predict if the recipient of mail will actually respond to the adver:sement – 750,000 records. More robust to irrelevant features than many learning methods Irrelevant Features cancel each other without aﬀec:ng results Decision Trees can suﬀer heavily from this. More robust to concept dril (changing class deﬁni:on over :me) Very good in domains with many equally important features Decision Trees suﬀer from fragmenta)on in such cases – especially if liMle data A good dependable baseline for text classiﬁca:on (but not the best)! Op:mal if the Independence Assump:ons hold: Bayes Op:mal Classiﬁer Never true for text, but possible in some domains Very Fast Learning and Tes:ng (basically just count the data) Low Storage requirements 27 Introduc)on to Informa)on Retrieval Ch. 13 Resources for today’s lecture IIR 13 Fabrizio Sebas:ani. Machine Learning in Automated Text Categoriza:on. ACM Compu)ng Surveys, 34(1):1
47, 2002. Yiming Yang & Xin Liu, A re
examina:on of text categoriza:on methods. Proceedings of SIGIR, 1999. Andrew McCallum and Kamal Nigam. A Comparison of Event Models for Naive Bayes Text Classiﬁca:on. In AAAI/ICML
98 Workshop on Learning for Text Categoriza)on, pp. 41
48. Tom Mitchell, Machine Learning. McGraw
Hill, 1997. Open Calais: Automa:c Seman:c Tagging Clear simple explana:on of Naïve Bayes Free (but they can keep your data), provided by Thompson/Reuters (ex
ClearForest) Weka: A data mining solware package that includes an implementa:on of Naive Bayes Reuters
21578 – the most famous text classiﬁca:on evalua:on set S:ll widely used by lazy people (but now it’s too small for realis:c experiments – you should use Reuters RCV1) 28 ...
View
Full
Document
This note was uploaded on 01/21/2011 for the course CSCP 689 taught by Professor James during the Spring '10 term at Texas A&M.
 Spring '10
 JAMES

Click to edit the document details