9.2.Bayesian Learning - Acknowledgement: Material derived...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Acknowledgement: Material derived from slides for the book Machine Learning, Tom M. Mitchell, McGraw-Hill, 1997 http://www-2.cs.cmu.edu/~tom/mlbook.html 11s1: COMP9417 Machine Learning and Data Mining and slides by Andrew W. Moore available at http://www.cs.cmu.edu/~awm/tutorials Bayesian Learning May 10, 2011 Aims Aims This lecture will enable you to describe machine learning in the framework of Bayesian statistics and reproduce key algorithms derived from this approach. Following it you should be able to: • describe the Naive Bayes model and the role of Bayesian inference in it • reproduce basic definitions of useful probabilities • outline the key elements of the Bayes Net formalism • derive Bayes theorem and the formulae for MAP and ML hypotheses • describe concept learning in Bayesian terms • outline the derivation of the method of numerical prediction by minimising the sum of squared errors in terms of maximum likelihood • define the Minimum Description Length principle in Bayesian terms and outline the steps in its application, e.g., to decision tree learning • define the Bayes Optimal Classifier COMP9417: May 10, 2011 Bayesian Learning: Slide 1 • reproduce the Naive Bayes classifier, e.g. for text classification • outline issues in learning Bayes Nets [Recommended reading: Mitchell, Chapter 6] [Recommended exercises: 6.1, 6.2, (6.4)] Relevant WEKA programs: weka.classifiers.bayes package (BayesNet, NaiveBayes, etc.) See also the R Project for Statistical Computing: http://www.r-project.org/ COMP9417: May 10, 2011 Bayesian Learning: Slide 2 Uncertainty Two Roles for Bayesian Methods As far as the laws of mathematics refer to reality, they are not certain; as far as they are certain, they do not refer to reality. Provides practical learning algorithms: • Naive Bayes learning • Bayesian belief network learning –Albert Einstein • Combine prior knowledge (prior probabilities) with observed data • Requires prior probabilities Provides useful conceptual framework: • Provides “gold standard” for evaluating other learning algorithms • Additional insight into Occam’s razor COMP9417: May 10, 2011 Bayesian Learning: Slide 3 COMP9417: May 10, 2011 Bayes Theorem P ( h| D ) = Bayesian Learning: Slide 4 Choosing Hypotheses P ( D | h) P ( h) P (D ) P ( h| D ) = P ( D | h) P ( h) P (D ) Generally want the most probable hypothesis given the training data • P (h) = prior probability of hypothesis h Maximum a posteriori hypothesis hM AP : • P (D) = prior probability of training data D • P (h|D) = probability of h given D hM AP • P (D|h) = probability of D given h = arg max P (h|D) h∈H = arg max h∈H P ( D | h ) P ( h) P (D ) = arg max P (D|h)P (h) h∈H COMP9417: May 10, 2011 Bayesian Learning: Slide 5 COMP9417: May 10, 2011 Bayesian Learning: Slide 6 Applying Bayes Theorem Choosing Hypotheses If assume P (hi) = P (hj ) then can further simplify, and choose the Maximum likelihood (ML) hypothesis Does patient have cancer or not? A patient takes a lab test and the result comes back positive. The test returns a correct positive result in only 98% of the cases in which the disease is actually present, and a correct negative result in only 97% of the cases in which the disease is not present. Furthermore, .008 of the entire population have this cancer. hM L = arg max P (D|hi) hi ∈H P (cancer) = P (⊕ | cancer) = P (⊕ | ¬cancer) = COMP9417: May 10, 2011 Bayesian Learning: Slide 7 P (¬cancer) = P (￿ | cancer) = P (￿ | ¬cancer) = COMP9417: May 10, 2011 Applying Bayes Theorem Bayesian Learning: Slide 8 Applying Bayes Theorem Does patient have cancer or not? Does patient have cancer or not? A patient takes a lab test and the result comes back positive. The test returns a correct positive result in only 98% of the cases in which the disease is actually present, and a correct negative result in only 97% of the cases in which the disease is not present. Furthermore, .008 of the entire population have this cancer. We can find the maximum a priori (MAP) hypothesis P (⊕ | cancer)P (cancer) = 0.98 × 0.008 = 0.00784 P (⊕ | ¬cancer)P (¬cancer) = 0.03 × 0.992 = 0.02976 Thus hM AP = . . .. P (cancer) = .008 P (⊕ | cancer) = .98 P (⊕ | ¬cancer) = .03 COMP9417: May 10, 2011 P (¬cancer) = .992 P (￿ | cancer) = .02 P (￿ | ¬cancer) = .97 Bayesian Learning: Slide 9 COMP9417: May 10, 2011 Bayesian Learning: Slide 10 Basic Formulas for Probabilities Applying Bayes Theorem Does patient have cancer or not? • Product Rule: probability P (A ∧ B ) of a conjunction of two events A and B: P ( A ∧ B ) = P ( A| B ) P ( B ) = P ( B | A) P ( A) We can find the maximum a priori (MAP) hypothesis P (⊕ | cancer)P (cancer) = 0.98 × 0.008 = 0.00784 • Sum Rule: probability of a disjunction of two events A and B: P (⊕ | ¬cancer)P (¬cancer) = 0.03 × 0.992 = 0.02976 P ( A ∨ B ) = P ( A) + P ( B ) − P ( A ∧ B ) Thus hM AP = ¬cancer. Also note: posterior probability of hypothesis cancer higher than prior. • Theorem of total probability: if events A1, . . . , An are mutually ￿n exclusive with i=1 P (Ai) = 1, then P (B ) = n ￿ i=1 COMP9417: May 10, 2011 Bayesian Learning: Slide 11 P ( B | Ai ) P ( Ai ) COMP9417: May 10, 2011 Bayesian Learning: Slide 12 Brute Force MAP Hypothesis Learner Basic Formulas for Probabilities Also worth remembering: 1. For each hypothesis h in H , calculate the posterior probability • Conditional Probability: probability of A given B : P ( A| B ) = P ( h| D ) = P (A ∧ B ) P (B ) P ( D | h) P ( h) P (D ) 2. Output the hypothesis hM AP with the highest posterior probability • Rearrange sum rule to get: hM AP = argmax P (h|D) h∈H P ( A ∧ B ) = P ( A) + P ( B ) − P ( A ∨ B ) Exercise: Derive Bayes Theorem. COMP9417: May 10, 2011 Bayesian Learning: Slide 13 COMP9417: May 10, 2011 Bayesian Learning: Slide 14 Relation to Concept Learning Relation to Concept Learning Assume fixed set of instances ￿x1, . . . , xm￿ Consider our usual concept learning task Assume D is the set of classifications D = ￿c(x1), . . . , c(xm)￿ • instance space X , hypothesis space H , training examples D • consider the FindS learning algorithm (outputs most specific hypothesis from the version space V SH,D ) Choose P (D|h): What would Bayes rule produce as the MAP hypothesis? Does F indS output a MAP hypothesis?? COMP9417: May 10, 2011 Bayesian Learning: Slide 15 COMP9417: May 10, 2011 Relation to Concept Learning Bayesian Learning: Slide 16 Relation to Concept Learning Assume fixed set of instances ￿x1, . . . , xm￿ Then: Assume D is the set of classifications D = ￿c(x1), . . . , c(xm)￿ P ( h| D ) = Choose P (D|h): • P (D|h) = 1 if h consistent with D 1 |V SH,D | 0 if h is consistent with D otherwise • P (D|h) = 0 otherwise Choose P (h) to be uniform distribution: • P ( h) = 1 |H | for all h in H COMP9417: May 10, 2011 Bayesian Learning: Slide 17 COMP9417: May 10, 2011 Bayesian Learning: Slide 18 Evolution of Posterior Probabilities Relation to Concept Learning Every hypothesis consistent with D is a MAP hypothesis, if P(h ) P(h|D 1) • uniform probability over H P(h|D 1, D 2) • target function c ∈ H • deterministic, noise-free data • etc. (see above) hypotheses (a) hypotheses ( b) hypotheses ( c) COMP9417: May 10, 2011 Bayesian Learning: Slide 19 FindS will output a MAP hypothesis, even though it does not explicitly use probabilities in learning. Bayesian interpretation of inductive bias : use Bayes theorem, define restrictions on P (h) and P (D | h) COMP9417: May 10, 2011 Characterizing Learning Algorithms by Equivalent MAP Learners Hypothesis space H Candidate Elimination Algorithm Slide 20 Learning A Real Valued Function y Inductive system Training examples D Bayesian Learning: Output hypotheses f hML e Equivalent Bayesian inference system Training examples D Output hypotheses Hypothesis space H Brute force MAP learner P(h) uniform P(D|h) = 0 if inconsistent, = 1 if consistent Prior assumptions made explicit COMP9417: May 10, 2011 x Bayesian Learning: Slide 21 COMP9417: May 10, 2011 Bayesian Learning: Slide 22 Learning A Real Valued Function Learning A Real Valued Function Consider any real-valued target function f Training examples ￿xi, di￿, where di is noisy training value hM L = argmax p(D|h) h∈H • di = f ( xi ) + ei = argmax • ei is random variable (noise) drawn independently for each xi according to some Gaussian distribution with mean=0 h∈H = argmax h∈H Then the maximum likelihood hypothesis hM L is the one that minimizes the sum of squared errors: hM L = arg min h∈H m ￿ i=1 (di − h(xi)) COMP9417: May 10, 2011 Slide 23 h∈H = argmin h∈H COMP9417: May 10, 2011 i=1 m ￿ i=1 (di − h(xi)) 2πσ 2 1 d i − h (x i ) ) 2 σ e− 2 ( Bayesian Learning: Slide 24 hM AP = argmax P (D|h)P (h) 1 − (di − h(xi)) 1 Once again, the MAP hypothesis ￿ ￿2 1 d i − h( x i ) = argmax ln √ − σ 2πσ 2 2 h∈H i=1 ￿ ￿2 m ￿ 1 d i − h( x i ) = argmax − 2 σ h∈H i=1 = argmax √ Minimum Description Length Principle Maximize natural log to give simpler expression . . . m ￿ i=1 p( d i | h) COMP9417: May 10, 2011 Learning A Real Valued Function hM L i=1 m ￿ Recall that we treat each probability p(D | h) as if h = f 2 Bayesian Learning: m ￿ m ￿ h∈H Which is equivalent to hM AP = argmax log2 P (D|h) + log2 P (h) h∈H 2 Or 2 hM AP = argmin − log2 P (D|h) − log2 P (h) h∈H Bayesian Learning: Slide 25 COMP9417: May 10, 2011 Bayesian Learning: Slide 26 Minimum Description Length Principle Minimum Description Length Principle Interestingly, this is an expression about a quantity of bits. So interpret (1): • − log2 P (h) is length of h under optimal code hM AP = arg min − log2 P (D|h) − log2 P (h) (1) h∈H • − log2 P (D|h) is length of D given h under optimal code Note well: assumes optimal encodings, when the priors and likelihoods are known. In practice, this is difficult, and makes a difference. From information theory: The optimal (shortest expected coding length) code for an event with probability p is − log2 p bits. COMP9417: May 10, 2011 Bayesian Learning: Slide 27 COMP9417: May 10, 2011 Minimum Description Length Principle Bayesian Learning: Slide 28 Minimum Description Length Principle Occam’s razor: prefer the shortest hypothesis Example: H = decision trees, D = training data labels MDL: prefer the hypothesis h that minimizes • LC1 (h) is # bits to describe tree h • LC2 (D|h) is # bits to describe D given h hM DL = argmin LC1 (h) + LC2 (D|h) h∈H where LC (x) is the description length of x under optimal encoding C – Note LC2 (D|h) = 0 if examples classified perfectly by h. Need only describe exceptions • Hence hM DL trades off tree size for training errors – i.e., prefer the hypothesis that minimizes length(h) + length(misclassif ications) COMP9417: May 10, 2011 Bayesian Learning: Slide 29 COMP9417: May 10, 2011 Bayesian Learning: Slide 30 Most Probable Classification of New Instances Most Probable Classification of New Instances So far we’ve sought the most probable hypothesis given the data D (i.e., hM AP ) Consider: • Three possible hypotheses: Given new instance x, what is its most probable classification? P (h1|D) = .4, P (h2|D) = .3, P (h3|D) = .3 • Given new instance x, • hM AP (x) is not the most probable classification! h 1 ( x ) = + , h 2 ( x ) = − , h3 ( x ) = − • What’s most probable classification of x? COMP9417: May 10, 2011 Bayesian Learning: Slide 31 COMP9417: May 10, 2011 Bayesian Learning: Bayes Optimal Classifier Bayes Optimal Classifier therefore Bayes optimal classification: arg max vj ∈ V ￿ hi ∈H Slide 32 ￿ P ( v j | hi ) P ( hi | D ) hi ∈H ￿ Example: hi ∈H P (+|hi)P (hi|D) = .4 P ( − | h i ) P ( hi | D ) = . 6 and P (h1|D) = .4, P (−|h1) = 0, P (+|h1) = 1 arg max vj ∈ V P (h2|D) = .3, P (−|h2) = 1, P (+|h2) = 0 P (h3|D) = .3, P (−|h3) = 1, P (+|h3) = 0 COMP9417: May 10, 2011 Bayesian Learning: ￿ hi ∈H P ( v j | hi ) P ( hi | D ) = − No other classification method using the same hypothesis space and same prior knowledge can outperform this method on average Slide 33 COMP9417: May 10, 2011 Bayesian Learning: Slide 34 Naive Bayes Classifier Naive Bayes Classifier Along with decision trees, neural networks, nearest neighbour, one of the most practical learning methods. Assume target function f : X → V , where each instance x described by attributes ￿a1, a2 . . . an￿. Most probable value of f (x) is: When to use vM AP • Moderate or large training set available • Attributes that describe instances are conditionally independent given classification = argmax P (vj |a1, a2 . . . an) vM AP = argmax vj ∈ V vj ∈ V P ( a 1 , a2 . . . a n | v j ) P ( v j ) P ( a 1 , a2 . . . a n ) = argmax P (a1, a2 . . . an|vj )P (vj ) Successful applications: vj ∈ V • Diagnosis • Classifying text documents COMP9417: May 10, 2011 Bayesian Learning: Slide 35 COMP9417: May 10, 2011 Naive Bayes assumption: ￿ i Slide 36 Bayesian Learning: Slide 38 Naive Bayes Algorithm Naive Bayes Classifier P ( a 1 , a2 . . . a n | v j ) = Bayesian Learning: Naive Bayes Learn(examples) P ( ai | vj ) For each target value vj • Attributes are statistically independent (given the class value) – which means knowledge about the value of a particular attribute tells us nothing about the value of another attribute (if the class is known) ˆ P (vj ) ← estimate P (vj ) For each attribute value ai of each attribute a ˆ P (ai|vj ) ← estimate P (ai|vj ) which gives Classify New Instance(x) Naive Bayes classifier: vN B = argmax P (vj ) vj ∈ V COMP9417: May 10, 2011 ￿ i ˆ vN B = argmax P (vj ) P ( ai |vj ) Bayesian Learning: vj ∈ V Slide 37 COMP9417: May 10, 2011 ￿ ai ∈ x ˆ P ( ai |vj ) Naive Bayes Example Naive Bayes Example Say we have the new instance: Consider PlayTennis again . . . Outlook Yes Sunny 2 Overcast 4 Rainy 3 Sunny Overcast Rainy No 3 0 2 2/9 4/9 3/9 3/5 0/5 2/5 Play Yes 9 9/14 Temperature Yes No Hot 2 2 Mild 4 2 Cool 3 1 Hot Mild Cool 2/9 4/9 3/9 2/5 2/5 1/5 Humidity Yes High 3 Normal 6 No 4 1 Windy Yes False 6 True 3 No 2 3 ￿Outlk = sun, T emp = cool, Humid = high, W ind = true￿ We want to compute: vN B = High Normal 4/5 1/5 False True 6/9 3/9 Bayesian Learning: Slide 39 2/5 3/5 ￿ i P ( ai | vj ) No 5 5/14 COMP9417: May 10, 2011 3/9 6/9 argmax P (vj ) vj ∈{“yes”,“no”} COMP9417: May 10, 2011 Naive Bayes Example Slide 40 Bayesian Learning: Slide 42 Naive Bayes Example So we first calculate the likelihood of the two classes, “yes” and “no” Then convert to a probability by normalisation 0.0053 (0.0053 + 0.0206) = 0.205 0.0206 P (“no”) = (0.0053 + 0.0206) = 0.795 for “yes” = P (y ) × P (sun|y ) × P (cool|y ) × P (high|y ) × P (true|y ) 9 2333 0.0053 = ×××× 14 9 9 9 9 for “no” = P (n) × P (sun|n) × P (cool|n) × P (high|n) × P (true|n) 5 3143 0.0206 = ×××× 14 5 5 5 5 COMP9417: May 10, 2011 Bayesian Learning: Bayesian Learning: Slide 41 P (“yes”) = The Naive Bayes classification is “no”. COMP9417: May 10, 2011 Naive Bayes: Subtleties Naive Bayes: “zero-frequency” problem 1. Conditional independence assumption is often violated ￿ P ( a 1 , a2 . . . a n | v j ) = P ( ai | vj ) i • ...but it works surprisingly well anyway. Note don’t need estimated ˆ posteriors P (vj |x) to be correct; need only that ￿ ˆ ˆ argmax P (vj ) P (ai|vj ) = argmax P (vj )P (a1 . . . , an|vj ) vj ∈ V i vj ∈ V 2. what if none of the training instances with target value vj have attribute value ai? Then ˆ P (ai|vj ) = 0, and... ￿ ˆ ˆ P ( vj ) P ( ai | vj ) = 0 i Pseudo-counts add 1 to each count (a version of the Laplace Estimator) i.e. maximum probability is assigned to correct class • see [Domingos & Pazzani, 1996] for analysis • Naive Bayes posteriors often unrealistically close to 1 or 0 • adding too many redundant attributes will cause problems (e.g. identical attributes) COMP9417: May 10, 2011 Bayesian Learning: Slide 43 COMP9417: May 10, 2011 Naive Bayes: “zero-frequency” problem ˆ This generalisation is a Bayesian estimate for P (ai|vj ) nc + mp ˆ P ( ai | vj ) ← n+m • Example: attribute outlook for class yes where Sunny Overcast Rainy 4+ µ 3 9+µ 3+ µ 3 9+µ • n is number of training examples for which v = vj , • Weights don’t need to be equal (if they sum to 1) – a form of prior Sunny Overcast Rainy 2+µp1 9+µ COMP9417: May 10, 2011 4+µp2 9+µ Slide 44 Naive Bayes: “zero-frequency” problem • In some cases adding a constant different from 1 might be more appropriate 2+ µ 2 9+µ Bayesian Learning: • nc number of examples for which v = vj and a = ai ˆ • p is prior estimate for P (ai|vj ) • m is weight given to prior (i.e. number of “virtual” examples) 3+µp3 9+µ This is called the m-estimate of probability. Bayesian Learning: Slide 45 COMP9417: May 10, 2011 Bayesian Learning: Slide 46 Naive Bayes: missing values Naive Bayes: numeric attributes • Training: instance is not included in frequency count for attribute value-class combination • Usual assumption: attributes have a normal or Gaussian probability distribution (given the class) • Classification: attribute will be omitted from calculation • The probability density function for the normal distribution is defined by two parameters: The sample mean µ: n µ= The standard deviation σ : σ= COMP9417: May 10, 2011 Bayesian Learning: Slide 47 Slide 48 Note: the normal distribution is based on the simple exponential function (x − µ )2 1 − e 2σ 2 2πσ f ( x ) = e − |x | Example: continuous attribute temperature with mean = 73 and standard deviation = 6.2. Density value (66−73) 1 − e 2×6.22 = 0.0340 2π 6.2 Missing values during training are not included in calculation of mean and standard deviation. Bayesian Learning: m As the power m in the exponent increases, the function approaches a step function. Where m = 2 f ( x ) = e − |x | 2 and this is the basis of the normal distribution – the various constants are the result of scaling so that the integral (the area under the curve from −∞ to +∞) is equal to 1. 2 COMP9417: May 10, 2011 Bayesian Learning: Naive Bayes: numeric attributes Then we have the density function f (x): f (temperature = 66|“yes”) = √ n 1￿ ( xi − µ) 2 n − 1 i=1 COMP9417: May 10, 2011 Naive Bayes: numeric attributes f ( x) = √ 1￿ xi n i=1 Slide 49 from “Statistical Computing” by Michael J. Crawley (2002) Wiley. COMP9417: May 10, 2011 Bayesian Learning: Slide 50 Learning to Classify Text Learning to Classify Text Why? Target concept Interesting ? : Document → {+, −} • Learn which news articles are of interest 1. Represent each document by vector of words • one attribute per word position in document • Learn to classify web pages by topic 2. Learning: Use training examples to estimate Naive Bayes is among most effective algorithms • • • • What attributes shall we use to represent text documents?? COMP9417: May 10, 2011 Bayesian Learning: Slide 51 P (+) P (−) P (doc|+) P (doc|−) COMP9417: May 10, 2011 Bayesian Learning: Learning to Classify Text Slide 52 Learning to Classify Text Learn naive Bayes text(Examples, V ) Naive Bayes conditional independence assumption // collect all words and other tokens that occur in Examples length(doc) P (doc|vj ) = ￿ i=1 V ocabulary ← all distinct words and other tokens in Examples P ( ai = w k | vj ) // calculate the required P (vj ) and P (wk |vj ) probability terms for each target value vj in V do where P (ai = wk |vj ) is probability that word in position i is wk , given vj docsj ← subset of Examples for which the target value is vj P (vj ) ← one more assumption: P (ai = wk |vj ) = P (am = wk |vj ), ∀i, m |docsj | |Examples| T extj ← a single document created by concatenating all members of docsj n ← total number of words in T extj (counting duplicate words multiple times) “bag of words” for each word wk in V ocabulary nk ← number of times word wk occurs in T extj P (wk |vj ) ← COMP9417: May 10, 2011 Bayesian Learning: Slide 53 COMP9417: May 10, 2011 nk +1 n+|V ocabulary | Bayesian Learning: Slide 54 Application: 20 Newsgroups Learning to Classify Text Classify naive Bayes text(Doc) Given: 1000 training documents from each group Learning task: classify each new document by newsgroup it came from • positions ← all word positions in Doc that contain tokens found in V ocabulary • Return vN B , where vN B = argmax P (vj ) vj ∈ V ￿ i∈positions P ( ai | vj ) comp.graphics comp.os.ms-windows.misc comp.sys.ibm.pc.hardware comp.sys.mac.hardware comp.windows.x alt.atheism soc.religion.christian talk.religion.misc talk.politics.mideast talk.politics.misc talk.politics.guns misc.forsale rec.autos rec.motorcycles rec.sport.baseball rec.sport.hockey sci.space sci.crypt sci.electronics sci.med Naive Bayes: 89% classification accuracy COMP9417: May 10, 2011 Bayesian Learning: Slide 55 COMP9417: May 10, 2011 Article from rec.sport.hockey Slide 56 Learning Curve for 20 Newsgroups Path: cantaloupe.srv.cs.cmu.edu!das-news.harvard.edu!ogicse!uwm.edu From: [email protected] (John Doe) Subject: Re: This year’s biggest and worst (opinion)... Date: 5 Apr 93 09:53:39 GMT Bayesian Learning: 20News 100 90 80 70 Bayes TFIDF PRTFIDF 60 I can only comment on the Kings, but the most obvious candidate for pleasant surprise is Alex Zhitnik. He came highly touted as a defensive defenseman, but he’s clearly much more than that. Great skater and hard shot (though wish he were more accurate). In fact, he pretty much allowed the Kings to trade away that huge defensive liability Paul Coffey. Kelly Hrudey is only the biggest disappointment if you thought he was any good to begin with. But, at best, he’s only a mediocre goaltender. A better choice would be Tomas Sandstrom, though not through any fault of his own, but because some thugs in Toronto decided ... COMP9417: May 10, 2011 Bayesian Learning: 50 40 30 20 10 0 100 1000 10000 Accuracy vs. Training set size (1/3 withheld for test) Slide 57 COMP9417: May 10, 2011 Bayesian Learning: Slide 58 Bayesian Belief Networks Conditional Independence Interesting because: • Naive Bayes assumption of conditional independence too restrictive • But it’s intractable without some such assumptions... • Bayesian Belief networks describe conditional independence among subsets of variables Definition: X is conditionally independent of Y given Z if the probability distribution governing X is independent of the value of Y given the value of Z ; that is, if (∀xi, yj , zk ) P (X = xi|Y = yj , Z = zk ) = P (X = xi|Z = zk ) more compactly, we write P (X |Y, Z ) = P (X |Z ) → allows combining prior knowledge about (in)dependencies among variables with observed training data Example: T hunder is conditionally independent of Rain, given Lightning (also called Bayes Nets) P (T hunder|Rain, Lightning ) = P (T hunder|Lightning ) COMP9417: May 10, 2011 Bayesian Learning: Slide 59 COMP9417: May 10, 2011 Bayesian Learning: Slide 60 Bayesian Belief Network Conditional Independence Naive Bayes uses cond. indep. to justify Storm BusTourGroup P (X, Y |Z ) = P (X |Y, Z )P (Y |Z ) S,B S,¬B ¬S,B ¬S,¬B = P (X |Z )P (Y |Z ) Campfire C 0.4 0.1 0.8 0.2 ¬C Lightning 0.6 0.9 0.2 0.8 Campfire Thunder COMP9417: May 10, 2011 Bayesian Learning: Slide 61 COMP9417: May 10, 2011 ForestFire Bayesian Learning: Slide 62 Bayesian Belief Network Bayesian Belief Network A Bayesian Belief Network or Bayes Net is: Storm BusTourGroup • a directed acyclic graph, plus S,B S,¬B ¬S,B ¬S,¬B • a set of associated conditional probabilities Campfire A Bayes Net represents a set of conditional independence assertions: • Each node is conditionally independent of its nondescendants, given its immediate predecessors COMP9417: May 10, 2011 Bayesian Learning: Slide 63 i=1 is fully 0.8 Thunder ForestFire COMP9417: May 10, 2011 Bayesian Learning: Slide 64 • If only one variable with unknown value, easy to infer it P (yi|P arents(Yi)) defined 0.2 0.2 • Bayes net contains all information needed for this inference • In general case, problem is NP hard where P arents(Yi) denotes immediate predecessors of Yi in graph • so, joint distribution P (yi|P arents(Yi)) 0.8 0.9 How can one infer the (probabilities of) values of one or more network variables, given observed values of others? • e.g., P (Storm, BusT ourGroup, . . . , F orestF ire) P ( y1 , . . . , y n ) = 0.1 0.6 Inference in Bayesian Networks A Bayes Net factors a joint probability distribution over all variables: n ￿ 0.4 Campfire Bayesian Belief Network • in general, C ¬C Lightning by graph, plus the In practice, can succeed in many cases • Exact inference methods work well for some network structures • Monte Carlo methods “simulate” the network randomly to calculate approximate solutions COMP9417: May 10, 2011 Bayesian Learning: Slide 65 COMP9417: May 10, 2011 Bayesian Learning: Slide 66 Learning of Bayesian Networks Learning Bayes Nets Several variants of this learning task Suppose structure known, variables partially observable • Network structure might be known or unknown e.g., observe ForestFire, Storm, BusTourGroup, Thunder, but not Lightning, Campfire... • Training examples might provide values of all network variables, or just some • A search through space of all possible values for variables in CPTs • Similar to training neural network with hidden units If structure known and observe all variables • Can learn network conditional probability tables using gradient ascent • Then it’s easy as training a Naive Bayes classifier COMP9417: May 10, 2011 • Converge to network h that (locally) maximizes P (D|h) Bayesian Learning: Slide 67 COMP9417: May 10, 2011 Gradient Ascent for Bayes Nets uik might be Search for assignment of values for CPTs that maximizes P (D|h). Derived training rule uses Bayes Net inference to calculate probabilities. Perform gradient ascent by repeatedly wijk = P (Yi = yij |P arents(Yi) = the list uik of values) then Slide 68 Gradient Ascent for Bayes Nets Let wijk denote one entry in the conditional probability table for variable Yi in the network e.g., if Yi = Campf ire, T, BusT ourGroup = F ￿ Bayesian Learning: ￿Storm = 1. update all wijk using training data D wijk ← wijk + η ￿ Ph(yij , uik |d) wijk d∈D 2. then, renormalize the wijk to assure ￿ • j wijk = 1 • 0 ≤ wijk ≤ 1 COMP9417: May 10, 2011 Bayesian Learning: Slide 69 COMP9417: May 10, 2011 Bayesian Learning: Slide 70 More on Learning Bayes Nets Summary: Bayesian Belief Networks EM algorithm can also be used. Repeatedly: • Combine prior knowledge with observed data 1. Calculate probabilities of unobserved variables, assuming h 2. Calculate new wijk to maximize E [ln P (D|h)] where D now includes both observed and (calculated probabilities of) unobserved variables • Impact of prior knowledge (when correct!) is to lower the sample complexity • Active research area – – – – – When structure unknown... • Algorithms use greedy search to add/substract edges and nodes Extend from Boolean to real-valued variables Parameterized distributions instead of tables Extend to first-order instead of propositional systems More effective inference methods ... • Active research topic COMP9417: May 10, 2011 Bayesian Learning: Slide 71 COMP9417: May 10, 2011 Bayesian Learning: Slide 72 On Bayesian Learning Expectation Maximization (EM) When to use: . . . if the prediction of a further observation is the sole objective, a Bayesian mixture of all tenable models is hard to beat. But is this really inferring anything about the source of the data ? • Data is only partially observable • Unsupervised learning, e.g., clustering (target value “unobservable”) • Supervised learning (some instance attributes unobservable) Some uses: . . . Also, would we be happy with a scientist who proposed a Bayesian mixture of a countably infinite set of incompatible models for electromagnetic fields ? –C. S. Wallace and P. R. Freeman (1987) • Train Bayesian Belief Networks • Unsupervised clustering (k -means, AUTOCLASS) • Learning Hidden Markov Models COMP9417: May 10, 2011 Bayesian Learning: Slide 73 COMP9417: May 10, 2011 Bayesian Learning: Slide 74 Summary: Bayesian Learning • Well-founded framework for learning where the model and the outputs are characterised using probabilities • How to get the probabilities ? – problem of prior probabilities, also – think carefully about likelihoods, choice of parameters, etc. – Empirical Bayes (estimate from data) ? • Machine Learning techniques – algorithmically clear, extensive empirical validation, powerful models • Bayesian Learning – shows how to set up Machine Learning methods in more formal probabilististic and statistical frameworks – more work needed COMP9417: May 10, 2011 Bayesian Learning: Slide 75 ...
View Full Document

This note was uploaded on 06/20/2011 for the course COMP 9417 taught by Professor Some during the Three '11 term at University of New South Wales.

Ask a homework question - tutors are online