This preview shows page 1. Sign up to view the full content.
Unformatted text preview: Acknowledgement: Material derived from slides for the book
Machine Learning, Tom M. Mitchell, McGrawHill, 1997
http://www2.cs.cmu.edu/~tom/mlbook.html 11s1: COMP9417 Machine Learning and Data Mining and slides by Andrew W. Moore available at
http://www.cs.cmu.edu/~awm/tutorials Bayesian Learning
May 10, 2011 Aims Aims This lecture will enable you to describe machine learning in the
framework of Bayesian statistics and reproduce key algorithms derived
from this approach. Following it you should be able to: • describe the Naive Bayes model and the role of Bayesian inference in it • reproduce basic deﬁnitions of useful probabilities • outline the key elements of the Bayes Net formalism • derive Bayes theorem and the formulae for MAP and ML hypotheses
• describe concept learning in Bayesian terms
• outline the derivation of the method of numerical prediction by
minimising the sum of squared errors in terms of maximum likelihood
• deﬁne the Minimum Description Length principle in Bayesian terms and
outline the steps in its application, e.g., to decision tree learning
• deﬁne the Bayes Optimal Classiﬁer
COMP9417: May 10, 2011 Bayesian Learning: Slide 1 • reproduce the Naive Bayes classiﬁer, e.g. for text classiﬁcation
• outline issues in learning Bayes Nets
[Recommended reading: Mitchell, Chapter 6]
[Recommended exercises: 6.1, 6.2, (6.4)]
Relevant WEKA programs:
weka.classiﬁers.bayes package (BayesNet, NaiveBayes, etc.)
See also the R Project for Statistical Computing:
http://www.rproject.org/ COMP9417: May 10, 2011 Bayesian Learning: Slide 2 Uncertainty Two Roles for Bayesian Methods As far as the laws of mathematics refer to reality, they are not
certain; as far as they are certain, they do not refer to reality. Provides practical learning algorithms:
• Naive Bayes learning
• Bayesian belief network learning –Albert Einstein • Combine prior knowledge (prior probabilities) with observed data
• Requires prior probabilities
Provides useful conceptual framework:
• Provides “gold standard” for evaluating other learning algorithms
• Additional insight into Occam’s razor
COMP9417: May 10, 2011 Bayesian Learning: Slide 3 COMP9417: May 10, 2011 Bayes Theorem P ( h D ) = Bayesian Learning: Slide 4 Choosing Hypotheses P ( D  h) P ( h)
P (D ) P ( h D ) = P ( D  h) P ( h)
P (D ) Generally want the most probable hypothesis given the training data • P (h) = prior probability of hypothesis h Maximum a posteriori hypothesis hM AP : • P (D) = prior probability of training data D
• P (hD) = probability of h given D hM AP • P (Dh) = probability of D given h = arg max P (hD)
h∈H = arg max
h∈H P ( D  h ) P ( h)
P (D ) = arg max P (Dh)P (h)
h∈H COMP9417: May 10, 2011 Bayesian Learning: Slide 5 COMP9417: May 10, 2011 Bayesian Learning: Slide 6 Applying Bayes Theorem Choosing Hypotheses If assume P (hi) = P (hj ) then can further simplify, and choose the
Maximum likelihood (ML) hypothesis Does patient have cancer or not?
A patient takes a lab test and the result comes back positive. The
test returns a correct positive result in only 98% of the cases in which
the disease is actually present, and a correct negative result in only
97% of the cases in which the disease is not present. Furthermore,
.008 of the entire population have this cancer. hM L = arg max P (Dhi)
hi ∈H P (cancer) =
P (⊕  cancer) = P (⊕  ¬cancer) =
COMP9417: May 10, 2011 Bayesian Learning: Slide 7 P (¬cancer) = P (  cancer) = P (  ¬cancer) = COMP9417: May 10, 2011 Applying Bayes Theorem Bayesian Learning: Slide 8 Applying Bayes Theorem Does patient have cancer or not? Does patient have cancer or not? A patient takes a lab test and the result comes back positive. The
test returns a correct positive result in only 98% of the cases in which
the disease is actually present, and a correct negative result in only
97% of the cases in which the disease is not present. Furthermore,
.008 of the entire population have this cancer. We can ﬁnd the maximum a priori (MAP) hypothesis
P (⊕  cancer)P (cancer) = 0.98 × 0.008 = 0.00784 P (⊕  ¬cancer)P (¬cancer) = 0.03 × 0.992 = 0.02976 Thus hM AP = . . ..
P (cancer) = .008
P (⊕  cancer) = .98 P (⊕  ¬cancer) = .03
COMP9417: May 10, 2011 P (¬cancer) = .992 P (  cancer) = .02 P (  ¬cancer) = .97
Bayesian Learning: Slide 9 COMP9417: May 10, 2011 Bayesian Learning: Slide 10 Basic Formulas for Probabilities Applying Bayes Theorem Does patient have cancer or not? • Product Rule: probability P (A ∧ B ) of a conjunction of two events A
and B:
P ( A ∧ B ) = P ( A B ) P ( B ) = P ( B  A) P ( A) We can ﬁnd the maximum a priori (MAP) hypothesis
P (⊕  cancer)P (cancer) = 0.98 × 0.008 = 0.00784 • Sum Rule: probability of a disjunction of two events A and B: P (⊕  ¬cancer)P (¬cancer) = 0.03 × 0.992 = 0.02976 P ( A ∨ B ) = P ( A) + P ( B ) − P ( A ∧ B ) Thus hM AP = ¬cancer.
Also note: posterior probability of hypothesis cancer higher than prior. • Theorem of total probability: if events A1, . . . , An are mutually
n
exclusive with i=1 P (Ai) = 1, then
P (B ) = n
i=1 COMP9417: May 10, 2011 Bayesian Learning: Slide 11 P ( B  Ai ) P ( Ai ) COMP9417: May 10, 2011 Bayesian Learning: Slide 12 Brute Force MAP Hypothesis Learner Basic Formulas for Probabilities Also worth remembering:
1. For each hypothesis h in H , calculate the posterior probability
• Conditional Probability: probability of A given B :
P ( A B ) = P ( h D ) = P (A ∧ B )
P (B ) P ( D  h) P ( h)
P (D ) 2. Output the hypothesis hM AP with the highest posterior probability
• Rearrange sum rule to get: hM AP = argmax P (hD)
h∈H P ( A ∧ B ) = P ( A) + P ( B ) − P ( A ∨ B )
Exercise: Derive Bayes Theorem. COMP9417: May 10, 2011 Bayesian Learning: Slide 13 COMP9417: May 10, 2011 Bayesian Learning: Slide 14 Relation to Concept Learning Relation to Concept Learning Assume ﬁxed set of instances x1, . . . , xm Consider our usual concept learning task Assume D is the set of classiﬁcations D = c(x1), . . . , c(xm) • instance space X , hypothesis space H , training examples D
• consider the FindS learning algorithm (outputs most speciﬁc
hypothesis from the version space V SH,D ) Choose P (Dh): What would Bayes rule produce as the MAP hypothesis?
Does F indS output a MAP hypothesis?? COMP9417: May 10, 2011 Bayesian Learning: Slide 15 COMP9417: May 10, 2011 Relation to Concept Learning Bayesian Learning: Slide 16 Relation to Concept Learning Assume ﬁxed set of instances x1, . . . , xm Then: Assume D is the set of classiﬁcations D = c(x1), . . . , c(xm) P ( h D ) = Choose P (Dh):
• P (Dh) = 1 if h consistent with D 1
V SH,D  0 if h is consistent with D
otherwise • P (Dh) = 0 otherwise
Choose P (h) to be uniform distribution:
• P ( h) = 1
H  for all h in H COMP9417: May 10, 2011 Bayesian Learning: Slide 17 COMP9417: May 10, 2011 Bayesian Learning: Slide 18 Evolution of Posterior Probabilities Relation to Concept Learning
Every hypothesis consistent with D is a MAP hypothesis, if P(h ) P(hD 1) • uniform probability over H P(hD 1, D 2) • target function c ∈ H
• deterministic, noisefree data
• etc. (see above) hypotheses
(a) hypotheses
( b) hypotheses
( c) COMP9417: May 10, 2011 Bayesian Learning: Slide 19 FindS will output a MAP hypothesis, even though it does not explicitly
use probabilities in learning.
Bayesian interpretation of inductive bias : use Bayes theorem, deﬁne
restrictions on P (h) and P (D  h) COMP9417: May 10, 2011 Characterizing Learning Algorithms by Equivalent MAP
Learners Hypothesis space H Candidate
Elimination
Algorithm Slide 20 Learning A Real Valued Function
y Inductive system
Training examples D Bayesian Learning: Output hypotheses f
hML
e Equivalent Bayesian inference system
Training examples D
Output hypotheses
Hypothesis space H Brute force
MAP learner P(h) uniform
P(Dh) = 0 if inconsistent,
= 1 if consistent Prior assumptions
made explicit COMP9417: May 10, 2011 x
Bayesian Learning: Slide 21 COMP9417: May 10, 2011 Bayesian Learning: Slide 22 Learning A Real Valued Function Learning A Real Valued Function Consider any realvalued target function f
Training examples xi, di, where di is noisy training value hM L = argmax p(Dh)
h∈H • di = f ( xi ) + ei = argmax • ei is random variable (noise) drawn independently for each xi according
to some Gaussian distribution with mean=0 h∈H = argmax
h∈H Then the maximum likelihood hypothesis hM L is the one that minimizes
the sum of squared errors: hM L = arg min
h∈H m
i=1 (di − h(xi)) COMP9417: May 10, 2011 Slide 23 h∈H = argmin
h∈H COMP9417: May 10, 2011 i=1 m
i=1 (di − h(xi)) 2πσ 2 1 d i − h (x i ) ) 2
σ e− 2 ( Bayesian Learning: Slide 24 hM AP = argmax P (Dh)P (h) 1 − (di − h(xi)) 1 Once again, the MAP hypothesis
2
1 d i − h( x i )
= argmax
ln √
−
σ
2πσ 2 2
h∈H
i=1
2
m
1 d i − h( x i )
= argmax
−
2
σ
h∈H
i=1
= argmax √ Minimum Description Length Principle Maximize natural log to give simpler expression . . . m
i=1 p( d i  h) COMP9417: May 10, 2011 Learning A Real Valued Function hM L i=1
m
Recall that we treat each probability p(D  h) as if h = f 2 Bayesian Learning: m
m
h∈H Which is equivalent to
hM AP = argmax log2 P (Dh) + log2 P (h)
h∈H 2 Or 2 hM AP = argmin − log2 P (Dh) − log2 P (h)
h∈H Bayesian Learning: Slide 25 COMP9417: May 10, 2011 Bayesian Learning: Slide 26 Minimum Description Length Principle Minimum Description Length Principle Interestingly, this is an expression about a quantity of bits. So interpret (1):
• − log2 P (h) is length of h under optimal code hM AP = arg min − log2 P (Dh) − log2 P (h) (1) h∈H • − log2 P (Dh) is length of D given h under optimal code
Note well: assumes optimal encodings, when the priors and likelihoods
are known. In practice, this is diﬃcult, and makes a diﬀerence. From information theory:
The optimal (shortest expected coding length) code for an event
with probability p is − log2 p bits. COMP9417: May 10, 2011 Bayesian Learning: Slide 27 COMP9417: May 10, 2011 Minimum Description Length Principle Bayesian Learning: Slide 28 Minimum Description Length Principle Occam’s razor: prefer the shortest hypothesis Example: H = decision trees, D = training data labels MDL: prefer the hypothesis h that minimizes • LC1 (h) is # bits to describe tree h
• LC2 (Dh) is # bits to describe D given h hM DL = argmin LC1 (h) + LC2 (Dh)
h∈H where LC (x) is the description length of x under optimal encoding C – Note LC2 (Dh) = 0 if examples classiﬁed perfectly by h. Need only
describe exceptions
• Hence hM DL trades oﬀ tree size for training errors
– i.e., prefer the hypothesis that minimizes
length(h) + length(misclassif ications) COMP9417: May 10, 2011 Bayesian Learning: Slide 29 COMP9417: May 10, 2011 Bayesian Learning: Slide 30 Most Probable Classiﬁcation of New Instances Most Probable Classiﬁcation of New Instances So far we’ve sought the most probable hypothesis given the data D (i.e.,
hM AP ) Consider:
• Three possible hypotheses: Given new instance x, what is its most probable classiﬁcation? P (h1D) = .4, P (h2D) = .3, P (h3D) = .3
• Given new instance x, • hM AP (x) is not the most probable classiﬁcation! h 1 ( x ) = + , h 2 ( x ) = − , h3 ( x ) = −
• What’s most probable classiﬁcation of x? COMP9417: May 10, 2011 Bayesian Learning: Slide 31 COMP9417: May 10, 2011 Bayesian Learning: Bayes Optimal Classiﬁer Bayes Optimal Classiﬁer therefore Bayes optimal classiﬁcation:
arg max
vj ∈ V hi ∈H Slide 32 P ( v j  hi ) P ( hi  D ) hi ∈H Example: hi ∈H P (+hi)P (hiD) = .4
P ( −  h i ) P ( hi  D ) = . 6 and
P (h1D) = .4, P (−h1) = 0, P (+h1) = 1 arg max
vj ∈ V P (h2D) = .3, P (−h2) = 1, P (+h2) = 0
P (h3D) = .3, P (−h3) = 1, P (+h3) = 0 COMP9417: May 10, 2011 Bayesian Learning: hi ∈H P ( v j  hi ) P ( hi  D ) = − No other classiﬁcation method using the same hypothesis space and
same prior knowledge can outperform this method on average
Slide 33 COMP9417: May 10, 2011 Bayesian Learning: Slide 34 Naive Bayes Classiﬁer Naive Bayes Classiﬁer Along with decision trees, neural networks, nearest neighbour, one of the
most practical learning methods. Assume target function f : X → V , where each instance x described by
attributes a1, a2 . . . an.
Most probable value of f (x) is: When to use vM AP • Moderate or large training set available
• Attributes that describe instances are conditionally independent given
classiﬁcation = argmax P (vj a1, a2 . . . an) vM AP = argmax vj ∈ V vj ∈ V P ( a 1 , a2 . . . a n  v j ) P ( v j )
P ( a 1 , a2 . . . a n ) = argmax P (a1, a2 . . . anvj )P (vj ) Successful applications: vj ∈ V • Diagnosis
• Classifying text documents
COMP9417: May 10, 2011 Bayesian Learning: Slide 35 COMP9417: May 10, 2011 Naive Bayes assumption:
i Slide 36 Bayesian Learning: Slide 38 Naive Bayes Algorithm Naive Bayes Classiﬁer P ( a 1 , a2 . . . a n  v j ) = Bayesian Learning: Naive Bayes Learn(examples)
P ( ai  vj ) For each target value vj • Attributes are statistically independent (given the class value) – which means knowledge about the value of a particular attribute
tells us nothing about the value of another attribute (if the class is
known) ˆ
P (vj ) ← estimate P (vj )
For each attribute value ai of each attribute a
ˆ
P (aivj ) ← estimate P (aivj ) which gives
Classify New Instance(x)
Naive Bayes classiﬁer: vN B = argmax P (vj )
vj ∈ V COMP9417: May 10, 2011
i ˆ
vN B = argmax P (vj ) P ( ai vj ) Bayesian Learning: vj ∈ V Slide 37 COMP9417: May 10, 2011 ai ∈ x ˆ
P ( ai vj ) Naive Bayes Example Naive Bayes Example Say we have the new instance: Consider PlayTennis again . . .
Outlook
Yes
Sunny
2
Overcast
4
Rainy
3
Sunny
Overcast
Rainy No
3
0
2 2/9
4/9
3/9 3/5
0/5
2/5 Play
Yes
9
9/14 Temperature
Yes
No
Hot
2
2
Mild
4
2
Cool
3
1
Hot
Mild
Cool 2/9
4/9
3/9 2/5
2/5
1/5 Humidity
Yes
High
3
Normal
6 No
4
1 Windy
Yes
False
6
True
3 No
2
3 Outlk = sun, T emp = cool, Humid = high, W ind = true
We want to compute:
vN B = High
Normal 4/5
1/5 False
True 6/9
3/9 Bayesian Learning: Slide 39 2/5
3/5
i P ( ai  vj ) No
5
5/14 COMP9417: May 10, 2011 3/9
6/9 argmax
P (vj )
vj ∈{“yes”,“no”} COMP9417: May 10, 2011 Naive Bayes Example Slide 40 Bayesian Learning: Slide 42 Naive Bayes Example So we ﬁrst calculate the likelihood of the two classes, “yes” and “no” Then convert to a probability by normalisation 0.0053
(0.0053 + 0.0206)
= 0.205
0.0206
P (“no”) =
(0.0053 + 0.0206)
= 0.795 for “yes” = P (y ) × P (suny ) × P (cooly ) × P (highy ) × P (truey )
9
2333
0.0053 =
××××
14 9 9 9 9 for “no” = P (n) × P (sunn) × P (cooln) × P (highn) × P (truen)
5
3143
0.0206 =
××××
14 5 5 5 5 COMP9417: May 10, 2011 Bayesian Learning: Bayesian Learning: Slide 41 P (“yes”) = The Naive Bayes classiﬁcation is “no”. COMP9417: May 10, 2011 Naive Bayes: Subtleties Naive Bayes: “zerofrequency” problem 1. Conditional independence assumption is often violated
P ( a 1 , a2 . . . a n  v j ) =
P ( ai  vj )
i • ...but it works surprisingly well anyway. Note don’t need estimated
ˆ
posteriors P (vj x) to be correct; need only that
ˆ
ˆ
argmax P (vj )
P (aivj ) = argmax P (vj )P (a1 . . . , anvj )
vj ∈ V i vj ∈ V 2. what if none of the training instances with target value vj have attribute
value ai? Then
ˆ
P (aivj ) = 0, and...
ˆ
ˆ
P ( vj )
P ( ai  vj ) = 0
i Pseudocounts add 1 to each count (a version of the Laplace Estimator) i.e. maximum probability is assigned to correct class • see [Domingos & Pazzani, 1996] for analysis
• Naive Bayes posteriors often unrealistically close to 1 or 0
• adding too many redundant attributes will cause problems (e.g.
identical attributes)
COMP9417: May 10, 2011 Bayesian Learning: Slide 43 COMP9417: May 10, 2011 Naive Bayes: “zerofrequency” problem ˆ
This generalisation is a Bayesian estimate for P (aivj )
nc + mp
ˆ
P ( ai  vj ) ←
n+m • Example: attribute outlook for class yes
where Sunny Overcast Rainy
4+ µ
3
9+µ 3+ µ
3
9+µ • n is number of training examples for which v = vj , • Weights don’t need to be equal (if they sum to 1) – a form of prior Sunny Overcast Rainy
2+µp1
9+µ COMP9417: May 10, 2011 4+µp2
9+µ Slide 44 Naive Bayes: “zerofrequency” problem • In some cases adding a constant diﬀerent from 1 might be more
appropriate 2+ µ
2
9+µ Bayesian Learning: • nc number of examples for which v = vj and a = ai
ˆ
• p is prior estimate for P (aivj )
• m is weight given to prior (i.e. number of “virtual” examples) 3+µp3
9+µ This is called the mestimate of probability.
Bayesian Learning: Slide 45 COMP9417: May 10, 2011 Bayesian Learning: Slide 46 Naive Bayes: missing values Naive Bayes: numeric attributes • Training: instance is not included in frequency count for attribute
valueclass combination • Usual assumption: attributes have a normal or Gaussian probability
distribution (given the class) • Classiﬁcation: attribute will be omitted from calculation • The probability density function for the normal distribution is deﬁned
by two parameters:
The sample mean µ: n µ=
The standard deviation σ :
σ= COMP9417: May 10, 2011 Bayesian Learning: Slide 47 Slide 48 Note: the normal distribution is based on the simple exponential function (x − µ )2
1
−
e 2σ 2
2πσ f ( x ) = e − x  Example: continuous attribute temperature with mean = 73 and standard
deviation = 6.2. Density value (66−73)
1
−
e 2×6.22 = 0.0340
2π 6.2 Missing values during training are not included in calculation of mean and
standard deviation.
Bayesian Learning: m As the power m in the exponent increases, the function approaches a step
function.
Where m = 2 f ( x ) = e − x  2 and this is the basis of the normal distribution – the various constants are
the result of scaling so that the integral (the area under the curve from
−∞ to +∞) is equal to 1. 2 COMP9417: May 10, 2011 Bayesian Learning: Naive Bayes: numeric attributes Then we have the density function f (x): f (temperature = 66“yes”) = √ n
1
( xi − µ) 2
n − 1 i=1 COMP9417: May 10, 2011 Naive Bayes: numeric attributes f ( x) = √ 1
xi
n i=1 Slide 49 from “Statistical Computing” by Michael J. Crawley (2002) Wiley. COMP9417: May 10, 2011 Bayesian Learning: Slide 50 Learning to Classify Text Learning to Classify Text Why? Target concept Interesting ? : Document → {+, −} • Learn which news articles are of interest 1. Represent each document by vector of words
• one attribute per word position in document • Learn to classify web pages by topic 2. Learning: Use training examples to estimate Naive Bayes is among most eﬀective algorithms •
•
•
• What attributes shall we use to represent text documents?? COMP9417: May 10, 2011 Bayesian Learning: Slide 51 P (+)
P (−)
P (doc+)
P (doc−) COMP9417: May 10, 2011 Bayesian Learning: Learning to Classify Text Slide 52 Learning to Classify Text Learn naive Bayes text(Examples, V ) Naive Bayes conditional independence assumption // collect all words and other tokens that occur in Examples
length(doc) P (docvj ) = i=1 V ocabulary ← all distinct words and other tokens in Examples P ( ai = w k  vj ) // calculate the required P (vj ) and P (wk vj ) probability terms
for each target value vj in V do where P (ai = wk vj ) is probability that word in position i is wk , given vj docsj ← subset of Examples for which the target value is vj
P (vj ) ← one more assumption: P (ai = wk vj ) = P (am = wk vj ), ∀i, m docsj 
Examples T extj ← a single document created by concatenating all members of docsj n ← total number of words in T extj (counting duplicate words multiple times) “bag of words” for each word wk in V ocabulary nk ← number of times word wk occurs in T extj
P (wk vj ) ←
COMP9417: May 10, 2011 Bayesian Learning: Slide 53 COMP9417: May 10, 2011 nk +1
n+V ocabulary  Bayesian Learning: Slide 54 Application: 20 Newsgroups Learning to Classify Text Classify naive Bayes text(Doc) Given: 1000 training documents from each group
Learning task: classify each new document by newsgroup it came from • positions ← all word positions in Doc that contain tokens found in
V ocabulary
• Return vN B , where
vN B = argmax P (vj )
vj ∈ V i∈positions P ( ai  vj ) comp.graphics
comp.os.mswindows.misc
comp.sys.ibm.pc.hardware
comp.sys.mac.hardware
comp.windows.x
alt.atheism
soc.religion.christian
talk.religion.misc
talk.politics.mideast
talk.politics.misc
talk.politics.guns misc.forsale
rec.autos
rec.motorcycles
rec.sport.baseball
rec.sport.hockey
sci.space
sci.crypt
sci.electronics
sci.med Naive Bayes: 89% classiﬁcation accuracy
COMP9417: May 10, 2011 Bayesian Learning: Slide 55 COMP9417: May 10, 2011 Article from rec.sport.hockey Slide 56 Learning Curve for 20 Newsgroups Path: cantaloupe.srv.cs.cmu.edu!dasnews.harvard.edu!ogicse!uwm.edu
From: [email protected] (John Doe)
Subject: Re: This year’s biggest and worst (opinion)...
Date: 5 Apr 93 09:53:39 GMT Bayesian Learning: 20News 100
90
80
70 Bayes
TFIDF
PRTFIDF 60 I can only comment on the Kings, but the most obvious candidate
for pleasant surprise is Alex Zhitnik. He came highly touted as
a defensive defenseman, but he’s clearly much more than that.
Great skater and hard shot (though wish he were more accurate).
In fact, he pretty much allowed the Kings to trade away that
huge defensive liability Paul Coffey. Kelly Hrudey is only the
biggest disappointment if you thought he was any good to begin
with. But, at best, he’s only a mediocre goaltender. A better
choice would be Tomas Sandstrom, though not through any fault
of his own, but because some thugs in Toronto decided ... COMP9417: May 10, 2011 Bayesian Learning: 50
40
30
20
10
0
100 1000 10000 Accuracy vs. Training set size (1/3 withheld for test) Slide 57 COMP9417: May 10, 2011 Bayesian Learning: Slide 58 Bayesian Belief Networks Conditional Independence Interesting because:
• Naive Bayes assumption of conditional independence too restrictive
• But it’s intractable without some such assumptions...
• Bayesian Belief networks describe conditional independence among
subsets of variables Deﬁnition: X is conditionally independent of Y given Z if the
probability distribution governing X is independent of the value of
Y given the value of Z ; that is, if
(∀xi, yj , zk ) P (X = xiY = yj , Z = zk ) = P (X = xiZ = zk )
more compactly, we write
P (X Y, Z ) = P (X Z ) → allows combining prior knowledge about (in)dependencies among
variables with observed training data Example: T hunder is conditionally independent of Rain, given
Lightning (also called Bayes Nets) P (T hunderRain, Lightning ) = P (T hunderLightning )
COMP9417: May 10, 2011 Bayesian Learning: Slide 59 COMP9417: May 10, 2011 Bayesian Learning: Slide 60 Bayesian Belief Network Conditional Independence Naive Bayes uses cond. indep. to justify Storm BusTourGroup P (X, Y Z ) = P (X Y, Z )P (Y Z ) S,B S,¬B ¬S,B ¬S,¬B = P (X Z )P (Y Z ) Campfire C 0.4 0.1 0.8 0.2 ¬C Lightning 0.6 0.9 0.2 0.8 Campfire
Thunder COMP9417: May 10, 2011 Bayesian Learning: Slide 61 COMP9417: May 10, 2011 ForestFire Bayesian Learning: Slide 62 Bayesian Belief Network Bayesian Belief Network A Bayesian Belief Network or Bayes Net is: Storm BusTourGroup • a directed acyclic graph, plus S,B S,¬B ¬S,B ¬S,¬B • a set of associated conditional probabilities Campfire A Bayes Net represents a set of conditional independence assertions:
• Each node is conditionally independent of its nondescendants, given its
immediate predecessors COMP9417: May 10, 2011 Bayesian Learning: Slide 63 i=1 is fully 0.8 Thunder ForestFire COMP9417: May 10, 2011 Bayesian Learning: Slide 64 • If only one variable with unknown value, easy to infer it P (yiP arents(Yi)) deﬁned 0.2 0.2 • Bayes net contains all information needed for this inference
• In general case, problem is NP hard where P arents(Yi) denotes immediate predecessors of Yi in graph
• so, joint distribution
P (yiP arents(Yi)) 0.8 0.9 How can one infer the (probabilities of) values of one or more network
variables, given observed values of others? • e.g., P (Storm, BusT ourGroup, . . . , F orestF ire)
P ( y1 , . . . , y n ) = 0.1 0.6 Inference in Bayesian Networks A Bayes Net factors a joint probability distribution over all variables: n
0.4 Campfire Bayesian Belief Network • in general, C
¬C Lightning by graph, plus the In practice, can succeed in many cases
• Exact inference methods work well for some network structures
• Monte Carlo methods “simulate” the network randomly to calculate
approximate solutions COMP9417: May 10, 2011 Bayesian Learning: Slide 65 COMP9417: May 10, 2011 Bayesian Learning: Slide 66 Learning of Bayesian Networks Learning Bayes Nets Several variants of this learning task Suppose structure known, variables partially observable • Network structure might be known or unknown e.g., observe ForestFire, Storm, BusTourGroup, Thunder, but not
Lightning, Campﬁre... • Training examples might provide values of all network variables, or just
some • A search through space of all possible values for variables in CPTs
• Similar to training neural network with hidden units If structure known and observe all variables • Can learn network conditional probability tables using gradient ascent • Then it’s easy as training a Naive Bayes classiﬁer COMP9417: May 10, 2011 • Converge to network h that (locally) maximizes P (Dh) Bayesian Learning: Slide 67 COMP9417: May 10, 2011 Gradient Ascent for Bayes Nets uik might be Search for assignment of values for CPTs that maximizes P (Dh).
Derived training rule uses Bayes Net inference to calculate probabilities.
Perform gradient ascent by repeatedly wijk = P (Yi = yij P arents(Yi) = the list uik of values)
then Slide 68 Gradient Ascent for Bayes Nets Let wijk denote one entry in the conditional probability table for variable
Yi in the network e.g., if Yi = Campf ire,
T, BusT ourGroup = F Bayesian Learning: Storm = 1. update all wijk using training data D
wijk ← wijk + η Ph(yij , uik d)
wijk d∈D 2. then, renormalize the wijk to assure
•
j wijk = 1
• 0 ≤ wijk ≤ 1
COMP9417: May 10, 2011 Bayesian Learning: Slide 69 COMP9417: May 10, 2011 Bayesian Learning: Slide 70 More on Learning Bayes Nets Summary: Bayesian Belief Networks EM algorithm can also be used. Repeatedly: • Combine prior knowledge with observed data 1. Calculate probabilities of unobserved variables, assuming h
2. Calculate new wijk to maximize E [ln P (Dh)] where D now includes
both observed and (calculated probabilities of) unobserved variables • Impact of prior knowledge (when correct!) is to lower the sample
complexity
• Active research area
–
–
–
–
– When structure unknown...
• Algorithms use greedy search to add/substract edges and nodes Extend from Boolean to realvalued variables
Parameterized distributions instead of tables
Extend to ﬁrstorder instead of propositional systems
More eﬀective inference methods
... • Active research topic COMP9417: May 10, 2011 Bayesian Learning: Slide 71 COMP9417: May 10, 2011 Bayesian Learning: Slide 72 On Bayesian Learning Expectation Maximization (EM)
When to use: . . . if the prediction of a further observation is the sole objective,
a Bayesian mixture of all tenable models is hard to beat. But is
this really inferring anything about the source of the data ? • Data is only partially observable
• Unsupervised learning, e.g., clustering (target value “unobservable”)
• Supervised learning (some instance attributes unobservable)
Some uses: . . . Also, would we be happy with a scientist who proposed
a Bayesian mixture of a countably inﬁnite set of incompatible
models for electromagnetic ﬁelds ?
–C. S. Wallace and P. R. Freeman (1987) • Train Bayesian Belief Networks
• Unsupervised clustering (k means, AUTOCLASS)
• Learning Hidden Markov Models
COMP9417: May 10, 2011 Bayesian Learning: Slide 73 COMP9417: May 10, 2011 Bayesian Learning: Slide 74 Summary: Bayesian Learning
• Wellfounded framework for learning where the model and the outputs
are characterised using probabilities
• How to get the probabilities ?
– problem of prior probabilities, also
– think carefully about likelihoods, choice of parameters, etc.
– Empirical Bayes (estimate from data) ?
• Machine Learning techniques – algorithmically clear, extensive empirical
validation, powerful models
• Bayesian Learning – shows how to set up Machine Learning methods
in more formal probabilististic and statistical frameworks
– more work needed
COMP9417: May 10, 2011 Bayesian Learning: Slide 75 ...
View
Full
Document
This note was uploaded on 06/20/2011 for the course COMP 9417 taught by Professor Some during the Three '11 term at University of New South Wales.
 Three '11
 some
 Data Mining, Machine Learning

Click to edit the document details