Unformatted text preview: CS440 Introduction to Artiﬁcial Intelligence Lecture 20:
Bayesian learning;
conjugate priors Julia Hockenmaier
[email protected]
3324 Siebel Center
Ofﬁce Hours: Thu, 2:003:00pm
http://www.cs.uiuc.edu/class/sp11/cs440 The binomial distribution
If p is the probability of heads, the probability of getting
exactly k heads in n independent yes/no trials is given by
the binomial distribution Bin(n,p):
nk
P (k heads) =
p (1 − p)n−k
k
n!
=
pk (1 − p)n−k
k !(n − k )!
Expectation E(Bin(n,p)) = np
Variance var(Bin(n,p)) = np(1p) CS598JHM: Advanced NLP
2 Binomial likelihood
What distribution does p (probability of heads) have,
given that the data D consists of #H heads and #T tails?
Likelihood L( ;D=(#Heads,#Tails)) for binomial distribution
0.007 L( ,(5,5))
L( ,(3,7))
L( ,(2,8)) 0.006
0.005
0.004
0.003
0.002
0.001
0 0 0.2 0.4 0.6 CS598JHM: Advanced NLP
3 0.8 1 Parameter estimation
Given data D=HTTHTT, what is the probability θ of heads?  Maximum likelihood estimation (MLE):
Use the θ which has the highest likelihood P(D θ).
θM LE = arg max P (Dθ)
θ  Maximum a posterior (MAP):
Use the θ which has the highest posterior probability P(θ D).
θM AP = arg max P (θD) = arg max P (θ)P (Dθ)
θ θ  Bayesian estimation:
Integrate over all θ = compute the expectation of θ given D:
1
P (x = H D) =
P (x = H θ)P (θD)dθ = E [θD]
0 CS598JHM: Advanced NLP
4 Maximum likelihood estimation
 Maximum likelihood estimation (MLE):
ﬁnd θ which maximizes likelihood P(D  θ). θ∗ = arg max P (Dθ)
θ = arg max θH (1 − θ)T
θ = H
H +T CS598JHM: Advanced NLP
5 Bayesian statistics
 Data D provides evidence for or against our beliefs.
We update our belief θ based on the evidence we see:
Prior P (θD) =
Posterior Likelihood P (θ)P (Dθ)
P (θ)P (Dθ)dθ Marginal Likelihood (=P(D)) CS598JHM: Advanced NLP
6 Bayesian estimation
Given a prior P(θ) and a likelihood P(Dθ),
what is the posterior P(θ D)?
How do we choose the prior P(θ)?
 The posterior is proportional to prior x likelihood:
P(θ D)∝ P(θ) P(Dθ)  The likelihood of a binomial is:
P(Dθ) = θH(1θ)T  If prior P(θ) is proportional to powers of θ and (1θ),
posterior will also be proportional to powers of θ and (1θ):
P(θ) ∝ θ a(1θ)b
⇒ P(θ D)∝ θ a(1θ)b θH(1θ)T = θa+H(1θ)b+T
CS598JHM: Advanced NLP
7 In search of a prior...
We would like something of the form: P (θ) ∝ θa (1 − θ)b
But  this looks just like the binomial:
nk
P (k heads) =
p (1 − p)n−k
k
n!
=
pk (1 − p)n−k
k !(n − k )! …. except that k is an integer and θ is a real with 0<θ <1. CS598JHM: Advanced NLP
8 The Gamma function
The Gamma function Γ(x) is the generalization of the
factorial x! (or rather (x1)!) to the reals: Γ(α) = ∞ xα−1 e−x dx for α > 0 0 For x >1, Γ(x) = (x1)Γ(x1).
For positive integers, Γ(x) = (x1)! CS598JHM: Advanced NLP
9 The Gamma function
(x) function
25 20 15 10 5 0
0 1 2 3 CS598JHM: Advanced NLP
10 4 5 The Beta distribution
A random variable X (0 < x < 1) has a Beta distribution with
(hyper)parameters α (α > 0) and β (β > 0) if X has a continuous
distribution with probability density function Γ(α + β ) α−1
P (xα, β ) =
x
(1 − x)β −1
Γ(α)Γ(β )
The ﬁrst term is a normalization factor (to obtain a distribution)
0 1 xα−1 (1 − x)β −1 dx = Expectation: Γ(α + β )
Γ(α)Γ(β ) α
α+β CS598JHM: Advanced NLP
11 Beta(α,β) with α >1, β >1
Unimodal
Beta Distribution Beta( , )
7 Beta(1.5,1.5)
Beta(3,1.5)
Beta(3,3)
Beta(20,20)
Beta(3,20) 6
5
4
3
2
1
0
0 0.2 0.4 0.6
µ CS598JHM: Advanced NLP
12 0.8 1 Beta(α,β) with α <1, β <1
Ushaped
Beta Distribution Beta( , )
6 Beta(0.1,0.1)
Beta(0.1,0.5)
Beta(0.5,0.5) 5
4
3
2
1
0
0 0.2 0.4 0.6
µ CS598JHM: Advanced NLP
13 0.8 1 Beta(α,β) with α=β
Symmetric. α=β=1: uniform
Beta Distribution Beta( , )
3.5 Beta(0.1,0.1)
Beta(1,1)
Beta(2,2) 3
2.5
2
1.5
1
0.5
0
0 0.2 0.4 0.6
µ CS598JHM: Advanced NLP
14 0.8 1 Beta(α,β) with α<1, β >1
Strictly decreasing
Beta Distribution Beta( , )
8 Beta(0.1,1.5)
Beta(0.5,1.5)
Beta(0.5,2) 7
6
5
4
3
2
1
0
0 0.2 0.4 0.6
µ CS598JHM: Advanced NLP
15 0.8 1 Beta(α,β) with α = 1, β >1
α = 1, 1< β < 2: strictly concave.
α = 1, β = 2: straight line
α = 1, β > 2: strictly convex
Beta Distribution Beta( , )
3.5 Beta(1,1.5)
Beta(1,2)
Beta(1,3) 3
2.5
2
1.5
1
0.5
0
0 0.2 0.4
0.6
CS598JHM: Advanced NLP
µ
16 0.8 1 Beta as prior for binomial
Given a prior P(θ α,β) = Beta(α,β), and data D=(H,T),
what is our posterior? P (θα, β , H, T ) ∝ P (H, T θ)P (θα, β )
∝ θH (1 − θ)T θα−1 (1 − θ)β −1
= θH +α−1 (1 − θ)T +β −1
With normalization
P (θα, β , H, T ) = Γ(H + α + T + β ) H +α−1
θ
(1 − θ)T +β −1
Γ(H + α)Γ(T + β )
= Beta(α + H, β + T )
CS598JHM: Advanced NLP
17 So, what do we predict?
Our Bayesian estimate for the next coin ﬂip P(x=1  D):
P (x = H D) =
= 1 0 0 1 P (x = H θ)P (θD)dθ
θP (θD)dθ = E [θD] = E [Beta(H + α, T + β )]
H +α
=
H +α+T +β CS598JHM: Advanced NLP
18 Conjugate priors
The beta distribution is a conjugate prior to the binomial:
the resulting posterior is also a beta distribution.
We can interpret its parameters α, β as pseudocounts
P(H  D) = (H + α)/(H + α + T + β)
All members of the exponential family of distributions have
conjugate priors.
Examples:
 Multinomial: conjugate prior = Dirichlet
 Gaussian: conjugate prior = Gaussian CS598JHM: Advanced NLP
19 Multinomials: Dirichlet prior
Multinomial distribution:
Probability of observing each possible outcome ci exactly Xi
times in a sequence of n yes/no trials:
N
n!
x
x
P (X1 = xi , . . . , XK = xk ) =
θ1 1 · · · θKK if
xi = n
x1 ! · · · xK !
i=1
Dirichlet prior:
Γ(α1 + ... + αk ) αk −1
Dir(θα1 , ...αk ) =
θk
Γ(α1 )...Γ(αk )
k=1 CS598JHM: Advanced NLP
20 More about conjugate priors
 We can interpret the hyperparameters as “pseudocounts”
 Sequential estimation (updating counts after each
observation) gives same results as batch estimation  Addone smoothing (Laplace smoothing) = uniform prior
 On average, more data leads to a sharper posterior
(sharper = lower variance) CS598JHM: Advanced NLP
21 ...
View
Full
Document
This note was uploaded on 10/13/2011 for the course CS 440 taught by Professor Levinson,s during the Spring '08 term at University of Illinois, Urbana Champaign.
 Spring '08
 Levinson,S

Click to edit the document details