Lecture20Opt - CS440 Introduction to Artificial...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: CS440 Introduction to Artificial Intelligence Lecture 20: Bayesian learning; conjugate priors Julia Hockenmaier [email protected] 3324 Siebel Center Office Hours: Thu, 2:00-3:00pm http://www.cs.uiuc.edu/class/sp11/cs440 The binomial distribution If p is the probability of heads, the probability of getting exactly k heads in n independent yes/no trials is given by the binomial distribution Bin(n,p): ￿￿ nk P (k heads) = p (1 − p)n−k k n! = pk (1 − p)n−k k !(n − k )! Expectation E(Bin(n,p)) = np Variance var(Bin(n,p)) = np(1-p) CS598JHM: Advanced NLP 2 Binomial likelihood What distribution does p (probability of heads) have, given that the data D consists of #H heads and #T tails? Likelihood L( ;D=(#Heads,#Tails)) for binomial distribution 0.007 L( ,(5,5)) L( ,(3,7)) L( ,(2,8)) 0.006 0.005 0.004 0.003 0.002 0.001 0 0 0.2 0.4 0.6 CS598JHM: Advanced NLP 3 0.8 1 Parameter estimation Given data D=HTTHTT, what is the probability θ of heads? - Maximum likelihood estimation (MLE): Use the θ which has the highest likelihood P(D| θ). θM LE = arg max P (D|θ) θ - Maximum a posterior (MAP): Use the θ which has the highest posterior probability P(θ |D). θM AP = arg max P (θ|D) = arg max P (θ)P (D|θ) θ θ - Bayesian estimation: Integrate over all θ = compute the expectation of θ given D: ￿1 P (x = H |D) = P (x = H |θ)P (θ|D)dθ = E [θ|D] 0 CS598JHM: Advanced NLP 4 Maximum likelihood estimation - Maximum likelihood estimation (MLE): find θ which maximizes likelihood P(D | θ). θ∗ = arg max P (D|θ) θ = arg max θH (1 − θ)T θ = H H +T CS598JHM: Advanced NLP 5 Bayesian statistics - Data D provides evidence for or against our beliefs. We update our belief θ based on the evidence we see: Prior P (θ|D) = Posterior Likelihood P (θ)P (D|θ) ￿ P (θ)P (D|θ)dθ Marginal Likelihood (=P(D)) CS598JHM: Advanced NLP 6 Bayesian estimation Given a prior P(θ) and a likelihood P(D|θ), what is the posterior P(θ |D)? How do we choose the prior P(θ)? - The posterior is proportional to prior x likelihood: P(θ |D)∝ P(θ) P(D|θ) - The likelihood of a binomial is: P(D|θ) = θH(1-θ)T - If prior P(θ) is proportional to powers of θ and (1-θ), posterior will also be proportional to powers of θ and (1-θ): P(θ) ∝ θ a(1-θ)b ⇒ P(θ |D)∝ θ a(1-θ)b θH(1-θ)T = θa+H(1-θ)b+T CS598JHM: Advanced NLP 7 In search of a prior... We would like something of the form: P (θ) ∝ θa (1 − θ)b But -- this looks just like the binomial: ￿￿ nk P (k heads) = p (1 − p)n−k k n! = pk (1 − p)n−k k !(n − k )! …. except that k is an integer and θ is a real with 0<θ <1. CS598JHM: Advanced NLP 8 The Gamma function The Gamma function Γ(x) is the generalization of the factorial x! (or rather (x-1)!) to the reals: Γ(α) = ￿ ∞ xα−1 e−x dx for α > 0 0 For x >1, Γ(x) = (x-1)Γ(x-1). For positive integers, Γ(x) = (x-1)! CS598JHM: Advanced NLP 9 The Gamma function (x) function 25 20 15 10 5 0 0 1 2 3 CS598JHM: Advanced NLP 10 4 5 The Beta distribution A random variable X (0 < x < 1) has a Beta distribution with (hyper)parameters α (α > 0) and β (β > 0) if X has a continuous distribution with probability density function Γ(α + β ) α−1 P (x|α, β ) = x (1 − x)β −1 Γ(α)Γ(β ) The first term is a normalization factor (to obtain a distribution) ￿ 0 1 xα−1 (1 − x)β −1 dx = Expectation: Γ(α + β ) Γ(α)Γ(β ) α α+β CS598JHM: Advanced NLP 11 Beta(α,β) with α >1, β >1 Unimodal Beta Distribution Beta( , ) 7 Beta(1.5,1.5) Beta(3,1.5) Beta(3,3) Beta(20,20) Beta(3,20) 6 5 4 3 2 1 0 0 0.2 0.4 0.6 µ CS598JHM: Advanced NLP 12 0.8 1 Beta(α,β) with α <1, β <1 U-shaped Beta Distribution Beta( , ) 6 Beta(0.1,0.1) Beta(0.1,0.5) Beta(0.5,0.5) 5 4 3 2 1 0 0 0.2 0.4 0.6 µ CS598JHM: Advanced NLP 13 0.8 1 Beta(α,β) with α=β Symmetric. α=β=1: uniform Beta Distribution Beta( , ) 3.5 Beta(0.1,0.1) Beta(1,1) Beta(2,2) 3 2.5 2 1.5 1 0.5 0 0 0.2 0.4 0.6 µ CS598JHM: Advanced NLP 14 0.8 1 Beta(α,β) with α<1, β >1 Strictly decreasing Beta Distribution Beta( , ) 8 Beta(0.1,1.5) Beta(0.5,1.5) Beta(0.5,2) 7 6 5 4 3 2 1 0 0 0.2 0.4 0.6 µ CS598JHM: Advanced NLP 15 0.8 1 Beta(α,β) with α = 1, β >1 α = 1, 1< β < 2: strictly concave. α = 1, β = 2: straight line α = 1, β > 2: strictly convex Beta Distribution Beta( , ) 3.5 Beta(1,1.5) Beta(1,2) Beta(1,3) 3 2.5 2 1.5 1 0.5 0 0 0.2 0.4 0.6 CS598JHM: Advanced NLP µ 16 0.8 1 Beta as prior for binomial Given a prior P(θ |α,β) = Beta(α,β), and data D=(H,T), what is our posterior? P (θ|α, β , H, T ) ∝ P (H, T |θ)P (θ|α, β ) ∝ θH (1 − θ)T θα−1 (1 − θ)β −1 = θH +α−1 (1 − θ)T +β −1 With normalization P (θ|α, β , H, T ) = Γ(H + α + T + β ) H +α−1 θ (1 − θ)T +β −1 Γ(H + α)Γ(T + β ) = Beta(α + H, β + T ) CS598JHM: Advanced NLP 17 So, what do we predict? Our Bayesian estimate for the next coin flip P(x=1 | D): P (x = H |D) = = ￿ 1 0 ￿ 0 1 P (x = H |θ)P (θ|D)dθ θP (θ|D)dθ = E [θ|D] = E [Beta(H + α, T + β )] H +α = H +α+T +β CS598JHM: Advanced NLP 18 Conjugate priors The beta distribution is a conjugate prior to the binomial: the resulting posterior is also a beta distribution. We can interpret its parameters α, β as pseudocounts P(H | D) = (H + α)/(H + α + T + β) All members of the exponential family of distributions have conjugate priors. Examples: - Multinomial: conjugate prior = Dirichlet - Gaussian: conjugate prior = Gaussian CS598JHM: Advanced NLP 19 Multinomials: Dirichlet prior Multinomial distribution: Probability of observing each possible outcome ci exactly Xi times in a sequence of n yes/no trials: N ￿ n! x x P (X1 = xi , . . . , XK = xk ) = θ1 1 · · · θKK if xi = n x1 ! · · · xK ! i=1 Dirichlet prior: Γ(α1 + ... + αk ) ￿ αk −1 Dir(θ|α1 , ...αk ) = θk Γ(α1 )...Γ(αk ) k=1 CS598JHM: Advanced NLP 20 More about conjugate priors - We can interpret the hyperparameters as “pseudocounts” - Sequential estimation (updating counts after each observation) gives same results as batch estimation - Add-one smoothing (Laplace smoothing) = uniform prior - On average, more data leads to a sharper posterior (sharper = lower variance) CS598JHM: Advanced NLP 21 ...
View Full Document

This note was uploaded on 10/13/2011 for the course CS 440 taught by Professor Levinson,s during the Spring '08 term at University of Illinois, Urbana Champaign.

Ask a homework question - tutors are online