Lecture20OptHO - CS440 Introduction to Artificial...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: CS440 Introduction to Artificial Intelligence The binomial distribution If p is the probability of heads, the probability of getting exactly k heads in n independent yes/no trials is given by the binomial distribution Bin(n,p): Lecture 20: Bayesian learning; conjugate priors ￿￿ nk P (k heads) = p (1 − p)n−k k n! = pk (1 − p)n−k k !(n − k )! Julia Hockenmaier Expectation E(Bin(n,p)) = np Variance var(Bin(n,p)) = np(1-p) [email protected] 3324 Siebel Center Office Hours: Thu, 2:00-3:00pm http://www.cs.uiuc.edu/class/sp11/cs440 CS598JHM: Advanced NLP 2 Binomial likelihood Parameter estimation What distribution does p (probability of heads) have, given that the data D consists of #H heads and #T tails? Likelihood L( ;D=(#Heads,#Tails)) for binomial distribution 0.007 θ - Maximum a posterior (MAP): 0.005 Use the ! which has the highest posterior probability P(! |D). θM AP = arg max P (θ|D) = arg max P (θ)P (D|θ) 0.004 θ 0.003 θ - Bayesian estimation: 0.002 0.001 0 - Maximum likelihood estimation (MLE): Use the ! which has the highest likelihood P(D| !). θM LE = arg max P (D|θ) L( ,(5,5)) L( ,(3,7)) L( ,(2,8)) 0.006 Given data D=HTTHTT, what is the probability ! of heads? Integrate over all ! = compute the expectation of ! given D: ￿1 P (x = H |D) = P (x = H |θ)P (θ|D)dθ = E [θ|D] 0 0 0.2 0.4 0.6 0.8 1 CS598JHM: Advanced NLP CS598JHM: Advanced NLP 3 4 Maximum likelihood estimation - Maximum likelihood estimation (MLE): Bayesian statistics - Data D provides evidence for or against our beliefs. We update our belief ! based on the evidence we see: find ! which maximizes likelihood P(D | !). θ ∗ P (θ|D) = = arg max P (D|θ) Posterior θ = arg max θ (1 − θ) H T θ = Prior Likelihood P (θ)P (D|θ) ￿ P (θ)P (D|θ)dθ Marginal Likelihood (=P(D)) H H +T CS598JHM: Advanced NLP CS598JHM: Advanced NLP 5 6 Bayesian estimation In search of a prior... Given a prior P(!) and a likelihood P(D|!), what is the posterior P(! |D)? How do we choose the prior P(!)? - The posterior is proportional to prior x likelihood: P(! |D)! P(!) P(D|!) - The likelihood of a binomial is: P(D|!) = !H(1-!)T - If prior P(!) is proportional to powers of ! and (1-!), posterior will also be proportional to powers of ! and (1-!): P(!) ! ! a(1-!)b " P(! |D)! ! a(1-!)b !H(1-!)T = !a+H(1-!)b+T We would like something of the form: P (θ) ∝ θa (1 − θ)b But -- this looks just like the binomial: ￿￿ nk P (k heads) = p (1 − p)n−k k n! = pk (1 − p)n−k k !(n − k )! …. except that k is an integer and ! is a real with 0<! <1. CS598JHM: Advanced NLP CS598JHM: Advanced NLP 7 8 The Gamma function The Gamma function "(x) is the generalization of the factorial x! (or rather (x-1)!) to the reals: Γ(α) = ￿ ∞ xα−1 e−x dx for α > 0 The Gamma function (x) function 25 20 0 15 For x >1, "(x) = (x-1)"(x-1). 10 For positive integers, "(x) = (x-1)! 5 0 0 1 2 3 4 CS598JHM: Advanced NLP CS598JHM: Advanced NLP 9 10 The Beta distribution 5 Beta(",#) with " >1, # >1 A random variable X (0 < x < 1) has a Beta distribution with (hyper)parameters " (" > 0) and # (# > 0) if X has a continuous distribution with probability density function P (x|α, β ) = Γ(α + β ) α−1 x (1 − x)β −1 Γ(α)Γ(β ) The first term is a normalization factor (to obtain a distribution) Unimodal Beta Distribution Beta( , ) 7 Beta(1.5,1.5) Beta(3,1.5) Beta(3,3) Beta(20,20) Beta(3,20) 6 5 4 3 ￿ 0 1 xα−1 (1 − x)β −1 dx = Expectation: Γ(α + β ) Γ(α)Γ(β ) α α+β 2 1 0 0 0.2 0.4 0.6 ! CS598JHM: Advanced NLP CS598JHM: Advanced NLP 11 12 0.8 1 Beta(",#) with " <1, # <1 Beta(",#) with "=# Symmetric. #=$=1: uniform U-shaped Beta Distribution Beta( , ) 6 Beta Distribution Beta( , ) 3.5 Beta(0.1,0.1) Beta(0.1,0.5) Beta(0.5,0.5) 5 Beta(0.1,0.1) Beta(1,1) Beta(2,2) 3 2.5 4 2 3 1.5 2 1 1 0 0.5 0 0.2 0.4 0.6 0.8 1 0 0 0.2 0.4 0.6 ! 0.8 1 ! CS598JHM: Advanced NLP CS598JHM: Advanced NLP 13 14 Beta(",#) with "<1, # >1 Beta(",#) with " = 1, # >1 # = 1, 1< $ < 2: strictly concave. # = 1, $ = 2: straight line # = 1, $ > 2: strictly convex Strictly decreasing Beta Distribution Beta( , ) 8 Beta(0.1,1.5) Beta(0.5,1.5) Beta(0.5,2) 7 Beta Distribution Beta( , ) 3.5 6 Beta(1,1.5) Beta(1,2) Beta(1,3) 3 5 2.5 4 2 3 1.5 2 1 1 0 0 0.2 0.4 0.6 ! CS598JHM: Advanced NLP 15 0.8 1 0.5 0 0 0.2 0.4 0.6 CS598JHM: Advanced NLP ! 16 0.8 1 Beta as prior for binomial Given a prior P(! |#,$) = Beta(#,$), and data D=(H,T), what is our posterior? P (θ|α, β , H, T ) ∝ P (H, T |θ)P (θ|α, β ) So, what do we predict? Our Bayesian estimate for the next coin flip P(x=1 | D): P (x = H |D) = ￿ 1 0 ￿ 1 P (x = H |θ)P (θ|D)dθ ∝ θH (1 − θ)T θα−1 (1 − θ)β −1 = = θH +α−1 (1 − θ)T +β −1 = E [Beta(H + α, T + β )] H +α = H +α+T +β 0 θP (θ|D)dθ = E [θ|D] With normalization P (θ|α, β , H, T ) = Γ(H + α + T + β ) H +α−1 θ (1 − θ)T +β −1 Γ(H + α)Γ(T + β ) = Beta(α + H, β + T ) CS598JHM: Advanced NLP CS598JHM: Advanced NLP 17 18 Conjugate priors Multinomials: Dirichlet prior The beta distribution is a conjugate prior to the binomial: the resulting posterior is also a beta distribution. We can interpret its parameters #, $ as pseudocounts P(H | D) = (H + #)/(H + # + T + $) All members of the exponential family of distributions have conjugate priors. Multinomial distribution: Probability of observing each possible outcome ci exactly Xi times in a sequence of n yes/no trials: N ￿ n! x x P (X1 = xi , . . . , XK = xk ) = θ1 1 · · · θKK if xi = n x1 ! · · · xK ! i=1 Dirichlet prior: Examples: - Multinomial: conjugate prior = Dirichlet - Gaussian: conjugate prior = Gaussian Dir(θ|α1 , ...αk ) = Γ(α1 + ... + αk ) ￿ αk −1 θk Γ(α1 )...Γ(αk ) k=1 CS598JHM: Advanced NLP CS598JHM: Advanced NLP 19 20 More about conjugate priors - We can interpret the hyperparameters as “pseudocounts” - Sequential estimation (updating counts after each observation) gives same results as batch estimation - Add-one smoothing (Laplace smoothing) = uniform prior - On average, more data leads to a sharper posterior (sharper = lower variance) CS598JHM: Advanced NLP 21 ...
View Full Document

This note was uploaded on 10/13/2011 for the course CS 440 taught by Professor Levinson,s during the Spring '08 term at University of Illinois, Urbana Champaign.

Ask a homework question - tutors are online