SPR_LectureHandouts_Chapter_03_Part2

SPR_LectureHandouts_Chapter_03_Part2 - Pattern Recognition...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Pattern Recognition ECE-8443 Chapter 3, Part 2 Bayesian Parameter Estimation Electrical and Computer Engineering Department, Mississippi State University. 1 Chapter 3 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Outline •Bayesian Estimation (BE) • Bayesian Parameter Estimation: Gaussian Case • Bayesian Parameter Estimation: General Estimation • Problems of Dimensionality 2 Chapter 3 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Introduction to Bayesian Parameter Estimation • In Chapter 2, we learned how to design an optimal classifier if we knew the prior probabilities, P(ωi), and class-conditional densities, p(x|ωi). • Bayes: treat the parameter(s) as random variables (or vectors) having some known prior distribution. Observations of samples converts this to a posterior. • Bayesian learning: sharpen the a posteriori density causing it to peak near the true value. • Supervised vs. unsupervised: do we know the class assignments of the training data. • Bayesian estimation and ML estimation produce very similar results in many cases. • Reduces statistical inference (prior knowledge or beliefs about the world) to probabilities. 3 Chapter 3 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Bayesian estimation • Bayes formula allows us to compute posteriors P(ωi|x) from the priors, P(ωi), and the likelihood, p(x|ωi) • Posterior probabilities, P(ωi|x), are central to Bayesian classification. • But what If the priors and class-conditional densities are unknown? • The answer is that we can compute the posterior, P(ωi|x), using all of the information at our disposal (e.g., training data). 4 Chapter 3 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Bayesian estimation • Estimate the priors P(ωi) from available training data, D: P(ωi|D) = P(ωi). • Assume functional independence : Di has no influence on p ( x ω j , D ) if i ≠ j - A reasonable assumption - Allows the partitioning of the estimation problem into c separate problems – one per class For a training dataset D and for the i’th class, this gives: P(ωi | x, D) = p(x ωi , Di ) P(ωi ) c ∑ p(x ω j , D j ) P(ω j ) j =1 5 Chapter 3 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Probability distribution of the parameter(s) • Assume the parametric form of the evidence, p(x), is known: p(x|θ). • Any information we have about θ prior to collecting samples is contained in a known prior density p(θ) • Recall that in the Bayes estimation method, θ is a random variable (/vector) • Observation of samples converts this density function to a posterior, p(θ|D), which we hope is peaked around the true value of θ. • Our goal is to estimate a parameter vector, such that p(x|D) comes as close as possible to the true value p(x): p(x D) = ∫ p(x,θ D)dθ Integrating a joint pdf along one random variable will yield the marginal pdf w.r.t. the other random variable • Although we will attempt to estimate p(x) in the following slides, what we are interested in is p(x|wi). The following estimation procedure holds just as well for estimating p(x|wi) – simply use Di for training, instead of D. 6 Chapter 3 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Probability distribution of the parameter(s) • We can write the joint distribution as a product: p(x D) = ∫ p(x,θ D)dθ = ∫ p(x θ , D ) p(θ D)dθ = ∫ p(x θ ) p(θ D)dθ Selection of x is done independent of the training samples D. • This equation links the class-conditional density p (x D ) to the posterior, p(θ D) . • It follows that: if p(θ | D) peaks very sharply about some value θ , we obtain p( x D) ≈ p( x | θ ) • Otherwise, in general, we can use numerical integration to estimate p (x D) 7 Chapter 3 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Univariate Gaussian Case • Case: only mean unknown p ( x µ ) ≈ N ( µ ,σ 2 ) • Known prior density: 2 p ( µ ) ≈ N ( µ0 , σ 0 ) • Using Bayes formula: p( µ D) p( D) = p( D µ ) p( µ ) • Rationale: Once a value of µ is known, the density for x is completely known. α is a normalization factor that depends on the data, D. 8 Chapter 3 Saurabh Prasad Pattern Recognition p( µ D) = = p( D µ ) p( µ ) p( D) p( D µ ) p( µ ) ∫ p ( D µ ) p ( µ ) dµ = α [ p ( D µ ) p ( µ )] n = α ∏ p( x k µ ) p( µ ) k =1 Electrical and Computer Engineering Department Univariate Gaussian Case • Applying our Gaussian assumptions: n 1 1µ −µ 1 x k − µ 2 1 0 exp − exp − p(µ | D ) = α ∏ k =1 2π σ 2 σ0 2 σ 2π σ 0 Independent of µ 9 2 1 µ − µ 2 n x − µ 2 0 + ∑ k = α ′ exp − σ σ 2 k =1 0 2 2 1 µ 2 − 2 µµ 0 + µ 0 n x k x k µ µ 2 + ∑ 2 − 2 2 + 2 = α ′ exp − 2 σ0 σ σ 2 k =1 σ 2 2 n 1 µ 2 2 µµ 0 n 1 µ0 x k x k µ µ 2 + ∑ (−2 2 + 2 ) + = α ′ exp − 2 + ∑ 2 exp − 2 − 2 σ σ σ 0 k =1 σ σ 2 0 k =1 σ 2 0 1 µ 2 2 µµ 0 = α ′′ exp − 2 − σ 02 2 σ 0 Chapter 3 Saurabh Prasad n x k µ µ 2 + ∑ (−2 2 + 2 ) + σ σ k =1 Pattern Recognition Electrical and Computer Engineering Department Univariate Gaussian Case (Cont.) • Now we need to work this into a simpler form: 1 µ 2 2 µµ 0 n x k µ µ 2 + ∑ (−2 2 + 2 ) + p(µ | D ) = α ′′ exp − 2 − 2 σ σ 0 k =1 σ σ 2 0 1 n µ 2 µ 2 n µµ x µ = α ′′ exp − ∑ 2 + 2 + ∑ − 2 k 2 + −2 20 σ σ0 2 k =1 σ σ 0 k =1 1 µ2 µ2 µ = α ′′ exp − n 2 + 2 − 2 2 σ0 σ 2 σ 1 n 1 = α ′′ exp (− 2 + 2 σ0 2 σ 1 n 1 = α ′′ exp (− 2 + 2 σ0 2 σ n ∑ x k − 2µ k =1 µ 0 2 σ 0 µ 2 1 µ0 µ − 2 2 (nµ n ) + 2 µ ˆ σ σ 0 2 1n µ µ − 2 2 ∑ x k + n 0 σ σ 02 k =1 1n ˆ where µ n = ∑ x k n k =1 10 Chapter 3 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Univariate Gaussian Case (Cont.) • p(µ|D) is an exponential of a quadratic function, which makes it a normal distribution. Because this is true for any n, it is referred to as a reproducing density. • Write p(µ|D) ~ N(µn,σn2): p ( µ D) = 1 µ − µn 2 )] exp[− ( 2 σn 2π σ n 1 • Expand the quadratic term: p( µ D) = 1 µ − µn 2 exp[− ( ) ]= 2 σn 2π σ n 1 2 1 µ 2 − 2 µµ n + µ n exp[− ( )] 2 2 σn 2π σ n 1 • Equate coefficients of our two functions: 2 1 µ 2 − 2 µµ n + µ n 1 exp − = 2 2 σn 2π σ n 1 n 1 α ′′ exp − 2 + 2 2 σ σ0 11 Chapter 3 2 µ µ − 2 12 (nµ n ) + 0 ˆ σ σ 02 Saurabh Prasad Pattern Recognition µ Electrical and Computer Engineering Department Univariate Gaussian Case (Cont.) Rearrange terms so that the dependencies on µ are clear: 2 1 1 1 µn µ − exp − 2 µ 2 − 2 n µ = exp 2 2 σ 2 σ 2 σ n 2π σ n n n 1 n 1 µ 1 ˆ ′′ exp − 2 + 2 µ 2 − 2 2 (nµ n ) + 0 µ α 2 σ 2 σ σ0 σ0 1 Associate terms related to µ2 and µ: 2 µn ⇔ µ 2 : 1 2 σn µn ⇔ µ : = n σ2 + 1 2 σ0 µn µ n ˆ = 2 µn + 0 2 2 σn σ σ0 There is actually a third equation involving terms not related to µ: 2 1 µn − = α ′′ − or − exp 2 σ2 2π σ n n 1 2 1 µn − = α exp 2 σ2 2π σ n n 1 1 2π σ 0 2π σ 1 n but we can ignore this since it is not a function of µ 12 Chapter 3 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Univariate Gaussian Case (Cont.) • Two equations and two unknowns. Solve for µn and σn2. First, solve for σn2 : 2 2 σ 2σ 0 σ 0σ 2 σ= = = 2 2 n 1 n σ 0 + σ 2 nσ 0 + σ 2 +2 σ 2 σ0 1 2 n • Next, solve for µn: • Summarizing: 2 σn ˆ µn = µn ( 2 ) + µ0 2 σ σ 0 2 2 1 σ 0 σ 2 n σ 0σ 2 ˆ = µ n ( 2 ) nσ 2 + σ 2 + µ 0 σ 2 nσ 2 + σ 2 σ 0 0 0 2 σ2 nσ 0 + µ0 ˆ = µn nσ 2 + σ 2 nσ 2 + σ 2 0 0 2 nσ n 2 σ2 nσ 0 ˆ µn = ( 2 )µ n + nσ 2 + σ 2 nσ 0 + σ 2 0 µ0 2 σ 0σ 2 σ= 2 nσ 0 + σ 2 2 n 13 Chapter 3 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Bayesian Learning • µn represents our best guess after n samples. • σn2 represents our uncertainty about this guess. • σn2 approaches σ2/n for large n – each additional observation decreases our uncertainty. • The posterior, p(µ|D), becomes more sharply peaked as n grows large. This is known as Bayesian learning. 14 Chapter 3 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Bayesian Learning 15 Chapter 3 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Summary • Introduction of Bayesian parameter estimation. • The role of the class-conditional distribution in a Bayesian estimate. • Estimation of the posterior and probability density function assuming the only unknown parameter is the mean, and the conditional density of the “features” given the mean, p(x|µ), can be modeled as a Gaussian distribution. 16 Chapter 3 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department ...
View Full Document

Ask a homework question - tutors are online