SPR_LectureHandouts_Chapter_03_Part1

# SPR_LectureHandouts_Chapter_03_Part1 - Pattern Recognition...

This preview shows page 1. Sign up to view the full content.

This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Pattern Recognition ECE-8443 Chapter 3, Part 1 Maximum-Likelihood Parameter Estimation Electrical and Computer Engineering Department, Mississippi State University. 1 Chapter 3 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Outline • Introduction • Maximum-Likelihood Estimation – Example of a Specific Case – The Gaussian case: unknown µ and σ – Bias • ML Problem Statement 2 Chapter 3 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Introduction – Data availability in a Bayesian framework • We could design an optimal classifier if we knew: – P(ωi) (priors) – p(x | ωi) (class-conditional densities) • Unfortunately, we rarely have this complete information! • What we typically have is a set of design samples (training data) for each class – Design a classifier using training samples (training feature vectors available per class) • Estimating prior probabilities is usually not an issue • Samples are often too small for estimating likelihoods (large dimension of feature space!) 3 Chapter 3 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Parameter estimation – A priori information about the problem – Prior information about the general shape of the likelihoods and any possible parametrization can help simplify the estimation problem – Normality of p(x | ωi) p(x | ωi) ~ N( µi, Σi) • Characterized by 2 sets of parameters • Only need to learn first and second order moments – Estimation techniques • Maximum-Likelihood (ML) and Bayesian estimation • Results are nearly identical, but the approaches are different 4 Chapter 3 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Parameter estimation • Parameters in ML estimation are fixed but unknown! • Best parameters are obtained by maximizing the probability of obtaining the samples observed • Bayesian methods view the parameters as random variables having some known distribution • In either approach, we use the posterior probability, P(ωi | x) for our classification rule! 5 Chapter 3 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Maximum-likelihood estimation • Has good convergence properties as the sample size increases • Simpler than any other alternative techniques – General principle • Assume we have c classes and p(x | ωj) ~ N( µj, Σj) p(x | ωj) ≡ p (x | ωj, θj) where: θ j = ( µ j , Σ j ) = ( µ 1j , µ 2 ,..., σ 11 , σ 22 , cov( x m , x n )...) j j j j j 6 Chapter 3 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Maximum-likelihood estimation • I.I.D.: c data sets, D1,...,Dc, where Dj drawn independently according to p(x|ωj) • Assume p(x|ωj) has a known parametric form and is completely determined by the parameter vector θj (e.g., p(x|ωj) ≈ N(µj,∑j), where θj=[µ1, ..., µj , ∑11, ∑12, ...,∑dd]) • p(x|ωj) has an explicit dependence on θj: p(x|ωj,θj) • Use training samples to estimate θ1, θ2,..., θc 7 Chapter 3 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Maximum-likelihood estimation • Functional independence: assume Di gives no useful information about θj for i≠j. (samples of one class provide no information about another class). This allows us to have c separate problems of the following form: • Use the set D of training samples (x1,... xn) drawn independently from p(x|θ) to estimate the unknown parameter vector θ for each class separately. • Because the samples were drawn independently: n p( D | θ) = p ({x1 , x2 , x3 ...xn } | θ) = ∏ p( xk θ) k =1 8 Chapter 3 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Maximum-likelihood estimation |D| = n . . . x2 x1 . x .. n Design / Training samples N(µj, Σj) = p(xj, ω1) p(xj | ω1) p(xj | ωk) D1 9 x x10 . 11 . Chapter 3 . x20 . Saurabh Prasad Dk x8 . . x1 x9 .. Pattern Recognition Dc . .. . Electrical and Computer Engineering Department Maximum-likelihood estimation • Use the information provided by the training samples to estimate θ = (θ1, θ2, …, θc), each θi (i = 1, 2, …, c) is associated with each category • Suppose that D contains n samples, x1, x2,…, xn k =n p ( D | θ ) = ∏ p ( xk | θ ) = F (θ ) k =1 p ( D | θ ) is called the likelihood of θ w.r.t. the set of samples) ˆ • ML estimate of θ is, by definition the valueθ that maximizes P(D | θ) “It is the value of θ that best agrees with the actually observed training sample” 10 Chapter 3 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Maximum-likelihood estimation 11 Chapter 3 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Maximum-likelihood estimation • Optimal estimation – Let θ = (θ1, θ2, …, θp)t and let ∇θ be the gradient operator ∂ ∂ ∂ ∇θ = , ,..., ∂θ1 ∂θ 2 ∂θ p t – We define l(θ) as the log-likelihood function l(θ) = ln p(D | θ) – New problem statement: determine θ that maximizes the log-likelihood Why take ln( )? • Computational/analytical convenience for Normal pdfs • Numerical accuracy (e.g., probabilities numerically tending to zero) • Since ln() is monotonically increasing, it does not affect the maximization ˆ θ = arg max l (θ ) θ 12 Chapter 3 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Maximum-likelihood estimation Let θ = (θ1,θ 2 ,...,θ p ) t . ∂ ∂θ 1 Let ∇ θ = . ∂ ∂θ p Define : l (θ ) ≡ ln p (D θ ) ˆ θ = arg max l (θ ) θ n • The ML estimate is found by solving this equation: ∇ θl = ∇ θ [ ∑ ln ( p (x k θ ))] n k =1 = ∑ ∇ θ ln ( p (x k θ )) = 0. n k =1 • The solution to this equation can be a global maximum, a local maximum, or even an inflection point = ln( ∏ p ( x k θ)) k =1 = ∑ ln ( p (x k θ )) n • Under what conditions is it a global maximum? k =1 • Precaution if the extremum is close to the boundary of the parameter space 13 Chapter 3 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Gaussian Case: Unknown Mean • Consider the case where only the mean, θ = µ, is unknown: • Consider the case where only the mean, θ = µ, is unknown: ∑ ∇ θ ln ( p (x k θ )) = 0 n k =1 ln( p (xk θ)) = ln[ 1 (2π ) d / 2 ∑ exp[ 1/ 2 −1 (x k − θ) t ∑ −1 (x k − θ)] 2 1 1 = − ln[(2π ) d ∑ ] − (x k − θ) t ∑ −1 (x k − θ) 2 2 1 ∂ 1 d t −1 Now, [− ln[(2π ) ∑ ] − (x k − θ) ∑ (x k − θ)] ∂ θ 2 2 ∂1 ∂1 = [− ln[(2π ) d ∑ ] − [ (x k − θ) t ∑ −1 (x k − θ)] ∂θ 2 ∂θ 2 = ∑ −1 (x k − θ) which implies: 14 Chapter 3 ∇ θµ ln( p (xk θ)) = ∑ −1 (x k − θ) Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Gaussian Case: Unknown Mean • Consider the case where only the mean, θ = µ, is unknown: • Substituting into the expression for the total likelihood: ∇θ l = ∑ ∇θ ln ( p (x k θ)) = ∑ ∑ −1 (x k − θ) = 0 n n k =1 k =1 • Rearranging terms: n −1 ˆ ∑ ∑ (x k − θ) = 0 k =1 n ˆ ∑ (x k − θ) = 0 k =1 n n ˆ ∑ xk − ∑ θ = 0 k =1 n k =1 ˆ ∑ xk − n θ = 0 k =1 n ˆ1 θ = ∑ xk n k =1 • Significance??? 15 Chapter 3 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Gaussian Case: Unknown Mean • Consider the case where only the mean, θ = µ, is unknown: and Variance • Let θ = [µ,σ2]. The log likelihood of a SINGLE point is: 1 1 −1 ln( p( xk θ)) = − ln[(2π )θ 2 ] − ( xk − θ ) t θ 2 (xk − θ ) 1 1 2 2 1 ( xk − θ1 ) θ 2 ∇θl = ∇θ ln( p ( xk θ)) = 2 − 1 + ( xk − θ1 ) 2 2θ 2 2θ 2 • The FULL likelihood (summed over all n data points) leads to: n 1 ∑ ˆ ( xk − θˆ1 ) = 0 k =1 θ 2 ˆ 1 ( x k − θ1 ) 2 ∑− ˆ + 2θ ˆ k =1 2θ 22 2 n and 16 Chapter 3 =0 ⇒ Saurabh Prasad n n k =1 k =1 ∑ ( xk − θˆ1 ) 2 =∑ θˆ2 Pattern Recognition Electrical and Computer Engineering Department Gaussian Case: Unknown Mean • Consider the case where only the mean, θ = µ, is unknown: and Variance • This leads to these equations: Sample mean and, Sample variance n ˆ1 = µ = 1 ∑ x ˆ θ k n k =1 n ˆ2 = σ 2 = 1 ∑( x − µ ) 2 ˆ θ kˆ n k =1 • In the multivariate case: 1n ˆ µ = ∑ xk n k =1 1n ˆ ˆ ˆ σ = ∑( x k − µ )( x k − µ )t n k =1 2 Sample mean vector and, Sample covariance matrix • The true covariance is the expected value of the matrix ( x k − µ )( x k − µ )t , ˆ ˆ which is a familiar result. 17 Chapter 3 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Maximum Likelihood Estimator • Consider the case where only the mean, θ = µ, is unknown: Bias in the estimates • Does the maximum likelihood estimate of the variance converge to the true value of the variance? Let’s start with a few simple results we will need later. • Mean and variance of the ML estimate of the mean: 1n ˆ E[ µ ] = E[ ∑ xi ] n i =1 ˆ ˆ ˆ var[µ ] = E[ µ 2 ] − ( E[ µ ]) 2 ˆ = E[ µ 2 ] − µ 2 1n = ∑ E[ xi ] n i =1 1 n 1 n = E[ ∑ xi ∑ x j ] − µ 2 n i =1 n j =1 1n = ∑µ = µ n i =1 1 =2 n n n ∑ ∑ E[ xi x j ] − µ 2 i =1 j =1 Since the expected value of the ML estimate for the mean equals the true value, it is an unbiased estimate 18 Chapter 3 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Maximum Likelihood Estimator Bias in the estimates • The expectedthe casexjwhere 2 for j ≠the mean, θ random variables are • Consider value of xi will be µ only k since the two = µ, is unknown: independent. • The expected value of xi2 will be µ2 + σ2. • Hence, in the summation above, we have n2-n terms with expected value µ2 and n terms with expected value µ2 + σ2. • Thus, ˆ var[µ ] = 1 n 2 ((n 2 ) ( − n µ + n µ +σ 2 2 2 ))− µ 2 = σ2 n which implies: ˆ ˆ ˆ E[ µ ] = var[µ ] + ( E[ µ ]) = 2 2 σ2 + µ2 n • Note: variance of the mean estimate goes to zero as n goes to infinity, and our estimate converges to the true estimate (error goes to zero). 19 Chapter 3 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Maximum Likelihood Estimator Bias in the estimates • We will need one more result: σ 2 = E[( x − µ ) 2 ] = E[ x 2 ] − 2 E[ x]µ + E[ µ 2 ] = E[ x 2 ] − 2 µ 2 + E[ µ 2 ] = E[ x 2 ] − µ 2 • Now we can combine these results. Recall our expression for the ML estimate of the variance: 1n ˆ ˆ2 E[σ ] = E[ ∑ (xi − µ ) ] n i =1 2 20 Chapter 3 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Maximum Likelihood Estimator Bias in the estimates • Expand the variance expression and simplify: 1n ˆ2 ˆ E[σ ] = E[ ∑ (xi − µ ) ] n i =1 n 1 ˆˆ = E[∑ ( xi2 − 2 xi µ + µ 2 )] n i =1 1n ˆ ˆ = ∑ ( E[ xi2 ] − 2 E[ xi µ ] + E[ µ 2 ]) n i =1 1n ˆ = ∑ ((σ 2 + µ 2 ) − 2 E[ xi µ ] + ( µ 2 + σ 2 n)) n i =1 2 • One more intermediate term to derive: 1n 1n 1n ˆ E[ xi µ ] = E[ xi ∑ x j ] = ∑ E[ xi x j ] = (∑ E[ xi x j ] + E[ xi xi ]) n j =1 n j =1 n ij≠=1 j σ2 1 1 2 2 2 2 2 2 = ((n − 1) µ + µ + σ ) = ((nµ + σ ) = µ + n n n ( 21 Chapter 3 Saurabh Prasad ) Pattern Recognition Electrical and Computer Engineering Department Maximum Likelihood Estimator Bias in the estimates • Substitute our previously derived expression for the second term: ˆ E[σ ] = 2 = = = = = 22 1n ˆ ∑ ((σ 2 + µ 2 ) − 2 E[ xi µ ] + ( µ 2 + σ 2 n)) n i =1 1n ∑ ((σ 2 + µ 2 ) − 2( µ 2 + σ 2 n) + ( µ 2 + σ 2 n)) n i =1 1n ∑ (σ 2 + µ 2 − 2µ 2 + µ 2 − 2 σ 2 n + σ 2 n) n i =1 1n ∑ (σ 2 − σ 2 n) n i =1 1n 1n 2 1 n 2 (n − 1) 2 2 ∑ (σ − σ n) = n ∑ σ (1 − 1 / n) = n ∑ σ n n i =1 i =1 i =1 (n − 1) 2 σ n Chapter 3 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Maximum Likelihood Estimator Bias in the estimates • Therefore, the ML estimate is biased: 1 ˆ E[σ 2 ] = E[ n ˆ (xi − µ )2 ] = ∑ n i =1 n −1 2 σ ≠σ2 n Since the expected value of the ML estimate for variance does not equal the true value, it is a biased estimate However, the ML estimate converges to the actual value as n becomes large • An unbiased estimator is: C= 1n ˆ ˆt ∑ (x i − µ )(x i − µ ) n − 1 i =1 • These are related by: ˆ (n − 1) C ∑= n 23 Chapter 3 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department ...
View Full Document

## This note was uploaded on 02/20/2012 for the course ECE 8443 taught by Professor Staff during the Spring '10 term at University of Houston.

Ask a homework question - tutors are online