This preview shows page 1. Sign up to view the full content.
Unformatted text preview: Pattern Recognition
ECE8443 Chapter 3, Part 1
MaximumLikelihood Parameter Estimation
Electrical and Computer Engineering Department,
Mississippi State University. 1 Chapter 3 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Outline
• Introduction
• MaximumLikelihood Estimation
– Example of a Specific Case
– The Gaussian case: unknown µ and σ
– Bias
• ML Problem Statement 2 Chapter 3 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Introduction
– Data availability in a Bayesian framework
• We could design an optimal classifier if we knew:
– P(ωi) (priors)
– p(x  ωi) (classconditional densities)
• Unfortunately, we rarely have this complete information!
• What we typically have is a set of design samples (training
data) for each class
– Design a classifier using training samples (training feature
vectors available per class)
• Estimating prior probabilities is usually not an issue
• Samples are often too small for estimating likelihoods (large
dimension of feature space!)
3 Chapter 3 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Parameter estimation
– A priori information about the problem
– Prior information about the general shape of the likelihoods and any possible
parametrization can help simplify the estimation problem
– Normality of p(x  ωi)
p(x  ωi) ~ N( µi, Σi)
• Characterized by 2 sets of parameters
• Only need to learn first and second order moments
– Estimation techniques
• MaximumLikelihood (ML) and Bayesian estimation
• Results are nearly identical, but the approaches are different
4 Chapter 3 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Parameter estimation
• Parameters in ML estimation are fixed but unknown!
• Best parameters are obtained by maximizing the
probability of obtaining the samples observed
• Bayesian methods view the parameters as random
variables having some known distribution
• In either approach, we use the posterior probability,
P(ωi  x) for our classification rule! 5 Chapter 3 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Maximumlikelihood estimation
• Has good convergence properties as the sample size
increases
• Simpler than any other alternative techniques – General principle
• Assume we have c classes and
p(x  ωj) ~ N( µj, Σj)
p(x  ωj) ≡ p (x  ωj, θj) where:
θ j = ( µ j , Σ j ) = ( µ 1j , µ 2 ,..., σ 11 , σ 22 , cov( x m , x n )...)
j
j
j
j
j
6 Chapter 3 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Maximumlikelihood estimation
• I.I.D.: c data sets, D1,...,Dc, where Dj drawn independently
according to p(xωj)
• Assume p(xωj) has a known parametric form and is
completely determined by the parameter vector θj (e.g., p(xωj)
≈ N(µj,∑j), where θj=[µ1, ..., µj , ∑11, ∑12, ...,∑dd])
• p(xωj) has an explicit dependence on θj: p(xωj,θj)
• Use training samples to estimate θ1, θ2,..., θc 7 Chapter 3 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Maximumlikelihood estimation
• Functional independence: assume Di gives no useful
information about θj for i≠j. (samples of one class provide no
information about another class). This allows us to have c
separate problems of the following form:
• Use the set D of training samples (x1,... xn) drawn independently from
p(xθ) to estimate the unknown parameter vector θ for each class
separately. • Because the samples were drawn independently:
n p( D  θ) = p ({x1 , x2 , x3 ...xn }  θ) = ∏ p( xk θ)
k =1 8 Chapter 3 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Maximumlikelihood estimation
D = n . . . x2
x1 . x
..
n Design /
Training
samples N(µj, Σj) = p(xj, ω1) p(xj  ω1) p(xj  ωk) D1 9 x
x10 . 11 . Chapter 3 .
x20
. Saurabh Prasad Dk x8 . .
x1 x9
..
Pattern Recognition Dc .
.. . Electrical and Computer Engineering Department Maximumlikelihood estimation
• Use the information provided by the training samples to estimate
θ = (θ1, θ2, …, θc), each θi (i = 1, 2, …, c) is associated with each
category
• Suppose that D contains n samples, x1, x2,…, xn
k =n p ( D  θ ) = ∏ p ( xk  θ ) = F (θ )
k =1 p ( D  θ ) is called the likelihood of θ w.r.t. the set of samples) ˆ
• ML estimate of θ is, by definition the valueθ that maximizes P(D  θ)
“It is the value of θ that best agrees with the actually observed
training sample” 10 Chapter 3 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Maximumlikelihood estimation 11 Chapter 3 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Maximumlikelihood estimation
• Optimal estimation
– Let θ = (θ1, θ2, …, θp)t and let ∇θ be the gradient operator
∂
∂
∂
∇θ = ,
,..., ∂θ1 ∂θ 2
∂θ p t – We define l(θ) as the loglikelihood function
l(θ) = ln p(D  θ)
– New problem statement:
determine θ that maximizes the loglikelihood Why take ln( )?
• Computational/analytical
convenience for Normal pdfs
• Numerical accuracy (e.g.,
probabilities numerically tending
to zero)
• Since ln() is monotonically
increasing, it does not affect the
maximization ˆ
θ = arg max l (θ )
θ 12 Chapter 3 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Maximumlikelihood estimation
Let θ = (θ1,θ 2 ,...,θ p ) t . ∂ ∂θ 1
Let ∇ θ = .
∂ ∂θ p Define : l (θ ) ≡ ln p (D θ )
ˆ
θ = arg max l (θ )
θ
n • The ML estimate is found by solving this
equation:
∇ θl = ∇ θ [ ∑ ln ( p (x k θ ))]
n k =1 = ∑ ∇ θ ln ( p (x k θ )) = 0.
n k =1 • The solution to this equation can be a
global maximum, a local maximum, or
even an inflection point = ln( ∏ p ( x k θ))
k =1 = ∑ ln ( p (x k θ ))
n • Under what conditions is it a global
maximum? k =1 • Precaution if the extremum is close to
the boundary of the parameter space
13 Chapter 3 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Gaussian Case: Unknown Mean • Consider the case where only the mean, θ = µ, is unknown:
• Consider the case where only the mean, θ = µ, is unknown:
∑ ∇ θ ln ( p (x k θ )) = 0
n k =1 ln( p (xk θ)) = ln[ 1
(2π ) d / 2 ∑ exp[
1/ 2 −1
(x k − θ) t ∑ −1 (x k − θ)]
2 1
1
= − ln[(2π ) d ∑ ] − (x k − θ) t ∑ −1 (x k − θ)
2
2
1
∂ 1 d
t −1
Now,
[− ln[(2π ) ∑ ] − (x k − θ) ∑ (x k − θ)]
∂ θ 2
2 ∂1
∂1
= [− ln[(2π ) d ∑ ] − [ (x k − θ) t ∑ −1 (x k − θ)]
∂θ 2
∂θ 2
= ∑ −1 (x k − θ)
which implies: 14 Chapter 3 ∇ θµ ln( p (xk θ)) = ∑ −1 (x k − θ) Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Gaussian Case: Unknown Mean • Consider the case where only the mean, θ = µ, is unknown:
• Substituting into the expression for the total likelihood:
∇θ l = ∑ ∇θ ln ( p (x k θ)) = ∑ ∑ −1 (x k − θ) = 0
n n k =1 k =1 • Rearranging terms: n −1
ˆ
∑ ∑ (x k − θ) = 0 k =1 n ˆ
∑ (x k − θ) = 0 k =1
n n ˆ
∑ xk − ∑ θ = 0 k =1
n k =1 ˆ
∑ xk − n θ = 0 k =1 n ˆ1
θ = ∑ xk
n k =1 • Significance???
15 Chapter 3 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Gaussian Case: Unknown Mean
• Consider the case where only the mean, θ = µ, is unknown:
and Variance
• Let θ = [µ,σ2]. The log likelihood of a SINGLE point is: 1
1
−1
ln( p( xk θ)) = − ln[(2π )θ 2 ] − ( xk − θ ) t θ 2 (xk − θ )
1
1
2
2
1 ( xk − θ1 ) θ
2
∇θl = ∇θ ln( p ( xk θ)) = 2
− 1 + ( xk − θ1 ) 2 2θ 2
2θ 2 • The FULL likelihood (summed over all n data points) leads to:
n 1
∑ ˆ ( xk − θˆ1 ) = 0
k =1 θ
2 ˆ
1
( x k − θ1 ) 2
∑− ˆ + 2θ
ˆ
k =1
2θ 22
2 n and 16 Chapter 3 =0 ⇒ Saurabh Prasad n n k =1 k =1 ∑ ( xk − θˆ1 ) 2 =∑ θˆ2 Pattern Recognition Electrical and Computer Engineering Department Gaussian Case: Unknown Mean
• Consider the case where only the mean, θ = µ, is unknown:
and Variance
• This leads to these equations: Sample mean
and,
Sample variance n
ˆ1 = µ = 1 ∑ x
ˆ
θ
k
n k =1
n
ˆ2 = σ 2 = 1 ∑( x − µ ) 2
ˆ
θ
kˆ
n k =1 • In the multivariate case: 1n
ˆ
µ = ∑ xk
n k =1
1n
ˆ
ˆ
ˆ
σ = ∑( x k − µ )( x k − µ )t
n k =1
2 Sample mean vector
and,
Sample covariance matrix • The true covariance is the expected value of the matrix ( x k − µ )( x k − µ )t ,
ˆ
ˆ
which is a familiar result. 17 Chapter 3 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Maximum Likelihood Estimator • Consider the case where only the mean, θ = µ, is unknown:
Bias in the estimates
• Does the maximum likelihood estimate of the variance converge to the true value of the
variance? Let’s start with a few simple results we will need later.
• Mean and variance of the ML estimate of the mean:
1n
ˆ
E[ µ ] = E[ ∑ xi ]
n i =1 ˆ
ˆ
ˆ
var[µ ] = E[ µ 2 ] − ( E[ µ ]) 2
ˆ
= E[ µ 2 ] − µ 2 1n
= ∑ E[ xi ]
n i =1 1 n 1 n = E[ ∑ xi ∑ x j ] − µ 2 n i =1 n j =1 1n
= ∑µ = µ
n i =1 1
=2
n n n ∑ ∑ E[ xi x j ] − µ 2 i =1 j =1 Since the expected value of the ML
estimate for the mean equals the true
value, it is an unbiased estimate 18 Chapter 3 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Maximum Likelihood Estimator Bias in the estimates
• The expectedthe casexjwhere 2 for j ≠the mean, θ random variables are
• Consider value of xi will be µ only k since the two = µ, is unknown:
independent.
• The expected value of xi2 will be µ2 + σ2.
• Hence, in the summation above, we have n2n terms with expected value µ2 and n terms
with expected value µ2 + σ2.
• Thus,
ˆ
var[µ ] = 1
n 2 ((n 2 ) ( − n µ + n µ +σ
2 2 2 ))− µ 2 = σ2
n which implies:
ˆ
ˆ
ˆ
E[ µ ] = var[µ ] + ( E[ µ ]) =
2 2 σ2 + µ2 n
• Note: variance of the mean estimate goes to zero as n goes to infinity, and our estimate
converges to the true estimate (error goes to zero).
19 Chapter 3 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Maximum Likelihood Estimator Bias in the estimates
• We will need one more result: σ 2 = E[( x − µ ) 2 ]
= E[ x 2 ] − 2 E[ x]µ + E[ µ 2 ]
= E[ x 2 ] − 2 µ 2 + E[ µ 2 ]
= E[ x 2 ] − µ 2 • Now we can combine these results. Recall our expression for the ML estimate
of the variance:
1n
ˆ
ˆ2
E[σ ] = E[ ∑ (xi − µ ) ]
n i =1
2 20 Chapter 3 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Maximum Likelihood Estimator Bias in the estimates
• Expand the variance expression and simplify:
1n
ˆ2
ˆ
E[σ ] = E[ ∑ (xi − µ ) ]
n i =1
n
1
ˆˆ
= E[∑ ( xi2 − 2 xi µ + µ 2 )]
n i =1
1n
ˆ
ˆ
= ∑ ( E[ xi2 ] − 2 E[ xi µ ] + E[ µ 2 ])
n i =1
1n
ˆ
= ∑ ((σ 2 + µ 2 ) − 2 E[ xi µ ] + ( µ 2 + σ 2 n))
n i =1
2 • One more intermediate term to derive:
1n
1n
1n
ˆ
E[ xi µ ] = E[ xi ∑ x j ] = ∑ E[ xi x j ] = (∑ E[ xi x j ] + E[ xi xi ])
n j =1
n j =1
n ij≠=1
j σ2
1
1
2
2
2
2
2
2
= ((n − 1) µ + µ + σ ) = ((nµ + σ ) = µ +
n
n
n ( 21 Chapter 3 Saurabh Prasad ) Pattern Recognition Electrical and Computer Engineering Department Maximum Likelihood Estimator Bias in the estimates
• Substitute our previously derived expression for the second term:
ˆ
E[σ ] =
2 =
=
=
=
= 22 1n
ˆ
∑ ((σ 2 + µ 2 ) − 2 E[ xi µ ] + ( µ 2 + σ 2 n))
n i =1
1n
∑ ((σ 2 + µ 2 ) − 2( µ 2 + σ 2 n) + ( µ 2 + σ 2 n))
n i =1
1n
∑ (σ 2 + µ 2 − 2µ 2 + µ 2 − 2 σ 2 n + σ 2 n)
n i =1
1n
∑ (σ 2 − σ 2 n)
n i =1
1n
1n 2
1 n 2 (n − 1)
2
2
∑ (σ − σ n) = n ∑ σ (1 − 1 / n) = n ∑ σ n
n i =1
i =1
i =1
(n − 1) 2
σ
n Chapter 3 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Maximum Likelihood Estimator Bias in the estimates
• Therefore, the ML estimate is biased:
1
ˆ
E[σ 2 ] = E[
n ˆ
(xi − µ )2 ] =
∑
n i =1 n −1 2
σ ≠σ2
n Since the expected value of the ML estimate
for variance does not equal the true value, it is
a biased estimate However, the ML estimate converges to the actual value as n becomes large
• An unbiased estimator is:
C= 1n
ˆ
ˆt
∑ (x i − µ )(x i − µ )
n − 1 i =1 • These are related by:
ˆ (n − 1) C
∑=
n 23 Chapter 3 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department ...
View
Full
Document
This note was uploaded on 02/20/2012 for the course ECE 8443 taught by Professor Staff during the Spring '10 term at University of Houston.
 Spring '10
 Staff

Click to edit the document details