Unformatted text preview: Finite Mixture Models and Expectation
Maximization Slides for this topic are courtesy Dr. Anil Jain. Recall: The Supervised Learning Problem Given a set of n samples X = {(xi , yi)}, i = 1,…,n Chapter 3 of DHS Assume examples in each class come from a parameterized Gaussian density Estimate the parameters (mean, variance) of the Gaussian density
for each class, and use them for classification Estimation uses Maximum Likelihood approach. 2 Review of Maximum Likelihood Given n i.i.d. examples from a density p(x;θ), with known form p and unknown parameter θ.
ˆ Goal: estimate θ, denoted by , such that the observed data is
θ
most likely to be from the distribution with that θ. Steps involved: Write the likelihood of the observed data. Maximize the likelihood with respect to the parameter. 3 Example: 1D Gaussian Distribution
Maximum Likelihood Estimation of Mean of a Gaussian Distribution
True Density
MLE from 10 Samples
MLE from 100 samples 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 1 4 0.5 0 0.5 1 1.5
x 2 2.5 3 Example: 2D Gaussian Distribution
2D Parameter Estimation for Gaussian Distribution
5 Blue: True density
Red : Estimated from
50 examples. 4 y 3 2 1 0 2 1 0 1 2
x 5 3 4 Multimodal Class Distributions A single Gaussian may not accurately model the classes. Find subclasses in handwritten “online” characters (122,000 characters written by 100 writers) Performance improves by modeling subclasses Connell and Jain, “Writer Adaptation for Online Handwriting Recognition”, IEEE PAMI, Mar 2002 Multimodal Classes
Handwritten ‘f’ vs ‘y’
classification task. 7 A single Gaussian distribution may not model the classes accurately. An extreme example of multimodal classes
Limitations of Unimodal class modelling
10 9 Red vs. Blue classification. 8 7 The classes are well separated. y 6 5 However, incorrect model
assumptions result in high
classification error. 4 3 2 1 0 0 2 4 6 8 10 12 x 8 The ‘red’ class is a “mixture of two Gaussian distributions”
There is no class label information, when modeling the
density of just the red class. Finite mixtures
k random sources, probability density functions fi(x), i=1,…,k
f1(x)
Choose at random
f2(x)
X
fi(x) fk(x)
9 random variable Finite mixtures
Example: 3 species (Iris) 10 Finite mixtures
f1(x) Choose at random,
Prob.(source i) = αi f2(x)
X fi(x) random variable Conditional: f (xsource i) = fi (x)
f (x and source i) = fi (x) αi Joint:
fk(x)
Unconditional: f(x) =
11 ∑ f (x and source i) = all sources k ∑α
i =1 i fi (x) Finite mixtures
k f (x) = ∑ αi fi (x)
i =1 Component densities
Mixing probabilities: α i ≥ 0 and k ∑α
i =1 i =1 Parameterized components (e.g., Gaussian): f i ( x ) = f ( x  θ i )
k f ( x  Θ) = ∑ α i f ( x  θ i )
i =1 12 Θ = {θ1 , θ 2 ,..., θ k , α1 , α 2 ,..., α k } Gaussian mixtures f ( x  θi ) Gaussian Arbitrary covariances: f ( x  θi ) = N ( x  μ i , Ci ) Θ = {μ 1 ,μ 2 ,...,μ k , C1 , C 2 ,..., C k , α1 , α 2 ,...α k −1}
Common covariance: f ( x  θi ) = N( x  μ i , C) Θ = {μ 1 ,μ 2 ,...,μ k , C, α1 , α 2 ,...α k −1}
13 Mixture fitting / estimation
Data: n independent observations, x = {x (1) , x ( 2 ) ,..., x ( n ) } Goals: estimate the parameter set Θ, maybe “classify the observations”
Example:
 How many species ? Mean of each species ?
 Which points belong to each species ?
Classified data (classes unknown) 14 Observed data Gaussian mixtures (d=1), an example µ1 = −2
σ1 = 3
µ2 = 4
σ2 = 2 µ3 = 7
σ3 = 0.1 15 α1 = 0.6
α 2 = 0.3
α 3 = 0.1 Gaussian mixtures, an R2 example
(1500 points) k=3 − 4
μ1 = 4 2 0
C2 = 0 8 3
μ2 = 3 16 0
μ3 = − 4 1 0
C3 = 0 1 2 − 1
C2 = −1 2 Uses of mixtures in pattern recognition
Unsupervised learning (modelbased clustering):
 each component models one cluster
 clustering = mixture fitting Observations:
 unclassified points Goals:
 find the classes,
 classify the points
17 Uses of mixtures in pattern recognition
Mixtures can approximate arbitrary densities Good to represent class conditional densities in supervised learning Example:
 two strongly nonGaussian
classes.
 Use mixtures to model each
classconditional density. 18 Uses of mixtures in pattern recognition Find subclasses (lexemes) Eg. “online” characters Performance improves by modeling subclasses 122,000 characters written by 100 writers
Connell and Jain, Writer Adaptation for Online Handwriting Recognition, IEEE PAMI, 2002 Fitting mixtures
n independent observations x = {x , x ,..., x }
(1) ( 2) (n) Maximum (log)likelihood (ML) estimate of Θ: ˆ
Θ = arg max L( x , Θ)
Θ n n j=1 j=1 L( x , Θ) = log ∏ f ( x ( j)  Θ) = ∑ log
mixture
20 ML estimate has no closedform solution k α i f ( x ( j)  θ i )
∑
i =1 Gaussian mixtures: A peculiar type of ML Θ = { μ1 , μ 2 ,..., μ k , C1 , C 2 ,..., C k , α1 , α 2 ,...α k −1}
Maximum (log)likelihood (ML) estimate of Θ: ˆ
Θ = arg max L( x , Θ)
Θ Subject to: Ci positive definite α i ≥ 0 and k ∑α
i =1 i =1 Problem: the likelihood function is unbounded as det(Ci ) → 0
There is no global maximum.
21 Unusual goal: a “good” local maximum A Peculiar type of ML problem
Example: a 2component Gaussian mixture f ( x  μ1 , μ 2 , σ , α ) =
2 α
2πσ 2 − e ( x −µ1 ) 2
2
2 σ1 1− α
+
e
2π ( x −µ 2 ) 2
−
2 Some data points: {x1 , x 2 ,..., x n } μ1 = x 1 ( x1 − µ 2 ) 2
α
1− α − 2 L( x , Θ) = log e
+ 2πσ 2
2π 22 → ∞, as σ 2 → 0 n + log(...)
∑ j= 2 Fitting mixtures: a missing data problem
ML estimate has no closedform solution
Standard alternative: expectationmaximization (EM) algorithm
Missing data problem:
Observed data:
Missing data: x = {x (1) , x ( 2 ) ,..., x ( n ) }
(1)
( 2)
(n)
z = {z , z ,..., z } Missing labels (“colors”) [ (
z ( j) = z1 j) , z (2j) , ..., z (kj) , = [0 ... 0 1 0 ... 0] T 23 “1” at position i ⇔ x( j) generated by component i Fitting mixtures: a missing data problem
Observed data:
Missing data: x = {x , x ,..., x }
(1)
( 2)
(n)
z = {z , z ,..., z }
(1) ( 2) (n) [ (
z ( j) = z1 j) , ..., z (kj) k1 zeros,
one “ 1” Complete loglikelihood function:
n k ( L c (x, z, Θ) = ∑∑ z i( j) log α i f i (x ( j)  θi ) ) j=1 i =1 log f ( x ( j) , z ( j)  Θ)
In the presence of both x and z, Θ would be easy to estimate,
…but z is missing.
24 The EM algorithm
Iterative procedure: ˆ ( 0 ) , Θ (1) ,..., Θ ( t ) , Θ ( t +1) ,...
ˆ
ˆ
Θˆ ˆ
Θ ( t ) → local maximum of L(x, Θ) Under mild conditions: t →∞ The Estep: compute the expected value of L c (x, z, Θ) ˆ
ˆ
E[L c (x, z, Θ)  x , Θ ( t ) ] ≡ Q(Θ, Θ ( t ) )
The Mstep: update parameter estimates ˆ ( t +1) = arg max Q(Θ, Θ ( t ) )
ˆ
Θ
Θ 25 The EM algorithm: the Gaussian case
The Estep: ˆ ( t ) ) ≡ E [ L ( x, z , Θ )  x, Θ ( t ) ]
ˆ
Q(Θ, Θ
Z
c
ˆ
= L c (x, E[z  x, Θ ( t ) ], Θ) Because L c (x, z, Θ) is linear in z
Bayes law ˆ
ˆ
E[z i( j)  x, Θ ( t ) ] = Pr{z i( j) = 1  x ( j) , Θ ( t ) } =
Binary variable w
26 ( j, t )
i ˆ
ˆ i f ( x ( j)  θi( t ) )
α
k ˆ
ˆ
α n f ( x ( j)  θ(nt ) )
∑ ≡ w i( j, t ) n =1 → Estimate, at iteration t, of the probability that x( j) was
produced by component i
“Soft” probabilistic assignment The EM algorithm: the Gaussian case
Result of the Estep: w i( j, t ) → Estimate, at iteration t, of
the probability that x( j) was produced by component i The Mstep: ˆ
α i( t +1) 1 n ( j, t )
= ∑ wi
n j=1 n ˆ
µ i( t +1) =
27 ∑w
j=1 n n ( j, t )
i x w i( j, t )
∑
j=1 ( j) ˆ
Ci( t +1) = ˆ
ˆ
w i( j, t ) ( x ( j) − µ i( t +1) ) ( x ( j) − µ i( t +1) ) T
∑
j=1 n w i( j, t )
∑
j=1 Difficulties with EM
It is a local (greedy) algorithm (likelihood never dcreases) Initialization dependent
74 iterations 28 270 iterations Automatically deciding the number of components
Add a penalty term to the objective function, which increases
with the number of clusters
Start with a large number of clusters
Modify the Mstep to include a “killer criterion” which removes
components satisfying certain criterion
Finally, choose the number of components, resulting in with the
largest objective function value (likelihood  penalty). 29 Example
Same as in [Ueda and Nakano, 1998]. 30 Example 0 0 4 4 C = I
k=4
µ1 = µ1 = µ1 = µ1 = 1
n = 1200
0 4
0 4 α m =
4
kmax = 10 31 Example
Same as in [Ueda, Nakano, Ghahramani and Hinton, 2000]. 32 Example
An example with overlapping components 33 The iris (4dim.) dataset: 3 components correctly identified 34 Another supervised learning example
Problem: learn to classify textures, from 19 Gabor features.
 Four classes: Fit Gaussian mixtures to 800 randomly located feature vectors
from each class/texture.
Test on the remaining data.
Error rate
Mixturebased
Linear discriminant
35 0.0074
0.0185 Quadratic disriminant 0.0155 Resulting decision regions 2d projection of the texture
data and the obtained mixtures 36 Properties of EM
EM is extremely popular because of the following properties:
• Easy to implement
• Guarantees the likelihood increases monotonically (why?)
• Guarantees the convergence of the solution to a stationary point
i.e., local maxima (why?). Limitations of EM
• resulting solution depends highly on the initialization
• Could be slow in several cases compared to direct optimization
methods (e.g., Iterative scaling)
37 EM as lower bound optimization • Start with initial guess l (θ1 ,θ 2 ) 0
θ10 ,θ 2 0
θ10 ,θ 2 EM as lower bound optimization • Start with initial guess {θ1 ,θ 2 }0 Touch
Point • Come up with a lower bounded
l (θ1 ,θ 2 ) l (θ1 ,θ 2 ) ≥ l (θ10 ,θ 20 ) + Q(θ1 ,θ 2 ) {θ 0
0
1 ,θ 2 } 0
l (θ1 ,θ 2 ) ≥ l (θ10 ,θ 2 ) + Q (θ1 ,θ 2 ) Q (θ1 ,θ 2 ) is a concave function
0
=
θ2
Touch point: Q (θ1 θ10 ,= θ= 0
2) EM as lower bound optimization • Start with initial guess {θ1 ,θ 2 }0
• Come up with a lower bounded
l (θ1 ,θ 2 ) l (θ1 ,θ 2 ) ≥ l (θ10 ,θ 20 ) + Q(θ1 ,θ 2 ) {θ 0
0
1 ,θ 2 } {θ ,θ }
1
1 1
2 0
l (θ1 ,θ 2 ) ≥ l (θ10 ,θ 2 ) + Q (θ1 ,θ 2 ) Q (θ1 ,θ 2 ) is a concave function
0
=
θ2
Touch point: Q (θ1 θ10 ,= θ= 0
2) • Search the optimal solution
that maximizes Q (θ1 ,θ 2 ) EM as lower bound optimization • Start with initial guess {θ1 ,θ 2 }0
• Come up with a lower bounded
l (θ1 ,θ 2 )
1
l (θ1 ,θ 2 ) ≥ l (θ11 ,θ 2 ) + Q(θ1 ,θ 2 ) {θ 0
0
1 ,θ 2 } {θ ,θ } {θ
1
1 1
2 2
2
1 ,θ 2 } 0
l (θ1 ,θ 2 ) ≥ l (θ10 ,θ 2 ) + Q (θ1 ,θ 2 ) Q (θ1 ,θ 2 ) is a concave function
0
=
θ2
Touch point: Q (θ1 θ10 ,= θ= 0
2) • Search the optimal solution
that maximizes Q (θ1 ,θ 2 )
• Repeat the procedure EM as lower bound optimization Optimal
Point • Start with initial guess {θ1 ,θ 2 }0
• Come up with a lower bounded l (θ1 ,θ 2 ) 0
l (θ1 ,θ 2 ) ≥ l (θ10 ,θ 2 ) + Q (θ1 ,θ 2 ) Q (θ1 ,θ 2 ) is a concave function
0
=
Touch point: Q (θ1 θ10 ,= θ= 0
θ2
2) {θ 0
0
1 ,θ 2 } {θ ,θ } {θ
1
1 1
2 2
2
1 ,θ 2 } ,... • Search the optimal solution that
Q (θ1 ,θ 2 )
maximizes
• Repeat the procedure
• Converge to the local optimal Summary
• ExpectationMaximization algorithm
• E step: Compute expected complete data likelihood
• Mstep: Maximize the likelihood to find parameters
• Can be used with any model with hidden (latent) variables
• Hidden variables can be natural to the model or can be
artificially introduced.
• Makes the parameter estimation simpler, and efficient
• EM algorithm can be explained from many perspectives
• Bound optimization
• Proximal point optimization, etc
• Several generalizations/specializations exist
• Easy to implement, and is widely used! 43 ...
View
Full
Document
This note was uploaded on 02/20/2012 for the course ECE 8443 taught by Professor Staff during the Spring '10 term at University of Houston.
 Spring '10
 Staff

Click to edit the document details