SPR_LectureHandouts_EM_GMM

SPR_LectureHandouts_EM_GMM - Finite Mixture Models and...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Finite Mixture Models and Expectation Maximization Slides for this topic are courtesy Dr. Anil Jain. Recall: The Supervised Learning Problem Given a set of n samples X = {(xi , yi)}, i = 1,…,n Chapter 3 of DHS Assume examples in each class come from a parameterized Gaussian density Estimate the parameters (mean, variance) of the Gaussian density for each class, and use them for classification Estimation uses Maximum Likelihood approach. 2 Review of Maximum Likelihood Given n i.i.d. examples from a density p(x;θ), with known form p and unknown parameter θ. ˆ Goal: estimate θ, denoted by , such that the observed data is θ most likely to be from the distribution with that θ. Steps involved: Write the likelihood of the observed data. Maximize the likelihood with respect to the parameter. 3 Example: 1D Gaussian Distribution Maximum Likelihood Estimation of Mean of a Gaussian Distribution True Density MLE from 10 Samples MLE from 100 samples 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 -1 4 -0.5 0 0.5 1 1.5 x 2 2.5 3 Example: 2D Gaussian Distribution 2D Parameter Estimation for Gaussian Distribution 5 Blue: True density Red : Estimated from 50 examples. 4 y 3 2 1 0 -2 -1 0 1 2 x 5 3 4 Multimodal Class Distributions A single Gaussian may not accurately model the classes. Find subclasses in handwritten “online” characters (122,000 characters written by 100 writers) Performance improves by modeling subclasses Connell and Jain, “Writer Adaptation for Online Handwriting Recognition”, IEEE PAMI, Mar 2002 Multimodal Classes Handwritten ‘f’ vs ‘y’ classification task. 7 A single Gaussian distribution may not model the classes accurately. An extreme example of multimodal classes Limitations of Unimodal class modelling 10 9 Red vs. Blue classification. 8 7 The classes are well separated. y 6 5 However, incorrect model assumptions result in high classification error. 4 3 2 1 0 0 2 4 6 8 10 12 x 8 The ‘red’ class is a “mixture of two Gaussian distributions” There is no class label information, when modeling the density of just the red class. Finite mixtures k random sources, probability density functions fi(x), i=1,…,k f1(x) Choose at random f2(x) X fi(x) fk(x) 9 random variable Finite mixtures Example: 3 species (Iris) 10 Finite mixtures f1(x) Choose at random, Prob.(source i) = αi f2(x) X fi(x) random variable Conditional: f (x|source i) = fi (x) f (x and source i) = fi (x) αi Joint: fk(x) Unconditional: f(x) = 11 ∑ f (x and source i) = all sources k ∑α i =1 i fi (x) Finite mixtures k f (x) = ∑ αi fi (x) i =1 Component densities Mixing probabilities: α i ≥ 0 and k ∑α i =1 i =1 Parameterized components (e.g., Gaussian): f i ( x ) = f ( x | θ i ) k f ( x | Θ) = ∑ α i f ( x | θ i ) i =1 12 Θ = {θ1 , θ 2 ,..., θ k , α1 , α 2 ,..., α k } Gaussian mixtures f ( x | θi ) Gaussian Arbitrary covariances: f ( x | θi ) = N ( x | μ i , Ci ) Θ = {μ 1 ,μ 2 ,...,μ k , C1 , C 2 ,..., C k , α1 , α 2 ,...α k −1} Common covariance: f ( x | θi ) = N( x | μ i , C) Θ = {μ 1 ,μ 2 ,...,μ k , C, α1 , α 2 ,...α k −1} 13 Mixture fitting / estimation Data: n independent observations, x = {x (1) , x ( 2 ) ,..., x ( n ) } Goals: estimate the parameter set Θ, maybe “classify the observations” Example: - How many species ? Mean of each species ? - Which points belong to each species ? Classified data (classes unknown) 14 Observed data Gaussian mixtures (d=1), an example µ1 = −2 σ1 = 3 µ2 = 4 σ2 = 2 µ3 = 7 σ3 = 0.1 15 α1 = 0.6 α 2 = 0.3 α 3 = 0.1 Gaussian mixtures, an R2 example (1500 points) k=3 − 4 μ1 = 4 2 0 C2 = 0 8 3 μ2 = 3 16 0 μ3 = − 4 1 0 C3 = 0 1 2 − 1 C2 = −1 2 Uses of mixtures in pattern recognition Unsupervised learning (model-based clustering): - each component models one cluster - clustering = mixture fitting Observations: - unclassified points Goals: - find the classes, - classify the points 17 Uses of mixtures in pattern recognition Mixtures can approximate arbitrary densities Good to represent class conditional densities in supervised learning Example: - two strongly non-Gaussian classes. - Use mixtures to model each class-conditional density. 18 Uses of mixtures in pattern recognition Find subclasses (lexemes) Eg. “online” characters Performance improves by modeling subclasses 122,000 characters written by 100 writers Connell and Jain, Writer Adaptation for Online Handwriting Recognition, IEEE PAMI, 2002 Fitting mixtures n independent observations x = {x , x ,..., x } (1) ( 2) (n) Maximum (log)likelihood (ML) estimate of Θ: ˆ Θ = arg max L( x , Θ) Θ n n j=1 j=1 L( x , Θ) = log ∏ f ( x ( j) | Θ) = ∑ log mixture 20 ML estimate has no closed-form solution k α i f ( x ( j) | θ i ) ∑ i =1 Gaussian mixtures: A peculiar type of ML Θ = { μ1 , μ 2 ,..., μ k , C1 , C 2 ,..., C k , α1 , α 2 ,...α k −1} Maximum (log)likelihood (ML) estimate of Θ: ˆ Θ = arg max L( x , Θ) Θ Subject to: Ci positive definite α i ≥ 0 and k ∑α i =1 i =1 Problem: the likelihood function is unbounded as det(Ci ) → 0 There is no global maximum. 21 Unusual goal: a “good” local maximum A Peculiar type of ML problem Example: a 2-component Gaussian mixture f ( x | μ1 , μ 2 , σ , α ) = 2 α 2πσ 2 − e ( x −µ1 ) 2 2 2 σ1 1− α + e 2π ( x −µ 2 ) 2 − 2 Some data points: {x1 , x 2 ,..., x n } μ1 = x 1 ( x1 − µ 2 ) 2 α 1− α − 2 L( x , Θ) = log e + 2πσ 2 2π 22 → ∞, as σ 2 → 0 n + log(...) ∑ j= 2 Fitting mixtures: a missing data problem ML estimate has no closed-form solution Standard alternative: expectation-maximization (EM) algorithm Missing data problem: Observed data: Missing data: x = {x (1) , x ( 2 ) ,..., x ( n ) } (1) ( 2) (n) z = {z , z ,..., z } Missing labels (“colors”) [ ( z ( j) = z1 j) , z (2j) , ..., z (kj) , = [0 ... 0 1 0 ... 0] T 23 “1” at position i ⇔ x( j) generated by component i Fitting mixtures: a missing data problem Observed data: Missing data: x = {x , x ,..., x } (1) ( 2) (n) z = {z , z ,..., z } (1) ( 2) (n) [ ( z ( j) = z1 j) , ..., z (kj) k-1 zeros, one “ 1” Complete log-likelihood function: n k ( L c (x, z, Θ) = ∑∑ z i( j) log α i f i (x ( j) | θi ) ) j=1 i =1 log f ( x ( j) , z ( j) | Θ) In the presence of both x and z, Θ would be easy to estimate, …but z is missing. 24 The EM algorithm Iterative procedure: ˆ ( 0 ) , Θ (1) ,..., Θ ( t ) , Θ ( t +1) ,... ˆ ˆ Θˆ ˆ Θ ( t ) → local maximum of L(x, Θ) Under mild conditions: t →∞ The E-step: compute the expected value of L c (x, z, Θ) ˆ ˆ E[L c (x, z, Θ) | x , Θ ( t ) ] ≡ Q(Θ, Θ ( t ) ) The M-step: update parameter estimates ˆ ( t +1) = arg max Q(Θ, Θ ( t ) ) ˆ Θ Θ 25 The EM algorithm: the Gaussian case The E-step: ˆ ( t ) ) ≡ E [ L ( x, z , Θ ) | x, Θ ( t ) ] ˆ Q(Θ, Θ Z c ˆ = L c (x, E[z | x, Θ ( t ) ], Θ) Because L c (x, z, Θ) is linear in z Bayes law ˆ ˆ E[z i( j) | x, Θ ( t ) ] = Pr{z i( j) = 1 | x ( j) , Θ ( t ) } = Binary variable w 26 ( j, t ) i ˆ ˆ i f ( x ( j) | θi( t ) ) α k ˆ ˆ α n f ( x ( j) | θ(nt ) ) ∑ ≡ w i( j, t ) n =1 → Estimate, at iteration t, of the probability that x( j) was produced by component i “Soft” probabilistic assignment The EM algorithm: the Gaussian case Result of the E-step: w i( j, t ) → Estimate, at iteration t, of the probability that x( j) was produced by component i The M-step: ˆ α i( t +1) 1 n ( j, t ) = ∑ wi n j=1 n ˆ µ i( t +1) = 27 ∑w j=1 n n ( j, t ) i x w i( j, t ) ∑ j=1 ( j) ˆ Ci( t +1) = ˆ ˆ w i( j, t ) ( x ( j) − µ i( t +1) ) ( x ( j) − µ i( t +1) ) T ∑ j=1 n w i( j, t ) ∑ j=1 Difficulties with EM It is a local (greedy) algorithm (likelihood never dcreases) Initialization dependent 74 iterations 28 270 iterations Automatically deciding the number of components Add a penalty term to the objective function, which increases with the number of clusters Start with a large number of clusters Modify the M-step to include a “killer criterion” which removes components satisfying certain criterion Finally, choose the number of components, resulting in with the largest objective function value (likelihood - penalty). 29 Example Same as in [Ueda and Nakano, 1998]. 30 Example 0 0 4 4 C = I k=4 µ1 = µ1 = µ1 = µ1 = 1 n = 1200 0 4 0 4 α m = 4 kmax = 10 31 Example Same as in [Ueda, Nakano, Ghahramani and Hinton, 2000]. 32 Example An example with overlapping components 33 The iris (4-dim.) data-set: 3 components correctly identified 34 Another supervised learning example Problem: learn to classify textures, from 19 Gabor features. - Four classes: -Fit Gaussian mixtures to 800 randomly located feature vectors from each class/texture. -Test on the remaining data. Error rate Mixture-based Linear discriminant 35 0.0074 0.0185 Quadratic disriminant 0.0155 Resulting decision regions 2-d projection of the texture data and the obtained mixtures 36 Properties of EM EM is extremely popular because of the following properties: • Easy to implement • Guarantees the likelihood increases monotonically (why?) • Guarantees the convergence of the solution to a stationary point i.e., local maxima (why?). Limitations of EM • resulting solution depends highly on the initialization • Could be slow in several cases compared to direct optimization methods (e.g., Iterative scaling) 37 EM as lower bound optimization • Start with initial guess l (θ1 ,θ 2 ) 0 θ10 ,θ 2 0 θ10 ,θ 2 EM as lower bound optimization • Start with initial guess {θ1 ,θ 2 }0 Touch Point • Come up with a lower bounded l (θ1 ,θ 2 ) l (θ1 ,θ 2 ) ≥ l (θ10 ,θ 20 ) + Q(θ1 ,θ 2 ) {θ 0 0 1 ,θ 2 } 0 l (θ1 ,θ 2 ) ≥ l (θ10 ,θ 2 ) + Q (θ1 ,θ 2 ) Q (θ1 ,θ 2 ) is a concave function 0 = θ2 Touch point: Q (θ1 θ10 ,= θ= 0 2) EM as lower bound optimization • Start with initial guess {θ1 ,θ 2 }0 • Come up with a lower bounded l (θ1 ,θ 2 ) l (θ1 ,θ 2 ) ≥ l (θ10 ,θ 20 ) + Q(θ1 ,θ 2 ) {θ 0 0 1 ,θ 2 } {θ ,θ } 1 1 1 2 0 l (θ1 ,θ 2 ) ≥ l (θ10 ,θ 2 ) + Q (θ1 ,θ 2 ) Q (θ1 ,θ 2 ) is a concave function 0 = θ2 Touch point: Q (θ1 θ10 ,= θ= 0 2) • Search the optimal solution that maximizes Q (θ1 ,θ 2 ) EM as lower bound optimization • Start with initial guess {θ1 ,θ 2 }0 • Come up with a lower bounded l (θ1 ,θ 2 ) 1 l (θ1 ,θ 2 ) ≥ l (θ11 ,θ 2 ) + Q(θ1 ,θ 2 ) {θ 0 0 1 ,θ 2 } {θ ,θ } {θ 1 1 1 2 2 2 1 ,θ 2 } 0 l (θ1 ,θ 2 ) ≥ l (θ10 ,θ 2 ) + Q (θ1 ,θ 2 ) Q (θ1 ,θ 2 ) is a concave function 0 = θ2 Touch point: Q (θ1 θ10 ,= θ= 0 2) • Search the optimal solution that maximizes Q (θ1 ,θ 2 ) • Repeat the procedure EM as lower bound optimization Optimal Point • Start with initial guess {θ1 ,θ 2 }0 • Come up with a lower bounded l (θ1 ,θ 2 ) 0 l (θ1 ,θ 2 ) ≥ l (θ10 ,θ 2 ) + Q (θ1 ,θ 2 ) Q (θ1 ,θ 2 ) is a concave function 0 = Touch point: Q (θ1 θ10 ,= θ= 0 θ2 2) {θ 0 0 1 ,θ 2 } {θ ,θ } {θ 1 1 1 2 2 2 1 ,θ 2 } ,... • Search the optimal solution that Q (θ1 ,θ 2 ) maximizes • Repeat the procedure • Converge to the local optimal Summary • Expectation-Maximization algorithm • E step: Compute expected complete data likelihood • Mstep: Maximize the likelihood to find parameters • Can be used with any model with hidden (latent) variables • Hidden variables can be natural to the model or can be artificially introduced. • Makes the parameter estimation simpler, and efficient • EM algorithm can be explained from many perspectives • Bound optimization • Proximal point optimization, etc • Several generalizations/specializations exist • Easy to implement, and is widely used! 43 ...
View Full Document

Ask a homework question - tutors are online