11-FNN.pdf - Machine Learning Lecture 11 Finite Mixture Models Nevin L Zhang [email protected] Department of Computer Science and Engineering Hong Kong

# 11-FNN.pdf - Machine Learning Lecture 11 Finite Mixture...

• Notes
• 111

This preview shows page 1 out of 111 pages. #### You've reached the end of your free preview.

Want to read all 111 pages?

Unformatted text preview: Machine Learning Lecture 11: Finite Mixture Models Nevin L. Zhang [email protected] Department of Computer Science and Engineering Hong Kong University of Science and Technology Nevin L. Zhang (HKUST) Machine Learning 1 / 49 Finite Mixture Models and Clustering Outline 1 Finite Mixture Models and Clustering 2 Gaussian Mixture Models (for Continuous Data) Learning Gaussian Mixture Models Gaussian Mixtures and K-Means 3 Latent Class Models (for Discrete Data) Nevin L. Zhang (HKUST) Machine Learning 2 / 49 Finite Mixture Models and Clustering Objective Discuss finite mixture models for real-valued and discrete data Nevin L. Zhang (HKUST) Machine Learning 3 / 49 Finite Mixture Models and Clustering Objective Discuss finite mixture models for real-valued and discrete data Reading: C. M. Bishop (2006). Pattern Recognition and Machine Learning. Springer. Chapter 9. K. P. Murphy (2012). Machine Learning: A Probabilistic Perspective. MIT Press. Chapter 11. Nevin L. Zhang (HKUST) Machine Learning 3 / 49 Finite Mixture Models and Clustering Finite Mixture Models Variables: z: class of the object, Nevin L. Zhang (HKUST) Machine Learning 4 / 49 Finite Mixture Models and Clustering Finite Mixture Models Variables: z: class of the object, latent variable, i.e., not observed. It has K possible values {1, 2, . . . , K }. x = {A1 , A2 , . . . , An }: attributes of the object, observed. Nevin L. Zhang (HKUST) Machine Learning 4 / 49 Finite Mixture Models and Clustering Finite Mixture Models Variables: z: class of the object, latent variable, i.e., not observed. It has K possible values {1, 2, . . . , K }. x = {A1 , A2 , . . . , An }: attributes of the object, observed. In general, not assuming different attributes are mutually independent given z. So, they are all placed inside one node. Nevin L. Zhang (HKUST) Machine Learning 4 / 49 Finite Mixture Models and Clustering Finite Mixture Models Variables: z: class of the object, latent variable, i.e., not observed. It has K possible values {1, 2, . . . , K }. x = {A1 , A2 , . . . , An }: attributes of the object, observed. In general, not assuming different attributes are mutually independent given z. So, they are all placed inside one node. Parameters: P(z): Distribution of z πk = P(z = k): size of class k Nevin L. Zhang (HKUST) Machine Learning 4 / 49 Finite Mixture Models and Clustering Finite Mixture Models Variables: z: class of the object, latent variable, i.e., not observed. It has K possible values {1, 2, . . . , K }. x = {A1 , A2 , . . . , An }: attributes of the object, observed. In general, not assuming different attributes are mutually independent given z. So, they are all placed inside one node. Parameters: P(z): Distribution of z πk = P(z = k): size of class k p(x|z): conditional distribution of attribute values p(x|z = k): attribute distribution for objects in class k Nevin L. Zhang (HKUST) Machine Learning 4 / 49 Finite Mixture Models and Clustering Finite Mixture Models Variables: z: class of the object, latent variable, i.e., not observed. It has K possible values {1, 2, . . . , K }. x = {A1 , A2 , . . . , An }: attributes of the object, observed. In general, not assuming different attributes are mutually independent given z. So, they are all placed inside one node. Parameters: P(z): Distribution of z πk = P(z = k): size of class k p(x|z): conditional distribution of attribute values p(x|z = k): attribute distribution for objects in class k Defines a joint distribution: p(z, x) = P(z)p(x|z). Nevin L. Zhang (HKUST) Machine Learning 4 / 49 Finite Mixture Models and Clustering Finite Mixture Models Marginal distribution of attributes K K X X p(x) = P(z = k)p(x|z = k) = πk p(x|z = k) k=1 Nevin L. Zhang (HKUST) k=1 Machine Learning 5 / 49 Finite Mixture Models and Clustering Finite Mixture Models Marginal distribution of attributes K K X X p(x) = P(z = k)p(x|z = k) = πk p(x|z = k) k=1 k=1 It is a mixture of the distributions for individual classes. Nevin L. Zhang (HKUST) Machine Learning 5 / 49 Finite Mixture Models and Clustering Finite Mixture Models Marginal distribution of attributes K K X X p(x) = P(z = k)p(x|z = k) = πk p(x|z = k) k=1 k=1 It is a mixture of the distributions for individual classes. Each p(x|z = k) is a component in the mixture. πk : Mixing coefficients. Nevin L. Zhang (HKUST) Machine Learning 5 / 49 Finite Mixture Models and Clustering Finite Mixture Models Marginal distribution of attributes K K X X p(x) = P(z = k)p(x|z = k) = πk p(x|z = k) k=1 k=1 It is a mixture of the distributions for individual classes. Each p(x|z = k) is a component in the mixture. πk : Mixing coefficients. So, the model is called a finite mixture model (FMM). Nevin L. Zhang (HKUST) Machine Learning 5 / 49 Finite Mixture Models and Clustering Learning FMMs Given Unlabeled data: {x1 , x2 , . . . , xN } A number: K Nevin L. Zhang (HKUST) Machine Learning 6 / 49 Finite Mixture Models and Clustering Learning FMMs Given Unlabeled data: {x1 , x2 , . . . , xN } A number: K Determine parameters: P(z), p(x|z = k) That is, the size of each class P(z = k) and the attribute distribution for each class P(x|z = k). Parameters are determined using the MLE principle. Nevin L. Zhang (HKUST) Machine Learning 6 / 49 Finite Mixture Models and Clustering Class Assignment in FMM Given: P(z) and p(x|z). Nevin L. Zhang (HKUST) Machine Learning 7 / 49 Finite Mixture Models and Clustering Class Assignment in FMM Given: P(z) and p(x|z). Soft assignment: Object with attribute values xn belong to class k with probability” P(z = k|xn ) = Nevin L. Zhang (HKUST) P(z = k)p(xn |z = k) πk p(xn |z = k) = PK p(xn ) k=1 πk p(xn |z = k) Machine Learning 7 / 49 Finite Mixture Models and Clustering Class Assignment in FMM Given: P(z) and p(x|z). Soft assignment: Object with attribute values xn belong to class k with probability” P(z = k|xn ) = P(z = k)p(xn |z = k) πk p(xn |z = k) = PK p(xn ) k=1 πk p(xn |z = k) Hard assignment: Object xn assigned to class k ∗ such that P(z = k ∗ |xn ) ≥ P(z = k|xn ) ∀k 6= k ∗ Nevin L. Zhang (HKUST) Machine Learning 7 / 49 Gaussian Mixture Models (for Continuous Data) Outline 1 Finite Mixture Models and Clustering 2 Gaussian Mixture Models (for Continuous Data) Learning Gaussian Mixture Models Gaussian Mixtures and K-Means 3 Latent Class Models (for Discrete Data) Nevin L. Zhang (HKUST) Machine Learning 8 / 49 Gaussian Mixture Models (for Continuous Data) Normal/Gaussian Distribution N (µ, σ):   1 (x − µ)2 p(x|µ, σ) = √ exp − 2σ 2 σ 2π Nevin L. Zhang (HKUST) Machine Learning 9 / 49 Gaussian Mixture Models (for Continuous Data) Multivariate Gaussian Distributions N (µ, Σ):   (x − µ)0 Σ−1 (x − µ) 1 exp − p(x|µ, Σ) = p 2 (2π)d det(Σ) d: dimension. x: vector of d random variables, representing data µ: vector of means Σ: covariance matrix. Nevin L. Zhang (HKUST) Machine Learning 10 / 49 Gaussian Mixture Models (for Continuous Data) Multivariate Gaussian Distributions A 2-D Gaussian distribution. µ: center of contours Σ: orientation and size of contours Nevin L. Zhang (HKUST) Machine Learning 11 / 49 Gaussian Mixture Models (for Continuous Data) Gaussian Mixtures Gaussian mixture model: Mixture distribution: p(x) = K X πk p(x|z = k) k=1 Each component is a Gassian distribution p(x|z = k) = N (x|µk , Σk ) Nevin L. Zhang (HKUST) Machine Learning 12 / 49 Gaussian Mixture Models (for Continuous Data) Gaussian Mixtures A 2-D example Nevin L. Zhang (HKUST) Machine Learning 13 / 49 Gaussian Mixture Models (for Continuous Data) Learning Gaussian Mixture Models Outline 1 Finite Mixture Models and Clustering 2 Gaussian Mixture Models (for Continuous Data) Learning Gaussian Mixture Models Gaussian Mixtures and K-Means 3 Latent Class Models (for Discrete Data) Nevin L. Zhang (HKUST) Machine Learning 14 / 49 Gaussian Mixture Models (for Continuous Data) Learning Gaussian Mixture Models Problem Statement Given Unlabeled data: {x1 , x2 , . . . , xN } A number: K Nevin L. Zhang (HKUST) Machine Learning 15 / 49 Gaussian Mixture Models (for Continuous Data) Learning Gaussian Mixture Models Problem Statement Given Unlabeled data: {x1 , x2 , . . . , xN } A number: K Find: a K -component Gaussian mixture model Mixing coefficients: π1 , . . . , πK Component parameters: µk , Σk (k = 1, . . . , K ) Nevin L. Zhang (HKUST) Machine Learning 15 / 49 Gaussian Mixture Models (for Continuous Data) Learning Gaussian Mixture Models Problem Statement Given Unlabeled data: {x1 , x2 , . . . , xN } A number: K Find: a K -component Gaussian mixture model Mixing coefficients: π1 , . . . , πK Component parameters: µk , Σk (k = 1, . . . , K ) such that the mixture distribution fits the data well, Nevin L. Zhang (HKUST) Machine Learning 15 / 49 Gaussian Mixture Models (for Continuous Data) Learning Gaussian Mixture Models Problem Statement Given Unlabeled data: {x1 , x2 , . . . , xN } A number: K Find: a K -component Gaussian mixture model Mixing coefficients: π1 , . . . , πK Component parameters: µk , Σk (k = 1, . . . , K ) such that the mixture distribution fits the data well, or with max likelihood. Nevin L. Zhang (HKUST) Machine Learning 15 / 49 Gaussian Mixture Models (for Continuous Data) Learning Gaussian Mixture Models The Expectation and Maximization (EM) Algorithm Choose initial values for πk , µk , Σk repeat: Expectation: For each training example xn , (a) Compute rnk ≡ P(z = k|xn ) for k = 1, . . . , K (b) Break it into K fractional examples according to the probabilities: xn [rnk ]; k = 1, . . . , K (c) Assign each fractional example xn [rnk ] to the corresponding cluster k. Maximization: Re-estimate πk , µk , Σk until convergence Nevin L. Zhang (HKUST) Machine Learning 16 / 49 Gaussian Mixture Models (for Continuous Data) Learning Gaussian Mixture Models EM Illustration Eclipses: 1 SD contour Nevin L. Zhang (HKUST) Machine Learning 17 / 49 Gaussian Mixture Models (for Continuous Data) Learning Gaussian Mixture Models The E-Step Steps (b) and (c) are conceptual. No actual computation. Nevin L. Zhang (HKUST) Machine Learning 18 / 49 Gaussian Mixture Models (for Continuous Data) Learning Gaussian Mixture Models The E-Step Steps (b) and (c) are conceptual. No actual computation. Computation in Step (a): rnk Nevin L. Zhang (HKUST) ≡ P(z = k|xn ) P(z = k)p(xn |z = k) = PK k=1 P(z = k)p(xn |z = k) Machine Learning 18 / 49 Gaussian Mixture Models (for Continuous Data) Learning Gaussian Mixture Models The E-Step Steps (b) and (c) are conceptual. No actual computation. Computation in Step (a): rnk Nevin L. Zhang (HKUST) ≡ P(z = k|xn ) P(z = k)p(xn |z = k) = PK k=1 P(z = k)p(xn |z = k) πk N (xn |µk , Σk ) = PK k=1 πk N (xn |µk , Σk ) Machine Learning 18 / 49 Gaussian Mixture Models (for Continuous Data) Learning Gaussian Mixture Models The E-Step Steps (b) and (c) are conceptual. No actual computation. Computation in Step (a): rnk ≡ P(z = k|xn ) P(z = k)p(xn |z = k) = PK k=1 P(z = k)p(xn |z = k) πk N (xn |µk , Σk ) = PK k=1 πk N (xn |µk , Σk ) rnk often called responsibilities Nevin L. Zhang (HKUST) Machine Learning 18 / 49 Gaussian Mixture Models (for Continuous Data) Learning Gaussian Mixture Models Estimating Parameters for Gaussian Distributions 1-D Gaussian Distribution N (µ, σ). Data: x1 , x2 , . . . , xN . Nevin L. Zhang (HKUST) Machine Learning 19 / 49 Gaussian Mixture Models (for Continuous Data) Learning Gaussian Mixture Models Estimating Parameters for Gaussian Distributions 1-D Gaussian Distribution N (µ, σ). Data: x1 , x2 , . . . , xN . Parameter estimation: µ = σ = Nevin L. Zhang (HKUST) N 1 X xn N 1 N n=1 N X (xn − µ)2 n=1 Machine Learning 19 / 49 Gaussian Mixture Models (for Continuous Data) Learning Gaussian Mixture Models Estimating Parameters for Gaussian Distributions 1-D Gaussian Distribution N (µ, σ). Fractional Data: x1 [r1 ], x2 [r2 ], . . . , xN [rN ]. Nevin L. Zhang (HKUST) Machine Learning 20 / 49 Gaussian Mixture Models (for Continuous Data) Learning Gaussian Mixture Models Estimating Parameters for Gaussian Distributions 1-D Gaussian Distribution N (µ, σ). Fractional Data: x1 [r1 ], x2 [r2 ], . . . , xN [rN ]. Actual Sample size: M= N X rn n=1 Nevin L. Zhang (HKUST) Machine Learning 20 / 49 Gaussian Mixture Models (for Continuous Data) Learning Gaussian Mixture Models Estimating Parameters for Gaussian Distributions 1-D Gaussian Distribution N (µ, σ). Fractional Data: x1 [r1 ], x2 [r2 ], . . . , xN [rN ]. Actual Sample size: M= N X rn n=1 Parameter estimation: µ = σ = Nevin L. Zhang (HKUST) N 1 X rn xn M 1 M n=1 N X rn (xn − µ)2 n=1 Machine Learning 20 / 49 Gaussian Mixture Models (for Continuous Data) Learning Gaussian Mixture Models Estimating Parameters for Gaussian Distributions Multivariate Gaussian Distribution N (µ, Σ). Fractional Data: x1 [r1 ], x2 [r2 ], . . . , xN [rN ]. Nevin L. Zhang (HKUST) Machine Learning 21 / 49 Gaussian Mixture Models (for Continuous Data) Learning Gaussian Mixture Models Estimating Parameters for Gaussian Distributions Multivariate Gaussian Distribution N (µ, Σ). Fractional Data: x1 [r1 ], x2 [r2 ], . . . , xN [rN ]. Actual sample size: M= N X rn n=1 Nevin L. Zhang (HKUST) Machine Learning 21 / 49 Gaussian Mixture Models (for Continuous Data) Learning Gaussian Mixture Models Estimating Parameters for Gaussian Distributions Multivariate Gaussian Distribution N (µ, Σ). Fractional Data: x1 [r1 ], x2 [r2 ], . . . , xN [rN ]. Actual sample size: M= N X rn n=1 Parameter estimation: µ = Σ = Nevin L. Zhang (HKUST) N 1 X rn xn M 1 M n=1 N X rn (xn − µ)(xn − µ)0 n=1 Machine Learning 21 / 49 Gaussian Mixture Models (for Continuous Data) Learning Gaussian Mixture Models The M-Step Conceptually, data assigned to cluster k during the E-step are: x1 [r1k ], x2 [r2k ], . . . , xN [rNk ]. Nevin L. Zhang (HKUST) Machine Learning 22 / 49 Gaussian Mixture Models (for Continuous Data) Learning Gaussian Mixture Models The M-Step Conceptually, data assigned to cluster k during the E-step are: x1 [r1k ], x2 [r2k ], . . . , xN [rNk ]. Totally number of examples assigned to cluster k Nk = N X rkn n=1 Nevin L. Zhang (HKUST) Machine Learning 22 / 49 Gaussian Mixture Models (for Continuous Data) Learning Gaussian Mixture Models The M-Step Conceptually, data assigned to cluster k during the E-step are: x1 [r1k ], x2 [r2k ], . . . , xN [rNk ]. Totally number of examples assigned to cluster k Nk = N X rkn n=1 Re-estimate of πk , µk Σk : Nk πknew = N µnew k Σnew k Nevin L. Zhang (HKUST) = = N 1 X rkn xn Nk 1 Nk n=1 N X new 0 rkn (xn − µnew k )(xn − µk ) n=1 Machine Learning 22 / 49 Gaussian Mixture Models (for Continuous Data) Learning Gaussian Mixture Models The EM Algorithm Choose initial values for πk , µk , Σk repeat until convergence: Expectation: For each training example xn , compute πk N (xn |µk , Σk )) for k = 1, . . . , K rnk = PK k=1 πk N (xn |µk , Σk ) Maximization: Re-estimate πk , µk , Σk as follows: πknew = Nk N µnew k = N 1 X rkn xn Nk n=1 = N 1 X new 0 rkn (xn − µnew k )(xn − µk ) Nk n=1 Σnew k where Nevin L. Zhang (HKUST) Nk = PN n=1 rkn . Machine Learning 23 / 49 Gaussian Mixture Models (for Continuous Data) Learning Gaussian Mixture Models EM Convergence Let π = (π1 , . . . , πK ), µ = (µ1 , . . . , µK ), Σ = (Σ1 , . . . , ΣK ) Nevin L. Zhang (HKUST) Machine Learning 24 / 49 Gaussian Mixture Models (for Continuous Data) Learning Gaussian Mixture Models EM Convergence Let π = (π1 , . . . , πK ), µ = (µ1 , . . . , µK ), Σ = (Σ1 , . . . , ΣK ) Specification of π, µ, Σ defines a probability density function over attributes x: p(x|π, µ, Σ) = K X πk N (x|µk , Σk ) k=1 Nevin L. Zhang (HKUST) Machine Learning 24 / 49 Gaussian Mixture Models (for Continuous Data) Learning Gaussian Mixture Models EM Convergence Let π = (π1 , . . . , πK ), µ = (µ1 , . . . , µK ), Σ = (Σ1 , . . . , ΣK ) Specification of π, µ, Σ defines a probability density function over attributes x: p(x|π, µ, Σ) = K X πk N (x|µk , Σk ) k=1 Data: X = {x1 , . . . , xN } Nevin L. Zhang (HKUST) Machine Learning 24 / 49 Gaussian Mixture Models (for Continuous Data) Learning Gaussian Mixture Models EM Convergence Let π = (π1 , . . . , πK ), µ = (µ1 , . . . , µK ), Σ = (Σ1 , . . . , ΣK ) Specification of π, µ, Σ defines a probability density function over attributes x: p(x|π, µ, Σ) = K X πk N (x|µk , Σk ) k=1 Data: X = {x1 , . . . , xN } Log Likelihood: viewed as a function of model parameters. l(π, µ, Σ|X) Nevin L. Zhang (HKUST) Machine Learning 24 / 49 Gaussian Mixture Models (for Continuous Data) Learning Gaussian Mixture Models EM Convergence Let π = (π1 , . . . , πK ), µ = (µ1 , . . . , µK ), Σ = (Σ1 , . . . , ΣK ) Specification of π, µ, Σ defines a probability density function over attributes x: p(x|π, µ, Σ) = K X πk N (x|µk , Σk ) k=1 Data: X = {x1 , . . . , xN } Log Likelihood: viewed as a function of model parameters. l(π, µ, Σ|X) = ln p(X|π, µ, Σ) Nevin L. Zhang (HKUST) Machine Learning 24 / 49 Gaussian Mixture Models (for Continuous Data) Learning Gaussian Mixture Models EM Convergence Let π = (π1 , . . . , πK ), µ = (µ1 , . . . , µK ), Σ = (Σ1 , . . . , ΣK ) Specification of π, µ, Σ defines a probability density function over attributes x: p(x|π, µ, Σ) = K X πk N (x|µk , Σk ) k=1 Data: X = {x1 , . . . , xN } Log Likelihood: viewed as a function of model parameters. l(π, µ, Σ|X) = ln p(X|π, µ, Σ) = ln N Y p(xn |π, µ, Σ) n Nevin L. Zhang (HKUST) Machine Learning 24 / 49 Gaussian Mixture Models (for Continuous Data) Learning Gaussian Mixture Models EM Convergence EM aims at computing the maximum likelihood estimation (MLE) of the parameters (π ∗ , µ∗ , Σ∗ ) = arg max ln p(X|π, µ, Σ) π ,µ,Σ Nevin L. Zhang (HKUST) Machine Learning 25 / 49 Gaussian Mixture Models (for Continuous Data) Learning Gaussian Mixture Models EM Convergence EM aims at computing the maximum likelihood estimation (MLE) of the parameters (π ∗ , µ∗ , Σ∗ ) = arg max ln p(X|π, µ, Σ) π ,µ,Σ Let l(t) be the log likelihood after iteration t Nevin L. Zhang (HKUST) Machine Learning 25 / 49 Gaussian Mixture Models (for Continuous Data) Learning Gaussian Mixture Models EM Convergence EM aims at computing the maximum likelihood estimation (MLE) of the parameters (π ∗ , µ∗ , Σ∗ ) = arg max ln p(X|π, µ, Σ) π ,µ,Σ Let l(t) be the log likelihood after iteration t The series l(1), l(2), l(3), . . . increases monotonically with t Nevin L. Zhang (HKUST) Machine Learning 25 / 49 Gaussian Mixture Models (for Continuous Data) Learning Gaussian Mixture Models EM Convergence EM aims at computing the maximum likelihood estimation (MLE) of the parameters (π ∗ , µ∗ , Σ∗ ) = arg max ln p(X|π, µ, Σ) π ,µ,Σ Let l(t) be the log likelihood after iteration t The series l(1), l(2), l(3), . . . increases monotonically with t Terminate EM when l(t + 1) − l(t) falls below a threshold. Nevin L. Zhang (HKUST) Machine Learning 25 / 49 Gaussian Mixture Models (for Continuous Data) Learning Gaussian Mixture Models EM Convergence: Singularity The maximum log likelihood might be infinite. ln p(X|π, µ, Σ) = ln N Y p(xn |π, µ, Σ) n Such singularity in likelihood function happens often in case of outliers and repeated points. ( ). Nevin L. Zhang (HKUST) Machine Learning 26 / 49 Gaussian Mixture Models (for Continuous Data) Learning Gaussian Mixture Models EM Convergence: Singularity The maximum log likelihood might be infinite. ln p(X|π, µ, Σ) = ln N Y p(xn |π, µ, Σ) n Such singularity in likelihood function happens often in case of outliers and repeated points. ( ). Solution: Bound the eigenvalues of covariance matrix. Nevin L. Zhang (HKUST) Machine Learning 26 / 49 Gaussian Mixture Models (for Continuous Data) Learning Gaussian Mixture Models EM Convergence: Singularity The maximum log likelihood might be infinite. ln p(X|π, µ, Σ) = ln N Y p(xn |π, µ, Σ) n Such singularity in likelihood function happens often in case of outliers and repeated points. ( ). Solution: Bound the eigenvalues of covariance matrix. To avoid local maximum: multiple restart. Nevin L. Zhang (HKUST) Machine Learning 26 / 49 Gaussian Mixture Models (for Continuous Data) Learning Gaussian Mixture Models Model Selection ...
View Full Document

• Fall '15
• AlbertWong

### What students are saying

• As a current student on this bumpy collegiate pathway, I stumbled upon Course Hero, where I can find study resources for nearly all my courses, get online help from tutors 24/7, and even share my old projects, papers, and lecture notes with other students.

Kiran Temple University Fox School of Business ‘17, Course Hero Intern

• I cannot even describe how much Course Hero helped me this summer. It’s truly become something I can always rely on and help me. In the end, I was not only able to survive summer classes, but I was able to thrive thanks to Course Hero.

Dana University of Pennsylvania ‘17, Course Hero Intern

• The ability to access any university’s resources through Course Hero proved invaluable in my case. I was behind on Tulane coursework and actually used UCLA’s materials to help me move forward and get everything together on time.

Jill Tulane University ‘16, Course Hero Intern  