10gmm-handout

10gmm-handout - Massachusetts Institute of Technology...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Massachusetts Institute of Technology Department of Electrical Engineering & Computer Science 6.345/HST.728 Automatic Speech Recognition Spring, 2010 3/2/10 Lecture Handouts Acoustic Modelling I Clustering & Vector Quantization (VQ) Gaussian Mixture Models (GMMs) Reading: VQ: Rabiner et al., Fundamentals of ASR, Chp 3.4 VQ & GMM: Huang et al., Spoken Language Processing, Chp 3.1.7, 4.4 MIT Acoustic Modelling Clustering & Vector Quantization (VQ) Gaussian Mixture Models (GMMs) 6.345/HST.728 Automatic Speech Recognition (2010) VQ & GMMs 1 MIT F2 (Hz) An Acoustic Modelling Problem iy iy iy iy iy iy iy iy iyiy iy iy iy iy iy iy iy iy iy iy iy ih iy iy iy ih eh 3500 3000 2500 2000 1500 1000 500 0 ih ih iy iy iyih iy iyiy ae ih ih eh ih iy eh eh iy iy iy iy iyiy iy eh iy iy ae iyiy ihih iy iy ih iyiyiy iy iy iy iy iyiy iyiy eh iy ihihih eh iy eh iy iy iyiyiy iyiy iy iy iy iy eh eh iyih ih iy eh ih ae iy iy ih ih ih iy iyiy iy iy iy iy ih ih ihih ih ih eh iy ih ih eh h e ih ih eh ih iy ihih ih iy iy ih ih ih ih ih ih ih eh ih ih ih eh ih ae ih eh ae ae eh eh ae ih ih eh ihih ih ehih ih ih ih eh eh eh eh ae eh iy ih iy ih eh ae eh ih ih ih ae eh iy eh eh ih eh eh ih ae iy iy iy eh eh eh iy eh eh ae eh ih ihih ae eh iy iy ae ih ih ih iy iy iy ih ih ih ih eh eh eh eheh eh ih ae iy iy eh eh iy eh ae ae ih iy iy iy eh eh iyiy iy eh eh ih eh eh eh eh eh ae ae iy ih ih ihih ae eh eh iy iy er eh eh eh eh eheh iy iy iyiy iyiyiyiy iy eh eh iy iy iy iy eh ae ih ih ae iy iy ih ih ae iy eh eh ae eh er ae iy iy ih ae ih ih iy iy iy ih ae ae iy eh iy eh eh ae ae ih ih iy ae ae ae ae ae eh iy iy ae ae ae iy eh eh er ih ih ih ae ae ih iy ae er ae er eh ae ae ae ae ae ae eh er eh iyiyiyiy iy eh ih ih ih eh eh ihih ih ae ae ih ae ae ae er iy ih aeae eh ae ihih ihihih ihihih eh ae ae eh ih ae ae ae ae ihih ih eh ih iheh eheh er ae eh ae ae iy iyih ih ih ih eh eh eh ae eh ae ae ae ae ae ih eh eh ih ae ih ihih ae eh er eh ae ih ihihih ih ih eh ih ae ae ae eh eher eh eh ehae eh ih ae er ih ihihihih ih eh eher eh ae ih eh ae aeae ae er eh ae ih ih er eh erer eheh eh ae er ae ae er eh eh ae ae ae eh er er ae ih ae er eheh ae eher ae ae ae ah ih ih ae ae ae ae ae ereh er er eh eh eh eh er eh ae er eh er eh eh eh er ae eh eh er er eh ae er ae ae ae ih ae ah er ah er ih eh eh eh ae ae er eh eh eh aeae ae ae ah ae ae erih ae er eh eh er aeaeae ah ah erae eh ah ae ae ae ah ae er er uwer er eher ae er er eh eh uw ah ah er aeae aa ah ah uh er erer uw er eh er er ae ae ae ae ae ah aa erer ae ae ae ah ae ae ae er er ah ah ae ah ah ae ae ae ah ae er erer ereruh er er er er uh ae er er aeah er aa uh er erer ah ah ah ah ah ah ah ah ah er er er er uh er eruh er er uh er ah uw ae ah uh ah ah uh ah aa aa ah er ah ah ah ah ah ah ae ah ah aa aa ah ah ah uw uh uh er er er er er er er aa aa aa ah ah ah ah ah ah ah ah uw er er uh erer uh er uh uh uh uh uh uh er er er uh er uh uh ah er ah er ah er uw ah uw aa ah aa aa er er aa ah ah ah ah ah ah ah aa er er uh er aa aa er uh uhuh ah ah ahuh uh er aa aa aaaa ah er er aa aa aa aa aa aa er uh er er aa er aa aa uw er er er uh er er er uh uh ah ah aa uwuh er er aa ah ah uw uh aa ah uh er er uw uw er er er er er er ah ah ah ah aa aa aa ah aa er er ah ah ah aa ah aa aa ah er uw er uw erereruh er er er uw uw ah aa aa er ah uh uh ah ah ah aa ah aa aa ah aa aa aaaa aa aa ah ao aa ah aa ao aa aa uw uh er uhuh uh uw ah ahah ah ah ah aa ao uh uh uh er er er uh uh aa aa ah ah ah aa uh ah ah ah ah aa ao aa uh uh ah er ao uh uh uhuhuheruh uh uh uw ahahahah ah ao aa aa aa aa aa aa ao aa uh er ah uh uh aa aa uh uh aaaa aaaa aa uh uw uw uh uh er ah ah ah ah ao aa aa aa aa aa aa aa uh uw uw uw uw uw ahahah aa ao ah ah aa aa aa aa uh uh aa ah aa ah ao aa aa ah aa uw aa uw uw uwuw uh uh uw uh uw ahao uw ah aoah aa ao aa aa aa ah aa aa aa ah ah ahah uh uh uh uh er ah uh aa uw uh uw ah aa ao ao aa aa ao aa uw uh uw uw uh uh uh uw uh uh uh ao ao ahahaa aaaaaaaaaaaa uw uh uw aa aa aa ao aa ao ao ao ah aoao aaao aa uw uh uhuh uh ao ao ahah uh uh uh uh uh ao aa uw aa ao aoaa uh uh uw uw uhuh uw ah aa aoaoaa aa aa aa aa aa uh uh ao ao aa aa aa uh uh uh uw uh uw uwuw uw uwuw uhuh uh uw uw uw uwuw uw uwuw uh uh uh uh uh ao ao ah ao ao aoao ao aa ah ao ao ao uh uw uw aa aouh aoao aa aa aa aa aa aa ahao ao ao ao ao aa aa ao ao uw uw uwuw uhuh uh uw uh uw uwuw uw uh uh uh uw uw uh uh uh ao uh ao uw ao ao ao uw uw uw uwuwuw uh uwuwuw uh uh ao aoao aoaoao ao ao aa ao ao ao ao ao ao uw uw uw uwuw uh ao ao uw uw uw uw uh ao ao ao ao uh ao ao ao ao ah ao uw ao ao ao ao ao ao ao ao ao uwuw uh uw uh uwuwuw uw uw uh ao ao ao ao ao ao ao ao ao ao uwuwuw uh uw ao aoao uw ao aoao ao ao ao aa aa uw uw uw uh uw ao aoao ao uw uh uw uw uw ao ao ao ao ao uw ao ao uw uwuwuw uh ao uwuwuwuw uw uw uw uh ao ao ao ao ao ao ao uwuw ao ao ao ao uw uw uwuwuw uw uwuw ao uw ao uw uw uw ao ao uw ao uw ao ao ae ae aa 200 400 600 F1 (Hz) 800 1000 1200 1400 Peterson & Barney vowel data (F1 vs F2) 6.345/HST.728 Automatic Speech Recognition (2010) VQ & GMMs 2 MIT Acoustic Modelling - Signal Representation - Acoustic Models - Waveform Feature Vectors Classes Signal representation produces feature vector sequence, {x} Acoustic models assign feature vectors into classes {i } via: Quantization methods that model discrete symbols sets {Cj } Density estimation methods that model feature space, p(x|i ) Discriminative methods that model class boundaries, P(i |x) 6.345/HST.728 Automatic Speech Recognition (2010) VQ & GMMs 3 MIT Vector Quantization Used in signal compression, speech and image coding Based on standard clustering algorithms: Individual cluster centroids are called codewords Set of cluster centroids is called a codebook Basic VQ is K-means clustering Binary VQ is a form of top-down clustering (used for efficient quantization) Used for discrete acoustic modelling since early 1980s Reduced storage and computation costs Potential loss of information due to quantization Probability profile, P(i |Cj ), associated with each codeword, Cj 6.345/HST.728 Automatic Speech Recognition (2010) VQ & GMMs 4 MIT K-Means Clustering Used to group data into K clusters, {C1 , . . . , CK } Each cluster is represented by the mean of the assigned data Iterative algorithm converges to a local optimum: Select an initial cluster mean, i , for each cluster, Ci Iterate until stopping criterion is satisfied (e.g., no changes): 1. Assign each data sample to the closest cluster xCi , d(x, i ) d(x, j ), xCi , i = j 2. Update K means from assigned samples i = E(x), 1iK Nearest neighbor quantizer used for unseen (e.g., test) data 6.345/HST.728 Automatic Speech Recognition (2010) VQ & GMMs 5 MIT 0 K-Means Example: K = 3 Random selection of 3 data samples for initial means Euclidean distance metric between means and samples 3 1 4 2 6.345/HST.728 Automatic Speech Recognition (2010) 5 VQ & GMMs 6 MIT K-Means Properties Usually used with a Euclidean distance metric d(x, i ) = x - i 2 = (x - i )t (x - i ) The total distortion, D, is the sum of squared error K D= i=1 xCi x - i 2 D decreases between nth and n + 1st iteration D(n + 1) D(n) Also known as Isodata, or generalized Lloyd algorithm 6.345/HST.728 Automatic Speech Recognition (2010) VQ & GMMs 7 MIT LPC VQ Example (Juang et al., 1982) Quantizing LPC reflection coefficients k1 and k2 5000 data samples 1024 codewords 6.345/HST.728 Automatic Speech Recognition (2010) VQ & GMMs 8 MIT Binary VQ Often used to create M = 2B size codebook (B bit codebook, codebook size M) Uniform binary divisive clustering used On each iteration each cluster is divided in two + = i (1 + ) i - = i (1 - ) i K-means used to determine cluster centroids Also known as LBG (Linde, Buzo, Gray) algorithm 6.345/HST.728 Automatic Speech Recognition (2010) VQ & GMMs 9 MIT 0 Example of Binary VQ 1 2 6.345/HST.728 Automatic Speech Recognition (2010) VQ & GMMs 10 MIT Agglomerative Clustering Clusters data in a bottom-up, greedy fashion On each iteration, the two most similar clusters are merged Structure displayed in the form of a dendrogram Dendrogram can yield insights into natural grouping of data Used for a variety of unsupervised clustering tasks Acoustic segmentation & broad acoustic classes Language model classes Speaker diarization 6.345/HST.728 Automatic Speech Recognition (2010) VQ & GMMs 11 MIT Clustering for Acoustic Segmentation 8000 6000 4000 2000 0 2.5 2 1.5 1 0.5 8000 6000 4000 2000 0 0.5 1 1.5 2 2.5 Within utterance distances can be used to create segment network. 6.345/HST.728 Automatic Speech Recognition (2010) VQ & GMMs 12 MIT Distance Dendrogram Example (One Dimension) 6.345/HST.728 Automatic Speech Recognition (2010) 1 0 1 11 0 00 11 00 1 11 0 00 11 00 11 00 11 00 11 00 11 00 11 00 11 00 VQ & GMMs 13 MIT Broad Acoustic Classes Items can be clustered based on their statistical distributions. 6.345/HST.728 Automatic Speech Recognition (2010) VQ & GMMs 14 MIT Kullback-Liebler Distance Can be used to compute a distance between two probability mass distributions, P(zi ), and Q (zi ) D(P Q) = i P(zi ) log P(zi ) 0 Q (zi ) Makes use of inequality log x x - 1 P(zi ) log i Q (zi ) P(zi ) P(zi )( i Q (zi ) - 1) = P(zi ) i Q (zi ) - P(zi ) = 0 Known as relative entropy in information theory The divergence of P(zi ) and Q (zi ) is the symmetric sum D(P Q ) + D(Q P) 6.345/HST.728 Automatic Speech Recognition (2010) VQ & GMMs 15 MIT Density Estimation Terminology 6 p(x|1 ) PDF p(x|2 ) Define: {i } P(i ) p(x|i ) P(i |x) x a set of M mutually exclusive classes a priori probability for class i PDF for feature vector x in class i a posteriori probability of i given x P(i |x) = M - From Bayes Rule: where p(x|i )P(i ) p(x) p(x|i )P(i ) p(x) = i=1 6.345/HST.728 Automatic Speech Recognition (2010) VQ & GMMs 16 MIT Gaussian Distributions Gaussian PDF's are reasonable when a feature vector can be viewed as perturbation around a reference 0.4 Probability Density 0.0 0.1 0.2 0.3 4 2 0 2 4 x- Simple estimation procedures for model parameters Classification often reduced to simple distance metrics Gaussian distributions also called Normal 6.345/HST.728 Automatic Speech Recognition (2010) VQ & GMMs 17 MIT Gaussian Distributions: One Dimension One-dimensional Gaussian PDF's can be expressed as: p(x) = 1 2 (x - )2 - N(, 2 ) e 2 2 The PDF is centered around the mean = E(x) = xp(x)dx The spread of the PDF is determined by the variance 2 = E((x - )2 ) = (x - )2 p(x)dx 6.345/HST.728 Automatic Speech Recognition (2010) VQ & GMMs 18 MIT Maximum Likelihood Parameter Estimation Maximum likelihood parameter estimation determines an ^ estimate for parameter by maximizing the likelihood L() of observing data X = {x1 , . . . , xn } ^ = arg max L() Assuming independent, identically distributed data n L() = p(X |) = p(x1 , . . . , xn |) = i=1 p(xi |) ML solutions can often be obtained via the derivative L() = 0 For Gaussian distributions log L() is easier to solve 6.345/HST.728 Automatic Speech Recognition (2010) VQ & GMMs 19 MIT Probability Density Gaussian ML Estimation: One Dimension [s] Duration (1000 utterances, 100 speakers) 0 2 4 6 8 10 0.05 0.10 0.15 0.20 Duration (sec) 0.25 0.30 ^ (^ 120 ms, 40 ms) 6.345/HST.728 Automatic Speech Recognition (2010) VQ & GMMs 20 MIT Gaussian Distributions: Multiple Dimensions A multi-dimensional Gaussian PDF can be expressed as: 1 (x - )t -1 (x - ) N( , ) e 2 p(x) = (2)d/2 | |1/2 1 - d is the number of dimensions x = {x1 , . . . , xd } is the input vector = E(x) = {1 , . . . , d } is the mean vector = E((x - )(x - )t ) is the covariance matrix with elements ij , inverse -1 , and determinant | | ij = ji = E((xi - i )(xj - j )) = E(xi xj ) - i j 6.345/HST.728 Automatic Speech Recognition (2010) VQ & GMMs 21 MIT Diagonal Covariance Matrix: = 2 I = 3-Dimensional PDF 4 2 0 0 2 PDF Contour 2 4 0 2 -4 -2 0 2 4 -4 -4 -4 -2 0 2 4 0 -2 -2 6.345/HST.728 Automatic Speech Recognition (2010) VQ & GMMs 22 MIT Diagonal Covariance Matrix: ij = 0 = 3-Dimensional PDF 4 i = j 2 0 0 1 PDF Contour 2 4 0 2 -4 -2 0 2 4 -4 -4 -4 -2 0 2 4 0 -2 -2 6.345/HST.728 Automatic Speech Recognition (2010) VQ & GMMs 23 MIT General Covariance Matrix = 3-Dimensional PDF 4 2 1 1 1 PDF Contour 2 4 0 2 -4 -2 0 2 4 -4 -4 -4 -2 0 2 4 0 -2 -2 6.345/HST.728 Automatic Speech Recognition (2010) VQ & GMMs 24 MIT Mixture Densities PDF is composed of a mixture of m component densities {1 , . . . , m }: m p(x) = j=1 p(x|j )P(j ) Component PDF parameters and mixture weights P(j ) are typically unknown, making parameter estimation a form of unsupervised learning Gaussian mixture models (GMMs) assume Normal components: p(x|k ) N( k , k ) 6.345/HST.728 Automatic Speech Recognition (2010) VQ & GMMs 25 MIT Probability Density Gaussian Mixture Example: One Dimension 0.25 0.0 0.05 0.10 0.15 0.20 -4 -2 0 x 2 4 p(x) = 0.6p1 (x) + 0.4p2 (x) 6.345/HST.728 Automatic Speech Recognition (2010) p1 (x) N(-, 2 ) p2 (x) N(1.5, 2 ) VQ & GMMs 26 MIT Probability Density 0.0 0.0005.0010.0015 0 0 1000 Probability Density Gaussian Example First 9 MFCC's from [s]: Gaussian PDF 0.0 0.002 0.004 0.006 0.008 -200 -100 0 100 200 s ( dimension 3 ) 0.0 0.004 0.008 0.002 0.006 0.010 -150 Probability Density 0.0 0.002 0.004 0.001 0.003 0.005 Probability Density -600 -400 -200 0 s ( dimension 2 ) 200 -200 -100 0 s ( dimension 5 ) 100 Probability Density 1500 2000 2500 s ( dimension 1 ) 3000 -300 -200 -100 0 100 s ( dimension 4 ) 0.0.002.004 0.008 0.012 0 0 0.006 0.010 200 0.0 0.004 0.008 0.002 0.006 0.010 0.00.002.004.006.008 0 0 0 Probability Density -50 0 50 100 150 s ( dimension 6 ) Probability Density Probability Density 0.005 0.010 0.015 Probability Density -100 -50 0 50 100 s ( dimension 8 ) -150 -50 0 50 s ( dimension 7 ) 100 0.0 0.0050.0100.015 -100 0.0 -50 0 50 s ( dimension 9 ) 100 6.345/HST.728 Automatic Speech Recognition (2010) VQ & GMMs 27 MIT 0.0 0.0005 0.0010 0.0015 Probability Density 1000 0.0 0.002 0.004 0.006 0.008 0.010 Probability Density Independent Mixtures [s]: 2 Gaussian Mixture Components/Dimension 0.0 0.0020.0040.0060.008 0.00.001.002 0.004 0 0.003 0.005 Probability Density Probability Density -600 -400 -200 0 s ( dimension 2 ) 200 1500 2000 2500 s ( dimension 1 ) 3000 -200 -100 0 100 s ( dimension 3 ) 200 0.00.002 .004 .006 .008 .010 0 0 0 0 -300 -200 -100 0 100 s ( dimension 4 ) 200 0.005 0.010 0.015 -200 -100 0 s ( dimension 5 ) 100 0.0 0.002 0.004 0.006 0.008 -150 Probability Density Probability Density -50 0 50 100 150 s ( dimension 6 ) 0.00.002 0.006 0.010 0.004 0.008 -150 -50 0 50 s ( dimension 7 ) 100 -100 -50 0 50 s ( dimension 8 ) 100 0.0 0.005 0.010 0.015 -100 Probability Density Probability Density 0.0 Probability Density -50 0 50 s ( dimension 9 ) 100 6.345/HST.728 Automatic Speech Recognition (2010) VQ & GMMs 28 MIT 0.0 0.0005 0.0010 0.0015 Probability Density 1000 0.0 0.002 0.004 0.006 0.008 0.010 Probability Density Mixture Components [s]: 2 Gaussian Mixture Components/Dimension 0.0 0.0020.0040.0060.008 0.00.001.002 0.004 0 0.003 0.005 Probability Density Probability Density -600 -400 -200 0 s ( dimension 2 ) 200 1500 2000 2500 s ( dimension 1 ) 3000 -200 -100 0 100 s ( dimension 3 ) 200 0.00.002 .004 .006 .008 .010 0 0 0 0 -300 -200 -100 0 100 s ( dimension 4 ) 200 0.005 0.010 0.015 -200 -100 0 s ( dimension 5 ) 100 0.0 0.002 0.004 0.006 0.008 -150 Probability Density Probability Density -50 0 50 100 150 s ( dimension 6 ) 0.00.002 0.006 0.010 0.004 0.008 -150 -50 0 50 s ( dimension 7 ) 100 -100 -50 0 50 s ( dimension 8 ) 100 0.0 0.005 0.010 0.015 -100 Probability Density Probability Density 0.0 Probability Density -50 0 50 s ( dimension 9 ) 100 6.345/HST.728 Automatic Speech Recognition (2010) VQ & GMMs 29 MIT Gaussian Mixture Example: Two Dimensions 3-Dimensional PDF 4 3 PDF Contour 2 4 1 2 -2 0 2 4 -2 0 0 -1 -2 -2 -1 0 1 2 3 4 6.345/HST.728 Automatic Speech Recognition (2010) VQ & GMMs 30 MIT 150 . Two Dimensional Mixtures Diagonal Covariance . . . . . . . . . . . .. . . . .. . . .. . . .. . . . . . . . . . . . . .. . . . . . . . .. . . . .. . . . .. . . ... . ....... .. .. . . . . . . .. . .. . . . . .. . . . ... . . .. . . . . . . . .... ... . . . . .. . . . . .... .. .... ......... ... .... .... ............. ... .. ....... . ...... .. .. .. .. .... .. . .. . . . ... . . . . . . . .... ..... . . . . .. ... ... .. ... .. . .. .. .. . .. . . ... .. . . .... . . . .. . ... ... .... . .... ... ... .. .. .................... ....................... ............... ... . .............. . . . . .. . . . . . . . . .. .......... ......... .... ............ . . . . .... ......... .... .. . ... ..... . . .. . . . . . . . . .... ...... . . . .. ........................ ................. ............................... ........ .. ..... . . .............. . . ... .. . . . .. . . . . . . .... .. ......... . .. . .... . ...................... ..... . ... .... .... ..... ... . . .. .. . . . . . . .. . . . . . .. . . . . . . . .. . . . . . . . . . . .. ... . ... ... .............. ............. .............. .. ...................... .. .... ....... . .. ... .. .. .. . . .. . . . .. . . .... . .... .... .. . .. ... .. ..... .. .. .... . . . . .. . . . .. ..... ......... ...................... .............. ............................................... ...... ... ........ ............ ... .. . . . . . . ... .... ..... ............................................................................................................................................................. ...... ........ ... . . .... .... ...... . . ..... . ................................................................................................... .......... . ..... . . . . .. .. ... ...... ... . ... . . ... . .. ... .......... . .... ... .. .... ..... . . . . . . . . . .. .. ................ .. ............... . .. ... .............. . . ........... ...... ... . .... ... ... . . . .. . . . .. .... .. . ........................................................................................................................................................................ .. .... .. ..... . ... . ... . . . . . . . . . . . . . . .... . . ... .. ..... .. .. . ... . .. . . . ....... . ..... ............... ............................................................................. .... .......... . . . . . . . . .................................................................................................................................................................................. ...... .... .......... ... . . . . . .. . . . .. . . . . ... ........ ... .. .... .. .... ... .... .. .. ... . . . ... . . .... .. .. ...... .. .. .. ... ............... ................... ................................... .... ........................... .............. .. ... ...... .. .. .... . . .. . . . . .. . .. . . .. .. . .. . . . . . ... .... .................................................................................................................................... . ...... ... . ... . . . .. . .. .......... . ................... .................. .. ..... ... .... .. . . . . . .. . . .. . .. . . . . . . . .. ... . . .. . . . . . ...... .... .................................................................................................................................................................................................................. ....... ..... .. ........ . . . ... . ... .. . . ......... .......................... ....................... ............ .......... ....... .... ................. . . . .. ...... . . . . . . .. . . . . . . . . . .. . . . . .... . ................................................................................................................................................................................................................. ....... . .. . .. .... .. . . . ... . . .... ......... .... ........ .. ............ . .. .. . . .......... .. . . . . . . .. . . . . . . . . . . . .. . .. .... ...... . .... ..... . . . .. . .. . . .. ... ....... ... .... .................... ...................................................................... ................ ............. .. ....... . ...... ...... . ............. . ..................................................................................................................................................................................................................... .. . .... . .. . . . . . . . ... .......... ..... ......... .......... ................ ....................... ..................... ........ .. ............. ... .. . . . ... . . . . ... . .. .... . . . . . . .. . . . . . .. . . .. .. .. .. .. ... ... ................... ..................................... ........................ ..... ....................................... .... .......... ... .... . .. . . . . . . .. . . . . . . . . . .. ... . .. ..... ...... ............... ...... .......................................... . ......... ..... ... .. .... . .... . . . . .... . ... ...... ....................................................................................................................................................................................................................... ........... . ......... . . . . .. . . ........ ... .......................... ................. .. ...................... ......... . . . ... . . . .. . ..... ..... ....... .. ..... .. . .. . .... .. .... . . ....... . ......... . .... ... . . . .. . . . .. . . . . ....... . ........................................................................................................................................................ .................................. ........ ..... .. ... . . . ... .. ......... ............................................................................................................................................................................................ .................. ... .. . . .. .. . . . .. . ... . . . . . ... . . . .. ............ ................................. .................................................................................................................. ....... .. ....... .. .. . .. . . ... .. . ... ...... ....... . ... ....................................... .......................... .. . .. . . .. .. .. . . .. . ..... .. . . . . .. . . ... . . . .. . .. . . .. . . ... . ... ......... .... .. .... ...... .... ......................... .................................... ..... .... ... ....... ....... ... .... ..... . . . . . .. ....... ... ... ... ....................................................................................................................................................................................................... ........... . ...... . .. . . . . . ... .. . .. .. . . ......... ..... . ... .. .. .... . . . . . ... . . . .. .. . . . ... .. .. . ....................................................................................................................................................................................................... . . . . .. ..... ................ ......................................................................... .......................................... .. . .. .. . . .. . . .. . . . . . . . .. . . . . .. .... ................ ............................................................................................................................................................ .... . . .... . . . . .. . . .... ......... .. ... ... ............... ..... . .. ... . . . . ... .. .. .. ....... .................................. ............................................................. . ...... . .. . .. . . . . . . ... . . . ... . . .... . . .. ......... ................ ................................................................................................................................................................................. .. .. ... ... ... . . . . . .. .. . ... .. .... .. .... . .. . ..... ............................... ............................................................... ... .... ... . .. . . . . . ... .. ...... .. ... ........ . . ....... .. ... .... . .. . ... . . . . . . . . . . .. ......... .. . ..... .................................................................................................................................................................................. .. ... ..... . . . .. . . ... . ..... ................................................................................................................. ................. . . . .. . . .. . . . . . . ... . . . . .. .. ... .. ..... ..... .......................................................................................... .... ..... . . . . . . . . ... . ... .... .. . .. . ......................... ....... .............. ... . . .. . . . . ... .. . .. .... . ...... .... ... ........... ....................................................................................................................................................................... ..... .. . .. . . . . .. .. .. . .. . ..... ...... ...................................................................................................... .................................. ....... ... .. .. . . . . .. . . . . . .. . ... .. . . ... ..... ............................................................................................................ .. . .. ..... . . . . .. .. ........ .... ........... .... . ................... ... .. ... ......... .. .. . . . . . ...... . . . . . . . .. . .. .... . ................ .... .... ...... .......... ...... .... ..... .. .. .. . . . . . . .......................... ................................................................................................ ..... ......................... . . . . . .. . . ......... .. . . ... .. .. ..... . ... . . . . . . . ...... ................... ........................ ............................ .. .... ........ ..... . . . . . . . . . .. . .. . .. . ... . . .. . . . . . . . ..... .... ...... ... ...................... ................................................................. ............... ....... ........... .. . . . . . . . .. . . . . . . .. . .. .. . .. .. ... ......... .... .. .......... ......... . . . .. . . ... . . . .. . . . . . . . ..... . . ... ................................. .............. ... ........ ....... ........ ......... . . .. . . . . . . . . . . ..... ..... . ... .. ... . . .. . ... .. .. . . . . .. . .. . . .. . . . . .... . . .. ........ ............ ... .. . .. .. ..... . . . .. . . . . . .. . .. .. .. . .. . .. . . . . . . . . .. .. . .. . . . . . .. .. . . . . . . . . . .. . . . .. . .. . . . . . . . .. . . . . . .. .. . . . . . . . . Full Covariance 100 s ( dimension 6 ) -150 -100 -200 -50 . s ( dimension 6 ) 50 100 0 -150 -100 -50 -200 -150 -100 -50 0 s ( dimension 5 ) 50 100 0 50 150 -150 -100 -50 0 s ( dimension 5 ) 50 100 Two Components 150 100 100 150 Three Components s ( dimension 6 ) s ( dimension 6 ) -200 -150 -100 -50 0 s ( dimension 5 ) 50 100 50 0 -50 -150 -100 -150 -100 -200 -50 0 50 -150 -100 -50 0 s ( dimension 5 ) 50 100 6.345/HST.728 Automatic Speech Recognition (2010) VQ & GMMs 31 MIT 150 100 s ( dimension 6 ) 50 0 -50 -150 -100 -200 -150 Two Dimensional Components Mixture 100 150 Components s ( dimension 6 ) -100 -50 0 s ( dimension 5 ) 50 100 -150 -100 -200 -50 0 50 -150 -100 -50 0 s ( dimension 5 ) 50 100 150 100 s ( dimension 6 ) s ( dimension 6 ) -200 -150 -100 -50 0 s ( dimension 5 ) 50 100 50 0 -50 -150 -100 -150 -100 -200 -50 0 50 100 150 -150 -100 -50 0 s ( dimension 5 ) 50 100 6.345/HST.728 Automatic Speech Recognition (2010) VQ & GMMs 32 MIT Gaussian Mixtures: ML Parameter Estimation 1 n 1 n The maximum likelihood solutions are of the form: i ^ k = ^ P (k |xi )xi i ^ P (k |xi ) ^ k = 1 n i ^ ^ ^ P (k |xi )(xi - k )(xi - k )t 1 n ^ P (k |xi ) ^ P (k |xi ) i 1 ^ P (k ) = n i An example of the Expectation-Maximization (EM) algorithm The solutions are obtained iteratively (similar to K-means) 6.345/HST.728 Automatic Speech Recognition (2010) VQ & GMMs 33 MIT Example: 4 Samples, 2 Densities 1. Data: X = {x1 , x2 , x3 , x4 } = {2, 1, -1, -2} 2. Init: p(x|1 ) N(1, 1) 3. Estimate: P(1 |x) P(2 |x) x1 0.98 0.02 x2 0.88 0.12 x3 0.12 0.88 x4 0.02 0.98 p(x|2 ) N(-1, 1) P(i ) = 0.5 p(X ) (e-0.5 + e-4.5 )(e0 + e-2 )(e0 + e-2 )(e-0.5 + e-4.5 )0.54 4. Recompute mixture parameters (only shown for 1 ): ^ P (1 ) = ^ 1 = ^2 1 = .98+.88+.12+.02 4 .98(2)+.88(1)+.12(-1)+.02(-2) .98+.88+.12+.02 .98(2-1.34)2 +.88(1-1.34)2 +.12(-1-1.34)2 +.02(-2-1.34)2 .98+.88+.12+.02 = 0.5 = 1.34 = 0.70 5. Repeat steps 3,4 until convergence 6.345/HST.728 Automatic Speech Recognition (2010) VQ & GMMs 34 MIT [s] Duration: 2 Densities Iter 0 1 2 3 4 5 6 7 1 152 150 148 147 146 146 146 145 P(1 ) .384 .376 .369 .362 .356 .349 .344 .338 2 95 97 98 100 100 102 102 102 P(2 ) .616 .624 .631 .638 .644 .651 .656 .662 1 35 37 39 41 42 43 44 44 2 23 24 25 25 26 26 26 26 10 0 2 4 6 8 12 Probability Density 0.05 0.10 0.15 0.20 Duration (sec) 0.25 0.30 Probability Density Iter 0 1 2 3 4 5 6 7 log p(X ) 2.727 2.762 2.773 2.778 2.781 2.783 2.784 2.785 0 2 4 6 8 10 12 0.05 0.10 0.15 0.20 Duration (sec) 0.25 0.30 Probability Density 0 2 4 6 8 10 12 0.05 0.10 0.15 0.20 Duration (sec) 0.25 0.30 6.345/HST.728 Automatic Speech Recognition (2010) VQ & GMMs 35 MIT Next Steps Reading: VQ: Rabiner et al., Fundamentals of ASR, Chp 3.4 VQ & GMM: Huang et al., Spoken Language Processing, Chp 3.1.7, 4.4 More on acoustic modeling... Dimensionality reduction & significance Discriminative techniques after HMMs 6.345/HST.728 Automatic Speech Recognition (2010) VQ & GMMs 36 ...
View Full Document

This note was uploaded on 05/08/2010 for the course CS 6.345 taught by Professor Glass during the Spring '10 term at MIT.

Ask a homework question - tutors are online