This preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: Massachusetts Institute of Technology Department of Electrical Engineering & Computer Science 6.345/HST.728 Automatic Speech Recognition Spring, 2010 4/8/10 Lecture Handouts • HMM Training • Homework: Hidden Markov Models 4/2/10 21 6.345/HST.728 Automatic Speech Recognition (2010) HMMs 41 Training an HMMbased Speech Recognition System Larry Gillick 6.345/HST.728 Automatic Speech Recognition (2010) HMMs 42 Training the Acoustic Model via Maximum Likelihood Estimation (MLE) Observe a series of utterances with transcription W and frames Y ( W 1 , Y 1 ), … ,( W M , Y M ) Let ϕ represent the (unknown) parameters of the acoustic model Let P ϕ (y  w) be the distribution of the frame sequence given W We shall estimate ϕ as follows : ϕ MLE = arg max ϕ P ϕ i = 1 M ∏ ( y i  w i ) 4/2/10 22 6.345/HST.728 Automatic Speech Recognition (2010) HMMs 43 What data do we need? • Dictionary of words – Pronunciation: string of phonemes • Set of recorded utterances – Transcription: string of words • Initial set of acoustic models – Not required but useful 6.345/HST.728 Automatic Speech Recognition (2010) HMMs 44 Some assumptions • We have a set of output distributions and transition probabilities (duration distributions) • We ʼ ll assume that the states have already been clustered somehow • Let ʼ s also assume the output distributions are single Gaussians or mixtures of Gaussians Output distributions f i ( y ) Transition probabilities T ( i , j ) 4/2/10 23 6.345/HST.728 Automatic Speech Recognition (2010) HMMs 45 What must be estimated? • For each output distribution, need to estimate a set of mean vectors, a set of (usually diagonal) covariance matrices, and mixture probabilities • Must also estimate the corresponding transition probabilities • We ʼ ll focus on the output distributions 6.345/HST.728 Automatic Speech Recognition (2010) HMMs 46 How can we estimate these quantities? • Align frames in each utterance to the corresponding output distributions • Collect all of the frames (from all the training utts) assigned to each output distribution • Single Gaussian – Compute the means and variances of the assigned frames for each distribution • Mixture of Gaussians – Estimate the mixture distribution via the EM algorithm, which we ʼ ll return to later 4/2/10 24 6.345/HST.728 Automatic Speech Recognition (2010) HMMs 47 Two alignment methods • Viterbi algorithm – Deterministic alignment of training utterances – Each frame will be assigned to a unique output distribution * Based on the maximum likelihood state sequence • BaumWelch algorithm (forwardbackward algorithm) – Probabilistic alignment of training utterances – Each frame will be distributed across (possibly multiple) output distributions 6.345/HST.728 Automatic Speech Recognition (2010) HMMs 48 Viterbi Algorithm We wish to determine the most likely (ML) state sequence corresponding to the observed frames y x ML = arg max x P ( x , y ) We can perform this computation via dynamic programming...
View
Full
Document
This note was uploaded on 05/08/2010 for the course CS 6.345 taught by Professor Glass during the Spring '10 term at MIT.
 Spring '10
 Glass
 Computer Science

Click to edit the document details