224s.09.lec16

224s.09.lec16 - CS 224S/LING 281 Speech Recognition...

Info iconThis preview shows pages 1–14. Sign up to view the full content.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: CS 224S/LING 281 Speech Recognition, Synthesis, and Dialogue Dan Jurafsky Lecture 16: Variation and Adaptation Outline • Variation in speech recognition • Sources of Variation • Three classic problems: Dealing with phonetic variation triphones Speaker differences (including accent) Speaker adaptation: MLLR, MAP Variation due to Genre: Conversational Speech Pronunciation modeling issues Unsolved! Sources of Variability • Phonetic context • Environment • Speaker • Genre/Task Most important: phonetic context: different “eh”s • w eh d y eh l b eh n Modeling phonetic context • The strongest factor affecting phonetic variability is the neighboring phone • How to model that in HMMs? • Idea: have phone models which are specific to context. • Instead of Context-Independent (CI) phones • We’ll have Context-Dependent (CD) phones CD phones: triphones • Triphones • Each triphone captures facts about preceding and following phone • Monophone: p, t, k • Triphone: iy-p+aa a-b+c means “phone b, preceding by phone a, followed by phone c” “Need” with triphone models Word-Boundary Modeling • Word-Internal Context-Dependent Models ‘OUR LIST’: SIL AA+R AA-R L+IH L-IH+S IH-S+T S-T • Cross-Word Context-Dependent Models ‘OUR LIST’: SIL-AA+R AA-R +L R-L+IH L-IH+S IH-S+T S-T +SIL • Dealing with cross-words makes decoding harder! We will return to this. Implications of Cross-Word Triphones • Possible triphones: 50x50x50=125,000 • How many triphone types actually occur? • 20K word WSJ Task, numbers from Young et al • Cross-word models: need 55,000 triphones • But in training data only 18,500 triphones occur! • Need to generalize models. Modeling phonetic context: some contexts look similar W iy r iy m iy n iy Solution: State Tying • Young, Odell, Woodland 1994 • Decision-Tree based clustering of triphone states • States which are clustered together will share their Gaussians • We call this “state tying”, since these states are “tied together” to the same Gaussian. • Previous work: generalized triphones Model-based clustering (‘model’ = ‘phone’) Clustering at state is more fine-grained Young et al state tying State tying/clustering • How do we decide which triphones to cluster together?...
View Full Document

Page1 / 78

224s.09.lec16 - CS 224S/LING 281 Speech Recognition...

This preview shows document pages 1 - 14. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online