10oov-handout

10oov-handout - Massachusetts Institute of Technology...

Info iconThis preview shows pages 1–6. Sign up to view the full content.

View Full Document Right Arrow Icon
Massachusetts Institute of Technology Department of Electrical Engineering & Computer Science 6.345/HST.728 Automatic Speech Recognition Spring, 2010 4/29/10 Lecture Handouts Out-of-Vocabulary (OOV) Modeling
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
1 OOV Models 1 6.345/HST.728 Automatic Speech Recognition (2010) Modelling New Words Introduction Modelling out-of-vocabulary (OOV) words Probabilistic formulation Domain-independent methods Learning OOV subword units Multi-class OOV models Dynamic vocabulary applications OOV Models 2 6.345/HST.728 Automatic Speech Recognition (2010) What is a new word? Acoustic Model Language Model Word Lexical Model Recognition Output Input Speech Speech Recognizer Almost all speech recognizers search a Fnite lexicon A word not contained in the lexicon is called out-of-vocabulary Out-of-vocabulary (OOV) words are inevitable, and problematic! argmax W P ( W | A )
Background image of page 2
2 OOV Models 3 6.345/HST.728 Automatic Speech Recognition (2010) New Words are Inevitable! Vocabulary growth appears unbounded New words are constantly appearing Growth appears to be language independent Analysis of multiple speech and text corpora Vocabulary size vs. amount of training data Out-of-vocabulary rate vs. vocabulary size Out-of-vocabulary rate a function of data type Human-machine speech Human-human speech Newspaper text OOV Models 4 6.345/HST.728 Automatic Speech Recognition (2010) Example: Spoken Lecture Vocabulary Usage Most frequent words not present in all three subjects DifFcult to cover content words w/o topic-speciFc material Computer Science Physics Linear Algebra Word BN SB Word BN SB Word BN SB Procedure 2683 5486 ±ield 1029 890 Matrix 23752 12918 Expression 4211 6935 Charge 1004 750 Transpose 51305 25829 Environment 1268 1055 Magnetic 10599 15961 Determinant 29023 -- Stream 5409 3210 Electric 3520 1733 Null 29431 -- Cons 14173 5385 ±orce 434 922 Eigenvalues -- -- Program 370 410 Volts 33928 -- Rows 12440 8272 Procedures 3162 5487 Energy 1386 1620 Matrices -- -- Machine 2201 906 Theta -- -- Eigen -- -- Arguments 2279 3738 Omega 24266 16279 Orthogonal -- -- Cdr -- -- Maximum 4107 3775 Diagonal 34008 14916
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
3 OOV Models 5 6.345/HST.728 Automatic Speech Recognition (2010) WER SER WER: Word Error Rate SER: Sentence Error Rate Out-of-vocabulary (OOV) words have higher word and sentence error rates compared to in-vocabulary (IV) words New Words Cause Errors! 14% 33% IV 51% 100% OOV OOV words often cause multiple errors, e.g., “Symphony” Ref: “Members of Charleston Symphony Orchestra are being treated…” Hyp: “Members of Charleston simple your stroke are being treated…” OOV Models 6 6.345/HST.728 Automatic Speech Recognition (2010) New Words Stress Recognizers! Search computation increases near presence of new words
Background image of page 4
4 OOV Models 7 6.345/HST.728 Automatic Speech Recognition (2010) New Words are Important! New words are often important content words NAME NOUN VERB ADJECTIVE ADVERB Weather Broadcast News Content words are more likely to be re-used (i.e., persistent) Example: 2,000 “rare” restaurant or street names e.g., Aceituna, Jonquilles, Lastorias, Pepperoncinis, Chungs 500 of these words are found in a 150k dictionary (75% OOV) 600 are found in a 300k Google subset (70% OOV) 1.4k are found in a 2.5 million Google subset (30% OOV)
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Image of page 6
This is the end of the preview. Sign up to access the rest of the document.

This note was uploaded on 05/08/2010 for the course CS 6.345 taught by Professor Glass during the Spring '10 term at MIT.

Page1 / 20

10oov-handout - Massachusetts Institute of Technology...

This preview shows document pages 1 - 6. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online