10oov-handout

10oov-handout - Massachusetts Institute of Technology...

Info icon This preview shows pages 1–6. Sign up to view the full content.

View Full Document Right Arrow Icon
Massachusetts Institute of Technology Department of Electrical Engineering & Computer Science 6.345/HST.728 Automatic Speech Recognition Spring, 2010 4/29/10 Lecture Handouts Out-of-Vocabulary (OOV) Modeling
Image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
1 OOV Models 1 6.345/HST.728 Automatic Speech Recognition (2010) Modelling New Words Introduction Modelling out-of-vocabulary (OOV) words Probabilistic formulation Domain-independent methods Learning OOV subword units Multi-class OOV models Dynamic vocabulary applications OOV Models 2 6.345/HST.728 Automatic Speech Recognition (2010) What is a new word? Acoustic Model Language Model Word Lexical Model Recognition Output Input Speech Speech Recognizer Almost all speech recognizers search a finite lexicon A word not contained in the lexicon is called out-of-vocabulary Out-of-vocabulary (OOV) words are inevitable, and problematic! argmax W P ( W | A )
Image of page 2
2 OOV Models 3 6.345/HST.728 Automatic Speech Recognition (2010) New Words are Inevitable! Vocabulary growth appears unbounded New words are constantly appearing Growth appears to be language independent Analysis of multiple speech and text corpora Vocabulary size vs. amount of training data Out-of-vocabulary rate vs. vocabulary size Out-of-vocabulary rate a function of data type Human-machine speech Human-human speech Newspaper text OOV Models 4 6.345/HST.728 Automatic Speech Recognition (2010) Example: Spoken Lecture Vocabulary Usage Most frequent words not present in all three subjects Difficult to cover content words w/o topic-specific material Computer Science Physics Linear Algebra Word BN SB Word BN SB Word BN SB Procedure 2683 5486 Field 1029 890 Matrix 23752 12918 Expression 4211 6935 Charge 1004 750 Transpose 51305 25829 Environment 1268 1055 Magnetic 10599 15961 Determinant 29023 -- Stream 5409 3210 Electric 3520 1733 Null 29431 -- Cons 14173 5385 Force 434 922 Eigenvalues -- -- Program 370 410 Volts 33928 -- Rows 12440 8272 Procedures 3162 5487 Energy 1386 1620 Matrices -- -- Machine 2201 906 Theta -- -- Eigen -- -- Arguments 2279 3738 Omega 24266 16279 Orthogonal -- -- Cdr -- -- Maximum 4107 3775 Diagonal 34008 14916
Image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
3 OOV Models 5 6.345/HST.728 Automatic Speech Recognition (2010) WER SER WER: Word Error Rate SER: Sentence Error Rate Out-of-vocabulary (OOV) words have higher word and sentence error rates compared to in-vocabulary (IV) words New Words Cause Errors! 14% 33% IV 51% 100% OOV OOV words often cause multiple errors, e.g., “Symphony” Ref: “Members of Charleston Symphony Orchestra are being treated…” Hyp: “Members of Charleston simple your stroke are being treated…” OOV Models 6 6.345/HST.728 Automatic Speech Recognition (2010) New Words Stress Recognizers! Search computation increases near presence of new words
Image of page 4
4 OOV Models 7 6.345/HST.728 Automatic Speech Recognition (2010) New Words are Important! New words are often important content words NAME NOUN VERB ADJECTIVE ADVERB Weather Broadcast News Content words are more likely to be re-used (i.e., persistent) Example: 2,000 “rare” restaurant or street names e.g., Aceituna, Jonquilles, Lastorias, Pepperoncinis, Chungs 500 of these words are found in a 150k dictionary (75% OOV) 600 are found in a 300k Google subset (70% OOV) 1.4k are found in a 2.5 million Google subset (30% OOV)
Image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Image of page 6
This is the end of the preview. Sign up to access the rest of the document.

{[ snackBarMessage ]}

What students are saying

  • Left Quote Icon

    As a current student on this bumpy collegiate pathway, I stumbled upon Course Hero, where I can find study resources for nearly all my courses, get online help from tutors 24/7, and even share my old projects, papers, and lecture notes with other students.

    Student Picture

    Kiran Temple University Fox School of Business ‘17, Course Hero Intern

  • Left Quote Icon

    I cannot even describe how much Course Hero helped me this summer. It’s truly become something I can always rely on and help me. In the end, I was not only able to survive summer classes, but I was able to thrive thanks to Course Hero.

    Student Picture

    Dana University of Pennsylvania ‘17, Course Hero Intern

  • Left Quote Icon

    The ability to access any university’s resources through Course Hero proved invaluable in my case. I was behind on Tulane coursework and actually used UCLA’s materials to help me move forward and get everything together on time.

    Student Picture

    Jill Tulane University ‘16, Course Hero Intern