124.11.lec20

124.11.lec20 - CS 124/LINGUIST 180: From Languages to...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: CS 124/LINGUIST 180: From Languages to Information Dan Jurafsky Lecture 20: Speech Recogni:on The final exam   Friday March 18, 12:15 ­3:15 in 370 ­370   Open book and open note   You won’t need a calculator   Computers are ok to read, e.g., the slides and the textbooks, but no use of the internet on your laptop or any internet ­aware devices, on the honor code   i.e., open book and notes, but not open ­web   The problems will be very much like homework 5, which I gave you specifically to be a finals prep Topics we covered   hSp://www.stanford.edu/class/cs124/ Some classes in these areas cs276 Informa(on Retrieval and Web Search Nayak/Raghavan, Spring 2011 cs224N Natural Language Processing Manning, Spring 2012 cs224W Social and Informa(on Network Analysis Lescovec (Winter 2011? cs224U Natural Language Understanding fall 2011 (or winter 2012) cs224S Speech Recogni(on, Understanding, Dialogue, Jurafsky, not taught next year ling284 History of Computa(onal Linguis(cs Jurafsky and Kay, winter 2011 cs121 Intro to AI Latombe cs221 Ar(ficial Intelligence Thrun or Ng O^en Winter cs228 Structured Probabilis(c Models Koller cs262 Computa(onal Genomics Batzoglou O^en Winter cs229 Machine Learning Ng cs270 Intro to Biomedical Informa(cs: Musen cs322 Network Analysis Lescovec Speech   speech recogni:on   speech synthesis   dialogue   spoken sen:ment extrac:on   speaker/language id Applications of Speech Recognition/ Understanding (ASR/ASU)   Dicta:on   Telephone ­based Informa:on   GOOG 411   Direc:ons, air travel, banking, etc   “Google Voice” Voice mail transcrip:on   Hands ­free (in car)   Second language ('L2') (accent reduc:on)   Audio archive searching and aligning 1/5/07 Speaker Recognition tasks •  Speaker Recogni:on   Speaker Verifica:on (Speaker Detec:on)   Is this speech sample from a par:cular speaker Is that Jane?   Speaker Iden:fica:on   Which of this set of speakers does this speech sample come from Who is that?   Related tasks: Gender ID, Language ID Is this a woman or a man? •  Speaker Diariza:on   Segmen:ng a dialogue or mul:party conversa:on Who spoke when? Applications of Speaker Recognition and Language Recognition   Language recogni:on for call rou:ng   Speaker Recogni:on:   Speaker verifica:on (binary decision)   Voice password, telephone assistant   Speaker iden:fica:on (one of N)   Criminal inves:ga:on 1/5/07 Speech synthesis   Telephone dialogue systems   Games   The ipod shuffle   hSp://www.apple.com/ipodshuffle/voiceover.html   Compare to state ­of ­the ­art synthesis:   hSp://www.research.aS.com/~Ssweb/Ss/demo.php LVCSR   Large Vocabulary Con:nuous Speech Recogni:on   ~20,000 ­64,000 words   Speaker independent (vs. speaker ­dependent)   Con:nuous speech (vs isolated ­word)   Useful for:   Dicta:on   Voice ­mail transcrip:on Outline for ASR   ASR Tasks and Architecture   Five easy pieces of an ASR system 1)  The Lexicon (An HMM with phones as hidden states) 2)  The Language Model 3)  The Acous:c Model (phone detector) 4)  Feature extrac:on (“MFCC”) 5)  HMM Stuff: Viterbi decoding 2)  EM (Baum ­Welch) training 1)  Current error rates Ballpark numbers; exact numbers depend very much on the specific corpus Task Digits WSJ read speech WSJ read speech Broadcast news Conversational Telephone Vocabulary 11 5K 20K 64,000+ 64,000+ Error Rate% 0.5 3 3 10 20 HSR versus ASR Task Vocab ASR Hum SR Continuous digits WSJ 1995 clean WSJ 1995 w/noise SWBD 2004 11 5K 5K 65K .009 0.9 1.1 4 .5 3 9 20   Conclusions:   Machines about 5 :mes worse than humans   Gap increases with noisy speech   These numbers are rough, take with grain of salt Why is conversational speech harder?   A piece of an uSerance without context   The same uSerance with more context LVCSR Design Intuition   Build a sta:s:cal model of the speech ­to ­words process   Collect lots and lots of speech, and transcribe all the words.   Train the model on the labeled speech   Paradigm: Supervised Machine Learning + Search The Noisy Channel Model   Search through space of all possible sentences.   Pick the one that is most probable given the waveform. The Noisy Channel Model (II)   What is the most likely sentence out of all sentences in the language L given some acous:c input O?   Treat acous:c input O as sequence of individual observa:ons   O = o1,o2,o3,…,ot   Define a sentence as a sequence of words:   W = w1,w2,w3,…,wn Noisy Channel Model (III)   Probabilis:c implica:on: Pick the highest prob S: ˆ W = arg max P (W | O) W ∈L   We can use Bayes rule to rewrite this: € ˆ = arg max P (O | W ) P (W ) W P (O) W ∈L   Since denominator is the same for each candidate sentence W, we can ignore it for the argmax: € ˆ W = arg max P (O | W ) P (W ) W ∈L Noisy channel model likelihood prior ˆ W = arg max P (O | W ) P (W ) W ∈L € Starting with the HMM Lexicon   A list of words   Each one with a pronuncia:on in terms of phones   We get these from on ­line pronucnia:on dic:onary   CMU dic:onary: 127K words   hSp://www.speech.cs.cmu.edu/cgi ­bin/cmudict   We’ll represent the lexicon as an HMM ARPAbet   hSp://www.stanford.edu/class/cs224s/arpabet.html 1/5/07 HMMs for speech: the word “six”   Hidden states are phones:   Loopbacks because   a phone is ~100 milliseconds long   An observa:on of speech every 10 ms   So each phone repeats ~10 :mes (simplifying greatly) HMM for the digit recognition task The rest   So if it’s just HMMs   I just need to tell you how to build a phone detector   The rest is the same as POS or Named En:ty tagging   Phone detec:on algorithm:   Supervised machine learning   Classifier: Gaussian Mixture Model   Features: “Mel ­frequence Cepstral Coefficients”, MFCC Speech Production Process   Respira:on:   We (normally) speak while breathing out. Respira:on provides airflow. “Pulmonic egressive airstream”   Phona:on   Airstream sets vocal folds in mo:on. Vibra:on of vocal folds produces sounds. Sound is then modulated by:   Ar:cula:on and Resonance   Shape of vocal tract, characterized by:   Oral tract   Teeth, so^ palate (velum), hard palate   Tongue, lips, uvula   Nasal tract 1/5/07 Text adopted from Sharon Rose Sagittal section of the vocal tract (Techmer 1880) Nasal Cavity Pharynx Vocal Folds (within the Larynx) Trachea Lungs 1/5/07 Text copyright J. J. Ohala, Sept 2001, from Sharon Rose slide 1/5/07 From Mark Liberman’s website, from Ultimate Visual Dictionary 1/5/07 From Mark Liberman’s Web Site, from Language Files (7th ed) Vocal tract 1/5/07 Figure thnx to John Coleman!! Vocal tract movie (high speed x-ray) 1/5/07 Figure of Ken Stevens, from Peter Ladefoged’s web site 1/5/07 Figure of Ken Stevens, labels from Peter Ladefoged’s web site USC’s SAIL Lab Shri Narayanan 1/5/07 Larynx and Vocal Folds   The Larynx (voice box)   A structure made of car:lage and muscle   Located above the trachea (windpipe) and below the pharynx (throat)   Contains the vocal folds   (adjec:ve for larynx: laryngeal)   Vocal Folds (older term: vocal cords)   Two bands of muscle and :ssue in the larynx   Can be set in mo:on to produce sound (voicing) 1/5/07 Text from slides by Sharon Rose UCSD LING 111 handout The larynx, external structure, from front 1/5/07 Figure thnx to John Coleman!! Vertical slice through larynx, as seen from back 1/5/07 Figure thnx to John Coleman!! Voicing: 1/5/07 • Air comes up from lungs • Forces its way through vocal cords, pushing open (2,3,4) • This causes air pressure in glottis to fall, since: •  when gas runs through constricted passage, its velocity increases (Venturi tube effect) •  this increase in velocity results in a drop in pressure (Bernoulli principle) • Because of drop in pressure, vocal cords snap together again (6-10) • Single cycle: ~1/100 of a second. Figure & text from John Coleman’s web site Voicelessness   When vocal cords are open, air passes through unobstructed   Voiceless sounds: p/t/k/s/f/sh/th/ch   If the air moves very quickly, the turbulence causes a different kind of phona:on: whisper 1/5/07 Vocal folds open during breathing 1/5/07 From Mark Liberman’s web site, from Ultimate Visual Dictionary Vocal Fold Vibration UCLA Phonetics Lab Demo Consonants and Vowels   Consonants: phone:cally, sounds with audible noise produced by a constric:on   Vowels: phone:cally, sounds with no audible noise produced by a constric:on   (it’s more complicated than this, since we have to consider syllabic func:on, but this will do for now) Text adapted from John Coleman Acoustic Phonetics   Sound Waves   hSp://www.keSering.edu/~drussell/Demos/waves ­ intro/waves ­intro.html Simple Periodic Waves (sine waves) 0.99 •  Characterized by: •  period: T •  amplitude A •  phase φ •  Fundamental frequency in cycles per second, or Hz •  F0=1/T 0 –0.99 0 0.02 Time (s) 1 cycle Simple periodic waves   Compu:ng the frequency of a wave:   5 cycles in .5 seconds = 10 cycles/second = 10 Hz   Amplitude:   1   Equa:on:   Y = A sin(2π^) Speech sound waves   A liSle piece from the waveform of the vowel [iy]   Y axis:   Amplitude = amount of air pressure at that :me point   Posi:ve is compression   Zero is normal air pressure,   nega:ve is rarefac:on   X axis: :me. Digitizing Speech Digitizing Speech   Analog ­to ­digital conversion   Or A ­D conversion.   Two steps   Sampling   Quan:za:on Sampling •  Measuring amplitude of signal at time t •  The sampling rate needs to have at least two samples for each cycle •  Roughly speaking, one for the positive and one for the negative half of each cycle. •  More than two sample per cycle is ok •  Less than two samples will cause frequencies to be missed •  So the maximum frequency that can be measured is one that is half the sampling rate. •  The maximum frequency for a given sampling rate called Nyquist frequency Sampling Original signal in red: If measure at green dots, will see a lower frequency wave and miss the correct higher frequency one! Sampling In practice, then, we use the following sample rates. •  16,000 Hz (samples/sec) Microphone (“Wideband”): •  8,000 Hz (samples/sec) Telephone Why? •  Need at least 2 samples per cycle •  max measurable frequency is half sampling rate •  Human speech < 10,000 Hz, so need max 20K •  Telephone filtered at 4K, so 8K is enough Quantization   Quan:za:on •  Represen:ng real value of each amplitude as integer •  8 ­bit ( ­128 to 127) or 16 ­bit ( ­32768 to 32767)   Formats: •  16 bit PCM •  8 bit mu ­law; log compression   Byte order •  LSB (Intel) vs. MSB (Sun, Apple)   Headers: •  Raw (no header) •  Microso^ wav •  Sun .au 40 byte header WAV format Waves have different frequencies 0.99 100 Hz 0 –0.99 0 0.99 0.02 Time (s) 1000 Hz 0 –0.99 0 0.02 Time (s) Complex waves: Adding a 100 Hz and 1000 Hz wave together 0.99 0 –0.9654 0 0.05 Time (s) Spectrum Amplitude Frequency components (100 and 1000 Hz) on x-axis 100 Frequency in Hz 1000 Spectra continued   Fourier analysis: any wave can be represented as the (infinite) sum of sine waves of different frequencies (amplitude, phase) Spectrum of one instant in an actual soundwave: many components across frequency range 40 20 0 0 5000 Frequency (Hz) Part of [ae] waveform from “had”   Note complex wave repea:ng nine :mes in figure   Plus smaller waves which repeats 4 :mes for every large paSern   Large wave has frequency of 250 Hz (9 :mes in .036 seconds)   Small wave roughly 4 :mes this, or roughly 1000 Hz   Two liSle :ny waves on top of peak of 1000 Hz waves Back to spectrum   Spectrum represents these freq components   Computed by Fourier transform, algorithm which separates out each frequency component of wave.   x ­axis shows frequency, y ­axis shows magnitude (in decibels, a log measure of amplitude)   Peaks at 930 Hz, 1860 Hz, and 3020 Hz. Spectrogram: spectrum + time dimension From Mark Liberman’s Web site Detecting Phones   Two stages   Feature extrac:on   Basically a slice of a spectrogram   Building a phone classifier (using GMM classifier) MFCC: Mel-Frequency Cepstral Coefficients Final Feature Vector   39 Features per 10 ms frame:   12 MFCC features   12 Delta MFCC features   12 Delta ­Delta MFCC features   1 (log) frame energy   1 Delta (log) frame energy   1 Delta ­Delta (log frame energy)   So each frame represented by a 39D vector Acoustic Modeling (= Phone detection)   Given a 39 ­dimensional vector corresponding to the observa:on of one frame oi   And given a phone q we want to detect   Compute p(oi|q)   Most popular method:   GMM (Gaussian mixture models)   Other methods   Neural nets, CRFs, SVM, etc Gaussian Mixture Models   Also called “fully ­con:nuous HMMs”   P(o|q) computed by a Gaussian: 1 (o − µ ) 2 p(o | q) = exp(− ) 2 2σ σ 2π € Gaussians for Acoustic Modeling A Gaussian is parameterized by a mean and a variance: Different means   P(o|q): P(o|q) is highest here at mean P(o|q is low here, very far from mean) P(o|q) o Training Gaussians   A (single) Gaussian is characterized by a mean and a variance   Imagine that we had some training data in which each phone was labeled   And imagine that we were just compu:ng 1 single spectral value (real valued number) as our acous:c observa:on   We could just compute the mean and variance from the data: 1T µi = ∑ ot s.t . ot is phone i T t =1 1T σ i = ∑ (ot − µi ) 2 s.t . ot is phone i T t =1 2 € € But we need 39 gaussians, not 1!   The observa:on o is really a vector of length 39   So need a vector of Gaussians: p(o | q) = 1 2π D D 2 σ 2 [d ] ∏ d =1 D 2 1 (o[ d ] − µ[ d ]) exp(− ∑ ) 2 2 d =1 σ [ d] Actually, mixture of gaussians Phone A Phone B   Each phone is modeled by a sum of different gaussians   Hence able to model complex facts about the data Gaussians acoustic modeling   Summary: each phone is represented by a GMM parameterized by   M mixture weights   M mean vectors   M covariance matrices   Usually assume covariance matrix is diagonal   I.e. just keep separate variance for each cepstral feature Where we are   Given: A wave file   Goal: output a string of words   What we know: the acous(c model   How to turn the wavefile into a sequence of acous:c feature vectors, one every 10 ms   If we had a complete phone:c labeling of the training set, we know how to train a gaussian “phone detector” for each phone.   We also know how to represent each word as a sequence of phones   What we knew from a few weeks ago: the language model   Next :me:   Seeing all this back in the context of HMMs   Search: how to combine the language model and the acous:c model to produce a sequence of words HMM for digit recognition task Viterbi trellis for “five” Viterbi trellis for “five” Search space with bigrams Viterbi trellis 7 6 Viterbi backtrace 7 7 Summary   ASR Architecture   Phone:cs Background   Five easy pieces of an ASR system 1)  Lexicon 2)  Feature Extrac:on 3)  Acous:c Model (Phone detector) 4)  Language Model 5)  Viterbi decoding A few advanced topics Why foreign accents are hard   A word by itself   The word in context Sentence Segmentation   Binary classifica:on task; judge the juncture between each two words:   Features:   Pause   Dura:on of previous phone and rime   Pitch change across boundary; pitch range of previous word Disfluencies   Reparandum: thing repaired   Interrup:on point (IP): where speaker breaks off   Edi:ng phase (edit terms): uh, I mean, you know   Repair: fluent con:nua:on   Example: Fragments: Incomplete or cut ­off words:   Uh yeah, yeah, well, it ­ it ­ that’s right. And it ­ ...
View Full Document

This document was uploaded on 06/01/2011.

Ask a homework question - tutors are online