224s.09.lec9

224s.09.lec9 - CS224S/LINGUIST281...

Info iconThis preview shows pages 1–10. Sign up to view the full content.

View Full Document Right Arrow Icon
CS 224S / LINGUIST 281 Speech Recognition, Synthesis, and  Dialogue Dan Jurafsky Lecture 9: Feature Extraction and start of Acoustic  Modeling (VQ) IP Notice:
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Outline for Today Speech Recognition Architectural Overview Hidden Markov Models in general and for  speech Forward Viterbi Decoding How this fits into the ASR component of course Jan 27 HMMs, Forward, Viterbi, Jan 29 Baum-Welch (Forward-Backward) Feb 3: Feature Extraction, MFCCs, start of AM Feb 5: Acoustic Modeling and GMMs Feb 10: N-grams and Language Modeling Feb 24: Search and Advanced Decoding Feb 26: Dealing with Variation
Background image of page 2
Outline for Today Feature Extraction Mel-Frequency Cepstral Coefficients Acoustic Model Increasingly sophisticated models Acoustic Likelihood for each state: Gaussians Multivariate Gaussians Mixtures of Multivariate Gaussians Where a state is progressively: CI Subphone (3ish per phone) CD phone (=triphones) State-tying of CD phone Evaluation Word Error Rate
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Discrete Representation of Signal Represent continuous signal into discrete form. Thanks to Bryan Pellom for this slide
Background image of page 4
Sampling measuring amplitude of signal at time t 16,000 Hz (samples/sec) Microphone  (“Wideband”): 8,000 Hz (samples/sec) Telephone Why? Need at least 2 samples per cycle max measurable frequency is half sampling rate Human speech < 10,000 Hz,  so need max 20K Telephone filtered at 4K, so 8K is enough Digitizing The Signal (A-D)
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Quantization Representing real value of each amplitude as integer 8-bit (-128 to 127) or 16-bit (-32768 to 32767) Formats : 16 bit PCM 8 bit mu-law; log compression LSB (Intel) vs. MSB (Sun, Apple) Headers : Raw (no header) Microsoft wav Sun .au 40 byte header Digitizing Speech (II)
Background image of page 6
Discrete Representation of Signal Byte swapping Little-endian vs. Big-endian Some audio formats have headers Headers contain meta-information such as  sampling rates, recording condition Raw file refers to 'no header' Example: Microsoft wav, Nist sphere Nice sound manipulation tool: sox.  change sampling rate convert speech formats
Background image of page 7

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
MFCC Mel-Frequency Cepstral Coefficient  (MFCC) Most widely used spectral representation in  ASR
Background image of page 8
Pre-Emphasis Pre-emphasis: boosting the energy in the high  frequencies Q: Why do this? A: The spectrum for voiced segments has more  energy at lower frequencies than higher  frequencies. This is called 
Background image of page 9

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Image of page 10
This is the end of the preview. Sign up to access the rest of the document.

Page1 / 59

224s.09.lec9 - CS224S/LINGUIST281...

This preview shows document pages 1 - 10. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online