signal-representation-10

# signal-representation-10 - MIT Speech Signal Representation...

This preview shows page 1. Sign up to view the full content.

This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: MIT Speech Signal Representation Fourier Analysis Cepstral Analysis Linear Prediction Auditorily-Motivated Representations Comparisons 6.345 Automatic Speech Recognition (2010) Speech Signal Representaion 1 MIT Discrete-Time Fourier Transform + X(ej ) = x[n]e-jn n=- x[n] = 1 2 - X(ej )ejn d Although x[n] is discrete, X(ej ) is continuous and periodic with period 2. Convolution/multiplication duality: y[n] = x[n] h[n] y[n] = x[n]w[n] 1 2 - Speech Signal Representaion 2 Y (ej ) = X(ej )H(ej ) 6.345 Automatic Speech Recognition (2010) Y (ej ) = W (ej )X(ej(-) )d MIT Short-Time Fourier Analysis w [ 50 - m ] w [ 100 - m ] w [ 200 - m ] x[m] m 0 n = 50 n = 100 n = 200 + Xn (e j )= m=- w[n - m]x[m]e-jm If n is fixed, then it can be shown that: Xn (e j )= 1 2 - W (ej )ejn X(ej(+) )d The above equation is meaningful only if we assume that X(ej ) represents the Fourier transform of a signal whose properties continue outside the window, or simply that the signal is zero outside the window. In order for Xn (ej ) to correspond to X(ej ), W (ej ) must resemble an impulse with respect to X(ej ). 6.345 Automatic Speech Recognition (2010) Speech Signal Representaion 3 MIT Rectangular Window w[n] = 1, 0nN -1 6.345 Automatic Speech Recognition (2010) Speech Signal Representaion 4 MIT Hamming Window 2n w[n] = 0.54 - 0.46cos , N -1 0nN -1 6.345 Automatic Speech Recognition (2010) Speech Signal Representaion 5 MIT Comparison of Windows 6.345 Automatic Speech Recognition (2010) Speech Signal Representaion 6 MIT Comparison of Windows (cont'd) 6.345 Automatic Speech Recognition (2010) Speech Signal Representaion 7 MIT Discrete Fourier Transform x[n] X[k] = X(z) | z=e j 2k n M Npoints X[k] = x[n] = Mpoints N-1 x[n]e n=0 M-1 -j 2k n M 1 M X[k]e k=0 j 2k n M In general, the number of input points, N, and the number of frequency samples, M, need not be the same. If M > N, we must zero-pad the signal If M < N, we must time-alias the signal 6.345 Automatic Speech Recognition (2010) Speech Signal Representaion 8 MIT A Wideband Spectrogram Two plus seven is less than ten 6.345 Automatic Speech Recognition (2010) Speech Signal Representaion 9 MIT A Narrowband Spectrogram Two plus seven is less than ten 6.345 Automatic Speech Recognition (2010) Speech Signal Representaion 10 MIT Speech from an Omni-Directional Microphone The Thinker is a famous sculpture. 6.345 Automatic Speech Recognition (2010) Speech Signal Representaion 11 MIT Speech from a Close-Talking Microphone The Thinker is a famous sculpture. 6.345 Automatic Speech Recognition (2010) Speech Signal Representaion 12 MIT Speech over Telephone Channel The Thinker is a famous sculpture. 6.345 Automatic Speech Recognition (2010) Speech Signal Representaion 13 MIT Voiced Unvoiced Cepstral Analysis of Speech u[n] H(z) s[n] The speech signal is often assumed to be the output of an LTI system; i.e., it is the convolution of the input and the impulse response. If we are interested in characterizing the signal in terms of the parameters of such a model, we must go through the process of de-convolution. Cepstral analysis is a common procedure used for such de-convolution. 6.345 Automatic Speech Recognition (2010) Speech Signal Representaion 14 MIT Cepstral Analysis x[n] = x1 [n] x2 [n] X(ej ) = X1 (ej )X2 (ej ) By taking the complex logarithm of X(ej ), then ^ log{X(ej )} = log{X1 (ej )} + log{X2 (ej )} = X(ej ) Cepstral analysis for convolution is based on the observation that: ^ If the complex logarithm is unique, and if X(ej ) is a valid Discrete-Time Fourier Transform, then ^ ^ ^ x[n] = x1 [n] + x2 [n] The two convolved signals will be additive in this new, cepstral domain. Note that: ^ X(ej ) = log |X(ej )| + j arg{X(ej )} It can be shown that one approach to dealing with the problem of uniqueness is to require that arg{X(ej )} be a continuous, odd, periodic function of . 6.345 Automatic Speech Recognition (2010) Speech Signal Representaion 15 MIT Cepstral Analysis (cont'd) ^ To the extent that X(ej ) = log{X(ej )} is valid, + 1 x[n] = log{X(ej )} ejn d ^ 2 - c[n] = 1 2 + complex cepstrum cepstrum log |X(ej )| ejn d - Contributions to the cepstrum due to periodic excitation will occur at integer multiples of the fundamental period Contributions due to the glottal waveform, vocal tract, and radiation will be concentrated in the low quefrency region Cepstral analysis has been used for fundamental frequency and formant tracking 6.345 Automatic Speech Recognition (2010) Speech Signal Representaion 16 MIT Example of Cepstral Analysis of Vowel (Tapering Window) 6.345 Automatic Speech Recognition (2010) Speech Signal Representaion 17 MIT Example of Cepstral Analysis of Fricative (Tapering Window) 6.345 Automatic Speech Recognition (2010) Speech Signal Representaion 18 MIT The Use of Cepstrum for Speech Recognition Many current speech recognition systems represent the speech signal as a set of cepstral coefficients, computed at a fixed frame rate. In addition, the time derivatives of the cepstral coefficients have also been used. 6.345 Automatic Speech Recognition (2010) Speech Signal Representaion 19 MIT Statistical Properties of Cepstral Coefficients (Tohkura, 1987) From a digit database (100 speakers) over dial-up telephone lines. 6.345 Automatic Speech Recognition (2010) Speech Signal Representaion 20 MIT Linear Prediction p Linear prediction assumes an all-pole model for production: ~ x[n] = k=1 ak x[n - k] H(e j )= 1- 1 p ak e-jk k=1 The predictor coefficients {ak } can be efficiently determined by minimizing the prediction error E= n ~ (x[n] - x[n])2 Linear prediction effectively models the spectral envelope of the power spectrum of the speech signal The LP coefficients can be used to represent the speech signal directly, or cepstral coefficients can be recursively obtained LPC (and related) methods remain popular for speech coding 6.345 Automatic Speech Recognition (2010) Speech Signal Representaion 21 MIT Examples of Various Spectral Representations 6.345 Automatic Speech Recognition (2010) Speech Signal Representaion 22 MIT Auditorily-Motivated Representations Perceptually relevant properties of the human auditory system can be incorporated into the spectral estimation process Critical-band spectral resolution Equal-loudness preemphasis Intensity-loudness power law A variety of models have been explored (e.g., Ghitza, Hermansky, Lyon, Patterson, Seneff, Shamma, and many others) Perceptual linear prediction (PLP) has obtained good ASR results (Hermansky,1990) 6.345 Automatic Speech Recognition (2010) Speech Signal Representaion 23 MIT Mel-Frequency Spectral Representations Mel-frequency spectral coefficients (MFSCs) mimic auditory scale The Mel frequency scale is linear to 1kHz and logarithmic above Log energy is computed in bands by weighting spectral power Mel-frequency cepstral coefficients (MFCCs) are popular for ASR MFCCs are computed via the discrete cosine transform (DCT) 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 1000 2000 3000 4000 Frequency (Hz) 5000 6000 7000 6.345 Automatic Speech Recognition (2010) Speech Signal Representaion 24 MIT Alternative Representations Phonetically-motivated representations are intuitive and concise It has been difficult to achieve robust ASR performance 6.345 Automatic Speech Recognition (2010) Speech Signal Representaion 25 MIT Signal Representation Comparisons Many researchers have compared cepstral representations with Fourier-, LPC-, and auditory-based representations. Cepstral representation typically out-performs Fourier- and LPC-based representations. Example: Classification of 16 vowels using ANN (Meng, 1991) 80 Clean Data Testing Accuracy (%) 70 66.1 Noisy Data 61.7 61.6 61.2 45.0 DFT 44.5 50 54.0 60 40 30 Auditory Model MFSC MFCC Acoustic Representation Performance of various signal representations cannot be compared without considering how the features will be used, i.e., the pattern classification techniques used. (Leung, et al., 1993). 6.345 Automatic Speech Recognition (2010) Speech Signal Representaion 26 36.6 MIT Frames or Segments? Most speech recognition systems are frame-based. But a segment-based approach may better capture the relevant acoustic properties of speech sounds. 6.345 Automatic Speech Recognition (2010) Speech Signal Representaion 27 MIT Heterogeneous Acoustic Measurements Classification confusions are most likely within manner classes. Measurements tailored to each class might prove useful. 6.345 Automatic Speech Recognition (2010) Speech Signal Representaion 28 MIT Things to Ponder... What are some of the properties of the human auditory system that we are not capturing - masking, bilateral hearing, etc.? What about representing the speech signal in terms of phonetically motivated attributes (e.g., formants, durations, fundamental frequency contours)? How do we make use of these (sometimes heterogeneous) features for recognition (i.e., what are the appropriate methods for modeling them)? What about non-acoustic features, e.g., facial features? ... 6.345 Automatic Speech Recognition (2010) Speech Signal Representaion 29 MIT References Huang, X., Acero, A., and Hon, H., Spoken Language Processing, Prentice-Hall, 2001(Chapters 5-6). Hermansky, H., "Perceptual linear prediction (PLP) analysis of speech,'' J. Acoust. Soc. of Amer., 87(4), 17381752, 1990. Mermelstein, P. and Davis, S., "Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences," IEEE Trans. ASSP, Vol. ASSP-28, No. 4, 357-366, 1980. Leung, H., Chigier, B., and Glass, J., "A Comparative Study of Signal Represention and Classification Techniques for Speech Recognition," Proc. ICASSP, Vol. II, 680-683, 1993. Meng, H., The Use of Distinctive Features for Automatic Speech Recognition, SM Thesis, MIT EECS, 1991. Tohkura, Y., "A weighted cepstral distance measure for speech recognition," IEEE Trans. ASSP, Vol. ASSP-35, No. 10, 1414-1422, 1987. 6.345 Automatic Speech Recognition (2010) Speech Signal Representaion 30 ...
View Full Document

## This note was uploaded on 05/08/2010 for the course CS 6.345 taught by Professor Glass during the Spring '10 term at MIT.

Ask a homework question - tutors are online