This preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: 3.4 122 Chap. 3 Signal Processing and Analysis Methods N number of samples in the analysis frame M number of samples shift between frames
p LPC analysis order Q dimension of LPC derived cepstral vector
K number of frames over which cepstral time derivatives are computed. Although each of these parameters can be varied over a wide range of values, the following table gives typical values for analysis systems at three different sampling rates (6.67 kHz,
8 kHz, 10 kHz). Typical Values of LPC Analysis Parameters for SpeechRecognition Systems x,
parameter F, = 6.67 kHz F, = 8 kHz F, = 10 kHz N 300 (45 msec) 240 (30 msec) 300 (30 msec)
M 100 (15 msec) 80 (10 msec) 100 (10 msec)
p 8 10 10
Q 12 12 12
K 3 3 3 . VECTOR QUANTIZATION The results of either the ﬁlterbank analysis or the LPC analysis are a series of vectors char
acteristic of the timevarying spectral characteristics of the Speech signal. For convenience,
we denote the spectral vectors as v1, Z = 1,2, . . . ,L, where typically each vector is a p
dimensional vector. If we compare the information rate of the vector representation to that
of the raw (uncoded) speech waveform, we see that the spectral analysis has signiﬁcantly
reduced the required information rate. Consider, for example, 10kHz sampled speech with
16bit speech amplitudes. A raw signal information rate of 160,000 bps is required to store
the speech samples in uncompressed format. For the spectral analysis, consider vectors
of dimension p = 10 using 100 spectral vectors per second. If we again represent each
spectral component to 16bit precision, the required storage is about 100 x 10 x 16 bps, or
16,000 bps——about a 10to1 reduction over the uncompressed signal. Such compressions
in storage rate are impressive. Based on the concept of ultimately needing only a single
spectral representation for each basic speech unit, it may be possible to further reduce
the raw spectral representation of speech to those drawn from a small, ﬁnite number of
“unique” spectral vectors, each corresponding to one of the basic speech units (i.e., the
phonemes). This ideal representation is, of course, impractical, because there is so much
variability in the spectral properties of each of the basic speech units. However, the concept
of building a codebook of “distinct” analysis vectors, albeit with signiﬁcantly more code
words than the basic set of phonemes, remains an attractive idea and is the basis behind a
set of techniques commonly called vector quantization (VQ) methods. Based on this line
of reasoning, assume that we require a codebook' with about 1024 unique spectral vectors 
3
.J.
1 ..._r . ..' Sec. 3.4 (i.e., about 2
spectral vec!
matches the
a total bit ra
signal. This
the VQ repr
information
methods.
Befort
VQ system,
The key adv 0 reduce
VQ re
numb: 0 reduce
recogr
simila'
simila
ofcod o discrel
a set 0
of chc
equiva
of rect
efﬁcie. The disadvai 0 an inh
is on]:
TOPICS!
vector
the co«
any ﬁr o the stc
the co:
the co
nontri‘
choosi
design 3.4.1 Elements 01 To build a V .‘r Analysis Methods .: 3 outed. llllCS, the following 13' ng rates (6.67 kHz, ._ 1 Systems
———" I__
I msec) ) msec) _______ .eries of vectors chart.
ial. For convenience; 
ly each vector is a p}. '
representation to
lysis has signiﬁcantly
: sampled speech
ms is required to sto
ysis, consider vec 5;
: again represent 635.
100 x 10 x 16 bps;.”’
11. Such compress
needing only a singl
;ib1e to further re '
mall, ﬁnite numbe
speech units (i.e.
cause there is so
. HOWever, the con__
;igniﬁcantly more; 6
nd is the basis he
ods. Based on thl To build a VQ codebook and implement a VQ analysis procedure, we need the following: Sec. 3.4 Vector Quantization . 123
(i.e., about 25 variants for each of the 40 basic speech units). Then to represent an arbitrary
spectral vector all we need is a 10bit number—the index of the codebook vector that best
matches the input vector. Assuming a rate of 100 spectral vectors per second, we see that
a total bit rate of about 1000 bps is required to represent the spectral vectors of a speech
signal. This rate is about 1/16th the rate required by the continuous spectral vectors. Hence
the VQ representation is potentially an extremely efﬁcient representation of the spectral
information in the speech signal. This is one of the main reasons for the interest in VQ
methods. Before discussing the concepts involved in designing and implementing a practical
VQ system, we ﬁrst discuss the advantages and disadvantages of this type of representation.
The key advantages of the VQ representation are 0 reduced storage for spectral analysis information. We have already shown that the
VQ representation is potentially very efﬁcient. This efﬁciency can be exploited in a
number of ways in practical VQbased speechrecognition systems. 0 reduced computation for determining similarity of spectral analysis vectors. In speech
recognition a major component of the computation is the determination of spectral
similarity between a pair of vectors. Based on the VQ representation, this spectral
similarity computation is often reduced to a table lookup of similarities between pairs
of codebook vectors. 0 discrete representation of speech sounds. By associating a phonetic label (or possibly
a set of phonetic labels or a phonetic class) with each codebook vector, the process
of choosing a best codebook vector to represent a given spectral vector becomes
equivalent to assigning a phonetic label to each spectral frame of speech. A range of recognition systems exist that exploit these labels so as to recognize speech in an
efﬁcient manner. The disadvantages of the use of a VQ codebook to represent speech spectral vectors are 0 an inherent spectral distortion in representing the actual analysis vector. Since there
is only a ﬁnite number of codebook vectors, the process of choosing the “best”
representation of a given spectral vector inherently is equivalent to quantizing the
vector and leads, by deﬁnition, to a certain level of quantization error. As the size of
the codebook increases, the size of the quantization error decreases. However, with
any ﬁnite codebook there will always be some nonzero level of quantization error. 0 the storage required for codebook vectors is often nontrivial. The larger we make
the codebook (so as to reduce quantization error), the more storage is required for
the codebook entries. For codebook sizes of 1000 or larger, the storage is often
nontrivial. Hence an inherent tradeoff among quantization error, processing for
choosing the codebook vector, and storage of codebook vectors exists, and practical
designs balance each of these three factors. 7: 4.1 Elements of a Vector Quantization Implementation r. Lang'1 124 Chap. 3 Signal Processing and Analysis Methods SEC 34 d ‘ " ‘ ) o talker; variat Tamara? tartar: ° speak { V1. V2. . vt } (k — MEANS) M = 2' vscrons o transd hands . chann : ' T o speec! CODEBOOK tional INK/”52:35:?” INDICES __' _f _ f The more n Figure 3.40 Block diagram of the basic VQ training and classiﬁcation structure. i: :t speaking, cz _‘ ‘:i the smaller
" codebook. }
l. a large set of spectral analysis vectors, v1, vz, . . . , vL, which form a training set. The  be as broad. training set is used to create the “optimal” set of codebook vectors for representing '
the spectral variability observed in the training set. If we denote the size of the " ' 3.4.3 The Similal
VQ codebook as M = 25 vectors (we call this a 8bit codebook), then we require L > M so as to be able to ﬁnd the best set of M codebook vectors in a robust manner. .1 The spectra
In practice, it has been found that L should be at least 10M in order to train a VQ 4
codebook that works reasonably well. 5
2. a measure of similarity, or distance, between a pair of spectral analysis vectors so as 1" ‘,._'.
to be able to cluster the training set vectors as well as to associate or classify arbitrary 1;: i ;I AS we Will
spectral vectors into uniqué' codebook entries. We denote the spectral distance, ,i.‘ __ bank VCCtO’
d(v,,v,), between two vectors v, and v, as dil'. We defer a discussion of spectral ': 1" VCCtOIS (am
distance measures to Chapter 4. E2': the likelihoi
3. a centroid computation procedure. On the basis of the partitioning that classiﬁes .j _k
the L training set vectors into M clusters we choose the M codebook vectors as the ".1, g‘ 3.4.4 Clustering
centroid of each of the M clusters. 5 l'" '_£ .
. . . . . ._, The way in
4. a classrﬁcatlon procedure for arbitrary speech spectral analysts vectors that chooses _ _2’ ;. tors is ﬂ
the codebook vector closest to the input vector and uses the codebook index as the 1;" :(eineans cl
resulting spectral representation. This is often referred to as the nearest—neighbor " 3,.
labeling or optimal encoding procedure. The classiﬁcation procedure is essentially a ’1', 1 Initia
quantizer that accepts, as input, a speech spectral vector and provides, as output, the ".j '
codebook index of the codebook vector that best matches the input. 1 '_"I. vecto
I '.‘,'. 2. New
‘ :' codel
Figure 3.40 shows a block diagram of the basic VQ training and classiﬁcation structure. In ‘3 corre
the following sections we discuss each element of the VQ structure in more detail. 3. C entl
._ ;?"_ vecto
3.4.2 The VQ Training Set _ .'j. 4. Iterat
.43
To properly train the VQ codebook, the training set vectors should span the anticipated Figm range of the following:  I: tioning of : \nalysis Methods Sec. 3.4 Vector Quantization ' 125 o talkers, including ranges in age, accent, gender, speaking rate, levels, and other
variables. "' o speaking conditions, such as quiet room, automobile, and noisy workstation. o transducers and transmission systems, including wideband microphones, telephone
handsets (with both carbon and electret microphones), direct transmission, telephone
channel, wideband channel, and other devices. .. 0 speech units including speciﬁcrecognition vocabularies (e.g., digits) and conversa EB°OK i tional speech.
ches .
g The more narrowly focused the training set (i.e., limited talker populations, quiet room FF
~ I". speaking, carbon button telephone over a standard telephone channel, vocabulary of digits) I
i3 5" the smaller the quantization error in representing the spectral information with a ﬁxedsize 3 '.' codebook. However, for applicability to a wide range of problems, the training set should
a training set. The 2'1 f be as broad, in each of the above dimensions, as possible.
as for representing . I" "
ote the size of the "
K), then we require .
'in a robust manner. ' '.
arder to train a VQ ructure. 3.4.3 The Similarity or Distance Measure The spectral distance measure for comparing spectral vectors v,~ and v; is of the form =0 ifVi=Vj d(Vi, Vj) = dij{ (3.93) . i; > 0 otherwise
ialysls vectors so as   or classify arbitrary I'l
e spectral distance,
;cussion of spectral . a.
I As we will see in Chapter 4, the distance measure commonly used for comparing ﬁlter bank vectors is an L1, L2, or covariance weighted spectral difference, whereas for LPC
vectors (and related feature sets such as LPC derived cepstral vectors), measures such as
the likelihood and cepstral distance measures are generally used. Jning that classiﬁes_.',‘_
:book vectors as the; r 3.4.4 Clustering the Training Vectors The way in which a set of L training vectors can be clustered into a set of M codebook vectors is the following (this procedure is known as the generalized Lloyd algorithm or the
K means clustering algorithm): vectors that choos _ .
debook index as the“?
Lhe nearestneighbo
edure is essentiallyk ..
)vides, as output, the
put. 1. Initialization: Arbitrarily choose M vectors (initially out of the training set of L
vectors) as the initial set of code words in the codebook. 2. NearestNeighbor Search: For each training vector, ﬁnd the code word in the current codebook that is closest (in terms of spectral distance), and assign that vector to the
corresponding cell (associated with the closest code word). iflcation structure. 1 more detail. 3. Centroid Update: Update the code word in each cell using the centroid of the training
vectors assigned to that cell. . Iteration: Repeat steps 2 and 3 until the average distance falls below a preset threshold. span the anticipd _ __ Figure 3.41 illustrates the result of designing a VQ codebook by showing the parti tioning of a (2dimensional) spectral vector space into distinct regions, each of which is 126 Chap. 3 Signal Processing and Analysis Methods em = 0.1 PARTITIONED VECTOR SPACE
X = CENTROID OF REGION Figure 3.41 Partitioning of a vector space into VQ cells with each cell represented by a centroid
vector. represented by a centroid vector. The shape of each partitioned cell is highly dependent
on the spectral distortion measure and the statistics of the vectors in the training set. (For
example, if a Euclidean distance is used, the cell boundaries are hyperplanes.) Although the above iterative procedure works well, it has been shown that it is
advantageous to design an M vector codebook in stages—i.e., by ﬁrst designing a 1vector
codebook, then using a splitting technique on the code words to initialize the search for a 2
vector codebook, and continuing the splitting process until the desired M vector codebook is obtained. This procedure is called the binary split algorithm and is formally implemented
by the following procedure: 1. Design a 1vector codebook; this is the centroid of the entire set of training vectors
(hence, no iteration is required here). 2. Double the size of the codebook by splitting each current codebook y,, according to
the rule y? =yn(1+e) y; = 33.0 — 6),
where It varies from 1 to the current size of the codebook, and e is a splitting parameter
(typically 6 is chosen in the range 0.01 S 6 S 0.05). 3. Use the K—means iterative algorithm (as discussed above) to get the best set of
centroids for the split codebook (i.e., the codebook of twice the size). 4. Iterate steps 2 and 3 until a codebook of size M is designed. (3.94) It Sec. 3.4 1' Figure 3.42
‘3». generation 1
I] . procedure,
a _:. K means a1
{T training vec
' ~_ has converg
1' To in:
_ training set
(in terms 01
... .''.' more detail
'"‘ voiced and
g are achieve:
. for both voi
H‘ smaller. ..''. ... One i
.3! that, in the
If per speech
' 7' measureme
, entries. Fig i 4 for a 32vel
codewords i.._ of male taI
be seen .tha
weak. Furtl \nalysis Methods 5 ted by a centroid s highly dependent i‘.
ie training set. (For ' ‘
)lanes.)
en shown that it is ”.
lesigning a l—vector __,:"‘
re the search for a 2 :I .
Mvector codebook '2':
rmally implemented '7. is .
! :t of training vectorsI_ . )ook y,, according to; ) get the best set :'..
:size). ' Sec. 3.4 Vector Quantization ‘ 127 SPLIT EACH
CENTROID D'=O Figure 3.42 Flow diagram of binary split codebook
generation algorithm. Figure 3.42 shows, in a ﬂow diagram, the detailed steps of the binary split VQ codebook
generation technique. The box labeled “Classify Vectors” is the nearestneighbor search
procedure, and the box labeled “Find Centroids" is the centroid update procedure of the
Kmeans algorithm. The box labeled “Compute D (Distortion)” sums the distances of all
training vectors in the nearestneighbor search so as to determine whether the procedure
has converged (i.e., D = D’ of the previous iteration). To illustrate the effect of codebook size (i.e., number of codebook vectors) on average
training set distortion, Figure 3.43 [12] shows experimentally measured values of distortion
(in terms of the likelihood ratio measure and the equivalent dB values; see Chapter 4 for
more details) versus codebook size (as measured in bits per frame, B) for vectors of both
voiced and unvoiced speech. It can be seen that very signiﬁcant reductions in distortion
are achieved in going from a codebook size of 1 bit (2 vectors) to about 7 bits (128 vectors)
for both voiced and unvoiced speech. Beyond this point, reductions in distortion are much
smaller. One initial motivation for considering the use of a VQ codebook was the assumption
that, in the limit, the codebook should ideally have about ‘40 vectors—i.e., one vector
per speech sound. However, since the codebook vectors represent short time spectral
measurements, there is inherently a certain degree of variability in speciﬁc codebook
entries. Figure 3.44 shows a comparison of codebook vector locations in the F1 — F 2 plane
for a 32vector codebook, along with the vowel ellipses discussed in Chapter 2. (The 32
codewords were generated from a training set of conversational speech spoken by a set
of male talkers. The training set included both speech and background signals.) It can
be seen that the correspondence between codebook vector location and vowel location is
weak. Furthermore, there appears to be a tendency to cluster around the neutral vowel /3~/. 128 Chap. 3 Signal Processing and Analysis Methods _ Sec. 3.4 1 .50 M—vector 1
(and quanl
6.87, 1.25 .'
6.14,1.00 _’_'. ' For codcbt
(z) (~dB, am) :.._j be excessi
E 5.32, 0.75 , . suboptima
.' r 5 ' ‘ .
g ' vorceo 'J;  bneﬂy dls
a . .
4.34, 0.50  .
UNVOICED  " 3.4.6 ComparISr
3.07, 0.25 _ To illustra
"' individual
0 '9. of using w
1 2 3 4 5 6 7 8 9 1011 '._. seebothm
! I' .
BITS/FRAME :_ '. 24bll scal
l' .' 5; error of th(
Figure 3.43 Codebook distortion versus codebook size (measured in bits per ' im lies th'
frame) for both voiced and unvoiced speech (after Juang et al. [12]). "i‘ . ,_ p '
,I :_ scalar quai
5_° " r“. Figu
4 o f. f r” for the thre
;_ of the 101
3° .; the 10bit
1", ,5; represents
3: 2.0 5 I?
I 3 3:; .
5 3 .1 3.4.7 Extension
11‘.“ 1 ‘5. ,
1.0 . .; As mentrc
2 _ proposed 2
:. I
o 5 t _ 1. Use
' 0 0.5 1.0 1.5 i f. dent
F1 (km) _1 we 1
Figure 3.44 Codebook vector locations in the F1 — F2 plane (for a 32vect0r _ I cede
codebook) superimposed on the vowel ellipses (after Juang et a1. [12]). . COde
 2. Bina
This can be attributed, in pan, to both the distortion measure and to the manner in which . '. ’_ spac
spectral centroids are computed. .23; dista
_.. assig
3.4.5 Vector Classification Procedure \ ‘5 Pa"
3:3. rtera
The classiﬁcation procedure for arbitrary spectral vectors is basically a full search through .' proc
the codebook to ﬁnd the “best” match. Thus if we denote the codebook vectors of an . :2 the ‘ nalysis Methods :2
PI
1
l
I .. 'u
r
3. 11 its per 1.5 .2vector the manner in which; . v a full search throng 
lebook vectors of J
Sec. 3.4 Vector Quantization 129 Mvector codebook as ym, 1 S m g M, and we denote the spectral vector to be classiﬁed
(and quantized) as v, then the index, m*, of the best codebook entry is m = arg 1 $113” d(v, ym). (3.95)
For codebooks with large values of M (e.g., M Z 1024), the computation of Eq. (3.95) could
be excessive, depending on the exact details of the distance measure; hence, alternative, suboptimal, procedures for designing VQ codebooks have been investigated. We will
brieﬂy discuss such methods in a later section of this chapter. 3.4.6 Comparison of Vector and Scalar Quantizers To illustrate the power of the concept of quantizing an entire vector (rather than quantizing
individual components of the vector), Figures 3.45 and 3.46 show comparisons of the results
of using vector and scalar quantizers on typical speech spectral frames. In Figure 3.45 we
see both model (speech) spectra and the resulting quantization error spectrum for 10bit and
24bit scalar quantizers and for a 10bit vector quantizer. It is clear that the quantization
error of the 10bit vector quantizer is comparable to that of the 24bit scalar quantizer. This
implies that the vector quantizer provides a 14bit reduction in storage (per frame) over a
scalar quantizer, i.e., more than a 50% reduction in storage for the same distortion. Figure 3.46 shows temporal plots of distortion as well as distortion error histograms
for the three quantizers of Figure 3.45. It can be seen that even though the average distortion
of the 10bit VQ is comparable to that of the 24bit scalar quantizer, the peak distortion of
the 10bit VQ is much smaller than the peak distortion of the 24bit scalar quantizer. This
represents another distinct advantage of VQ over scalar quantization. 3.4.7 Extensions of Vector Quantization As mentioned earlier, several straightforward extensions of the ideas of VQ have been
proposed and studied, including the following: 1. Use of multiple codebooks in which codebooks are created separately (and indepen
dently) for each of several spectral (or temporal) representations of speech. Thus
we might consider using a separate codebook for cepstral vectors and a separate
codebook for the time derivatives of the cepstral vectors. This method of multiple
codebooks has been used extensively in large vocabulary speechrecognition systems. 2. Binary search trees in which a series of suboptimal VQs is used to limit the search
space so as to reduce the computation of the overall VQ from M distances to log (M)
distances. The training procedure ﬁrst designs an optimal M = 2 VQ and then
assigns all training vectors to one of the VQ cells. Next the procedure designs a
pair of M = 2 VQs, one for each subset of the preceding stage. This process is
iterated until the desired size is obtained in log M steps. The suboptirnality of the
procedure is related to the fact that training vectors initially split along one branch of the VQ cannot join the other branch at a later stage of processing; hence, the overall 130 Chap. 3 Signal Processing and Analysis Methods MODEL SPECTRA ERROR SPECTRA — UNOUANTIZED +503
~~~~~~ OUANTIZED E SDB 10BIT SCALAR 4,_.._‘,._
w“
W
W
9%...
W—
l‘\=F—v——
(4*.—
M
H—ﬁ—
w
(6—.—
W
W— MODEL SPECTRA ERROR SPECTRA —UNOUANTIZED E +503
 QUANTIZED —593 248” SCALAR fffffiffiffff MODEL SPECTRA ERROR SPECTRA _ UNOUANTIZED +503
...... QUANTIZED E 5DB 10BIT VECTOR Figure 3.45 Model and distortion error spectra for scalar and vector quantizers (after Juang
et al. [12]). distortion is not minimal at each branch of the tree. 3. K tup1e (ﬁxedlength block) quantizers in which K—frames of speech are coded at a
time, rather than single frames, as is conventionally the case. The idea is to exploit
correlations in time for vowels and vowellike sounds. The disadvantage occurs for sounds where the correlation along the Ktuple is low—i.e., transient sounds and
many consonants. 4. Matrix quantization in which a codebook of sounds or words of variable sequence ....i_._ e: n.......L ' \.I : 3.4.8 Sec. 3.4 LIKELIHOOD RATIO FREQUENCY OF Fig
Iua leng
dyn.
vecl
to vs 5. Trel
exp]
ian
usin
lean
and
unit 6. Hid.
qua:
adi: Summary In later of
exploited
is to redu
a codeboc Analysis Methods :ROR SPECTRA +503
SDB ALAR :ers (after Juang )eech are coded at
‘he idea is to explor.
idvantage occurs for
ransient sounds and 3f variable sequenqe Sec. 3.4 Vector Quantization ' 131
30%
o 1 .1 ‘ y I IL 24 BITS SCALAFI
; ; [MHWM'W'IM‘W o 3 MAX 1.433
< Z ‘ ‘ ' ' AVE 0.145
n: o 0 > Z 20%
_ 1o BITS SCALAR QUAN o m 0 015,
o I— 1 z c: 
o [E LIJ II
0 19 o! ‘ I 8 8
E (2 24 BITS SCALAR OUAN w 0 10%
u.I O 1 E O
x
' 0W 0%
10 BITS VECTOR OUAN 0 0.5 1.0
TIME “’ LIKELIHOOD RATIO
DISTORTION 24 BITS OUAN
30% 30%
[L 10 BITS SCALAn u_ 10 BITS VECTOR
o UJ MAX 3.869 0 3 MAX 0.478
> g 20% AVE 0.463 (>5 2 20% AVE 0.151
g 3:4 o 0.514 2 g 0.073
u.I n: LIJ u:
D 3 D 3
g 8 10% 8 8 10%
n: 0 II II 0
. illl""' .
0% .. "IIIIIIIi.:..._... ., 0% 0 0.5 1.0 0 0.5 1.0 LIKELIHOOD RATIO LIKELIHOOD RATIO
DISTORTION 10 BITS QUAN DISTORTION 10 BITS QUAN
Figure 3.46 Plots and histograms of temporal distortion for scalar and vector quantizers (after
Juang et a1. [12]). length is created. The concept here is to handle time variability via some types of
dynamic programming procedure and thereby create a codebook of sequences of vectors that represent typical sounds or words. Such techniques are most applicable
to wordrecognition systems. Trellis codes in which time sequential dependencies among codebook entries are
explicitly determined as part of the training phase. The idea here is that when
input vector v,, is quantized using codeword y, then input vector v,.+1 is quantized
using one of a limited subset of codebook entries that are related to y; via a set of
learned sequential constraints, thereby reducing computation of encoding the input, and increasing the ability to interpret the codebook output in terms of basic speech
units. Hidden Markov models in which both time and spectral constraints are used to quantize an entire speech utterance in a welldeﬁned and efﬁcient manner. We defer
a discussion of hidden Markov models to Chapter 6. 3.4.8 Summary of the VQ Method In later chapters of this book we will see several examples of how VQ concepts can be
exploited in speechrecognition systems. Here we have shown that the basic idea of VQ
is to reduce the information rate of the speech signal to a low rate through the use of
a codebook with a relatively small number of code words. The goal is to be able to 132 Chap. 3 Signal Processing and Analysis Methods ouren MIDDLE INNER ':.
EAR l EAR EAR ',‘
/\ . l l
\ . . l ’ vesnauun APPARATUS T'
.  . 1 . ,  WITH semcmcuun cA
\ Out HAMMER ; .. . ._ ”“5
_ .3 (MALLEUS)' ‘  1 'VESTIBULAR
._U ANVIL _  ’:‘.NERVE''
.7, llNcus) '
3 srmnup
' @1685}! .....
/.
/ PINNA
\ \ EXTERNAL “N“ ....,.'e'mbéou,<.. ,
\ ‘ :(TYMPANIC ' '
,_.'; .MEMpﬁANE) IZ' ‘
\ \\ ._.".'.' '.‘.';‘.';:OVAL wmoow
~ :.' ' ' ‘ ' ,‘ROUND wwoow  \ .3 ' ' gusIACHIAN roe; NASAL CAVITY Figure 3.47 Physiological model of the human ear. represent the spectral information of the signal in an efﬁcient manner and in a way that
direct connections to the acousticphonetic framework discussed in Chapter 2 can be made. Various techniques for achieving this efﬁciency of representation were discussed, and their
properties were illustrated on representative examples of speech. 3.5 AUDITORYBASED SPECTRAL ANALYSIS MODELS The motivation for investigating spectral analysis methods that are physiologically based
is to gain an understanding of how the human auditory system processes speech, so as to
be able to design and implement robust, efﬁcient methods of analyzing and representing
speech. It is generally assumed that the better we understand the signal processing in the
human auditory system, the closer we will come to being able to design a system that can
truly understand meaning as well as content of speech. With these considerations in mind, we ﬁrst examine a physiological model of the
human ear. Such a model is given in Figure 3.47 and it shows that the car has three distinct
regions called the outer ear, the middle ear, and the inner ear. The outer ear consists
of the pinna (the ear surface surrounding the canal in which sound is funneled), and the
external canal. Sound waves reach the ear and are guided through the outer ear to the Sec. 3.5 MID middle ear,
wave impir
the incus 0'
mechanical
ﬂuidﬁlled
The mecha'
create stam
vibrate at f
formants 0'
with these 1
in Figure 3
this ﬁgure‘l The 1
points alon
mechanical
along the b
hair cells (1
motion at :
causes ﬁrin
is connecte
connection
only at big;
of about 3C ...
View
Full Document
 Spring '10
 Glass

Click to edit the document details