This preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: Speech Processing
Analysis and Synthesis of PoleZero Speech Models Introduction Deterministic: Speech Sounds with periodic or impulse sources Stochastic: Speech Sounds with noise sources Goal is to derive vocal tract model of each class of sound source. It will be shown that solution equations for the two classes are similar in structure. Solution approach is referred to as linear prediction analysis. Note that allpole model is intimately associated with the concatenated lossless tube model of previous chapter (i.e., Chapter 4). February 13, 2012 Veton Kpuska Linear prediction analysis leads to a method of speech synthesis based on the allpole model. 2 AllPole Modeling of Deterministic Signals Consider a vocal tract transfer function during voiced source: Ug[n]
...
T=pitch A Glottal Model G(z) Vocal Track Model V(z) Radiation s[n] Speech Model R(z) H ( z ) = AG ( z )V ( z ) R( z ) A H ( z)= P 1ak z k
k =1
February 13, 2012 Veton Kpuska 3 AllPole Modeling of Deterministic Signals What about the fact that R(z) is a zero model? A single zero function can be expressed as a infinite set of poles. Note: ( az ) k =0 1 k 1 = a z = , 1 1az k =0
k k az 1 <1 a < z From the above expression one can derive: 1 = az
simple zero 1 1 a k z k 1bk z 1 k =0 k =0 infinite number of poles
Veton Kpuska = 1 ( ) z>a February 13, 2012 4 AllPole Modeling of Deterministic Signals In practice infinite number of poles are approximated with a finite site of poles since ak0 as k. H(z) can be considered allpole representation: representing a zero with large number of poles inefficient Estimating zeros directly a more efficient approach (covered later in this chapter). Veton Kpuska 5 February 13, 2012 Model Estimation Goal Estimate : filter coefficients {a1, a2, ...,ap}; for a particular order p, and A, Over a short time span of speech signal (typically 20 ms) for which the signal is considered quasistationary. Use linear prediction method: Each speech sample is approximated as a linear combination of past speech samples Set of analysis techniques for estimating parameters of the allpole model.
Veton Kpuska February 13, 2012 6 Model Estimation Consider ztransform of the vocal tract model: H ( z)= S( z) = U g ( z) A Which can be transformed into: 1ak z k
k =1 p p p S ( z ) 1ak z k = S ( z ) ak S ( z ) z k = AU g ( z ) k =1 k =1 In time domain it can be written as: s[ n] = ak s[ nk ] + Au g [ n]
k =1 p Referred to us as a autoregressive (AR) model.
Current Sample
February 13, 2012 Scaling Factor Veton Kpuska Linear Prediction Coefficients Past Samples Input
7 Model Estimation Method used to predict current sample from linear combination of past samples is called linear prediction analysis. LPC Quantization of linear prediction coefficients or of a transformed version of these coefficients is called linear prediction coding (Chapter 12). For ug[n]=0 s[ n ] = a k s [ n  k ]
k =1 p This observation motivates the analysis technique of linear prediction.
February 13, 2012 Veton Kpuska 8 Model Estimation: Definitions A linear predictor of order p is defined by: ~ [ n ] = s[ n  k ] s k
k =1 p Estimate of s[n] z
k =1 Estimate of ak p ~ S ( z ) = S ( z ) k z k P( z ) = k z k
k =1 p February 13, 2012 Veton Kpuska 9 Model Estimation: Definitions Prediction error sequence is given as difference of the original sequence and its prediction: e[ n] = s[ n] ~[ n] = s[ n]  k s[ nk ] s
k =1 p k p Associated prediction error filter is defined as: p E ( z ) = S ( z ) S ( z ) k z = S ( z ) 1 k z k = S ( z ) A( z ) k =1 k =1 A( z ) =1 k z k =1 P( z ) If {k}={ak}
s[n]
k =1 p P[z] ~s[n] A(z) e[n]=Aug[n] February 13, 2012 Veton Kpuska 10 Model Estimation: Definitions Note 1:
p e[ n] = s[ n]  ~[ n] s
p p k =1 k =1 k =1 e[ n] = s[ n]  k s[ n  k ] = ak s[ n  k ] + Au g [ n]  k s[ n  k ] e[ n] = Au g [ n] Recovery of s[n]: 1 s[ n] A( z ) = Au g [ n] s[ n] = Au g [ n] A( z )
Aug[n] 1 A( z )
Veton Kpuska s[n] February 13, 2012 11 Model Estimation: Definitions Note 2: If then
1. 2. Vocal tract contains finite number of poles and no zeros, Prediction order is correct, {k}={ak}, and e[n] is an impulse train for voiced speech and for impulse speech e[n] will be just an impulse. February 13, 2012 Veton Kpuska 12 Example 5.1 Consider an exponentially decaying impulse response of the form h[n]=anu[n] where u[n] is the unit step. Response to the scaled unit sample A[n] is: s[ n] = A [ n]h[ n] = Aa nu[ n] Consider the prediction of s[n] using a linear predictor of order p=1. It is a good fit since: 1 H ( z)= 1az 1 Prediction error sequence with =a is:
1 e[ n] = A a nu[ n] aa n1u[ n1] = Aa n ( u[ n] u[ n1] ) = A [ n] The prediction of the signal is exact except at the time origin. ( e[ n] = s[ n] as[ n1] ) February 13, 2012 Veton Kpuska 13 Error Minimization Important question is: how to derive an estimate of the prediction coefficients al, for a particular order p, that would be optimal in some sense. Optimality is measured based on a criteria. An appropriate measure of optimality is meansquared error (MSE). Goal is to minimize the meansquared prediction error: E defined as: m= m= In reality, a model must be valid over some shorttime interval, say M samples on either side of n: E = ( s[ m] ~[ m] ) = e 2 [ m] s
2 February 13, 2012 Veton Kpuska 14 Error Minimization Thus in practice MSE is timedepended and is formed over a finite interval as depicted in previous figure.
m =n + M m =n  M En = e 2 [ m] [nM,n+M] prediction error interval. Alternatively:
En =
m = m =  e [ m]
2
n where p sn [ m]  k sn [ m  k ], nM m n + M en [ m] = k =1 0, elsewhere February 13, 2012 Veton Kpuska 15 Error Minimization Determine {k} for which En is minimal: Which results in: En =0, i i =1,2,3,..,p
2 2 p p En sn [ m] k sn [ mk ] = sn [ m]  k sn [ mk ] = i i m= k =1 k =1 m= i p p En = 2 sn [ m]  k sn [ mk ]  k sn [ mk ] i m= k =1 i k =1 p En = 2 sn [ m]  k sn [ mk ] ( sn [ mi ] ) i m= k =1 p 0 = 2 sn [ m] sn [ mi ]+ k sn [ mk ] sn [ mi ] m= k =1 February 13, 2012 Veton Kpuska 16 Error Minimization Last equation can be rewritten by multiplying through:
m = m ] s [ m  i ] s [ m=s i [ s m] k[
n n k =1 k m = n n p , p, i 1p. Define the function: [ i, k ] = Which gives the following: m = sn [ m  i ] sn [ m  k ] , 1 i 1 k p [ i,k ] = [ i,0] ,
k =1 k p i = 1,2 ,3,...,p Referred to as the normal equations given in the matrix form bellow: =b
Veton Kpuska February 13, 2012 17 Error Minimization The minimum error for the optimal solution can be derived as follows: En = sn [ m]  k sn [ mk ] m= k =1 p m= 2 n p m= k =1 2 En = s [ m] 2 sn [ m] k sn [ mk ] + k sn [ mk ] l sn [ ml ]
m= k =1 l =1 p p Last term in the equation above can be rewritten as: p k sn [ mk ] l sn [ ml ] = l k sn [ mk ] sn [ ml ] m= k =1 l =1 l =1 k =1 m= p p p = l
l =1
February 13, 2012 Veton Kpuska p m= s [ ml ] s [ m]
n n
18 Error Minimization Thus error can be expressed as: En = s [ m] 2 k
m= 2 n k =1 p m= s [ mk ] s [ m]+ s [ ml ] s [ m]
n n l =1 l m= n n n n p = s [ m]  k
m= 2 n k =1 p p m= s [ mk ] s [ m ] = n [ 0,0] k n [ 0,k ]
k =1 February 13, 2012 Veton Kpuska 19 Error Minimization 1. Remarks: Order (p) of the actual underlying allpole transfer function is not known. 1. Prediction error en[m] is nonzero only "in the vicinity" of the time n: [nM,n+M]. In predicating values of the shorttime sequence sn[m], p values outside of the prediction error interval [nM,n+M] are required. Order can be estimated by observing the fact that a pth order predictor in theory equals that of a (p+1) order predictor. Also predictor coefficients for k>p equal zero (or in practice close to zero and model only noiserandom effects). Covariance method uses values outside the interval to predict values inside the interval Autocorrelation Method assumes that speech samples are zero outside the interval. February 13, 2012 Veton Kpuska 20 Error Minimization Matrix formulation
[ n  M + 0  1] s [ n  M + 0  2] L s [ n  M + 0  p ] 1 s [ n  M ] s [ n  M + 1  1] s [ n  M + 1  2] L s [ n  M + 1  p ] [ n  M + 1] 2 s s = M M M M M M p s[ n + M ] n+ M 2 s n ] 1 44 2 4 43 s [ n + M  1] 4 4 4 4 4 4 4 4s [4 4 4 2 4 ] 4 4L 4 4 [ 4 + M  p43 { 1 4 444 Sn sn Projection Theorem: Columns of Sn basis vectors Error vector en is orthogonal to each basis vector: SnTen=0; where Orthogonality leads to:
T n e [ m] = sn [ m] k sn [ mk ] , m[ nM,n+m]
k =1 p ( S S ) =S
n
February 13, 2012 T n sn
21 Veton Kpuska Autocorrelation Method In previous section we have described a general method of linear prediction that uses samples outside the prediction error interval referred to as covariance method. Alternative approach that does not consider samples outside analysis interval, referred to as autocorrelation method, will be presented next. This method is: Suboptimal, however it Leads to an efficient and stable solution to normal equations. February 13, 2012 Veton Kpuska 22 Autocorrelation Method Assumes that the samples outside the time interval [nM,n+M] are all zero, and Extends the prediction error interval, i.e., the range over which we minimize the meansquared error to . Conventions: Shorttime interval: [n, n+Nw1] where Nw=2M+1 (Note: it is not centered around sample n as in previous derivation). Segment is shifted to the left by n samples so that the first nonzero sample falls at m=0. This operation is equivalent to: Shifting of speech sequence s[m] by nsamples to the left and Windowing by Nw point rectangular window: w [ m ] = 1,
February 13, 2012 for m=0,1,2,K ,N w  1
Veton Kpuska 23 Autocorrelation Method Windowed sequence can be expressed as: sn [ m ] = s [ m + n ] w [ m ] This operation can be depicted in the figure presented on the right. February 13, 2012 Veton Kpuska 24 Autocorrelation Method 1. 1. Important observations that are consequence of zeroing the signal outside of interval:
Prediction error is nonzero only in the interval [0,Nw+p1]
Nwwindow length pthe predictor order The prediction error is largest at the left and right ends of the segment. This is due to edge effects caused by the way the prediction is done:
from zeros from the left of the window to zeros from the right of the window February 13, 2012 Veton Kpuska 25 Autocorrelation Method To compensate for edge effects typically tapered window is used (e.g., Hamming). Removes the possibility that the meansquared error be dominated by end (edge) effects. Data becomes distorted hence biasing estimates: k. Let the meansquared prediction error be given by: 1. 2. m =0 Limits of summation refer to new time origin, and Prediction error outside this interval is zero. En = N w + p 1 2 en [ m ] February 13, 2012 Veton Kpuska 26 Autocorrelation Method Normal equations take the following form (Exercise 5.1):
p k =1 k n [ i, k ] = N [ i, 0] , i = 1, 2,3, ,p where n [ i, k ] = Nw + p 1 m= 0 sn [ m  i ] sn [ m  k ] , 1 i p, 1 k p February 13, 2012 Veton Kpuska 27 Autocorrelation Method Due to summation limits depicted in the figure on the right function n[i,k] can be written as: n [ i, k ] = k + N w 1 m =i sn [ m  i ] sn [ m  k ] Recognizing that only samples in the interval [i,k+Nw1] contribute to the sum, and Changing variable m mi: February 13, 2012 Veton Kpuska 28 Autocorrelation Method
[ i , j ] =
N w 1 ( i  k ) m =0 n s [ m] s [ m + ( i  k ) ],
n 1 i p, 1 k p Since the above expression is only function of difference ik thus we denote it as: rn [ i  k ] = n [ i, k ]
N w 1 m =0 Letting =ik, referred to as correlation "lag", leads to short time autocorrelation function: rn [ ] =
February 13, 2012 sn [ m ] sn [ m + ]
29 rn [ ] = sn [ ] sn [  ]
Veton Kpuska Autocorrelation Method
rn=sn*sn Autocorrelation method leads to computation of the shorttime sequence sn[m] convolved with itself flipped in time. Autocorrelation function is a measure of the "self similarity" of the signal at different lags . When rn is large then signal samples spaced by are said to by highly correlated.
February 13, 2012 Veton Kpuska 30 Autocorrelation Method Properties of rn:
1. 2. 3. For an Npoint sequence, rn is zero outside the interval [(N1),N 1]. rn is even function of rn[0] rn 2 4. 5. rn[0] energy of sn[m] rn [ 0] = sn [ m]
m= If sn[m] is a segment of a periodic sequence, then rn is periodiclike with the same period: Because sn[m] is shorttime, the overlapping data in the correlation decreases as increases Amplitude of rn decreases as increases; With rectangular window the envelope of r n decreases linearly. 1. If sn[m] is a random white noise sequence, then rn is impulse like, reflecting selfsimilarity only within a small neighborhood.
Veton Kpuska February 13, 2012 31 Autocorrelation Method February 13, 2012 Veton Kpuska 32 Autocorrelation Method Letting n[i,k] = rn[ik], normal equation take the form: r [ ik ] =r [ i ],
k =1 k n n p 1 k p The expression represents p linear equations with p unknowns, k for 1kp. Using the normal equation solution, it can be shown that the corresponding minimum meansquared prediction error is given by: p En = rn [ 0]  k rn [ k ].
k =1 Matrix form representation of normal equations: Rn=rn. February 13, 2012 Veton Kpuska 33 Autocorrelation Method Expanded form: rn [1] rn [ 2] rn [ 0] r [1] rn [ 0] rn [1] n rn [ 2] rn [1] rn [ 0] M M M rn [ p 1] rn [ p2] rn [ p 3] Rn L rn [ p 1] 1 rn [1] L rn [ p 2] 2 rn [ 2] L rn [ p 3] 3 = rn [ 3] M M M L rn [ 0] p rn [ p ] rn The Rn matrix is Toepliz: Symmetric about the diagonal All elements of the diagonal are equal. Matrix is invertible Implies efficient solution.
Veton Kpuska 34 February 13, 2012 Example 5.3 Consider a system with an exponentially decaying impulse response of the form h[n] = anu[n], with u[n] being the unit step function. A[n] h[n] s[n] s[ n] = [ n]h[ n] = h[ n] = a nu[ n]
Z 1 S( z)= , 1 1az a <1 Estimate a using the autocorrelation method of linear prediction.
February 13, 2012 Veton Kpuska 35 Example 5.3 Apply Npoint rectangular window [0,N1] at n=0. Compute r0[0] and r0[1]. 1a 2 N r0 [ 0] = s0 [ m] s0 [ m] = a a =a = 1a 2 m=0 m=0 m=0
N 1 N 1 m m N 1 2m 1a 2 N 2 r0 [1] = s0 [ m] s0 [ m+1] = a m a m+1 =a a 2 m = a 1a 2 m=0 m=0 m=0
N 1 N 2 N 2 Using normal equations: r0 [1] 1a 2 N 2 r0 [ 0] = r0 [1] = =a r0 [ 0] 1a 2 N lim = a
N February 13, 2012 Veton Kpuska 36 Example 5.3 Minimum squared error (from slide 33) is thus (Exercise 5.5): 1a 4 N 2 E0 = r0 [ 0] k r0 [ k ] =r0 [ 0] 1r0 [1] = 1a 2 N k =1
1 For 1st order predictor, as in this example here, prediction error sequence for the true predictor (i.e., 1 = a) is given by: e[n]=s[n]as[n1]=[n] (see example 5.1 presented earlier). Thus the prediction of the signal is exact except at the time origin. This example illustrates that with enough data the autocorrelation method yields a solution close to the true singlepole model for an impulse input. February 13, 2012 Veton Kpuska 37 Limitations of the linear prediction model When the underlying measured sequence is the impulse response of an arbitrary allpole sequence, then autocorrelation methods yields correct result. There are a number of speech sounds that even with an arbitrary long data sequence a true solution can not be obtained. Consider a periodic sequence simulating a steady voiced sound formed by convolving a periodic impulse train p[n] with an all pole impulse response h[n]. Ztransform of h[n] is given by: February 13, 2012
H ( z)= 1 1 k z k
k =1
38 p Veton Kpuska Limitations of the linear prediction model Thus h[ n] = k h[ nk ] + [ n]
k =1 p Normal equations of this system are given by (see Exercise 5.7) Where autocorrelation of h[n] is denoted by rh=h*h. Suppose now that the system is excited with an impulse train of the period P: r [ ik ] =r [ ik ] ,
k =1 k h h p 1i p ... P
February 13, 2012 h[n] s[ n] = h[ nkP ]
k =
39 Veton Kpuska Limitations of the linear prediction model Normal equations associated with s[n] (windowed over multiple pitch periods) for an order p predictor are given by: r [ ik ] =r [ ik ] ,
k =1 k n n p 1i p It can be shown that rn is equal to periodically repeated replicas of rh: rn [ ] = but with decreasing amplitude due to the windowing (Exercise 5.7). k = rh [  kP ] February 13, 2012 Veton Kpuska 40 Limitations of the linear prediction model The autocorrelation function rn of the windowed signal s[n] can be thought of as "aliased" version of rh due to overlap which introduces distortion: When aliasing is minor the two solutions are approximately equal. Accuracy of this approximation decreases as the pitch period decreases (e.g., high pitch) due to increase in overlap of autocorrelation replicas repeated every P samples. 1. 2. February 13, 2012 Veton Kpuska 41 Limitations of the linear prediction model Sources of error: Aliasing increases with high pitched speakers (smaller pitch period P). Signal is not truly periodic. Speech not always allpole. Autocorrelation is a suboptimal solution. Covariance method capable of giving optimal solution, however, is not guaranteed to converge when underlying signal does not follow an allpole model. February 13, 2012 Veton Kpuska 42 The Levinson Recursion of the Autocorrelation method Direct inversion method (Gaussian elimination): = Rn1 r n
requires p3 multiplies and additions. Levinson Recursion (1947): Requires p2 multiplies and additions Links directly to the concatenated lossless tube model (Chapter 4) and thus a mechanism for estimating the vocal tract area function from an allpolemodel estimation. February 13, 2012 Veton Kpuska 43 The Levinson Recursion of the Autocorrelation method Step 1:
0 0 = 0 & E 0 = r [ 0] for i=1,2,...,p Step 2: Step 3:
i1 r [ i ] ij1r [ i  j ] E i1 ki = j =1 kipartial correlation coefficients PARCOR ii = ki
 ij = (ji1) ki ((ii1j)) , 1 j i 1 Step 4: Ei = 1  k i2 E (i 1) * = jp , j
1 j p
Veton Kpuska 44 ( ) end February 13, 2012 The Levinson Recursion of the Autocorrelation method It can be shown that on each iteration that the predictor coefficients k, can be written as solely functions of the autocorrelation coefficients (Exercise 5.11). Desired transfer function is given by: H ( z)= A 1 z
k =1 p * 1 k Gain A has yet to be determined. February 13, 2012 Veton Kpuska 45 Properties of the Levinson Recursion of the Autocorrelation method
1. Magnitude of partial correlation coefficients is less than 1: ki<1 for all i. Condition under 1 is sufficient for stability; if all ki<1 then all roots of A(z) are inside the unit circle. Autocorrelation Method gives a minimumphase solution even when the actual system is mixedphase. 2. 3. February 13, 2012 Veton Kpuska 46 Example 5.4 Consider the discretetime model of the complete transfer function from the glottis to the lips derived in Chapter 4 (Equation 4.40), but without zero contributions from the radiation and vocal tract: H ( z) = A (1 z) 2 Ci k =1 ( 1 c z ) ( 1 c z )
1 k 1 k Suppose we measure a single impulse response denoted by h[n] wich is equal to the inverse ztransform of H(z) and estimate the model with autocorrelation method setting the number of poles of (z) correctly; p=2+2Ci, and with prediction error defined over the entire duration of h[n] which yields a solution H ( z) = A ( 1 z )
1 2 Ci k =1 ( 1 c z ) ( 1 c z )
1 k 1 k February 13, 2012 Veton Kpuska 47 Experimentation Results February 13, 2012 Veton Kpuska 48 Properties of the Levinson Recursion of the Autocorrelation method
Formal explanation: Suppose s[n] follows an allpole model Prediction error function is defined over all time (i.e., no window truncation effects: j smin ( ) + smax ( ) j s ( ) S () = Ms () e = Ms ( ) e smax ( ) smin ( ) and are the Fourier transform phase functions for the minimum and maximumphase contributions of S(), respectively. Autocorrelation solution can be expressed as (Exercise 5.14): ^ ( ) = M ( ) e j [ smin ( )  smax ( ) ] = M ( ) e j [ s ( )  2 smax ( ) ] S s s
February 13, 2012 Veton Kpuska 49 Properties of the Levinson Recursion of the Autocorrelation method Exercise 5.14 Rationalization of the Result: H ( ) = Mh ( ) e j h ( ) = Mh ( ) e j v ( ) + g ( ) v ( ) is the minimumphase contribution due to the vocal tract poles g ( ) inside the unit circle, and is maximumphase contribution due to glottal poles outside the unit circle. Resulting estimated frequency response can be expressed as: ^ ( ) = M ( ) e j [ v ( )  g ( ) ] = M ( ) e j [ h ( )  2 g ( ) ] H h s The phase distortion of synthesized speech can have perceptual consequence since a gradual onset of the glottal flow, and thus of the speech waveform during the open phase of the glottal cycle, is transformed to a "sharp attack" consistent with the energy concentration property of minimumphase sequences (Chapter 2). Veton Kpuska 50 February 13, 2012 Properties of the Levinson Recursion to Autocorrelation method
4. Reverse Levinson Recursion: How to obtain lower level model from higher ones? ki = ii 1 i j +ki ii j , = 1k 2 i i1 j ( ) for j =1,2,...,i 1 5. Autocorrelation matching: Let rn be the autocorrelation of the speech signal s[n+m]w[m] and rh the autocorrelation of h[n]= 1 {H(z)} then: rn = rh for p Veton Kpuska
51 February 13, 2012 Autocorrelation Method Gain Computation: A2 = En = rn [ 0]  p k =1 k rn [ k ] En is the average minimum prediction error for the pthorder predictor. If the energy in the allpole impulse response h[m] equals the energy in the measurement sn[m] Squared gain equal to the minimum prediction error. February 13, 2012 Veton Kpuska 52 Autocorrelation Method Relationship to Lossless Tube Model: Recall that for the lossless concatenated tube model, with glottal impedance Zg(z)= (open circuit), with the transfer function:
N A V ( z) = , where D ( z ) = 1  k z  k D( z) k =1 Recursively obtained from: D0 ( z ) =1 Dk ( z ) = Dk 1 ( z ) + rk z k Dk 1 z 1 , N Nnumber of tubes and where reflection coefficients rk is a function of crosssectional areas of successive tubes, i.e., D( z ) = D ( z ) ( ) k =1,2,..., N rk =
February 13, 2012 Ak +1  Ak Ak +1 + Ak
53 Veton Kpuska Relationship to Lossless Tube Model: Levinson Recursion:
p A H ( z) = , where A ( z ) = 1  k z  k A( z ) k =1 Can be written in the domain (see Appendix 5.B)
A0 ( z ) = 1 Ai ( z ) = Ai 1 ( z ) + ki z  i Ai 1 ( z 1 ) , A ( z ) = Ap ( z ) k = 1, 2,..., p Starting condition is obtained by mapping 00=0 to
A ( z ) = 1
0 0 k =1 k0 z  k = 1 Two recursions are identical when ri=ki which then makes Di(z)=Ai(z).
February 13, 2012 Veton Kpuska 54 Relationship to Lossless Tube Model: Since the boundary condition was not included in the lossless tube model, V(z) represents the ratio between an ideal volume velocity at the glottis and at the lips: U L ( z) V ( z) = U g ( z) Speech pressure measurement at the lips output, however, has embedded within it the glottal shape G(z), as well as radiation at the lips R(z). Recall that for the voiced case (with no vocal tract zeros): ( 1 z ) H ( z ) = AG ( z )V ( z ) R ( z ) = A ( 1 z ) ( 1 c z ) ( 1 c z )
1 2 Ci 1 *
k 1 k =1 k The presence of glottal shape, i.e., G(z), thus introduces poles that are not part of vocal tract. The net effect of glottal shape is typically 6dB/octave falloff (see slide 94 of the presentation Acoustic of Speech Production) to the spectral tilt of V(z), The influence of the glottal flow shape and radiation load can be approximately removed with a preemphasis of 6dB/octave spectral rise.
Veton Kpuska February 13, 2012 55 Example 5.5 In the following figure two examples that show good matches to measured vocal tract area functions for the vowels /a/ and /i/ derived from estimates of the partial correlation coefficients. February 13, 2012 Veton Kpuska 56 Frequency Domain Interpretation Consider an allpole model of speech production: H ( z)= A A H ( ) = A( z ) A( )
p Where A() is given by: A( ) =1 k e  jk
k =1 Define Q() as the difference of the logmagnitude of measured and modeled spectra: 2 S ( ) Q( ) = log S ( ) log H ( ) = log H ( )
2 2 Recall: E ( z ) = A( z ) S ( z ) E ( ) = A( ) S ( ) =
Veton Kpuska AS ( ) AS ( ) H ( ) = H ( ) E( )
57 February 13, 2012 Frequency Domain Interpretation Thus we can write Q() as: E( ) Q( ) = log A 2 Thus as e[n] is minimized E() is minimized, which in turn Q() minimized spectral difference between actual measured speech and modeled spectrum is minimized. February 13, 2012 Veton Kpuska 58 Linear Prediction Analysis of Stochastic Speech Sounds Linear Prediction analysis was motivated with observation that for a single impulse or periodic impulse train input to an allpole vocal tract model, the prediction error is zero "most of the time". Such analysis appears not to be applicable to speech sounds with fricative or aspirated sources modeled as a stochastic (or random) process. However, autocorrelation method of linear prediction can be formulated for the stochastic case where a white noise input takes on the role of the single impulse. The solution to a stochastic optimization problem analogous to the minimization of meansquared error function En, leads to normal equations which are the stochastic counterparts to our earlier solution. Derivation and interpretation of this stochastic optimization problem is left as an exercise. February 13, 2012 Veton Kpuska 59 Criterion of "Goodness" How well does linear predication describe the speech signal in time and in frequency? Suppose: Time Domain Underlying speech model is allpole model of order p, and Autocorrelation method is used in the estimation of the coefficients of the predictor polynomial P(z). Speech measurement s[n] A(z)=1P(z) e[n] Prediction error If predictor coefficients are estimated exactly then the prediction error: February 13, 2012 Is perfect impulse train for voiced speech A single impulse for a plosive A white noise for noisy (stochastic) speech.
Veton Kpuska 60 Time Domain Autocorrelation method of linear prediction analysis does not yield such idealized outputs when the measurement s[n] is inverse filtered by the estimated system function A(z) (method limitation): Even when the vocal tract response follow an allpole model, true solution can not be obtained, since the obtained solution approached to the true solution in the limit when infinite amount of data is available. In a typical waveform segment, the actual vocal tract impulse response is not allpole for variety of reasons: Presence of zeros due to: The radiation load, Nasalization, Back vocal cavity during frication and plosives. Glottal flow shape even when adequately modeled, is not minimum phase (see example 5.6).
February 13, 2012 Veton Kpuska 61 Prediction Error Residuals
Autocorrelation method of linear prediction of order 14 Estimation performed over 20 ms Hamming windowed speech segments. February 13, 2012 Veton Kpuska 62 Prediction Error Residuals Reconstructing residuals form an entire utterance typically one hears in the prediction error: Not a noisy buzz as expected from idealized residual, but rather Roughly the speech itself Some of the vocal tract spectrum is passing through the inverse filter. February 13, 2012 Veton Kpuska 63 Frequency Domain Behavior of linear prediction analysis can be studied alternatively in frequency domain: How well the spectrum derived form linear prediction analysis matches the spectrum of a sequence that follows: An allpole model, and Not an allpole model. February 13, 2012 Veton Kpuska 64 Frequency DomainVoiced Speech Recall for voiced speech s[n]: with Fourier transform Ug(). u g [ n] = [ nkP ]
k = Vocal tract impulse response with allpole frequency response H(). Windowed speech sn[n] is: sm [ n] = s[ n  m] w[ m] Fourier transform of windowed speech sn[n] is: Where: 2 S ( ) = H ( k0 )W ( k0 ) P k = W() is the window transform o=2/P is the fundamental frequency
Veton Kpuska 65 February 13, 2012 Frequency DomainUnvoiced Speech Recall for unvoiced speech (stochastic sounds): S N w ( ) H ( ) U oN w ( )
2 2 2 Spectral envelope Periodogram of noise Linear prediction analysis attempts to estimate H() spectral envelope of the harmonic spectrum S(). February 13, 2012 Veton Kpuska 66 Schematics of Spectra for Periodic and Stochastic Speech Sounds February 13, 2012 Veton Kpuska 67 Properties:
1. For large p H() matches the Fourier transform magnitude of the windowed signal S(). February 13, 2012 Veton Kpuska 68 Properties:
2. Spectral peeks are better matched than spectral valleys February 13, 2012 Veton Kpuska 69 Properties: February 13, 2012 Veton Kpuska 70 Synthesis Based on Allpole Modeling Properties: Now able to synthesize the waveform from model parameters estimated using linear prediction analysis:
so[n] Au[n] A(z)=1P(z) e[n] s[n] 1 H ( z) = A( z ) Synthesized signal: s[ n] = k s[ nk ] + Au[ n]
k =1 p February 13, 2012 Veton Kpuska 71 Synthesis Based on Allpole Modeling Important Parameters to Consider: Window Duration 2030 [ms] to give a satisfactory timefrequency tradeoff (Exercise 5.20). Duration can be adaptively varied to account for different timefrequency resolution requirement based on: Pitch Voicing state Phoneme class. Frame Interval Model Order 1. Typical rate at which to perform analysis is 10 [ms]. There are three components to be considered:
Vocal tract: On average "resonant density" of one resonance per 1000 Hz. Order of the system: #poles=2 x #resonances (e.g., for 5000 Hz bandwidth signal 2x5=10 poles) Glottal flow: 2pole maximumphase model Radiation at lips: 1 zero inside the unit circle 4 poles provide adequate representation. 1. 1. Total of 16 poles Remarks: Magnitude of speech frequency is preserved frequency phase response is not preserved. February 13, 2012 Veton Kpuska 72 Synthesis Based on Allpole Modeling Voiced/Unvoiced State and Pitch Estimation: Currently no discrimination is done between for example plosive and fricative unvoiced speech sound categories. Pitch is estimated during voiced regions of speech only. However, Pitch estimation algorithms typically estimate pitch as well as perform voiced/unvoiced classification. A degree of voicing may be desired in more complex analysis and synthesis methods: Voicing and turbulence occurs simultaneously Voiced fricatives Breathy vowels. February 13, 2012 Veton Kpuska 73 Synthesis Based on Allpole Modeling Synthesis Structures: Determine excitation for each frame Generate excitation for each frame by: Compute Gain Concatenating an impulse train during voiced signal (spacing determined by the timevarying pitch contour) White noise during unvoiced signal. Directly by measuring frame energy Using Autocorrelation method Update filter values on each frame. Overlap and add signal at consecutive frames: Voiced Speech: Magnitude of impulse is square root of signal energy. Unvoiced Speech: Noise variance = signal variance. February 13, 2012 Veton Kpuska 74 Synthesis structures February 13, 2012 Veton Kpuska 75 Alternate Synthesis Structures February 13, 2012 Veton Kpuska 76 ...
View
Full Document
 Fall '10
 Staff
 Regression Analysis, Signal Processing, Human voice, Phonation, Veton Këpuska, autocorrelation method

Click to edit the document details