This preview shows page 1. Sign up to view the full content.
Unformatted text preview: Introduction to Sound Processing Davide Rocchesso∗ ∗ Università di Verona Dipartimento di Informatica email: D.Rocchesso@computer.org www: http://www.scienze.univr.it/˜rocchess Copyright c 2003 Davide Rocchesso. This work is licensed under the Creative Commons AttributionShareAlike License. To view a copy of this license, visit http://creativecommons.org/licenses/bysa/1.0/ or send a letter to Creative Commons, 559 Nathan Abbott Way, Stanford, California 94305, USA. AttributionShareAlike 1.0 You are free: • • • to copy, distribute, display, and perform the work to make derivative works to make commercial use of the work under the following conditions: Attribution. You must give the original author credit. Share Alike. If you alter, transform, or build upon this work, you may distribute the resulting work only under a license identical to this one. • • For any reuse or distribution, you must make clear to others the license terms of this work. Any of these conditions can be waived if you get permission from the author. Your fair use and other rights are in no way affected by the above. The book is accessible from the author’s web site: http://www.scienze.univr.it/˜rocchess. The book is listed in http://www.theassayer.org, where reviews can be posted. ISBN 8890112611 Cover Design: Claudia Calvaresi. Editorial Production Staff: Nicola Bernardini, Federico Fontana, Alessandra Ceccherelli, Nicola Giosmin, Anna Meo.
A Produced from LTEX text sources and PostScript and TIFF images. Compiled with VTEX/free. Online distributed in Portable Document Format. Printed and bound in Italy by PHASAR Srl, Firenze. Contents
1 Systems, Sampling and Quantization 1.1 ContinuousTime Systems . . . . . . . 1.2 The Sampling Theorem . . . . . . . . . 1.3 DiscreteTime Spectral Representations 1.4 DiscreteTime Systems . . . . . . . . . 1.4.1 The Impulse Response . . . . . 1.4.2 The Shift Theorem . . . . . . . 1.4.3 Stability and Causality . . . . . 1.5 Continuoustime to discretetime system conversion . . . . . . . . . . . . . . . . 1.5.1 Impulse Invariance . . . . . . . 1.5.2 Bilinear Transformation . . . . 1.6 Quantization . . . . . . . . . . . . . . . Digital Filters 2.1 FIR Filters . . . . . . . . . . . . . . . 2.1.1 The Simplest FIR Filter . . . 2.1.2 The Phase Response . . . . . 2.1.3 HigherOrder FIR Filters . . 2.1.4 Realizations of FIR Filters . . 2.2 IIR Filters . . . . . . . . . . . . . . . 2.2.1 The Simplest IIR Filter . . . 2.2.2 HigherOrder IIR Filters . . . 2.2.3 Allpass Filters . . . . . . . . 2.2.4 Realizations of IIR Filters . . 2.3 Complementary ﬁlters and ﬁlterbanks 2.4 Frequency warping . . . . . . . . . . iii 3 D. Rocchesso: Sound Processing Delays and Effects 3.1 The Circular Buffer . . . . . . . . . 3.2 FractionalLength Delay Lines . . . 3.2.1 FIR Interpolation Filters . . 3.2.2 Allpass Interpolation Filters 3.3 The NonRecursive Comb Filter . . 3.4 The Recursive Comb Filter . . . . . 3.4.1 The CombAllpass Filter . 3.5 Sound Effects Based on Delay Lines 3.6 Spatial sound processing . . . . . . 3.6.1 Spatialization . . . . . . . . 3.6.2 Reverberation . . . . . . . . 67 67 68 69 72 74 76 78 79 81 81 89 99 99 99 100 103 108 110 113 117 117 117 122 123 124 124 125 127 129 130 130 135 137 137 138 141 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Sound Analysis 4.1 ShortTime Fourier Transform . . . . . . . . . 4.1.1 The Filterbank View . . . . . . . . . . 4.1.2 The DFT View . . . . . . . . . . . . . 4.1.3 Windowing . . . . . . . . . . . . . . . 4.1.4 Representations . . . . . . . . . . . . . 4.1.5 Accurate partial estimation . . . . . . . 4.2 Linear predictive coding (with Federico Fontana) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Sound Modelling 5.1 Spectral modelling . . . . . . . . . . . . . . . . 5.1.1 The sinusoidal model . . . . . . . . . . . 5.1.2 Sines + Noise + Transients . . . . . . . . 5.1.3 LPC Modelling . . . . . . . . . . . . . . 5.2 Timedomain models . . . . . . . . . . . . . . . 5.2.1 The Digital Oscillator . . . . . . . . . . 5.2.2 The Wavetable Oscillator . . . . . . . . . 5.2.3 Wavetable sampling synthesis . . . . . . 5.2.4 Granular synthesis (with Giovanni De Poli) 5.3 Nonlinear models . . . . . . . . . . . . . . . . . 5.3.1 Frequency and phase modulation . . . . . 5.3.2 Nonlinear distortion . . . . . . . . . . . 5.4 Physical models . . . . . . . . . . . . . . . . . . 5.4.1 A physical oscillator . . . . . . . . . . . 5.4.2 Coupled oscillators . . . . . . . . . . . . 5.4.3 Onedimensional distributed resonators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii A Mathematical Fundamentals A.1 Classes of Numbers . . . . . . . A.1.1 Fields . . . . . . . . . . A.1.2 Rings . . . . . . . . . . A.1.3 Complex Numbers . . . A.2 Variables and Functions . . . . . A.3 Polynomials . . . . . . . . . . . A.4 Vectors and Matrices . . . . . . A.4.1 Square Matrices . . . . A.5 Exponentials and Logarithms . A.6 Trigonometric Functions . . . . A.7 Derivatives and Integrals . . . . A.7.1 Derivatives of Functions A.7.2 Integrals of Functions . A.8 Transforms . . . . . . . . . . . A.8.1 The Laplace Transform A.8.2 The Fourier Transform . A.8.3 The Z Transform . . . . A.9 Computer Arithmetics . . . . . A.9.1 Integer Numbers . . . . A.9.2 Rational Numbers . . . B Tools for Sound Processing
(with Nicola Bernardiniounds in Matlab and Octave . . . . . . . . . . . . . . . B.1.1 Digression . . . . . . . . . . . . . . . . . . . . B.2 Languages for Sound Processing . . . . . . . . . . . . . B.2.1 Unit generator . . . . . . . . . . . . . . . . . . B.2.2 Examples in Csound, SAOL, and CLM . . . . . B.3 Interactive Graphical Building Environments . . . . . . B.3.1 Examples in ARES/MARS and pd . . . . . . . B.4 Inline sound processing . . . . . . . . . . . . . . . . . . B.4.1 TimeDomain Graphical Editing and Processing B.4.2 Analysis/Resynthesis Packages . . . . . . . . . B.5 Structure of a Digital Signal Processor . . . . . . . . . . B.5.1 Memory Management . . . . . . . . . . . . . . B.5.2 Internal Arithmetics . . . . . . . . . . . . . . . B.5.3 The Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 178 179 182 185 186 192 193 195 196 198 200 202 203 205 iv C Fundamentals of psychoacoustics C.1 The ear . . . . . . . . . . . . C.2 Sound Intensity . . . . . . . . C.2.1 Psychophysics . . . . C.3 Pitch . . . . . . . . . . . . . . C.4 Critical Band . . . . . . . . . C.5 Masking . . . . . . . . . . . . C.6 Spatial sound perception . . . Index References D. Rocchesso: Sound Processing 207 207 209 213 215 217 217 219 222 229 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Preface
What you have in your hands, or on your screen, is an introductory book on sound processing. By reading this book, you may expect to acquire some knowledge on the mathematical, algorithmic, and computational tools that I consider to be important in order to become proﬁcient sound designers or manipulators. The book is targeted at both science and artoriented readers, even though the latter may ﬁnd it hard if they are not familiar with calculus. For this purpose an appendix of mathematical fundamentals has been prepared in such a way that the book becomes self contained. Of course, the mathematical appendix is not intended to be a substitute of a thorough mathematical preparation, but rather as a shortcut for those readers that are more eager to understand the applications. Indeed, this book was conceived in 1997, when I was called to teach introductory audio signal processing in the course “Specialisti in Informatica Musicale” organized by the Centro Tempo Reale in Firenze. In that class, the majority of the students were excellent (no kidding, really superb!) music composers. Only two students had a scientiﬁc background (indeed, a really strong scientiﬁc background!). The task of introducing this audience to ﬁlters and trasforms was so challenging for me that I started planning the lectures and laboratory material much earlier and in a structured form. This was the initial form of this book. The course turned out to be an exciting experience for me and, based on the music and the research material that I heard from them afterward, I have the impression that the students also made good use of it. After the course in Firenze, I expanded and improved the book during four editions of my course on sound processing for computer science students at the University of Verona. The mathematical background of these students is different from that of typical electrical engineering students, as it is stronger in discrete mathematics and algebra, and with not much familiarity with advanced v vi D. Rocchesso: Elaborazione del Suono and applied calculus. Therefore, the books presents the basics of signals, systems, and transforms in a way that can be immediately used in applications and experienced in computer laboratory sessions. This is a free book, thus meaning that it was written using free software tools, and it is freely downloadable, modiﬁable, and distributable in electronic or printed form, provided that the enclosed license and link to its original web location are included in any derivative distribution. The book web site also contains the source codes listed in the book, and other auxiliary software modules. I encourage additions that may be useful to the reader. For instance, it would be nice to have each chapter ended by a section that collects annotations, solutions to the problems that I proposed in footnotes, and other problems or exercises. Feel free to exploit the open nature of this book to propose your additional contents. Venezia, 11th February 2004 Davide Rocchesso Chapter 1 Systems, Sampling and Quantization
1.1 ContinuousTime Systems Sound is usually considered as a monodimensional signal (i.e., a function of time) representing the air pressure in the ear canal. For the purpose of this book, a SingleInput SingleOutput (SISO) System is deﬁned as any algorithm or device that takes a signal in input and produces a signal in output. Most of our discussion will regard linear systems, that can be deﬁned as those systems for which the superposition principle holds: Superposition Principle : if y1 and y2 are the responses to the input sequences x1 and x2 , respectively, then the input ax1 + bx2 produces the response ay1 + by2 . The superposition principle allows us to study the behavior of a linear system starting from test signals such as impulses or sinusoids, and obtaining the responses to complicated signals by weighted sums of the basic responses. A linear system is said to be linear timeinvariant (LTI), if a time shift in the input results in the same time shift in the output or, in other words, if it does not change its behavior in time. Any continuoustime LTI system can be described by a differential equation. The Laplace transform, deﬁned in appendix A.8.1 is a mathematical tool that is used to analyze continuoustime LTI systems, since it allows to transform complicated differential equations into ratios of polynomials of a complex 1 2 D. Rocchesso: Sound Processing variable s. Such ratio of polynomials is called the transfer function of the LTI system. Example 1. Consider the LTI system having as input and output the functions of time (i.e., the signals) x(t) and y (t), respectively, and described by the differential equation dy − s0 y = x . (1) dt This equation, transformed into the Laplace domain according to the rules of appendix A.8.1, becomes sYL (s) − s0 YL (s) = XL (s) . (2) Here, as in most of the book, we implicitly assume that the initial conditions are zero, otherwise eq. (2) should also contain a term in y (0). From the algebraic equation (2) the transfer function is derived as the ratio between the output and input transforms: 1 . (3) H (s) = s − s0 ### The coefﬁcient s0 , root of the denominator polynomial of (3), is called the pole of the transfer function (or pole of the system). Any root of the numerator would be called a zero of the system. The inverse Laplace transform of the transfer function is an equivalent description of the system. In the case of example 1.1, it takes the form h(t) = es0 t 0 t≥0 t<0 , (4) and such function is called a causal exponential. In general, the function h(t), inverse transform of the transfer function, is called the impulse response of the system, since it is the output obtained from the system as a response to an ideal impulse1 . The two equivalent descriptions of a linear system in the time domain (impulse response) and in the Laplace domain (transfer function) correspond to two alternative ways of expressing the operations that the system performs in order to obtain the output signal from the input signal.
1 A rigorous deﬁnition of the ideal impulse, or Dirac function, is beyond the scope of this book. The reader can think of an ideal impulse as a signal having all its energy lumped at the time instant 0. Systems, Sampling and Quantization 3 The description in the Laplace domain leads to simple multiplication between the Laplace transform of the input and the system transfer function: Y (s) = H (s)X (s) . (5) This operation can be interpreted as multiplication in the frequency domain if the complex variable s is replaced by j Ω, being Ω the real variable of the Fourier domain. In other words, the frequency interpretation of (5) is obtained by restricting the variable s from the complex plane to the imaginary axis. The transfer function, whose domain has been restricted to j Ω is called frequency response. The frequency interpretation is particularly intuitive if we imagine the input signal as a complex sinusoid ej Ω0 t , which has all its energy focused on the frequency Ω0 (in other words, we have a single spectral line at Ω0 ). The complex value of the frequency response (magnitude and phase) at the point j Ω0 corresponds to a joint magnitude scaling and phase shift of the sinusoid at that frequency. The description in the time domain leads to the operation of convolution, which is deﬁned as2
+∞ y (t) = (h ∗ x)(t) =
−∞ h(t − τ )x(τ )dτ . (6) In order to obtain the signal coming out from a linear system it is sufﬁcient to apply the convolution operator between the input signal and the impulse response. 1.2 The Sampling Theorem In order to perform any form of processing by digital computers, the signals must be reduced to discrete samples of a discretetime domain. The operation that transforms a signal from the continuous time to the discrete time is called sampling, and it is performed by picking up the values of the continuoustime signal at time instants that are multiple of a quantity T , called the sampling interval. The quantity Fs = 1/T is called the sampling rate. The presentation of a detailed theory of sampling would take too much space and it would become easily boring for the readership of this book. For a more extensive treatment there are many excellent books readily available,
2 The convolution will be fully justiﬁed for discretetime systems in section 1.4. Here, for continuoustime systems, we give only the deﬁnition. 4 D. Rocchesso: Sound Processing from the more rigorous [66, 65] to the more practical [67]. Luckily, the kernel of the theory can be summarized in a few rules that can be easily understood in terms of the frequencydomain interpretation of signals and systems. The ﬁrst rule is related to the frequency representation of discretetime variables by means of the Fourier transform, deﬁned in appendix A.8.3 as a specialization of the Z transform: Rule 1.1 The Fourier transform of a function of discrete variable is a function of the continuous variable ω , periodic3 with period 2π . The second rule allows to treat the sampled signals as functions of discrete variable: Rule 1.2 Sampling a continuoustime signal x(t) with sampling interval T produces a function x(n) = x(nT ) of the discrete variable n. ˆ If we call spectrum of a signal its Fouriertransformed counterpart, the fundamental rule of sampling is the following: Rule 1.3 Sampling a continuoustime signal with sampling rate Fs produces a discretetime signal whose frequency spectrum is a periodic replication of the spectrum of the original signal, and the replication period is Fs . The Fourier variable ω for functions of discrete variable is converted into the frequency variable f (in Hz) by means of ω = 2πf T = 2πf . Fs (7) Fig. 1 shows an example of frequency spectrum of a signal sampled with sampling rate Fs . In the example, the continuoustime signal had all and only the frequency components between −Fb and Fb . The replicas of the original spectrum are sometimes called images. Given the simple rules that we have just introduced, it is easy to understand the following Sampling Theorem, introduced by Nyquist in the twenties and popularized by Shannon in the forties: Theorem 1.1 A continuoustime signal x(t), whose spectral content is limited to frequencies smaller than Fb (i.e., it is bandlimited to Fb ) can be recovered from its sampled version x(n) = x(nT ) if the sampling rate Fs = 1/T is such ˆ that Fs > 2Fb . (8)
3 This periodicity is due to the periodicity of the complex exponential of the Fourier transform. Systems, Sampling and Quantization
X(f) 5 F s F b 0 Fb Fs /2 Fs f Figure 1: Frequency spectrum of a sampled signal It is also clear how such recovering might be obtained. Namely, by a linear reconstruction ﬁlter capable to eliminate the periodic images of the base band introduced by the sampling operation. Ideally, such ﬁlter doesn’t apply any modiﬁcation to the frequency components lower than the Nyquist frequency, deﬁned as FN = Fs /2, and eliminates the remaining frequency components completely. The reconstruction ﬁlter can be deﬁned in the continuoustime domain by its impulse response, which is given by the function h(t) = sinc(t) = which is depicted in ﬁg. 2.
Impulse response of the Reconstruction Filter 1 sin (πt/T ) , πt/T (9) sinc 0 −1 −5 0 time in sampling intervals 5 Figure 2: sinc function, impulse response of the ideal reconstruction ﬁlter Ideally, the reconstruction of the continuoustime signal from the sampled signal should be performed in two steps: 6 D. Rocchesso: Sound Processing • Conversion from discrete to continuous time by holding the signal constant in time intervals between two adjacent sampling instants. This is achieved by a device called a holder. The cascade of a sampler and a holder constitutes a sample and hold device. • Convolution with an ideal sinc function. The sinc function is ideal because its temporal extension is inﬁnite on both sides, thus implying that the reconstruction process can not be implemented exactly. However, it is possible to give a practical realization of the reconstruction ﬁlter by an impulse response that approximates the sinc function. Whenever the condition (8) is violated, the periodic replicas of the spectrum have components that overlap with the base band. This phenomenon is called aliasing or foldover and is avoided by forcing the continuoustime original signal to be bandlimited to the Nyquist frequency. In other words, a ﬁlter in the continuoustime domain cuts off the frequency components exceeding the Nyquist frequency. If aliasing is allowed, the reconstruction ﬁlter can not give a perfect copy of the original signal. Usually, the word aliasing has a negative connotation because the aliasing phenomenon can make audible some spectral components which are normally out of the frequency range of hearing. However, some sound synthesis techniques, such as frequency modulation, exploit aliasing to produce additional spectral lines by folding onto the base band spectral components that are outside the Nyquist bandwidth. In this case where the connotation is positive, the term foldover is preferred. 1.3 DiscreteTime Spectral Representations We have seen how the sampling operation essentially changes the nature of the signal domain, which switches from a continuous to a discrete set of points. We have also seen how this operation is transposed in the frequency domain as a periodic replication. It is now time to clarify the meaning of the variables which are commonly associated to the word “frequency” for signals deﬁned in both the continuous and the discretetime domain. The various symbols are collected in table 1.1, where the limits imposed by the Nyquist frequency are also indicated. With the term “digital frequencies” we indicate the frequencies of discretetime signals. Appendix A.8.3 shows how it is possible to deﬁne a Fourier transform for functions of a discrete variable. Here we can reexpress such deﬁnition, as Systems, Sampling and Quantization Nyquist Domain ... 0 ... ... 0 ... ... 0 ... ... 0 ... Symbol f f /Fs ω = 2πf /Fs Ω = 2πf Unit [Hz] = [cycles/s] [cycles/sample] [radians/sample] [radians/s] 7 [−Fs /2 [−1/2 [−π [−πFs Fs /2] 1/2] π] πFs ] digital freqs. Table 1.1: Frequency variables a function of frequency, for discretevariable functions obtained by sampling continuoustime signals with sampling interval T . This transform is called the DiscreteTime Fourier Transform (DTFT) and is expressed by
+∞ Y (f ) =
n=−∞ y (nT )e−j 2π Fs n . f (10) We have already seen that the function Y (f ) is periodic4 with period Fs . Therefore, it is easy to realize that the DTFT can be inverted by an integral calculated on a single period: y (nT ) = 1 Fs
Fs /2 Y (f )ej 2πf nT df .
−Fs /2 (11) In practice, in order to compute the Fourier transform with numeric means we must consider a ﬁnite number of points in (10). In other words, we have to consider a window of N samples and compute the discrete Fourier transform on that signal portion:
N −1 ˆ Y (f ) =
n=0 y (n)e−j 2π Fs n . ˆ f (12) In (12) we have taken a window of N samples (i.e., N T seconds) of the signal, starting from instant 0, thus forming an N point vector. The result is still a function of continuos variable: the larger the window, the closer is the function to Y (f ). Therefore, the “windowing” operation introduces some loss of precision in frequency analysis. On the other hand, it allows to localize the analysis in the time domain. There is a tradeoff between the time domain and
4 Indeed, the expression (10) can be read as the Fourier series expansion of the periodic signal Y (f ) with coefﬁcients y (nT ) and components which are “sinusoidal” in frequency and are multiples of the fundamental 1/Fs . 8 D. Rocchesso: Sound Processing the frequency domain, governed by the Uncertainty Principle which states that the product of the window length by the frequency resolution ∆f is constant: ∆f N = 1 . (13) Example 2. This example should clarify the spectral effects induced by sampling and windowing. Consider the causal complex exponential function in continuous time y (t) = es0 t 0 t≥0 t<0 , (14) where s0 is the complex number s0 = a + jb. To visualize such complex signal we can consider its real part (y (t)) = (eat ejbt ) = eat cos (bt) , (15) and obtain ﬁg. 3.a from it. The Laplace transform of function (14) has been calculated in appendix A.8.1. It can be reduced to the Fourier transform by the substitution s = j Ω: Y (Ω) = 1 . j Ω − s0 (16) The magnitude of the complex function (16) is drawn in solid line in ﬁg. 3. The sampled signal is also Fouriertransformable in closed form, by reducing the Z transform obtained in appendix A.8.3 by the substitution z = ejω . The formula turns out to be5 Y (ω ) = 1 1− es0 /Fs e−jω , (17) and its magnitude is drawn in dashed line in ﬁg. 3 for Fs = 50Hz . We can see that sampling induces a periodic replication in the spectrum and that the periodicity is established by the sampling rate. The fact that the spectrum is not identically zero for frequencies higher than the Nyquist limit determines aliasing. This can be seen, for instance, in the heightening of the peak at the frequency of the damped sinusoid. If we consider only the sampled signal lying within a window of N = 7 samples, we can compute the DTFT by means of (12) and obtain the third curve of ﬁg. 3. Two important artifacts emerge after windowing:
5 If we compare this formula with (57) of the appendix A, we see that here the variable s in the 0 exponent is divided by Fs . Indeed, the discretevariable functions of appendix A.8.3 correspond to signals sampled with unit sampling rate. Systems, Sampling and Quantization 9 • The peak is enlarged. In general, we wave a main lobe for each relevant spectral component, and the width of the lobe might prevent from resolving two components that are close to each other. This is a loss of frequency resolution due to the uncertainty principle. • There are side lobes (frequency leakage) due to the discontinuity at the edges of the rectangular window. Smaller side lobes can be obtained by using windows that are smoother at the edges. Unfortunately, for signals that are not known analytically, the analysis can only be done on ﬁnite segments of sampled signal, and the artifacts due to windowing are not eliminable. However, as we will show in sec. 4.1.3, the tradeoff between width of the main lobe and height of the side lobes can be explored by choosing windows different from the rectangular one. (a)
1 Exponentially−decayed sinusoid (b)
Frequency response of a damped sinusoid −10 −20 0 Y [dB] 0.5 t [s] 1 −30 −40 −50 y −1 0 −60 0 50 f [Hz] 100 Figure 3: (a): Exponentiallydecayed sinusoid, obtained as the real part of the complex exponential y (t) = es0 t , with s0 = −10 + j 100; (b): Frequency analysis of the complex exponential y (t) = es0 t . Transform of the continuoustime signal (continuous line), transform of the signal sampled at Fs = 50Hz (dashed line), and transform of the sampled signal windowed with a 7sample rectangular window (dashdotted line) To conclude the example we report the Octave/Matlab code (see the appendix B) that allows to plot the curves of ﬁg. 3. The computation of the DTFT is particularly instructive. We have expressed the sum in (12) as a vectormatrix multiply, thus obtaining a compact expression that is computed efﬁciently. We also notice how Matlab and Octave manage vectors of complex numbers with the proper arithmetics. 10 D. Rocchesso: Sound Processing % script that visualizes the effects of % sampling and windowing global_decl; platform(’octave’); %put either ’octave’ or ’matlab’ a =  10.0; b = 100; s0 = a + i * b; t = [0:0.001:1]; y = exp(s0*t); % complex exponential subplot(2,2,1); plot(t,real(y)); eval(mygridon); title(’Exponentiallydecayed sinusoid’); xlabel(’t [s]’); ylabel(’y’); eval(myreplot); pause; f = [0:0.1:100]; Y = 1 ./ (i * 2 * pi * f  s0*ones(size(f))); % closedform Fourier transform subplot(2,2,2); plot(f, 20*log10(abs(Y)), ’’); title(’Frequency response of a damped sinusoid’); xlabel(’f [Hz]’); ylabel(’Y [dB]’); hold on; Fs = 50; Ysamp = 1 ./ (1  exp(s0/Fs) * exp( i*2*pi*f/Fs)) / Fs; % closedform Fourier transform of the sampled signal plot(f,20*log10(abs(Ysamp)),’’); n = [0:6]; y = exp(s0*n/Fs); Ysampw = y * exp(i*2*pi/Fs*n’*f) / Fs; % Fourier transform of the windowed signal % obtained by vectormatrix multiply plot(f,20*log10(abs(Ysampw)),’.’); hold off; eval(myreplot); ### Finally, we deﬁne the Discrete Fourier Transform (DFT) as the collection of N samples of the DTFT of a discretetime signal windowed by a lengthN rectangular window. The frequency sampling points (called bins) are equally Systems, Sampling and Quantization spaced between 0 and Fs according to the formula fk = Therefore, the DFT is given by
N −1 11 kFs . N (18) Y (k ) =
n=0 y (n)e−j N kn , k = [0 . . . N − 1] . 2π (19) The DFT can also be expressed in matrix form. Just consider y (n) and Y (k ) as elements of two N component vectors y and Y related by Y = Fy , where F is the Fourier matrix, whose generic element of indices k, n is Fk,n = e−j N kn .
2π (20) (21) It is clear that the sequence y can be recovered by premultiplication of the sequence Y by the matrix F−1 , which is the inverse Fourier matrix. This can be expressed as 1 y (n) = N
N −1 Y (k )ej N kn , n = [0 . . . N − 1] ,
k=0 2π (22) which is called the Inverse Discrete Fourier Transform. The Fast Fourier Transform (FFT) [65, 67], is a fast algorithm for computing the sum (19). Namely, the FFT has computational complexity [24] of the order of N log N , while the trivial procedure for computing the sum (19) would take an order of N 2 steps, thus being intractable in many practical cases. The FFT can be found as a predeﬁned component in most systems for digital signal processing and sound processing languages. For instance, there is an fft builtin function in Octave, CSound, CLM (see the appendix B). 1.4 DiscreteTime Systems A discretetime system is any processing block that takes an input sequence of samples and produces an output sequence of samples. The actual processing 12 D. Rocchesso: Sound Processing can be performed sample by sample or as a sequence of transformations of data blocks. The linear and timeinvariant systems are particularly interesting because a theory is available that describes them completely. Since we have already seen in sec. 1.1 what we mean by linearity, here we restate the concept with formulas. If y1 (n) and y2 (n) are the system responses to the inputs x1 (n) and x2 (n) then, feeding the system with the input x(n) = a1 x1 (n) + a2 x2 (n) we get, at each discrete instant n y (n) = a1 y1 (n) + a2 y2 (n) . (24) (23) In words, the superposition principle does hold. The time invariance is deﬁned by considering an input sequence x(n), which gives an output sequence y (n), and a version of x(n) shifted by D samples: x(n − D). If the system is time invariant, the response to x(n − D) is equal to y (n) shifted by D samples, i.e. y (n − D). In other words, the time shift can be indifferently put before or after a timeinvariant system. Cases where the time invariance does not hold are found in systems that change their functionality over time or that produce an output sequence at a rate different from that of the input sequence (e.g., a decimator that undersamples the input sequence). An important property of linear and timeinvariant (LTI) systems is that, in a cascade of LTI blocks the order of such blocks is irrelevant for the global inputoutput relation. As we have already mentioned for continuoustime systems, there are two important system descriptions: the impulse response and the transfer function. LTI discretetime systems are completely described by either one of these two representations. 1.4.1 The Impulse Response Any input sequence can be expressed as a weighted sum of discrete impulses properly shifted in time. A discrete impulse is deﬁned as δ (n) = 1 0 n=0 n=0 . (25) If the impulse (25) gives as output a sequence (called, indeed, the impulse response) h(n) deﬁned in the discrete domain, then a linear combination of shifted impulses will produce a linear combination of shifted impulse responses. Systems, Sampling and Quantization 13 Therefore, it is easy to be convinced that the output can be expressed by the following general convolution6 : y (n) = (h ∗ x)(n) =
m x(m)h(n − m) =
m h(m)x(n − m) , (26) which is the discretetime version of (6). The Z transform H (z ) of the impulse response is called transfer function of the LTI discretetime system. By analogy to what we showed in sec. 1.1, the inputoutput relationship for LTI systems can be described in the transform domain by Y (z ) = H (z )X (z ) , (27) where the input and output signals X (z ) and Y (z ) have been capitalized to indicate that these are the Z transforms of the signals themselves. The following general rule can be given: • A linear and timeinvariant system working in continuous or discrete time can be represented by an operation of convolution in the time domain or, equivalently, by a complex multiplication in the (respectively Laplace or Z) transform domain. The results of the two operations are related by a (Laplace or Z) transform. Since the transforms can be inverted the converse statement is also true: • The convolution between two signals in the transform domain is the transform of a multiplication in the time domain between the antitransforms of the signals. 1.4.2 The Shift Theorem We have seen how two domains related by a transform operation such as the Z transform are characterized by the fact that the convolution in one domain corresponds to the multiplication in the other domain. We are now interested to know what happens in one domain if in the other domain we perform a shift operation. This is stated in the Theorem 1.2 (Shift Theorem) Given two domains related by a transform operator, the shift by τ in one domain corresponds, in the transform domain, to a multiplication by the kernel of the transform raised to the power τ .
6 The reader is invited to construct an example with an impulse response that is different from zero only in a few points. 14 D. Rocchesso: Sound Processing We recall that the kernel of the Laplace transform7 is e−s and the kernel of the Z transform is z −1 . The shift theorem can be easily justiﬁed in the discrete domain starting from the deﬁnition of Z transform. Let x(n) be a discretetime signal, and let y (n) be its version shifted by an integer number τ of samples. With the variable substitution N = n − τ we can produce the following chain of identities, which proves the theorem:
∞ ∞ Y (z ) =
n=−∞ ∞ y (n)z −n =
n=−∞ x(n − τ )z −n = (28) =
N =−∞ x(N )z −N −τ = z −τ X (z ) . 1.4.3 Stability and Causality The notion of causality is rather intuitive: it corresponds to the experience of exciting a system and getting its response back only in future time instants, i.e. in instants that follow the excitation time along the time arrow. It is easy to realize that, for an LTI system, causality is enforced by forbidding nonzero values to the impulse response for time instants preceding zero. Noncausal systems, even though not realizable by samplebysample processing, can be of interest for nonrealtime applications or where a processing delay can be tolerated. The notion of stability is more delicate and can be given in different ways. We deﬁne the socalled boundedinput boundedoutput (BIBO) stability, which requires that any input bounded in amplitude might only produce a bounded output, even though the two bounds can be different. It can be shown that having BIBO stability is equivalent to have an impulse response that is absolutely summable, i.e.
∞ h(n) < ∞ .
−∞ (29) In particular, a necessary condition for BIBO stability is that the impulse response converges toward zero for time instants diverging from zero. It is easy to detect stability on the complex plane for LTI causal systems [58, 66, 65]. In the continuoustime case, the system is stable if all the poles are on the left of the imaginary axis or, equivalently, if the strip of convergence (see
7 This is the kernel of the direct transform, being es the kernel of the inverse transform. Systems, Sampling and Quantization 15 appendix A.8.1) ranges from a negative real number to inﬁnity. In the discretetime case, the system is stable if all the poles are within the unit circle or, equivalently, the ring of convergence (see appendix A.8.3) has the inner radius of magnitude less than one and the outer radius extending to inﬁnity. Stability is a condition that is almost always necessary for practical realizability of linear ﬁlters in computing systems. It is interesting to note that physical systems can be locally unstable but, in virtue of the principle of energy conservation, these instabilities must be compensated in other points of the systems themselves or of the other systems they are interacting with. However, in numeric implementations, even local instabilities can be a problem, since the numerical approximations introduced in the representations of variables can easily produce diverging signals that are difﬁcult to control. 1.5 Continuoustime to discretetime system conversion In many applications, and in particular in sound synthesis by physical modeling, the design of a discretetime system starts from the description of a physical continuoustime system by means of differential equations and constraints. This description of an analog system can itself be derived from the simpliﬁcation of the physical reality into an assembly of basic mechanical elements, such as springs, dampers, frictions, nonlinearities, etc. . Alternatively, our continuoustime physical template can result from measurements on a real physical system. In any case, in order to construct a discretetime system capable to reproduce the behavior of the continuoustime physical system, we need to transform the differential equations into difference equations, in such a way that the resulting model can be expressed as a signal ﬂowchart in discrete time. The techniques that are most widely used in signal processing to discretize a continuoustime LTI system are the impulse invariance and the bilinear transformation. 1.5.1 Impulse Invariance In the method of the impulse invariance, the impulse response h(n) of the discretetime system is a uniform sampling of the impulse response hs (t) of the continuoustime system, rescaled by the width of the sampling interval T , according to h(n) = T hs (nT ) . (30) 16 D. Rocchesso: Sound Processing In the usual practice of digital ﬁlter design, the constant T is usually neglected, since the design stems from speciﬁcations for the discretetime ﬁlter, and the conversion to continuous time is only an intermediate stage. Since one should introduce 1/T when going from discrete to continuous time, and T when returning to discrete time, the overall effect of the constant is canceled. Vice versa, if we start from a description in continuous time, such as in physical modeling, the constant T should be considered. From the sampling theorem we can easily deduce that the frequency response of the discretetime system is the periodic replication of the frequency response of the continuoustime system, with a repetition period equal to Fs = 1/T . In terms of “discretetime frequency” ω (in radians per sample), we can write
∞ H (ω ) =
k=−∞ Hs 2π jω +j k T T ∞ =
k=−∞ Hs (j Ω + j 2πFs k ) . (31) The equation (31) shows that the frequency components in the two domains, discrete and continuous, can be identical in the base band only if the continuoustime system is bandlimited. If this is not the case (and it is almost never the case!), there will be some aliasing that introduces spurious components in the band of interest of the discretetime system. However, if the frequency response of the continuoustime system is sufﬁciently close to zero in high frequency, the aliasing can be neglected and the resulting discretetime system turns out to be a good approximation of the continuoustime template. Often, the continuoustime impulse response is derived from a decomposition of the transfer function of a system into simple fractions. Namely, the transfer function of a continuoustime system can be decomposed8 into a sum of terms such as a , (32) Hs (s) = s − sa which are given by impulse responses such as hs (t) = aesa t 1(t) , (33) where 1(t) is the ideal step function, or Heaviside function, which is zero for negative (anticausal) time instants. Sampling the (33) we produce the discretetime response n h(n) = T a esa T 1(n) , (34)
8 This holds for simple distinct poles. The reader might try to extend the decomposition to the case of coincident double poles. Systems, Sampling and Quantization whose transfer function in z is H (z ) = Ta . 1 − esa T z −1 17 (35) By comparing (35) and (32) it is clear what is the kind of operation that we should apply to the sdomain transfer function in order to obtain the z domain transfer function relative to the impulse response sampled with period T . It is important to recognize that the impulseresponse method preserves the stability of the system, since each pole of the left s hemiplane is matched with a pole that stays within the unit circle of the z plane, and vice versa. However, this kind of transformation can not be considered a conformal mapping, since not all the points of the s plane are coupled to points of the z plane by a relation9 z = esT . An important feature of the impulseinvariance method is that, being based on sampling, it is a linear transformation that preserves the shape of the frequency response of the continuoustime system, at least where aliasing can be neglected. It is clear that the method of the impulse invariance can be used when the continuoustime reference model is a lowpass or a bandpass ﬁlter (see sec. 2 for a treatment of ﬁlters). If the template is an highpass ﬁltering block the method is not applicable because of aliasing. 1.5.2 Bilinear Transformation An alternative approach to using the impulse invariance to discretize continuous systems is given by the bilinear transformation, a conformal map that creates a correspondence between the imaginary axis of the s plane and the unit circumference of the z plane. A general formulation of the bilinear transformation is 1 − z −1 . (36) s=h 1 + z −1 It is clear from (36) that the dc component j 0 of the continuoustime system corresponds to the dc component 1 + j 0 of the discretetime system, and the inﬁnity of the imaginary axis of the s plane corresponds to the point −1 + j 0, which represents the Nyquist frequency in the z plane. The parameter h allows to impose the correspondence in a third point of the imaginary axis of
9 To be convinced of that, consider a second order continuoustime transfer function with simple poles and a zero and convert it with the method of the impulse invariance. Verify that the zero does not follow the same transformation that the poles are subject to. 18 D. Rocchesso: Sound Processing the s plane, thus controlling the compression of the axis itself when it gets transformed into the unit circumference. A particular choice of the parameter h derives from the numerical integration of differential equations by the trapezoid rule. To understand this point, consider the transfer function (32) and its relative differential equation that couples the input variable xs to the output variable ys dys (t) − sa ys (t) = axs (t) . dt If we sample the output variable with period T we can write
nT (37) ys (nT ) = ys (nT − T ) +
nT −T ys (τ )dτ , ˙ (38) where ys = ˙ dys (t) dt , and integrate the (38) with the trapezoid rule, thus obtaining ys (nT ) ≈ ys ((n − 1)T ) + (ys (nT ) + ys ((n − 1)T )) T /2 . ˙ ˙ (39) By replacing (37) into (39) and setting y (n) = ys (nT ) we get a difference equation represented, in virtue of the shift theorem 1.2, by the transfer function H (z ) = a(1 + z −1 )T /2 , 1 − sa T /2 − (1 + sa T /2)z −1 (40) which can be obviously obtained from Hs (s) by bilinear transformation with h = 2/T . 1 2 It is easy to check that, with h = T , the continuostime frequency f = πT π maps into the discretetime frequency ω = 2 , i.e. half the Nyquist limit. More generally, half the Nyquist frequency of the discretetime system corresponds to the frequency f = 2h of the continuoustime system. The more h is high, π the more the low frequencies are compressed by the transformation. To give a practical example, using the sampling frequency Fs = 44100Hz 2 and h = T = 88200, the frequency that is mapped into half the Nyquist rate of the discretetime system (i.e., 11025Hz), is f = 14037.5Hz. The same transformation, with h = 100000 maps the frequency f = 15915.5Hz to half the Nyquist rate. If we are interested in preserving the magnitude and phase response at f = 11025Hz we need to use h = 69272.12. Systems, Sampling and Quantization 19 1.6 Quantization With the adjectives “numeric” and “digital” we connote systems working on signals that are represented by numbers coded according to the conventions of appendix A.9. So far, in this chapter we have described discretetime systems by means of signals that are functions of a discrete variable and having a codomain described by a continuous variable. Actually, the internal arithmetic of computing systems imposes a signal quantization, which can produce various kinds of effects on the output sounds. For the scope of this book the most interesting quantization is the linear quantization introduced, for instance, in the process of conversion of an analog signal into a digital signal. If the word representing numerical data is b bits long, the range of variation of the analog signal can be divided into 2b quantization levels. Any signal amplitude between two quantization levels can be quantized to the closest level. The processes of sampling and quantization are illustrated in ﬁg. 4 for a wordlength of 3 bits. The minimal amplitude difference that can be represented is called the quantum interval and we indicate it with the symbol q . We can notice from ﬁg. 4 that, due to two’s complement representation, the representation levels for negative amplitude exceed by one the levels used for positive amplitude. It is also evident from ﬁg. 4 how quanty(t) 2q 0 2q 4q T 5T 10T t Figure 4: Sampling and 3bit quantization of a continuoustime signal ization introduces an approximation in the representation of a discretetime signal. This approximation is called quantization error and can be expressed as η (n) = yq (n) − y (n) , (41) where the symbol yq (n) indicates the value y (n) quantized by rounding it to the nearest discrete level. From the viewpoint of the designer, the quantization noise can be considered as a noise superimposed to the unquantized signal. 20 This noise takes values in the range − D. Rocchesso: Sound Processing q q ≤η≤ , 2 2 (42) and it is spectrally colored according to the nature and form of the unquantized signal. What follows is a superﬁcial analysis of quantization noises. In order to do a rigorous analysis we should assume that the reader has a background in random variables and processes. We rather refer to signal processing books [58, 67, 65] for a more accurate exposition. In order to study the effects of quantization noise analytically, it is often assumed that it is a white noise (i.e., a noise with a constantmagnitude spectrum) with values uniformly distributed in the interval (42), and that there is no correlation between the noise and the unquantized signal. This assumption is false in general but, nevertheless, it leads to results which are good estimates of many actual behaviors. The uniformlydistributed white noise has a zero mean but it has a nonzero quadratic mean (i.e., a power) with value η2 = 1 q/2
q /2 η 2 dη =
0 q2 . 12 (43) In the frequency domain, the quantization noise is interpreted by means of a spectrum such as that depicted in ﬁg. 5, which represents the square of the magnitude of the Fourier transform. The area of the dashed rectangle is equal to the power η 2 . Usually the rootmeansquare value (or RMS value) of the
E
2 η2 /F s F s /2 0 Fs /2 f Figure 5: Squared magnitude spectrum of an ideal quantization noise quantization noise is given, and this is deﬁned as ηrms = q η2 = √ , 12 (44) Systems, Sampling and Quantization 21 which can be directly compared with the maximal representable value in order to get the signaltoquantization noise ratio (or SNR) SN R = 20 log10 √ q 2b−1 √ = 20 log10 (2b 3) ≈ 4.7 + 6 b dB . q/ 12 (45) As a general rule, each further quantization bit increases the SNR by 6dB. Therefore, with 16 bits we have a signaltoquantization noise ratio of about 101.1dB. When we are given a SNR of 96.3dB with 16 bits, it means that the ratio has been computed using the maximum value q/2 of the quantization noise and not its RMS value, which is more signiﬁcant for the human ear. The deﬁnition (45) is that proposed by Steiglitz [102]. The assumptions on the statistical properties of the quantization noise are better veriﬁed if the signal is large in amplitude and wide in its frequency extension. For quasisinusoidal signals the quantization noise is heavily colored and correlated with the unquantized signal, in such an extent that some additive noise called dither is sometimes introduced in order to whiten and decorrelate the quantization noise. In this way, the perceptual effects of quantization turn out to be less severe. By considering the quantization noise as an additive signal we can easily study its effects within linear systems. The operations performed by a discretetime linear system, especially when done in ﬁxedpoint arithmetics, can indeed modify the spectral content of noise signals, and different realizations of the same transfer functions can behave very differently as far as their immunity to quantization noise is concerned. Several quantizations can occur within the realization of a linear system. For instance, the multiplication of two ﬁxedpoint numbers represented with b bits requires 2b − 1 bits to represent the result without any precision loss. If successive operations use operands represented with b bits it is clear that the leastsigniﬁcant bits must be eliminated, thus introducing a quantization. The effects of these quantizations can be studied resorting to the additive white noise model, where the points of injection of noises are the points where the quantization actually occurs. The ﬁxedpoint implementations of linear systems are subject to disappointing phenomena related to quantization: limit cycles and overﬂow oscillations. Both phenomena can be expressed as nonzero signals that are maintained even when the system has stopped to produce usuful signals. The limit cycles are usually small oscillations due to the fact that, because of rounding, the sources of quantization noise determine a local ampliﬁcation or attenuation of the signal (see ﬁg. 4). If the signals within the system have a physical meaning (e.g., they are propagating waves), the limit cycles can be avoided by 22 D. Rocchesso: Sound Processing forcing a lossy quantization, which truncates the numbers always toward zero. This operation corresponds to introducing a small numerical dissipation. The overﬂow oscillations are more serious because they produce signals as large as the maximum amplitude that can be represented. They can be produced by operations whose results exceed the largest representable number, so that the result is slapped back into the legal range of two’s complement numbers. Such a distructive oscillation can be avoided by using overﬂowprotected operations, which are operations that saturate the result to the largest representable number (or to the most negative representable number). The quantizations introduce nonlinear elements within otherwise linear structures. Indeed, limit cycles and overﬂow oscillations can persist only because there are nonlinearities, since any linear and stable system can not give a persistent nonzero output with a zero input. Quantization in ﬂoating point implementations is usually less of a concern for the designer. In this case, quantization occurs only in the mantissa. Therefore, the relative error ηr (n) = yq (n) − y (n) , y (n) (46) is more meaningful for the analysis. We refer to [65] for a discussion on the effects of quantization with ﬂoating point implementations. Some digital audio formats, such as the µlaw and Alaw encodings, use a ﬁxedpoint representation where the quantization levels are distributed non linearly in the amplitude range. The idea, resemblant of the quasi logarithmic sensitivity of the ear, is to have many more levels where signals are small and a coarser quantization for large amplitudes. This is justiﬁed if the signals being quantized do not have a statistical uniform distribution but tend to assume small amplitudes more often than large amplitudes. Usually the distribution of levels is exponential, in such a way that the intervals between points increase exponentially with magnitude. This kind of quantization is called logarithmic because, in practical realizations, a logarithmic compressor precedes a linear quantization stage [69]. Floatingpoint quantization can be considered as a piecewiselinear logarithmic quantization, where each linear piece corresponds to a value of the exponent. Chapter 2 Digital Filters
For the purpose of this book we call digital ﬁlter any linear, timeinvariant system operating on discretetime signals. As we saw in chapter 1, such a system is completely described by its impulse response or by its (rational) transfer function. Even though the adjective digital refers to the fact that parameters and signals are quantized, we will not be too concerned about the effects of quantization, that have been brieﬂy introduced in sec. 1.6. In this chapter, we will face the problem of designing impulse responses or transfer functions that satisfy some speciﬁcations in the time or frequency domain. Traditionally, digital ﬁlters have been classiﬁed into two large families: those whose transfer function doesn’t have the denominator, and those whose transfer function have the denominator. Since the ﬁlters of the ﬁrst family admit a realization where the output is a linear combination of a ﬁnite number of input samples, they are sometimes called nonrecursive ﬁlters1 . For these systems, it is more customary and correct to refer to the impulse response, which has a ﬁnite number of nonnull samples, thus calling them Finite Impulse Response (FIR) ﬁlters. On the other hand, the ﬁlters of the second family admit only recursive realizations, thus meaning that the output signal is always computed by using previous samples of itself. The impulse response of these ﬁlters is inﬁnitely long, thus justifying their name as Inﬁnite Impulse Response (IIR) ﬁlters.
1 Strictly speaking, this deﬁnition is not correct because the same transfer functions can be realized in recursive form 23 24 D. Rocchesso: Sound Processing 2.1 FIR Filters An FIR ﬁlter is nothing more than a linear combination of a ﬁnite number of samples of the input signal. In our examples we will treat causal ﬁlters, therefore we will not process input samples coming later than the time instant of the output sample that we are producing. The mathematical expression of an FIR ﬁlter is
N y (n) =
m=0 h(m)x(n − m) . (1) In eq. 1 the reader can easily recognize the convolution (26), here specialized to ﬁnitelength impulse responses. Since the time extension of the impulse response is N + 1 samples, we say that the FIR ﬁlter has length N + 1. The transfer function is obtained as the Z transform of the impulse response and it is a polynomial in the powers of z −1 :
N H (z ) =
m=0 h(m)z −m = h(0) + h(1)z −1 + · · · + h(N )z −N . (2) Since such polynomial has order N , we also say that the FIR ﬁlter has order N. 2.1.1 The Simplest FIR Filter Let us now consider the simplest nontrivial FIR ﬁlter that one can imagine, the averaging ﬁlter y (n) = 1 1 x(n) + x(n − 1) . 2 2 (3) In appendix B.1 it is illustrated how such ﬁlter can be implemented in Octave/Matlab in two different ways: block processing or samplebysample processing. The simplest way to analyze the behavior of the ﬁlter [97] is probably the injection of a complex sinusoid having amplitude A and initial phase φ, i.e. the signal x(n) = Aej (ω0 n+φ) . Since the system is linear we do not loose any generality by considering unitamplitude signals (A = 1). Since the system is time invariant we do not loose any generality by considering signals with initial zero phase (φ = 0). Since the complex sinusoid can be expressed as the sum of a cosinusoidal real part and a sinusoidal imaginary part, we can imagine that feeding the system with such a complex signal corresponds to feeding Digital Filters 25 two copies of the ﬁlter, the one with a cosinusoidal real signal, the other with a sinusoidal real signal. The output of the ﬁlter fed with the complex sinusoid is obtained, thanks to linearity, as the sum of the outputs of the two copies. If we replace the complex sinusoidal input in eq. (3) we readily get y (n) = 11 1 jω0 n 1 jω0 (n−1) e +e = ( + e−jω0 )ejω0 n = 2 2 22 11 = ( + e−jω0 )x(n) . 22 (4) We see that the output is a copy of the input multiplied by the complex number 1 ( 1 + 2 e−jω0 ), wich is the value taken by the transfer function at the point 2 jω0 z = e . In fact, the transfer function (2) can be rewritten, for the case under analysis, as 11 (5) H (z ) = + z −1 , 22 and its evaluation on the unit circle (z = ejω ) gives the frequency response H (ω ) = 1 1 −jω +e . 22 (6) For an input complex sinusoid having frequency ω0 , the frequency response takes value H (ω0 ) = 1 1 −jω0 1 1 +e = ( ejω0 /2 + e−jω0 /2 )e−jω0 /2 = 22 2 2 = cos (ω0 /2)e−jω0 /2 , (7) and we see that the magnitude response and the phase response are, respectively H (ω0 ) = cos (ω0 /2) (8) and ∠H (ω0 ) = −ω0 /2 . (9) These are respectively the magnitude and argument of the complex number that is multiplied by the input function in (4). Therefore, we have veriﬁed a general property of linear and timeinvariant systems, i.e., sinusoidal inputs give sinusoidal outputs, possibly with an amplitude rescaling and a phase shift2 .
2 The reader can easily verify that this is true not only for complex sinusoids, but also for real sinusoids. The real sinusoid can be expressed as a combination of complex sinusoids and linearity can be applied. 26 D. Rocchesso: Sound Processing If the frequency of the input sine is thought of as a real variable ω in the interval [0, π ), the magnitude and phase responses become a function of such variable and can be plotted as in ﬁg. 1. At this point, the interpretation of such curves as ampliﬁcation and phase shift of sinusoidal inputs should be obvious. (a)
0 0.8 0.6 0.4 0.2 0 0 1 2 frequency [rad/sample] 3 −1.5 0 1 2 frequency [rad/sample] 3 phase [rad] magnitude −0.5 (b) −1 Figure 1: Frequency response (magnitude and phase) of an averaging ﬁlter In order to plot curves such as those of ﬁg. 1 it is not necessary to calculate closed forms of the functions representing the magnitude (8) and the phase response (9). Since with Octave/Matlab we can directly operate on arrays of complex numbers, the following simple script will do the job: global_decl; platform(’octave’); w = [0:0.01:pi]; H = 0.5 + 0.5*exp( i * w ); subplot(2,2,1); plot(w, abs(H)); xlabel(’frequency [rad/sample]’); ylabel(’magnitude’); eval(myreplot); subplot(2,2,2); plot(w, angle(H)); xlabel(’frequency [rad/sample]’); ylabel(’phase [rad]’); eval(myreplot); % frequency points % complex freq. resp. % plot the magnitude % plot the phase The averaging ﬁlter is the simplest form of lowpass ﬁlter. In a lowpass ﬁlter the high frequencies are more attenuated than the low frequencies. Another way to approach the analysis of a ﬁlter is to reason directly in the plane of the complex variable z . In this plane (ﬁg. 2) two families of points are marked: the Digital Filters 27 points where the transfer function vanishes, and the points where it diverges to inﬁnity. Let us rewrite the transfer function as the ratio of two polynomials in z 1 z − z0 , (10) H (z ) = 2z where z0 = −1 is the root of the numerator. The roots of the numerator of a transfer function are called zeros of the ﬁlter, and the roots of the denominator are called poles of the ﬁlter. Usually, for reasons that will emerge in the following, only the nonzero roots are counted as poles or zeros. Therefore, in the example (10) we have only one zero and no pole. In order to evaluate the frequency response of the ﬁlter it is sufﬁcient to replace the variable z with ejω and to consider ejω as a geometric vector whose head moves along the unit circle. The difference between this vector and the vector z0 gives the cord drawn in ﬁg. 2. The cord length doubles3 the magnitude response of the ﬁlter. Such a chord, interpreted as a vector with the head in ejω , has an angle that can be subtracted from the vector angle of the pole at the origin, thus giving the phase response of the ﬁlter at the frequency ω .
1 0.8 0.6 0.4 0.2 0 −0.2 −0.4 −0.6 −0.8 −1 −1.5 −1 −0.5 0 0.5 1 1.5 Figure 2: Single zero (◦) and pole in the origin (×) The following general rules can be given, for any number of poles and zeros: • Considered a point ejω on the unit circle, the magnitude of the frequency response (regardless of constant factors) at the frequency ω is obtained by multiplication of the magnitudes of the vectors linking the zeros with
3 Do not forget the scaling factor 1 2 in (10). 28 D. Rocchesso: Sound Processing the point ejω , divided by the magnitudes of the vectors linking the poles with the point ejω . • The phase response is obtained by addition of the phases of the vectors linking the zeros with the point ejω , and by subtraction of the phases of the vectors linking the poles with the point ejω . It is readily seen that poles or zeros in the origin do only contribute to the phase of the frequency response, and this is the reason for their exclusion from the total count of poles and zeros. The graphic method, based on pole and zero placement on the complex plane is very useful to have a rough idea of the frequency response. For instance, the reader is invited to reconstruct ﬁg. 1 qualitatively using the graphic method. The frequency response gives a clear picture of the behavior of a ﬁlter when its inputs are stationary signals, which can be decomposed as constantamplitude sinusoids. Therefore, the frequency response represents the steadystate response of the system. In practice, even signals composed by sinusoids have to be turned on at a certain instant, thus producing a transient response that comes before the steadystate. However, the knowledge of the Z transform of a causal complex sinusoid and the knowledge of the ﬁlter transfer function allow us to study the overall response analytically. As we show in appendix A.8.3, the Z transform of causal exponential sequence is . (11) 1− If we multiply, in the z domain, X (z ) by the transfer function H (z ) we get X (z ) = ejω0 z −1 1 1 = (1 + z −1 ) jω0 z −1 2 1−e 1 1 1 z −1 = + . (12) jω0 z −1 21−e 2 1 − ejω0 z −1 The second term of the last member of (12) is, by the shift theorem, the transform of a causal complex sinusoidal sequence delayed by one sample. Therefore, the overall response can be thought of as a sum of two identical sinusoids shifted by one sample and this turns out to be another sinusoid, but only after the ﬁrst sampling instant. The ﬁrst instant has a different behavior since it is part of the transient of the response (see ﬁg. 3). It is easy to realize that, for an FIR ﬁlter, the transient lasts for a number of samples that doesn’t exceed the order (memory) of the ﬁlter itself. Since an orderN FIR ﬁlter has a memory of N samples, the transient is at most N samples long. Y (z ) = H (z )X (z ) = 1 Digital Filters
1 29 0 −1 0 10 20 30 samples 40 Figure 3: Response of an FIR averaging ﬁlter to a causal cosine: input and delayed input (◦), actual response (×) 2.1.2 The Phase Response If we ﬁlter a sound with a nonlinearphase ﬁlter we alter its timedomain wave shape. This happens because the different frequency components are subject to a different delay while being transferred from the input to the output of the ﬁlter. Therefore, a compact wavefront is dispersed during its traversal of the ﬁlter. Before deﬁning this concept more precisely we illustrate what happens to the wave shape that is impressed by a hammer to the string in the piano. The string behaves like a nonlinearphase ﬁlter, and the dispersion of the frequency components becomes increasingly more evident while the wave shape propagates away from the hammer along the string. Fig. 4 illustrates the string displacement signal as it is produced by a physical model (see chapter 5 for details) of the hammerstring system. The initial wave shape progressively loses its initial form. In particular, the fact that high frequencies are subject to a smaller propagation delay than low frequencies is visible in the form of little precursors, i.e., small highfrequency oscillations that precede the return of the main components of the wave shape. Such an effect can be experienced with an aerial ropeway like those that are found in isolated mountain houses. If we shake the rope energetically and keep our hand on it, after a few seconds we perceive small oscillations preceding a strong echo. The effects of the phase response of a ﬁlter can be better formalized by introducing two mathematical deﬁnitions: the phase delay and the group delay. 30
1.0 D. Rocchesso: Sound Processing 1.0 0.02 time .11 Figure 4: Struck string: string displacement at the bridge termination The phase delay is deﬁned as τph = − ∠H (ω ) , ω (13) i.e., at any frequency, it is given by the phase response divided by the frequency itself. In practice, given the phaseresponse curve, the phase delay at one point is obtained as the slope of the straight line that connects that point with the origin. The group delay is deﬁned in differential terms as τgr = − d∠H (ω ) . dω (14) Therefore, the group delay at one point of the phaseresponse curve, is equal to the slope of the curve. The ﬁg. 5 illustrates the difference between phase delay and group delay. It is clear that, if the phase is linear, the two delays are equal and coincident with the slope of the straight line that represents the phase response. The difference between local slope and slope to the origin is crucial to understand the physical meaning of the two delays. The phase delay at a certain frequency point is the delay that a single frequency component is subject to when it passes through the ﬁlter, and the quantity (13) is, indeed, a delay in samples. Vice versa, in order to interpret the group delay let us consider a local approximation of the phase response by the tangent line at one point. Locally, propagation can be considered linear and, therefore, a signal having frequency components focused around that point has a timedomain envelope Digital Filters
ω τph 31 τgr H(ω) Figure 5: Phase delay and group delay that is delayed by an amount proportional to the slope of the tangent. For instance, two sinusoids at slightly different frequencies are subject to beats and the beat frequency is the difference of the frequency components (see ﬁg. 6). Therefore, beats are a frequency local phenomenon, only dependent on the relative distance between the components rather than on their absolute positions. If we are interested in knowing how the beat pattern is delayed by a ﬁlter, we should consider local variations in the phase curve. In other words, we should consider the group delay.
beats 2 1 0 −1 −2 0 0.1 0.2 s 0.3 Figure 6: Beats between a sine wave at 100 Hz and a sine wave at 110 Hz In telecommunications the group delay is often the most signiﬁcant between the two delays, since messages are sent via wave packets localized in a narrow frequency band, and preservation of the shape of such packets is important. Vice versa, in sound processing it is more meaningful to consider the set of 32 D. Rocchesso: Sound Processing frequency components in the audio range as a whole, and the phase delay is more signiﬁcant. In both cases, we have to be careful of a problem that often arises when dealing with phases: the phase unwrapping. So far we have deﬁned the phase response as the angle of the frequency response, without bothering about the fact that such an angle is deﬁned univocally only between 0 and 2π . There is no way to distinguish an angle θ from those angles obtained by addition of θ with multiples of 2π . However, in order to give continuity to the phase and group delays, we have to unwrap the phase into a continuous function. For instance, the Matlab Signal Processing Toolbox provides the function unwrap that unwraps the phase in such a way that discontinuities larger than a given threshold are offset by 2π . In Octave we can use the function unwrap found in the web repository of this book. Example 1. Fig. 7 shows the phase response of the FIR ﬁlter H (z ) = 0.5 − 0.2z −1 − 0.3z −2 + 0.8z −3 before and after unwrapping. The following Octave/Matlab script allows to plot the curve in ﬁg. 7. It is illustrative of the usage of the function unwrap with the default unwrapping threshold set to π . w = [0:0.01:pi]; H = 0.5  0.2*exp(i*w )  0.3*exp(2*i*w ) + \ 0.8*exp(3*i*w ) ; plot(w, unwrap(angle(H)), ’’); hold on; plot(w, angle(H), ’’); hold off; xlabel(’frequency [rad/sample]’); ylabel(’phase [rad]’); title(’Phase response’); % replot; % Octave only ### 2.1.3 HigherOrder FIR Filters An FIR ﬁlter is nothing more than the realization of the operation of convolution (1). The ﬁlter coefﬁcients are the samples of the impulse response. The FIR ﬁlters having an impulse response that is symmetric are particularly important, since the phase of their frequency response is linear. More precisely, a symmetric impulse response is such that h(n) = h(N − n), n = [0, . . . , N ] , (15) Digital Filters
Phase response 2 phase [rad] 0 −2 −4 −6 −8 −10 0 1 2 frequency [rad/sample] 3 33 Figure 7: Wrapped (dashed line) and unwrapped (solid line) phase response of a third order FIR ﬁlter having impulse response: 0.5 0.2 0.3 0.8 and an antisymmetric impulse response is such that h(n) = −h(N − n), n = [0, . . . , N ] . (16) It is possible to show that the symmetry (or antisymmetry) of the impulse response is a sufﬁcient condition to ensure the linearity of phase4 . This property is important to ensure the invariance of the shape of signals going through the ﬁlter. For instance, if a sawtooth signal is the input of a linearphase lowpass ﬁlter, the output is still a sawtooth signal with rounded corners. In order to prove that symmetry is a sufﬁcient condition for phase linearity for an N th order FIR ﬁlter (with N odd integer), we write the transfer function as H (z ) = h(0) + · · · + h( + h(
N −1 2 N − 1 − N −1 )z 2 + 2 N − 1 − N +1 )z 2 + · · · + h(0)z −N 2 (17) =
n=0 h(n) z −n + z −N +n . 4 Actually, − for antisymmetric oddorder ﬁlters, linear phase is achieved if h( N 2 1 ) = 0 34 The frequency response can be expressed as
N −1 2 D. Rocchesso: Sound Processing H (ω ) =
n=0
N −1 2 h(n) e−jωn + ejω(−N +n) =
n=0 h(n)e−jω 2
N −1 2 N e−jω(n− 2 ) + ejω(n− 2 ) N )) . 2 N N (18) = e −jω N 2 2
n=0 h(n) cos (ω (n − In the latter term we have isolated the phase contribution from a (real) weighted sum of sinusoidal functions. The phase contribution is a straight line having slope −N/2, as we have already seen in the special case of the ﬁrstorder averaging ﬁlter (5). Where the real term changes sign there are indeed 180◦ phase shifts, so that we should more precisely say that the phase is piecewise linear. However, phase discontinuities at isolated points do not alter the overall constancy of group delay, and they are nevertheless irrelevant because at those points the magnitude is zero. The same property of piecewise phase linearity holds for antisymmetric impulse responses and for even values of N . At this point, we are going to introduce a very useful FIR ﬁlter. It is linear phase and it has order 2 (i.e., length 3). The averaging ﬁlter (5) was also a linear phase ﬁlter, but it is not possible to change the shape of its frequency response without giving up the phase linearity. In fact, ﬁlters having form H (z ) = h(0) + h(1)z −1 can have linear phase only if h(0) = ±h(1), and this force them to have a magnitude response such as that of ﬁg. 1 or like its highpass mirrored version5 . The ﬁlter that we are going to analyze has transfer function H (z ) = a0 + a1 z −1 + a0 z −2 . (19) The impulse response is symmetric and, therefore, its phase response is linear. The frequency response can be calculated as H (ω ) = = =
5 The a0 + a1 e−jω + a0 e−2jω e−jω a0 ejω + a1 + a0 e−jω e−jω (a1 + 2a0 cos ω ) . (20) reader can analyze the ﬁlter H (z ) = 0.5 − 0.5z −1 and verify that it is a highpass ﬁlter. Digital Filters 35 As we have anticipated, the phase is linear and we have a phase delay of one sample. The magnitude of the frequency response is a function of the two parameters a0 and a1 . Therefore, the designer has two degrees of freedom to control, for instance, the magnitude of the frequency response at two distinct frequencies. A ﬁrst property that one might want to impose is a lowpass shape to the frequency response. The reader, starting from (20), can easily verify that a sufﬁcient condition to ensure that the magnitude of the frequency response is a decreasing monotonic function is that a1 ≥ 2a0 ≥ 0 . (21) If we want to set the magnitude A1 at the frequency ω1 and the magnitude A2 at the frequency ω2 we have to solve the linear system of equations a1 + 2a0 cos ω1 a1 + 2a0 cos ω2 that can be expressed in matrix form as 1 2 cos ω1 1 2 cos ω2 a1 a0 = A1 A2 . (22) = A1 = A2 , For instance, if ω1 = 0.01, ω2 = 2.0, A1 = 1.0 and A2 = 0.5, in Octave/Matlab a system such as this can be written and solved with the script w1 = 0.01; w2 = 2.0; A1 = 1.0; A2 = 0.5; A = [ 1 2*cos(w1) ; 1 2*cos(w2)]; b = [A1 ; A2]; a = A \ b; % solution of the system b = A a and the solutions returned for the variables a1 and a0 are, respectively, a= 0.64693 0.17654 The frequency response of this ﬁlter is shown in ﬁg. 8. If we design the secondorder ﬁlter by speciﬁcation of the frequency response at two arbitrary frequencies, we can easily get a magnitude response larger than one at zero frequency (also called dc frequency). Especially in signal processing ﬂowgraphs 36 (a) D. Rocchesso: Sound Processing (b)
0 0.8 0.6 0.4 0.2 0 0 1 2 frequency [rad/sample] 3 phase [rad] −0.5 −1 −1.5 −2 −2.5 −3 0 1 2 frequency [rad/sample] 3 Figure 8: Frequency response (magnitude (a) and phase (b)) of the length3 linear phase FIR ﬁlter with coefﬁcients a0 = 0.17654 and a1 = 0.64693 having loops it is often desirable to normalize the maximum value of the magnitude response to one, in such a way that ampliﬁcations generating instabilities can be avoided. Of course, it is always possible to rescale the lter input or output by a scalar that is reciprocal to H (0) = a1 + 2a0 so that the response is forced to be unitary at dc6 . Instead of drawing the polezero diagram of the ﬁlter, let us represent the contours of the logarithm of the magnitude of the transfer function, evaluated on the complex plane in a square centered on the origin (see ﬁg. 9). The effects of the double pole in the origin and of the zeros z = −0.29695 and z = −3.36754 are clearly visible. A ﬁlter such as (8) has been proposed as part of an algorithm for synthesis of plucked string sounds [104]. We have seen that an FIR ﬁlter is the realization of a convolution between the input signal and the sequence of coefﬁcients. The computation of this convolution can be made explicit in a language such as Octave and, indeed, this is what we have done in the appendix B.1 for the simple ﬁlter of length 2. For highorder ﬁlters it is more convenient to use algorithms that increase the efﬁciency of convolution. In Octave, there is the function fftfilt that, given a vector b of coefﬁcients and an input signal x, returns the output of the FIR ﬁlter7 . In order to perform this computation, the fftfilt computes an FFT
6 The reader is invited to reformulate the system (22) with ω = 0 and ω = π . This corres1 2 ponds to setting the magnitude at dc and Nyquist rate. 7 In Matlab, the same function is available in the Signal Processing Toolbox. In any case, the Octave version fftfilt, avaliable in the web repository of this book, can also be used in Matlab. magnitude Digital Filters
Magnitude of the Transfer Function 4 2 0 −2 −4 −4 37 −2 0 2 4 Figure 9: Magnitude of the transfer function [in dB] of an order2 FIR ﬁlter on the complex z plane of the coefﬁcients and an FFT of the input signal, it multiplies the two transforms point by point (convolution in the time domain is multiplication in the transform domain), and it applies an inverse FFT to the result. Since the FFT of a lengthN sequence has complexity of the order of N log N and the pointbypoint multiply has complexity of the order of N , the convolution computed in this way has complexity of the order of N log N . For sequences longer than a few samples, such a procedure is much faster than direct convolution. For even longer sequences, it is convenient to decompose the sequences into blocks and repeat the operations block by block. The partial results are then recomposed by partial addition of neighboring blocks of results. The detailed explanation of this technique is reported in several signal processing books, such as [67]. Most sound processing languages and realtime sound processing environments have primitive functions to compute the output of FIR ﬁlters. For instance, in SAOL (see appendix B.2) there is the function fir(input, h0, h1, h2, ...) that takes the input signal and the ﬁlter coefﬁcients as arguments. Example 2. In order to strengthen our understanding of FIR ﬁlters, we approach the design of a 10th order linear phase ﬁlter having unit response at dc and an attenuation of 20dB at Fs /6. The impulse response of a 10th order (or length 11) ﬁlter can be considered as the convolution of the responses of 5 2nd order ﬁlters. Therefore, it is sufﬁcient to design a length3 ﬁlter with 38 D. Rocchesso: Sound Processing a slighter attenuation at Fs /6 and to convolve ﬁve copies of this ﬁlter. The reader is invited to design the ﬁlter and to experience its effect using a sound processing language or realtime environment. A related task is the design of a highpass ﬁlter of the same length having a magnitude response that is symmetric to the response of the lowpass ﬁlter. Is there any law of symmetry that relates the coefﬁcients of the two ﬁlters? How are the zeros distributed in the complex plane in the two cases? A further interesting exercise is the analysis and experimentation of the frequency response of the parallel connection of the two ﬁlters. Development. The Octave/Matlab script that follows answers most of the questions. The remaining questions are left to the reader. global_decl; plat = platform(’octave’); w0=0; A0=1; % Response at dc w1=pi/3; A1=0.1^(1/5); % Response at Fs/6 (1/5 of 20 dB) %% coefficients of the length3 FIR filter A = [1 2*cos(w0); 1 2*cos(w1)]; b = [A0; A1]; a = A\b; a1 = a(1) a0 = a(2) w = [0:0.01:pi]; %% frequency response of the length3 FIR filter H = a0 + a1*exp(i*w) + a0*exp(i*2*w); %% frequency response of the length11 FIR filter %% (cascade of 5 length3 filters) H11 = H.^5; subplot(2,2,1); plot(w, 20*log10(abs(H11))); xlabel(’frequency [rad/sample]’); ylabel(’magnitude [dB]’); axis([0,pi,90,0]); grid; eval(myreplot); pause; %% polezero plot %% In Matlab, it can be done with %% the single line: %% zplane(roots([a0,a1,a0]),0); Digital Filters w_all = [0:0.05:2*pi]; subplot(2,2,2); plot(exp(i*w_all), ’.’); hold on; zeri = roots([a0, a1, a0]); plot(real(zeri),imag(zeri), ’o’); plot(0,0, ’x’); hold off; xlabel(’Re’); ylabel(’Im’); axis ([1.2, 1.2, 1.2, 1.2]); if (plat==’matlab’) axis (’square’); end; eval(myreplot); pause; k = [0:10]’; kernelw = exp(i*k*w); aa = H11 / kernelw subplot(2,2,3); plot([0:10],real(aa),’+’); xlabel(’samples’); ylabel(’h’); grid; axis; eval(myreplot); aa2 = conv([a0 a1 a0],[a0 a1 a0]); aa3 = conv(aa2,[a0 a1 a0]); aa4 = conv(aa3,[a0 a1 a0]); aa5 = conv(aa4,[a0 a1 a0]) %% verify that aa5 = aa: %% by composition of convolutions we get %% the same length11 filter 39 In the ﬁrst couple of lines the script converts the speciﬁcations for a length3 FIR ﬁlter. Then, this elementary ﬁlter is designed using the technique previously presented in this section. The frequency response H11 of the length11 ﬁlter is obtained by exponentiation of the length3 ﬁlter to the ﬁfth power. The magnitude of the frequency response is depicted in ﬁg. 10. We see that the speciﬁcations are met. However, the response is not monotonically decreasing. This is due to the fact that the speciﬁcations are quite demanding, thus impeding the satisfaction of (21). In fact, the coefﬁcients turn out to be a0 = 0.369 and a1 = 0.262, and the zeros are not real but complex conjugate, as shown in the polezero plot of ﬁg. 11. The impulse response of the 10th order FIR ﬁlter is obtained from its frequency response by solving in [a0 a1 . . . a10 ] the matrix 40
0 magnitude [dB] −20 −40 −60 −80 0 D. Rocchesso: Sound Processing 1 2 frequency [rad/sample] 3 Figure 10: Magnitude of the frequency response of the length11 ﬁlter equation 1 e−jω = H11 (ω ) , [a0 a1 . . . a10 ] ... −j 10ω e which is all contained in the lines k = [0:10]’; kernelw = exp(i*k*w); aa = H11 / kernelw; Finally, the ending lines of the script aim at verifying that the same impulse response can be obtained by iterated convolution of the 2nd order impulse response. The length11 impulse response is shown in ﬁg. 12. ### (23) 2.1.4 Realizations of FIR Filters The digital ﬁlters, especially FIR ﬁlters, are implementable as a sequence of operations “multiplyandaccumulate”, often called MAC. In order to run an Nth order FIR ﬁlter we need to have, at any instant, the current input sample together with the sequence of the N preceding samples. These N samples constitute the memory of the ﬁlter. In practical implementations, it is customary to allocate the memory in contiguous cells of the data memory or, in any case, in locations that can be easily accessed sequentially. At every sampling instant, the state must be updated in such a way that x(k ) becomes x(k − 1), and this Digital Filters
1 0.5 0 −0.5 −1 −1 0 Re 1 Im 41 Figure 11: Polezero plot for the length3 FIR ﬁlter seems to imply a shift of N data words in the ﬁlter memory. Indeed, instead of moving data, it is convenient to move the indexes that access the data. Consider the scheme depicted in ﬁg. 13, which represents the realization of an FIR ﬁlter of order 3. The three memory words are put in an area organized as a circular buffer. The input is written to the word pointed by the index and the three preceding values of the input are read with the three preceding values of the index. At every sample instant, the four indexes are incremented by one, with the trick of beginning from location 0 whenever we exceed the length M of the buffer (this ensures the circularity of the buffer). The counterclockwise arrow indicates the direction taken by the indexes, while the clockwise arrow indicates the movement that should be done by the data if the indexes would stay in a ﬁxed position. In ﬁg. 13 we use small triangles to indicate the multiplications by the ﬁlter coefﬁcients. This is a notation commonly used for multiplications within the signal ﬂowgraphs that represent digital ﬁlters. As a matter of fact, an FIR ﬁlter contains a delay line since it stores N consecutive samples of the input sequence and uses each of them with a delay of N samples at most. The points where the circular buffer is read are called taps and the whole structure is called a tapped delay line. 42 D. Rocchesso: Sound Processing 0.2 0.1 h 0 0 5 samples 10 Figure 12: Impulse response of the length11 FIR ﬁlter x n M1 0 h(0) h(1) n1 n2 n3 h(2) h(3) y Figure 13: Circular buffer that implements a 3rd order FIR ﬁlter Digital Filters 43 2.2 IIR Filters In general, a causal IIR ﬁlter is represented by a difference equation where the output signal at a given instant is obtained as a linear combination of samples of the input and output signals at previous time instants. Moreover, an instantaneous dependency of the output on the input is also usually included in the IIR ﬁlter. The difference equation that represents an IIR ﬁlter is
N M y (n) = −
m=1 am y (n − m) +
m=0 bm x(n − m) . (24) Eq. (24) is also called AutoRegressive Moving Average (ARMA) representation. While the impulse response of FIR ﬁlters has a ﬁnite time extension, the impulse response of IIR ﬁlters has, in general, an inﬁnite extension. The transfer function is obtained by application of the Z transform to the sequence (24). In virtue of the shift theorem, the Z transform is a mere operatorial substitution of each translation by m samples with a multiplication by z −m . The result is the rational function H (z ) that relates the Z transform of the output to the Z transform of the input: Y (z ) = b0 + b1 z −1 + · · · + bM z −M X (z ) = H (z )X (z ) . 1 + a1 z − 1 + · · · + aN z − N (25) The ﬁlter order is deﬁned as the degree of the polynomial in z −1 that is the denominator of (25). 2.2.1 The Simplest IIR Filter In this section we analyze the properties of the simplest nontrivial IIR ﬁlter 1 that can be conceived: the onepole ﬁlter having coefﬁcients a1 = − 2 and 1 b0 = 2 : 1 1 y (n) = y (n − 1) + x(n) . (26) 2 2 The transfer function of this ﬁlter is H (z ) = 1/2 . 1 1 − 2 z −1 (27) If the ﬁlter (26) is fed with a unit impulse at instant 0, the response will be: y = 0.5, 0.25, 0.125, 0.0625, . . . . (28) 44 D. Rocchesso: Sound Processing It is clear that the impulse response is nonzero over an inﬁnitely extended support, and every sample is obtained by halving the preceding one. Similarly to what we did for the ﬁrstorder FIR ﬁlter, we analyze the behavior of this ﬁlter using a complex sinusoid having magnitude A and initial phase φ, i.e. the signal Aej (ω0 n+φ) . Since the system is linear, we do not loose any generality by considering unitmagnitude signals (A = 1). Moreover, since the system is time invariant, we do not loose generality by considering signals having the initial phase set to zero (φ = 0). In a linear and timeinvariant system, the steadystate response to a complex sinusoidal input is a complex sinusoidal output. To have a conﬁrmation of that, we can consider the reversed form of (26) x(n) = 2y (n) − y (n − 1) , and replace the output y (n) with a complex sinusoid, thus obtaining x(n) = 2ejω0 n − ejω0 (n−1) = (2 − e−jω0 )y (n) . (30) (29) Eq. (30) shows that a sinusoidal output gives a sinusoidal input, and vice versa. The input sinusoid gets rescaled in magnitude and shifted in phase. Namely, the output y is a copy of the input multiplied by the complex quantity 2−e1 jω0 , − which is the value taken by the transfer function (27) at the point z = ejω0 . The frequency response is H (ω ) = 1/2 , 1 − 1 e−jω 2 (31) and there are no simple formulas to express its magnitude and phase, so that we have to resort to the graphical representation, depicted in ﬁg. 14. This simple ﬁlter still has a lowpass shape. As compared to the ﬁrstorder FIR ﬁlter, the onepole ﬁlter gives a steeper magnitude response curve. The fact that, for a given ﬁlter order, the IIR ﬁlters give a steeper (or, in general, a more complex) frequency response is a general property that can be seen as an advantage in preferring IIR over FIR ﬁlters. The other side of the coin is that IIR ﬁlters can not have a perfectlylinear phase. Furthermore, IIR ﬁlters can produce numerical artifacts, especially in ﬁxedpoint implementations. The onepole ﬁlter can also be analyzed by watching its polezero distribution on the complex plane. To this end, we rewrite the transfer function as a ratio of polynomials in z and give a name to the root of the denominator: 1 p0 = 2 . The transfer function has the form H (z ) = 1z 2z−
1 2 = 1z . 2 z − p0 (32) Digital Filters (a)
0.8 0.6 0.4 0.2 0 0 1 2 frequency [rad/sample] 3 phase [rad] magnitude 0.2 0 −0.2 −0.4 −0.6 0 1 2 frequency [rad/sample] 45 (b) 3 Figure 14: Frequency response (magnitude (a) and phase (b)) of a onepole IIR ﬁlter We can apply the graphic method presented in sec. 2.1.1 to have a qualitative idea of the magnitude and phase responses. In order to do that, we consider the point ejω on the unit circle as the head of the vectors that connect it to the pole p0 and to the zero in the origin. Fig. 15 is illustrative of the procedure. While we move along the unit circumference from dc to the Nyquist frequency, we go progressively away from the pole, and this is reﬂected by the monotonically decreasing shape of the magnitude response.
1 0.8 0.6 0.4 0.2 0 −0.2 −0.4 −0.6 −0.8 −1 −1.5 −1 −0.5 0 0.5 1 1.5 Figure 15: Single pole (×) and zero in the origin (◦) To have a complete picture of the ﬁlter behavior we need to analyze the transient response to the causal complex exponential. The Z transform of the 46 input has the wellknown form X (z ) = D. Rocchesso: Sound Processing 1 . 1 − ejω0 z −1 (33) A multiplication of X (z ) by H (z ) in the Z domain gives Y (z ) = = H (z )X (z ) = 1 1 1 1 −1 1 − ejω0 z −1 2 1 − 2z 1/2 1 1/2 1 , (34) + 1 1 − 2ejω0 1 − 2 z −1 1 − 1/2e−jω0 1 − ejω0 z −1 where we have done a partial fraction expansion of Y (z ). The second addendum of the last member of (34) represents the steadystate response, and it is the product of the Z transform of the causal complex exponential sequence by the ﬁlter frequency response evaluated at the same frequency of the input signal. The ﬁrst addendum of the last member of (34) represents the transient response and it can be represented as a causal exponential sequence: yt (n) = Ap0 n , (35) where A = 1−1/2 0 . Since p0  < 1 (i.e., the pole is within the unit circle), the 2ejω transient response is doomed to die out for increasing values of n. In general, for causal systems, the stability condition (29) of chapter 1 is shown to be equivalent to having all the poles within the unit circle. If the condition is not satisﬁed, even if the steadystate response is bounded, the transient will diverge. In terms of Z transform, a system is stable if the region of convergence is a geometric ring containing the unit circumference; the system is causal if such ring extends to inﬁnity out of the circle, and it is anticausal if it extends down to the origin. It is useful to evaluate the time needed to exhaust the initial transient. We deﬁne the time constant τn (in samples) of the lter as the time taken by the exponential sequence p0 n to reduce its amplitude to 1% of the initial value. We have p0 τn = 0.01 , (36) and, therefore, τn = ln 0.01 , ln p0 (37) where the logarithm can be evaluated in any base. In our example, where p0 = 1/2, we obtain τn ≈ 6.64 samples. The time constant in seconds τ is obtained Digital Filters 47 by multiplication of τn by the sampling rate. This way of evaluating the time constant corresponds to evaluating the time needed to attenuate the transient response by 40dB. When we refer to systems for artiﬁcial reverberation such lower threshold of attenuation is moved to 60dB, thus corresponding to 0.1% of the initial amplitude of the impulse response. In the case of higherorder IIR ﬁlters, we can always do a partial fraction expansion of the response to a causal exponential sequence, in a way similar to what has been done in (34), where each addendum but the last one corresponds to a single complex pole of the transfer function. The transient response of these systems is, therefore, the superposition of causal complex exponentials, each corresponding to a complex pole of the transfer function. If the goal is to estimate the duration of the transient response, the pole that is closest to the unit circumference is the dominant pole, since its time constant is the longest. It is customary to deﬁne the time constant of the whole system as the constant associated with the dominant pole. 2.2.2 HigherOrder IIR Filters The twopole IIR ﬁlter is a very important component of any sound processing environment. Such ﬁlter, which is capable of selecting the frequency components in a narrow range, can ﬁnd practical applications as an elementary resonator. Instead of starting from the transfer function or from the difference equation, in this case we begin by positioning the two poles in the complex plane at the point p0 = rejω0 (38) and at its conjugate point p0 ∗ = re−jω0 . In fact, if p0 is not real, the two poles must be complex conjugate if we want to have a realcoefﬁcient transfer function. In order to make sure that the ﬁlter is stable, we impose r < 1. The transfer function of the secondorder ﬁlter can be written as H (z ) = = = G (1 − rejω0 z −1 )(1 − re−jω0 z −1 ) G G = 1 − r(ejω0 + e−jω0 )z −1 + r2 z −2 1 − 2 r cosω0 z −1 + r2 z −2 G (39) −1 + a z −2 1 + a1 z 2 where G is a parameter that allows us to control the total gain of the ﬁlter. 48 D. Rocchesso: Sound Processing As usual, we obtain the frequency response by substitution of z with ejω in (31): G . (40) H (ω ) = 1 − 2 r cosω0 e−jω + r2 e−2jω If the input is a complex sinusoid at the (resonance) frequency ω0 , the output is, from the ﬁrst of (39): H (ω0 ) = = G = (1 − r)(1 − re−2jω0 ) G . (1 − r)(1 − r cos 2ω0 + j r sin 2ω0 ) (41) In order to have a unitmagnitude response at the frequency ω0 we have to impose H (ω0 ) = 1 (42) and, therefore, G = (1 − r) 1 − 2r cos 2ω0 + r2 . (43) The frequency response of this normalized ﬁlter is reported in ﬁg. 16 for r = 0.95 and ω0 = π/6. It is interesting to notice the large step experienced by the phase response around the resonance frequency. This step approaches π as the poles get closer to the unit circumference. (a)
0.8 0.6 0.4 0.2 0 0 1 2 frequency [rad/sample] 3 phase [rad] magnitude 0 (b) −1 −2 0 1 2 frequency [rad/sample] 3 Figure 16: Frequency response (magnitude (a) and phase (b)) of a twopole IIR ﬁlter It is useful to draw the polezero diagram in order to gain intuition about the frequency response. The magnitude of the frequency response is found by Digital Filters
1 0.5 0 −0.5 −1 −1 0 1 49 Figure 17: Couple of poles on the complex plane taking the ratio of the product of the magnitudes of the vectors that go from the zeros to the unit circumference with the product of the magnitudes of the vectors that go from the poles to the unit circumference. The phase response is found by taking the difference of the sum of the angles of the vectors starting from the zeros with the sum of the angles of the vectors starting from the poles. If we move along the unit circumference from dc to the Nyquist rate, we see that, as we approach the pole, the magnitude of the frequency response increases, and it decreases as we move away from the pole. Reasoning on the complex plane it is also easier to ﬁgure out why there is a step in the phase response and why the width of this step converges to π as we move the pole toward the unit circumference. In the computation of the frequency response it is clear that, in the neighborhood of a pole close to the unit circumference, the vector that comes from that pole is dominant over the others. This means that, accepting some approximation, we can neglect the longer vectors and consider only the shortest vector while evaluating the frequency response in that region. This approximation is useful to calculate the bandwidth ∆ω of the resonant ﬁlter, which is deﬁned as the difference between the two √ frequencies corresponding to a magnitude attenuation by 3dB , i.e., a ratio 1/ 2. Under the simplifying assumption that only the local pole is exerting some inﬂuence in the neighboring area, we can use the geometric construction of ﬁg. 18 in order to √ ﬁnd an expression for the bandwidth [67]. The segment P0 A is 2 times larger than the segment P0 P . Therefore, the triangle formed by the points P0 AP has two, orthogonal, equal edges and AB = 2P0 P = 2(1 − r). If AB is small enough, its length can be approximated with that of the arc subtended by it, which is the bandwidth that we are looking for. Summarizing, for poles that 50 D. Rocchesso: Sound Processing A P P0
∆ω B 0
Figure 18: Graphic construction of the bandwidth. P0 is the pole. P0 P ≈ 1 − r. are close to the unit circumference, the bandwidth is given by ∆ω = 2(1 − r) . (44) The formula (44) can be used during a ﬁlter design stage in order to guide the pole placement on the complex plane. The transfer function (39) can be expanded in partial fractions as H (z ) = = G (1 − rejω0 z −1 )(1 − re−jω0 z −1 ) G/(1 − e−j 2ω0 ) Ge−j 2ω0 /(1 − e−j 2ω0 ) − , 1 − rejω0 z −1 1 − re−jω0 z −1 (45) and each addendum is the Z transform of a causal complex exponential sequence. By manipulating the two sequences algebraically and expressing the sine function as the difference of complex exponentials we can obtain the analytic expression of the impulse response8 h(n) = Grn sin (ω0 n + ω0 ) . sin ω0 (46) The impulse response is depicted in ﬁg. 19, which shows that a resonant ﬁlter can be interpreted in the time domain as a damped oscillator with a characteristic frequency that corresponds to the phase of the poles in the complex plane.
8 The reader is invited to work out the expression (46). Digital Filters
0.1 51 0 h −0.1 0 50 time [samples] 100 Figure 19: Impulse response of a secondorder resonant ﬁlter As we have anticipated in sec. 2.2.1, the time constant is determined by evaluating the distance of one of the poles from the unit circumference. In the speciﬁc case that we are examining, such a time constant is τn = ln 0.01 ln 0.01 = ≈ 90 samples , ln r ln 0.95 (47) and we can verify from ﬁg. 19 that this value makes sense. Example 3. With the example that follows we face the problem of doing a practical implementation of a ﬁlter. The platform that we adopt is the CSound language (see appendix B.2) and our prototypical implementation is the secondorder allpole IIR ﬁlter. This simple example can be extended to higherorder ﬁlters. We design an “orchestra” of two instruments: an excitation instrument and a ﬁltering block. The excitation block generates white noise. The ﬁltering block extracts from the noise the components in a band around a center frequency, passed as a parameter, that corresponds to the phase of the pole9 . Another parameter is the decay time of the response of the resonant ﬁlter, which is related to the resonance bandwidth. The Csound orchestra that implements our two blocks is: ; res.orc: by Francesco Scagliola and Davide Rocchesso sr=44100
9 Indeed, the central frequency of the passing frequency band is not coincident with the phase of the complex pole, since the conjugate pole can exert some inﬂuence and slightly modify the frequency response in the neighborood of the other pole. However, for our purposes it is not dangerous to mix the two concepts, provided that the resulting spectrum corresponds to our needs. 52 kr=44100 ksmps=1 nchnls=1 init 0 init 30000 instr 1 D. Rocchesso: Sound Processing ga1 gamp a1 ga1 ; white noise generator rand gamp = a1 ; sound to be passed to the filter endin instr 2 ; p4 ; p5 ipi ithres central frequency decay time = 3.141592654 = 0.01 ; the duration of the frequency response ; is measured in seconds until the response ; goes below the threshold 20*log10(ithres) ; [40 dB] iw0 = 2*ipi*p4/sr ; frequency correspondent to the pole phase ir = exp((1/(sr*p5))*log(ithres)) ; radius of the pole ia1 = 2*ir*cos(iw0) ; coefficient a1 of the filter denominator ia2 = ir*ir ; coeff. a2 of the filter denominator ig = (1ir)*sqrt(12*ir*cos(2*iw0)+ir*ir)* \ 10*sqrt(p5) ; coefficient to have unit gain at the ; center of the band izero =0 Digital Filters as1 as2 afilt init izero ; initialize the filter status init izero = ia1*as1ia2*as2+ig*ga1 ; difference equation afilt = as1 ; filter status update = afilt endin The orchestra can be experimented with the score ;instr. i1 i2 i2 i2 i2 i2 i2 i2 time 0 0 5 10 15 20 20 20 durat. 30.0 5 5 5 5 5 5 5 freq. 700 700 1700 2900 700 1700 2900 decay 0.1 1.0 0.2 2.0 1.0 1.5 2.0 53 out as2 as1 The sounds resulting from the score performance are represented in the sonogram of ﬁg. 20, where larger magnitudes are represented by darker points. In the ﬁltering instrument, the ﬁlter coefﬁcients are computed according to the formulas (47) and (39), starting from the given decay time and central frequency. Moreover, the signal is rescaled by a gain such that the magnitude of the frequency response is one at the central frequency. Empirically, we have found that, in order to keep some homogeneity in the output energy level even for very narrow frequency responses, it is useful to insert a further factor equal to ten times the square root of the decay time. Another observation concerns the difference equation. This equation uses two state variables as1 and as2, used to store the previous values of the output. The state variables are updated in the ﬁnal two lines of the instrument. It is interesting to reduce the control rate in the orchestra, for instance by a factor ten. The resulting sounds will have fundamental frequencies lowered by the same factor and the spectrum will be repeated at multiples of sr/10. 54 D. Rocchesso: Sound Processing This kind of artifacts is often found when writing explicit ﬁltering structures in CSound and using a sample rate different from the control rate. The reason for such a strange behavior is found in the special block processing used by the CSound interpreter, which uses sr/kr variables for each signal variable indicated in the orchestra, and updates all these variables in the same cycle. This means that, as a matter of fact, we get sr/kr ﬁlters, each working at a reduced sample rate on a signal undersampled by a factor sr/kr. The samples of the partial results are then interleaved to give the signal at the sampling rate sr. The output of each of the undersampled ﬁlters is subject to an upsampling that produces the sr/kr periodic replicas of the spectrum. Figure 20: Sonogram of a musical phrase produced by ﬁltering white noise ### Positioning the zeros We have seen how the poles can be positioned within the unit circle in order to give resonances at the desired frequency and with the desired bandwidth. The ratio between the central frequency and the width of a band is often called quality factor and indicated with the symbol Q. In many cases, it is necessary to design a ﬁlter having a ﬂat frequency response (in magnitude) except for a narrow zone around a frequency ω0 where it ampliﬁes or attenuates. The resonant ﬁlter that we have just introduced can be modiﬁed for this purpose by introducing a couple of zeros positioned near the poles. In particular, the numerator of the transfer function will be the polynomial in z −1 having roots at z0 = r0 ejω0 and at z0 ∗ = r0 e−jω0 . By means of a qualitative analysis of the polezero diagram we can realize that, if r0 < r we Digital Filters 55 have a boost of the frequency response, and if r0 > r we have an attenuation (a notch) of the response around ω0 . The reader is invited to do this qualitative analysis on her own and to write the Octave/Matlab script that produces ﬁg. 21, which is obtained using the values r0 = 0.9 and r0 = 1.0. We notice that the phase jumps down by 2π radians when we cross a zero laying on the unit circumference. (a)
2 2 0 phase [rad] magnitude 1 −2 −4 −6 0 0 1 2 3 frequency [rad/sample] 4 −8 0 1 2 3 frequency [rad/sample] 4 (b) Figure 21: Frequency response (magnitude and phase) of an IIR ﬁlter with two poles (r = 0.95) and two zeros. The notch ﬁlter (dashed line) has the zeros with magnitude 1.0. The boost ﬁlter (solid line) has the zeros with magnitude 0.9. 2.2.3 Allpass Filters Imagine that we are designing a ﬁlter by positioning its poles within the unit circle in the complex plane. For each complex pole pi , let us introduce a zero zi = 1/pi ∗ in the transfer function. In other words, we form the polezero couple z − 1 − pi ∗ , (48) Hi (z ) = 1 − pi z − 1 which places the pole and the zero on reciprocal points about the unit circumference and along tha same radius that links them to the origin. Moving along the circumference we can realize that the vectors drawn from the pole and the zero have lengths that keep a constant ratio. A more accurate analysis can be done using the frequency response of this polezero couple, which is written as 56 D. Rocchesso: Sound Processing Hi (ω ) = 1 − pi ∗ ejω e−jω − pi ∗ = e−jω . −jω 1 − pi e 1 − pi e−jω (49) It is clear that numerator and denominator of the fraction in the last member of (49) are complex conjugate one to each other, thus meaning that the rational function has unit magnitude at any frequency. Therefore, the couple (49) is the fundamental block for the construction of an allpass ﬁlter, whose frequency response is obtained by multiplication of blocks such as (49). The allpass ﬁlters are systems that leave all frequency component magnitudes unaltered. Stationary sinusoidal input signals can only be subject to phase delays, with no modiﬁcation in magnitude. The phase response and phase delay of the fundamental polezero couple are depicted in ﬁg. 22 for values of pole set to p1 = 0.9 and p1 = −0.9. A secondorder allpass ﬁl(a)
0 −0.5 phase [rad] −1 −1.5 −2 −2.5 −3 0 1 2 frequency [rad/sample] 3 phase delay [samples] 20 15 10 5 0 0 (b) 1 2 frequency [rad/sample] 3 Figure 22: Phase of the frequency response (a) and phase delay (b) for a ﬁrstorder allpass ﬁlter. Pole in p1 = 0.9 (solid line) and pole in p1 = −0.9 (dashed line) ter with real coefﬁcients is obtained by multiplication of two allpass polezero couples, where the poles are the conjugate of each other. Fig. 23 shows the phase response and the phase delay of a second order allpass ﬁlter with poles in p1 = 0.9 + i0.2 and p2 = 0.9 − i0.2 (solid line) and in p1 = −0.9 + i0.2 and p2 = −0.9 − i0.2 (dashed line). It can be shown that the phase response of any allpass ﬁlter is always negative and monotonically decreasing [65]. The group and phase delays are always functions that take positive values. This fact allows us to think about allpass ﬁlters as media where signals propagate with a frequencydependent delay, without being subject to any absorption or Digital Filters (a)
0 −1
phase [rad] phase delay [samples] 57 (b)
15 −2 −3 −4 −5 −6 0 1 2 frequency [rad/sample] 3 10 5 0 0 1 2 frequency [rad/sample] 3 Figure 23: Phase of the frequency response (a) and phase delay (b) for a secondorder allpass ﬁlter. Poles in p1,2 = 0.9 ± i0.2 (solid line) and p1,2 = −0.9 ± i0.2 (dashed line) ampliﬁcation. The reader might think that the allpass ﬁlters are like open doors for audio signals, since the phase shifts are barely distinguishable by the human hearing system. Actually, this is true only for stationary signals, i.e., signals formed by stable sinusoidal components. Realworld sounds are made of transients at least as much as they are made of stationary components, and the transient response of allpass ﬁlters can be characterized according to what we showed in sec. 2.2. During transients, the phase response plays an important role for perception, and in this sense the allpass ﬁlters can modify the sound signals appreciably. For instance, veryhighorder allpass ﬁlters are used to construct artiﬁcial reverberators. These ﬁlters usually have a long time constant, so that the effects of their phase response are mainly perceived in the time domain in the form of a reverberation tail. The importance of allpass ﬁlters becomes readily evident when they are inserted into complex computational structures, typically to construct ﬁlters whose properties should be easy to control. We will see an example of this use of allpass ﬁlters in sec. 2.3. 2.2.4 Realizations of IIR Filters So far, we have studied the IIR ﬁlters by analysis of transfer functions or impulse responses. In this section we want to face the problem of implementing 58 D. Rocchesso: Sound Processing these ﬁlters as computational structures that can be directly coded using sound processing languages or realtime sound processing environments. Consider a secondorder ﬁlter with two poles and two zeros, which is represented by the transfer function (25) with N = M = 2. This can be realized by the signal ﬂowgraph of ﬁg. 24, where the nodes having converging edges are considered as points of addition, and the nodes having diverging edges are considered as branching points. Such a realization is called Direct Form I.
x z1 z1 b0 b1 b2 a 1 a 2 y z1 z1 Figure 24: Secondorder ﬁlter, Direct Form I Signal ﬂowgraphs can be manipulated in several ways, thus leading to alternative realizations having different numerical properties and, possibly, more computationally efﬁcient. For instance, if we want to implement a ﬁlter as a cascade of secondorder cells such as that of ﬁg. 24, we can share, between two contiguous cells, the unit delays that are on the output stage of the ﬁrst cell, with the unit delays that are on the input stage of the second cell, thus saving a number of memory accesses. We are going to show some other kind of manipulation of signal ﬂowgraphs, in the special case of the realization of the secondorder allpass ﬁlter, which has the property bi = a2−i , i = 0, 1, 2 . (50) A ﬁrst transformation comes from the observation that the structure of ﬁg. 24 is formed by the cascade of two blocks, each being linear and time invariant. Therefore, the two blocks can be commuted without altering the inputoutput behavior. Moreover, from the block exchange we get a ﬂowgraph with two sidetoside stages of pure delays, and these stages can be combined in one only. The realization of these transformations is shown in ﬁg. 25 and it is called Direct Form II. Another transformation that can be done on a signal ﬂowgraph without altering its inputoutput behavior is the transposition [65]. The transposition of Digital Filters 59 x a 1 a 2 z1 z1 a2 a1 y Figure 25: Secondorder allpass ﬁlter, Direct Form II a signal ﬂowgraph is done with the following operations: • Inversion of the direction of all the edges • Transformation of the nodes of addition into branching nodes, and vice versa • Exchange of the roles of the input and output edges The transposition of a realization in Direct Form II leads to the Transposed Form II, which is shown in ﬁg. 26. Similarly, the Transposed Form I is obtained by transposition of the Direct Form I. x a2 a1 z1 z1 a 1 a 2 y Figure 26: Secondorder allpass ﬁlter, Transposed Form II By direct manipulation of the graph, we can also take advantage of the properties of special ﬁlters. For instance, in an allpass ﬁlter, the coefﬁcients of the numerator are the same of the denominator, in inverted order (see (50)). With simple transformations of the graph of the Direct Form II it is possible to obtain the realization of ﬁg. 27, which is interesting because it only has two multiplies. In fact, the multiplications by −1 can be avoided by replacing two additions with subtractions. 60 D. Rocchesso: Sound Processing x a 1 1 z1 z1 z a2 1
2 y Figure 27: Secondorder allpass ﬁlter, realization with two multipliers and four state variables A special structure that plays a very important role in signal processing is the lattice structure, which can be used to implement FIR and IIR ﬁlters [65]. In particular, the IIR lattice ﬁlters are interesting because they have physical analogues that can be considered as physical sound processing systems. The lattice structure can be deﬁned in a recursive fashion as indicated in ﬁg. 28, where HaM −1 is an order M − 1 allpass ﬁlter, kM is called reﬂection coefﬁcient and it is a real number not exceeding one. Between the signals x and y there is an
x k Ha M kM ya z1
M y H a M1 Figure 28: Lattice ﬁlter allpole transfer function 1/A(z ), while between the points x and ya there is an allpass transfer function HaM (z ) having the same denominator A(z ). More precisely, it can be shown that, if HaM −1 is an allpass stable transfer function and kM  < 1, then HaM is an allpass stable transfer function. Proceeding with the recursion, the allpass ﬁlter HaM −1 can be realized as a lattice structure, and so on. The recursion termination is obtained by replacing Ha 1 with a short circuit. The lattice section having coefﬁcient kM can be interpreted as the Digital Filters 61 junction between two cylindrical lossless tubes, where kM is the ratio between the two crosssectional areas. This number is also the scaling factor that an incoming wave is subject to when it hits the junction, so that the name reﬂection coefﬁcient is justiﬁed. To have a physical understanding of lattice ﬁlters, think of modeling the human vocal tract. The lattice realization of the transfer function that relates the signals produced by the vocal folds to the pressure waves in the mouth can be interpreted as a piecewise cylindrical approximation of the vocal tract. In this book, we do not show how to derive the reﬂection coefﬁcients from a given transfer function [65]. We just give the result that, for a secondorder ﬁlter, a denominator such as A(z ) = 1 + a1 z −1 + a2 z −2 gives the reﬂection coefﬁcients10 k1 k2 = a1 /(1 + a2 ) = a2 . (51) 10 Verify that the ﬁlter is stable if and only if k1  < 1 and k2  < 1. 62 D. Rocchesso: Sound Processing 2.3 Complementary ﬁlters and ﬁlterbanks In sec. 2.2.4 we have presented several different realizations of allpass ﬁlters because they ﬁnd many applications in signal processing [76]. In particular, a couple of allpass ﬁlters is often combined in a parallel structure in such a way that the overall response is not allpass. If Ha1 and Ha2 are two different allpass ﬁlters, their parallel connection, having transfer function Hl (z ) = Ha1 (z ) + Ha2 (z ) is not allpass. To ﬁgure this out, just think about frequencies where the two phase responses are equal. At these points the signal will be doubled at the output of H (z ). On the other hand, at points where the phase response are different by π (i.e., they are in phase opposition), the outputs of the two branches cancel out at the output. In order to design a lowpass ﬁlter it is sufﬁcient to connect in parallel two allpass ﬁlters having a phase response similar to that of ﬁg. 29.The same parallel connection, with a subtraction instead of the addition at the output, gives rise to a highpass ﬁlter Hh (z ), and it is possible to show that the highpass and the lowpass transfer functions are complementary, in the sense that Hl (ω )2 + Hh (ω )2 is constant in frequency. Therefore, we have the compact realization of a crossover ﬁlter, as depicted in
ω Ηa π Figure 29: Phase responses of two allpass ﬁlters that, if connected in parallel, give a lowpass ﬁlter ﬁg. 30, which is a device with one input and two outputs that conveys the low frequencies to one outlet, and the high frequencies to the other outlet. Devices such as this are found not only in loudspeakers, but also in musical instrument models. For instance, the bell of woodwinds transmits to the air the high frequencies and reﬂects the low frequencies back to the bore. The idea of connecting two allpass ﬁlters in parallel can be applied to the realization of resonant complementary ﬁlters. In particular, it is interesting to be able to tune the bandwidth and the center frequency independently. To construct such a ﬁlter, one of the two allpass ﬁlters is replaced by the identity (i.e., Digital Filters
x 1/2 H a1(z) y1 63 H a2(z) y2 1 Figure 30: Crossover implemented as a parallel of allpass ﬁlters and a lattice junction a short circuit) while the other one is a second order allpass ﬁlter (see ﬁg. 31). Recall that, close to the frequency ω0 that corresponds to the pole of the ﬁlter, the phase response takes values that are very close to −π (see ﬁg. 23). Therefore, the frequency ω0 corresponds to a minimum in the overall frequency response. In other words, it is the notch frequency. The closer is the pole to the unit circumference, the narrower is the notch. The lattice implementation of this allpass ﬁlter allows to tune the notch position and width independently, since the two reﬂection coefﬁcients have the form [76] k1 k2 = − cos ω0 1 − tan B/2 , = 1 + tan B/2 (52) where B is the bandwidth for 3dB of attenuation.
x 1/2 H a (z) y Figure 31: Notch ﬁlter implemented by means of a secondorder allpass ﬁlter A structure that allows to convert a notch into a boost with a continuous control is obtained by a weighted combination of the complementary outputs and it is shown in ﬁg 32. For values of k such that 0 < k < 1 the ﬁlter is a notch, while for k > 1 the ﬁlter is a boost. 64
x 1/2 D. Rocchesso: Sound Processing H a (z) k 1 y Figure 32: Notch/boost ﬁlter implemented by means of a secondorder allpass ﬁlter and a lattice section Filters such as those of ﬁgures 31 and 32, whose properties can be controlled by a few parameters decoupled with each other, are called parametric ﬁlters. For thorough surveys on structures for parametric ﬁltering, with analyses of numerical properties in ﬁxedpoint implmentations, we refer the reader to a book by Zölzer [109] and an article by Dattorro [29]. 2.4 Frequency warping Section (1.5.2) has shown how the bilinear transformation distorts the frequency axis while maintaining the “shape” of the frequency response. Such transformation is a socalled conformal transformation [62] of the complex plane onto itself. In this section we are interested in conformal transformations that map the unit circumference (instead of the imaginary axis) onto itself, in such a way that, if applied to a discretetime ﬁlter, they give a new discretetime ﬁlter having the same stability properties. Indeed, the simplest nontrivial transformation of this kind is a bilinear transformation a + θ −1 . (53) z −1 = 1 + aθ−1 The transformation (53) is allpass and, therefore, it maps the unit circumference onto itself. Moreover, if the transformation (53) is applied to a discretetime ﬁlter described by a transfer function in z , it preserves the ﬁlter order in the variable θ. The reason for using conformal maps in digital ﬁlter design is that it might be easier to design a ﬁlter using a warped frequency axis. For instance, to design a presence ﬁlter it is convenient to start from a secondorder resonant ﬁlter prototype having center frequency at π/2 and tunable bandwidth and boost. Then, it is possible to compute the coefﬁcient of the conformal trans Digital Filters 65 formation (53) in such a way that the resonant peak gets moved to the desired position [62]. Conformal transformations of order higher than the ﬁrst are often used to design multiband ﬁlters starting from the design of a lowpass ﬁlter, or to satisfy demanding speciﬁcations on the slope of the transition band that connects the pass band from the attenuated band. When designing digital ﬁlters to be used in models of acoustic systems, the transformation (53) can be useful, especially if it is specialized in order to optimize some psychoacousticbased quality measure. Namely, the warping of the frequency axis can be tuned in such a way that it resembles the frequency distribution of critical bands in the basilar membrane of the ear [99]. Similarly to what we saw in section 1.5.2 for the bilinear transformation, it can be shown that a ﬁrstorder conformal map is determined by setting the correspondence in three points, two of them being ω = 0 and ω = π . The mapping of the third point is determined by the coefﬁcient a to be used in (53). Surprisingly enough, a simple ﬁrstorder transformation is capable to follow the distribution of critical bands quite accurately. Smith and Abel [99], using a technique that minimizes the squared equation error, have estimated the value that has to be assigned to a for sampling frequencies ranging from 1Hz to 50KHz, in order to have a earbased frequency distribution. An approximate expression to calculate such coefﬁcient is a(Fs ) 1.0211 2 arctan (76 · 10−6 Fs ) π
1 /2 − 0.19877 . (54) As an exercise, the reader can set a value of the sampling rate Fs , and compute the value of a by means of (54). Then the curve that maps the frequencies in the θ plane to the frequencies in the z plane can be drawn and compared to the curve obtained by uniform distribution of the center frequencies of the Bark scale11 [99, 111] that are below the Nyquist rate. A psychoacousticsdriven frequency warping is also useful to design digital ﬁlters in such a way that the approximation error gets distributed on the frequency axis in a way that is most tolerable by our ears. The procedure consists in transforming the desired frequency response according to (53), and designing a digital ﬁlter that approximates it using some ﬁlter design method [65]. Then the inverse conformal mapping (unwarping) is applied on the resulting
11 The Bark scale is based on measurements on critical bands, published by Zwicker in 1961. The center frequencies (in Hz) of the rectangular ﬁlters, equivalent in power to the critical bands, are: 50, 150, 250, 350, 450, 570, 700, 840, 1000, 1170, 1370, 1600, 1850, 2150, 2500, 2900, 3400, 4000, 4800, 5800, 7000, 8500, 10500, 13500, 20500, 27000. 66 D. Rocchesso: Sound Processing digital ﬁlter. Some ﬁlter design techniques, beyond giving a better approximation in a psychoacoustic sense, take advantage of the expansion of low frequencies induced by the warping map, because lowfrequency sharp transitions get smoother and the design algorithms become less sensitive to numerical errors. Chapter 3 Delays and Effects
Most acoustic systems have some component where waves can propagate, such as a membrane, a string, or the air in an enclosure. If propagation in these media is ideal, i.e., free of losses, dispersion, and nonlinearities, it can be simulated by delay lines. A delay line is a linear timeinvariant, singleinput singleoutput system, whose output signal is a copy of the input signal delayed by τ seconds. In continuous time, the frequency response of such system is HDs (j Ω) = e−j Ωτ . (1) Equation (1) tells us that the magnitude response is unitary, and that the phase is linear with slope τ . 3.1 The Circular Buffer
A discretetime realization of the system (1) is given by a system that implements the transfer function HD (z ) = z −τ Fs = z −m , (2) where m is the number of samples of delay. When the delay τ is an integral multiple of the sampling quantum, m is an integer number and it is straightforward to implement the system (2) by means of a memory buffer. In fact, an msamples delay line can be implemented by means of a circular buffer, that is 67 68 D. Rocchesso: Sound Processing a set of M contiguous memory cells accessed by a write pointer IN and a read pointer OUT, such that IN = (OUT + m)%M , (3) where the symbol % is used for the quotient modulo M . At each sampling instant, the input is written in the location pointed by IN, the output is taken from the location pointed by OUT, and the two pointers are updated with IN OUT = (IN + 1)%M . = (OUT + 1)%M (4) In words, the pointers are incremented respecting the circularity of the buffer. In some architectures dedicated to sound processing, memory organization is optimized for wavetable synthesis, where a stored waveform is read with variable increments of the reading pointer. In these architectures, a quantity of 2r memory locations is available, and from these M = 2s locations (with s < r) are uniformly chosen among the 2r available cells. In this case the locations of the circular buffer are not contiguous, and the update of the pointers is done with the operations IN OUT = (IN + 2r−s )%2r = (OUT + 2r−s )%2r . (5) In practice, since the addresses are rbit long, there is no need to compute the modulo explicitly. It is sufﬁcient to do the sum neglecting any possible overﬂow. Of course, the (3) is also replaced by IN = (OUT + m2r−s )%2r . (6) 3.2 FractionalLength Delay Lines It might be thought that, choosing a sufﬁciently high sampling rate, it is always possible to use delay lines having an integer number of samples. Actually, there are some good reasons that lead us to state that this is not the case in sound synthesis and processing. In sound synthesis, the models have to be carefully tuned without resorting to very high sample rates. In particular, it is easy to verify that using integerlength delays in physical models we get errors in fundamental frequencies that Delay Lines and Effects 69 go well beyond the just noticeable difference in pitch1 (see the appendix C). For instance, for a pressure wave propagating in air at normal temperature conditions, the spatial discretization given by the sampling rate Fs = 44100Hz gives intervals of 0.0075m, a distance that can produce wellperceivable pitch differences in a wind instrument. Another reason for using fractional delays is that we often want to vary the delay lengths continuously, in order to reproduce effects such as glissando or vibrato. The adoption of integerlength delays would produce annoying discontinuities. The most widely used techniques for implementing fractional delays are interpolation by FIR ﬁlters or by allpass ﬁlters. These two techniques are, in some sense, complementary. The choice of one of the two has to be made according to the peculiarities of the system to be simulated or of the architecture chosen for the implementation. In any case, a delay of length m is obtained by means of a delay line whose length is equal to the integer part of m, cascaded with a block capable to approximate a constant phase delay equal to the fractional part of m. We recall that the phase delay at a given frequency ω is the delay in time samples experienced by the sinusoidal component at frequency ω . For instance, consider a linear ﬁltering block enclosed in a feedback loop (see sec. 3.4): the frequency of the k th resonance fk of the whole feedback system is found at the points where the phase response equates the multiples of 2π . At these frequencies, the components reappear in phase every round trip in the loop, thus reinforcing their amplitude at the output. The phase delay at frequency fk is therefore the effective delay length at that frequency, that is the length of an ideal (linear phase) delay line that gives the same k th resonance. Fig. 1 shows a phase curve and its crossings with multiples of 2π giving a distribution of resonances. 3.2.1 FIR Interpolation Filters The easiest and most intuitive way to obtain a variablelength delay is to linearly interpolate the output of the line with the content of its preceding cell in the memory buffer. This corresponds to using the ﬁrstorder FIR ﬁlter Hl (z ) = c0 + c1 z −1 . (7) 1 To ﬁgure this out, the reader can consider an msample delay line in a feedback loop. It gives a harmonic series of partials whose fundamental is f0 = Fs (see sec. 3.4). The set of integer m delay lengths that give the best approximation to a tempered scale can be found and the curve of fundamental frequency errors can be drawn. 70
ω1 ω2 D. Rocchesso: Sound Processing
ω3 ω 2π 4π 6π arg(H) Figure 1: Graphical construction to ﬁnd the series of resonances produced by a linear block in a feedback loop. The slope of the dashed lines indicates the phase delay at each resonance frequency. Given a certain phase delay τph0 = − 1 −c1 sin ω0 arctan ω0 c0 + c1 cos ω0 (8) that has to be obtained at a given frequency ω0 , the following formulas give the coefﬁcient values: c0 + c1 c1 =1 =
1
sin (ω0 ) −cos (ω0 ) ph0 ω0 ) 1+ tan (τ ≈ τph0 , (9) where the approximation is valid in the lowfrequency range. The ﬁrst of the (9) is needed in order to normalize the lowfrequency response to one. In the spe1 cial case that c0 = c1 = 2 (averaging ﬁlter) the phase is linear and the delay is of half a sample. Unfortunately, the magnitude response of this interpolator is lowpass with a zero at the Nyquist frequency. Fig. 2 shows the magnitude, phase, and phase delay responses for several ﬁrstorder linear interpolators. We can see that the phase is linear in most of the audio range, but the magnitude varies from the allpass to the lowpass with a zero at the Nyquist rate. When the interpolator is inserted within a feedback loop, its lowpass behavior can be treated as an additional frequencydependent loss, which should be somewhat taken into account. Interpolation ﬁlters can be of order higher than the ﬁrst. We can do quadratic, cubic, or other polynomial interpolations. In general, the problem of designing an interpolator can be turned into the design of an lth order FIR ﬁlter approximating a constant magnitude and linear phase frequency response. Several criteria can be adopted to drive the approximation problem. One approach is to impose that the ﬁrst L derivatives of the error function will be zero Delay Lines and Effects
Frequency Response (magnitude) Frequency Response (phase) 71 1 0.8 0.6 0.4 0.2 0 0 −0.5 phase [rad] −1 −1.5 −2 −2.5 −3 magnitude 0 1 2 frequency [rad/sample] Frequency Response (phase delay) 3 0 1 2 frequency [rad/sample] 3 1 phase delay [samples] 0.8 0.6 0.4 0.2 0 0 1 2 frequency [rad/sample] 3 Figure 2: Magnitude, phase, and phase delay responses of a linear interpolation ﬁlter (1 − α) + αz −1 for α = k/16, k = 0, . . . , 16 at zero frequency. In this way we obtain maximallyﬂat ﬁlters whose coefﬁcients are the same used in Lagrange interpolation as it is taught in numerical analysis courses. For a thorough treatment of interpolation ﬁlters we suggest reading the article [51]. Here we only point out that using high orders allows to keep the magnitude response close to unity and a phase response close to linear in a wide frequency band. Of course, this is paid in terms of computational complexity. In special architectures, where the access to delay lines is governed by (5) and (6), the linear interpolation is implemented very efﬁciently by using the r − s bits that are not used to access the 2s samples delay line. In fact, if the address is computed using r bits, the r − s least signiﬁcant bits represent the fractional part of the delay or, equivalenty, the coefﬁcient c1 of the interpolator. 72 D. Rocchesso: Sound Processing Therefore, it is sufﬁcient to access two consecutive delay cells and keep the values c0 and c1 = 1 − c0 in two registers. The implementation of a glissando with these architectures is immediate and free from complications. 3.2.2 Allpass Interpolation Filters Another widely used technique to obtain the fractional part of a desired delay length makes use of unitmagnitude IIR ﬁlters, i.e., allpass ﬁlters. Since the magnitude of these ﬁlters is constant there is no frequencydependent attenuation, a property that can never be ensured by FIR ﬁlters. The simplest allpass ﬁlter has order one, and it has the following transfer function: c + z −1 . (10) 1 + cz −1 In order to make sure that the ﬁlter is stable, the coefﬁcient c has to stay within the unit circle. Moreover, if we stick with real coefﬁcients, c belongs to the real axis. The phase delay given by the ﬁlter (10) is shown in ﬁg. 3 for several values of the coefﬁcient c. It is clear that the phase delay is not as ﬂat as in the case of the FIR interpolator, depicted in ﬁg. 2. Ha (z ) =
0 phase delay [samples] 0 1 2 frequency [rad/sample] 3 −0.5 phase [rad] −1 −1.5 −2 −2.5 −3 2.5 2 1.5 1 0.5 0 0 1 2 frequency [rad/sample] 3 Figure 3: Phase response and phase delay of a ﬁrstorder allpass ﬁlter for the values of the coefﬁcient c = 1.998k/17 − 0.999, k = 0, . . . , 16 It is easy to verify2 that, at frequencies close to dc, the phase response of (10) takes the approximate form ∠H (ω ) ≈ −
2 The sin (ω ) c sin (ω ) 1−c + ≈ −ω , c + cos (ω ) 1 + c cos (ω ) 1+c (11) proof of (11) is left to the reader as a useful exercise. Delay Lines and Effects 73 where the ﬁrst approximation is obtained by replacing the argument of the arctan with the function value and the second approximation, valid in an even smaller neighborhood, is obtained by approximating sin x with x and cos x with 1. The phase and group delay around dc are τph (ω ) ≈ τgr (ω ) ≈ 1−c . 1+c (12) Therefore, the ﬁlter coefﬁcient c can be easily determined from the desired lowfrequency delay as 1 − τph (0) c= . (13) 1 + τph (0) Fig. 3 shows that the delay of the allpass ﬁlter is approximately constant only in a narrow frequency range. We can reasonably assume that such range, for positive values of c smaller than one, extends from 0 to Fs /5. With Fs = 50kHz we see that at Fs /5 = 10kHz we have an error of about 0.05 time samples. In a note at that frequency produced by a feedback delay line, such an error produces a pitch deviation smaller than 1%. For lower fundamental frequencies, such as those found in actual musical instruments, the error is smaller than the just noticeable difference measured with slow pitch modulations (see the appendix C). If the ﬁrstorder ﬁlter represents an elegant and efﬁcient solution to the problem of tuning a delay line, it has also the relevant side effect of detuning the upper partials, due to the marked phase nonlinearity. Such detuning can be tolerated in most cases, but has to be taken into account in some other contexts. If a phase response closer to linear is needed, we can use higherorder allpass ﬁlters [51]. In some cases, especially in sound synthesis by physical modeling, a speciﬁc inharmonic distribution of resonances has to be approximated. This can be obtained by designing allpass ﬁlters that approximate a given phase response along the whole frequency axis. In these cases the problem of tuning is superseded by the more difﬁcult problem of accurate partial positioning [83]. With allpass interpolators it is more complicated to handle continuous delay length variations, since the recursive structure of the ﬁlter does not show an obvious way of transferring memory cells from and to the delay line, as it was in the case of the FIR interpolator, which is constructed on the delay line by a certain number of taps. Indeed, the glissando can be implemented with the allpass ﬁlter by adding a new cell to the delay line whenever the ﬁlter coefﬁcient becomes one and, at the same time, zeroing out the ﬁlter state variable and the coefﬁcient. What is really more complicated with allpass ﬁlters is to handle sudden variations of the delay length, as they are found, for instance, when a 74 D. Rocchesso: Sound Processing ﬁnger hole is opened in a wind instrument. In this case, the recursive nature of allpass ﬁlters causes annoying transients in the output signal. Ad hoc structures have been devised to cancel these transients [51]. 3.3 The NonRecursive Comb Filter Sounds, propagating in the air, come into contact with surfaces and objects of various kinds and this interaction produces physical phenomena such as reﬂection, refraction, and diffraction. A simple and very important phenomenon is the reﬂection of sound about a planar surface. Due to a reﬂection such as this, a listener receives two delayed copies of the same signal. If the delay is larger than about a hundred milliseconds, the second copy is perceived as a distinguished echo, while if the delay is smaller than about ten milliseconds, the effect of a single reﬂection is perceived as a spectral coloration. A simple model of single reﬂection can be constructed starting from the basic blocks described in this and in the preceding chapters. It is constructed as an msamples delay line, with the incidental fractional part of m obtained by FIR interpolation or allpass ﬁltering, cascaded with an attenuation coefﬁcient g , possibly replaced by a ﬁlter if a frequencydependent absorption has to be simulated. The output of this lossy delay line is summed to the direct signal. Let us analyze the structure in the case that m is integer and g is a positive constant not exceeding 1. The difference equation is expressed as y (n) = x(n) + g · x(n − m) , and, therefore, the transfer function is H (z ) = 1 + gz −m . (15) (14) In the case that g = 1, it is easy to see by using the De Moivre formula (see section A.6) that the frequency response of the comb ﬁlter has the following magnitude and group delay: H (ω ) = 2(1 + cos (ωm)) τgr,H (ω ) = m , 2 (16) and it is straightforward to verify that the frequency band ranging from dc to the Nyquist rate comprises m zeros (antiresonances), equally spaced by Fs /mHz . Delay Lines and Effects 75 The phase response3 is piecewise linear with discontinuities of π at the odd multiples of F s/2m. If g < 1, it is easy to see that the amplitude of the resonances is P =1+g , (17) while the amplitude of the points of minimum (halfway between contiguous resonances) is V =1−g . (18) An important parameter of this ﬁltering structure, called nonrecursive comb ﬁlter (or FIR comb), is the peaktovalley ratio 1+g P = . V 1−g (19) Fig. 4 shows the response of a nonrecursive comb ﬁlter having length m = 11samples and a reﬂection attenuation g = 0.9. The shape of the frequency response justiﬁes the name comb given to the ﬁlter.
2 magnitude 1 0 0 1 2 frequency [rad/sample] 3 Figure 4: Magnitude of the frequency response of the comb FIR ﬁlter having coefﬁcient g = 0.9 and delay length m = 11 The zeros of the comb ﬁlter are evenly distributed along the unit circle at the mth roots of −g , as shown in ﬁgure 5.
3 The reader is invited to calculate and plot the phase response. 76
1 0.5 0 −0.5 −1 −1 0 Re Im D. Rocchesso: Sound Processing 1 Figure 5: Zeros and poles of an FIR comb ﬁlter 3.4 The Recursive Comb Filter A simple model of onedimensional resonator can be constructed using the basic blocks presented in this and in the preceding chapters. It is composed by an msamples delay line, with the incidental fractional part of m obtained by FIR interpolation or allpass ﬁltering, in feedback loop with an attenuation coefﬁcient g , possibly replaced by a ﬁlter in order to give different decay times at different frequencies. Let us analyze the whole ﬁltering structure in the case that m is integer and g is a positive constant not exceeding 1. The difference equation is expressed as y (n) = x(n − m) + g · y (n − m) , and the transfer function is H (z ) = z −m . 1 − gz −m (21) (20) Whenever g < 1, the stability is ensured. In the case that g = 1, the frequency response of the ﬁlter has the following magnitude and group delay: H (ω ) = 2 sin (1 2) ωm/ τgr,H (ω ) = m , 2 (22) and it is easy to verify that the frequency band ranging from dc to the Nyquist rate comprises m vertical asymptotes (resonances), equally spaced by Fs /mHz. Delay Lines and Effects 77 If g = 1 the ﬁlter is at the limit of stability, and this is the only case when the phase response is piecewise linear4 , starting with the value −π/2 at dc, with discontinuities of π at the even multiples of Fs /2m. If g < 1, it is easy to verify that the amplitude of the resonances is P= 1 , 1−g (23) while the amplitude of the points of minimum (halfway between contiguous resonances) is 1 . (24) V= 1+g An important parameter of this ﬁltering structure, called recursive comb ﬁlter (or IIR comb), is the peaktovalley ratio P 1+g = . V 1−g (25) Fig. 6 shows the frequency response of a recursive comb ﬁlter having a delay line of m = 11 samples and feedback attenuation g = 0.9. The shape of the magnitude response justiﬁes the name comb given to the ﬁlter.
10
phase delay [samples] 120 100 80 60 40 20 0 0 1 2 frequency [rad/sample] 3 8
magnitude 6 4 2 0 0 1 2 frequency [rad/sample] 3 Figure 6: Magnitude and phase delay response of the recursive comb ﬁlter having coefﬁcient g = 0.9 and delay length m = 11 The poles of the comb ﬁlter are evenly distributed along the unit circle at the mth roots of g , as shown in ﬁgure 7.
4 The reader is invited to calculate and plot the phase response. 78
1 0.5 0 −0.5 −1 −1 0 Re Im D. Rocchesso: Sound Processing 1 Figure 7: Zeros and poles of an IIR comb ﬁlter In sound synthesis by physical modeling, a recursive comb ﬁlter can be interpreted as a simple model of lossy onedimensional resonator, like a string, or a tube. This model can be used to simulate several instruments whose resonator is not persistently excited. In fact, if the input is a short burst of ﬁltered noise, we obtain the basic structure of the plucked string synthesis algorithm due to Karplus and Strong [47]. 3.4.1 The CombAllpass Filter The ﬁlter given by the difference equation (20) has a frequency response characterized by evenlydistributed resonances. With a slight modiﬁcation of its structure, such ﬁlter can be made allpass. In other words, the magnitude response of the ﬁlter can be made ﬂat even though the impulse response remains almost the same (20). The modiﬁcation is just a direct path connecting the input of the delay line to the ﬁlter output, as it is depicted in ﬁg. 8. It is easy to
g x y z m g Figure 8: Allpass comb ﬁlter Delay Lines and Effects 79 see that the transfer function of the ﬁlter of ﬁg. 8, called the allpass comb ﬁlter can be written as −g + z − m , (26) H (z ) = 1 − gz −m which has the structure of an allpass ﬁlter. It is interesting to note that the direct path introduces a nonzero sample at the time instant zero in the impulse response. All the following samples are just a scaled version of those of the impulse response of the comb ﬁlter, with a scaling factor equal to 1 − g 2 . The time properties, such as the time decay, are substantially unvaried. The allpass comb ﬁlter does not introduce any coloration in stationary signals. On the other hand, its effect is evident on signals exhibiting rapid transients, and for these signals we can not state that the ﬁlter is transparent. 3.5 Sound Effects Based on Delay Lines Many of the effects commonly used in electroacoustic music are obtained by composition of timevarying delay lines, i.e., by lines whose length is modulated by slowlyvarying signals. In order to avoid discontinuities in the signals, it is necessary to interpolate the delay lines in some way. The interpolation by means of allpass ﬁlters is applicable only for very slow modulations or for narrowwidth modulations, since sudden changes in the state of allpass ﬁlters give rise to transients that can be perceived as signal distortions [30]. On the other hand, linear (or, more generally, polynomial) interpolation introduces frequencydependent losses whose magnitude is dependent on the fractional length of the delay line. As the delay length is varied, these variable losses give an amplitude distortion due to amplitude modulation of the various frequency components. Coupled to amplitude modulation, there is also phase modulation due to phase nonlinearity of the interpolator, in both cases of FIR and IIR interpolation. The terminology used for audio effects is not consistent, as terms such as ﬂanger, chorus, and phaser are often associated with a large variety of effects, that can be quite different from each other. A ﬂanger is usually deﬁned as an FIR comb ﬁlter whose delay length is sinusoidally modulated between a minimum and a maximum value. This has the effect of expanding and contracting the harmonic series of notches of the frequency response. The name ﬂanger derives from the old practice, used long ago in the analog recording studios, to alternatively slow down the speed of two tape recorders or two turntables playing the same music track by pressing a ﬁnger on the ﬂanges. 80 D. Rocchesso: Sound Processing The name phaser is most often reserved for structures similar to the comb FIR ﬁlter, with the difference that the notches are not harmonically distributed. Orfanidis [67] proposes to use, instead of the delay line, a bunch of parametric notch ﬁlters such as those presented in sec. 2.2.4. Each notch is controllable in its frequency position and width. Smith [96], instead, proposes to use a large allpass ﬁlter instead of the delay line. If this allpass ﬁlter is obtained as a cascade of secondorder allpass sections, it becomes possible to control and modulate the position of any single pole couple, which represent all the single notches of the overall response. A common feature of ﬂangers and phasers is the relatively large distance between the notches. Vice versa, if the notches are very dense, the term chorus is preferred. Orfanidis [67], suggests to implement a chorus as a parallel of FIR comb ﬁlters, where the delay lengths are randomly modulated around values that are slightly different from each other. This should simulate the deviations in time and height that are found in performances of a choir singing in unison. Vice versa, Dattorro [30] says that a chorus can be obtained by the same structure used for the ﬂanger, with a difference that the delay lengths have to be set to larger values than for the ﬂanger. In this way, the notches are made more dense. For the ﬂanger the suggested nominal delay is 1msec and for the chorus it is 5msec. If the objective is to recreate the effect of a choir singing in unison, the fact of having many notches in the spectrum is generally disliked. Dattorro [30] proposes a partial solution that makes use of a recursive allpass ﬁlter, where the delay line is read by two pointers, one is kept ﬁxed and produces the feedback signal, the other is varied to pick up the signal that is fed directly to the output. In this way, when both the pointers are at the nominal position, the structure does not introduce any coloration for stationary signals. A ﬁnal remark is reserved to the spatialization of these combbased effects. In general, ﬂanging, phasing, and chorusing effects can be obtained from two different timevarying allpass chains, whose outputs feed different loudspeakers. In this case, sums and subtractions between signals at the different frequencies happen “on air” in a way dependent from position. Therefore, the spatial sensation is largely due to the different spectral coloration found in different points of the listening area. Exercise The reader is invited to write a chorus/ﬂanger based on comb or allpass comb ﬁlters using a language for sound processing (e.g., CSound). As an input signal, try a sine wave and a noisy signal. Then, implement a phaser by Delay Lines and Effects 81 cascading several ﬁrstorder allpass ﬁlters having coefﬁcients between 0 and 1. 3.6 Spatial sound processing The spatial processing of sound is a wide topic that would require at least a thick book chapter on its own [82]. Here we only describe very brieﬂy a few techniques for sound spatialization and reverberation. In particular, techniques for sound spatialization are different if the target display is by means of headphones or loudspeakers. 3.6.1 Spatialization Spatialization with headphones Humans can localize sound sources in a 3D space with good accuracy using several cues. If we can rely on the assumption that the listener receives the sound material via a stereo headphone we can reproduce most of the cues that are due to the ﬁltering effect of the pinna–head–torso system, and inject the signal artiﬁcially affected by this ﬁltering process directly to the ears. Sound spatialization for headphones can be based on interaural intensity and time differences (see the appendix C). It is possible to use only one of the two cues, but using both cues will provide a stronger spatial impression. Of course, interaural time and intensity differences are just capable of moving the apparent azimuth of a sound source, without any sense of elevation. Moreover, the apparent source position is likely to be located inside the head of the listener, without any sense of externalization. Special measures have to be taken in order to push the virtual sources out of the head. A ﬁner localization can be achieved by introducing frequencydependent interaural differences. In fact, due to diffraction the low frequency components are barely affected by IID, and the ITD is larger in the low frequency range. Calculations done with a spherical head model and a binaural model [49, 73] allow to draw approximated frequencydependent ITD curves, one being displayed in ﬁg. 9.a for 30o of azimuth. The curve can be further approximated by constant segments, one corresponding to a delay of about 0.38ms in low frequency, and the other corresponding to a delay of about 0.26ms in high frequency. The lowfrequency limit can in general be obtained for a general incident angle θ by the formula ITD = 1.5δ sin θ , c (27) 82 D. Rocchesso: Sound Processing where δ is the interear distance in meters and c is the speed of sound. The crossover point between high and low frequency is located around 1kHz. Similarly,
Time Difference Intensity Difference 0 dB  0.26 ms  0.38 ms  10 dB frequency 1 kHz frequency 1 kHz (a) (b) Figure 9: Frequencydependent interaural time (a) and intensity (b) difference for azimuth 30o . the IID should be made frequency dependent. Namely, the difference is larger for highfrequency components, so that we have IID curves such as that reported in ﬁg. 9.b for 30o of azimuth. The IID and ITD are shown to change when the source is very close to the head [32]. In particular, sources closer than ﬁve times the head radius increase the intensity difference in low frequency. The ITD also increases for very close sources but its changes do not provide signiﬁcant information about source range. Several researchers have measured the ﬁltering properties of the system pinna  head  torso by means of manikins or human subjects. A popular collection of measurements was taken by Gardner and Martin using a KEMAR dummy head, and made freely available [36, 38, 2]. Measurements of this kind are usually taken in an anechoic chamber, where a loudspeaker plays a test signal which invests the head from the desired direction. The directions should be taken in such a way that two neighbor directions never exceed the localization blur, which ranges from about ±3◦ in azimuth for frontal sources, to about ±20◦ in elevation for sources above and slightly behind the listener [13]. The result of the measurements is a set of HeadRelated Transfer Functions (HRIR) that can be directly used as coefﬁcients of a pair of FIR ﬁlters. Since the decay time of the HRIR is always less than a few milliseconds, 256 to 512 taps are sufﬁcient at a sampling rate of 44.1kHz. A cookbook of HRIRs and direct convolution seems to be a viable solution for providing directionality to sound sources using current technology. A fundamental limitation comes from the fact that HRIRs vary widely between different subjects, in such an extent that frontback reversals are fairly common Delay Lines and Effects 83 when listening through someone else’s HRIRs. Using individualized HRIRs dramatically improves the quality of localization. Moreover, since we unconsciously use small head movements to resolve possible directional ambiguities, headmotion tracking is also desirable. There are some reasons that make a model of the external hearing system more desirable than a raw catalog of HRIRs. First of all, a model might be implemented more efﬁciently, thus allowing more sources to be spatialized in real time. Second, if the model is well understood, it might be described with a few parameters having a direct relationship with physical or geometric quantities. This latter possibility can save memory and allow easy calibration. Modeling the structural properties of the system pinna  head  torso gives us the possibility to apply continuous variation to the positions of sound sources and to the morphology of the listener. Much of the physical/geometric properties can be understood by careful analysis of the HRIRs, plotted as surfaces, functions of the variables time and azimuth, or time and elevation. This is the approach taken by Brown and Duda [19] who came up with a model which can be structurally divided into three parts: • Head Shadow and ITD • Shoulder Echo • Pinna Reﬂections Starting from the approximation of the head as a rigid sphere that diffracts a plane wave, the shadowing effect can be effectively approximated by a ﬁrstorder continuoustime system, i.e., a polezero couple in the Laplace complex plane: sz sp = = −2ω0 α(θ) −2ω0 , (28) (29) where ω0 is related to the effective radius a of the head and the speed of sound c by c ω0 = . (30) a The position of the zero varies with the azimuth θ (see ﬁg. 10 of the appendix C)) according to the function α(θ) = 1.05 + 0.95 cos θ − θear 180◦ 150◦ , (31) 84 D. Rocchesso: Sound Processing where θear is the angle of the ear that is being considered, typically 100◦ for the right ear and −100◦ for the left ear. The polezero couple can be directly translated into a stable IIR digital ﬁlter by bilinear transformation, and the resulting ﬁlter (with proper scaling) is Hhs = (ω0 + αFs ) + (ω0 − αFs )z −1 . (ω0 + Fs ) + (ω0 − Fs )z −1 (32) The ITD can be obtained by means of a ﬁrstorder allpass ﬁlter [65, 100] whose group delay in seconds is the following function of the azimuth angle θ: τh (θ) = a + c − a cos (θ − θear ) c π a c θ − θear  − 2 if 0 ≤ θ − θear  < π 2 if π ≤ θ − θear  < π 2 . (33) Actually, the group delay provided by the allpass ﬁlter varies with frequency, but for these purposes such variability can be neglected. Instead, the ﬁlter (32) gives an excess delay at DC that is about 50% that given by (33). This increase of the group delay at DC is exactly what one observes for the real head [49], and it has already been outlined in ﬁg. 9. The overall magnitude and group delay responses of the block responsible for head shadowing and ITD are reported in ﬁg. 10.
20 100 130 160 190 220 250 30 250 220 190 160 130 100 15 25 10 20 5 magnitude [dB] 0 group delay [samples] 1 frequency [kHz] 10 15 5 10 10 5 15 20 0.1 0 0.1 1 frequency [kHz] 10 Figure 10: Magnitude and Group Delay responses of the block responsible for head shadowing and ITD (Fs = 44100Hz ). Azimuth ranging from θear to θear + 150◦ . In a rough approximation, the shoulder and torso effects are synthesized in a single echo. An approximate expression of the time delay can be deduced by the measurements reported in [19, ﬁg. 8] τsh = 1.2 180◦ − θ 180◦ 1 − 0.00004 (φ − 80◦ ) 180◦ 180◦ + θ
2 [msec] , (34) Delay Lines and Effects 85 where θ and φ are azimuth and elevation, respectively (see ﬁg. 10 of the appendix C). The echo should also be attenuated as the source goes from frontal to lateral position. Finally, the pinna provides multiple reﬂections that can be obtained by means of a tapped delay line. In the frequency domain, these short echoes translate into notches whose position is elevation dependent and that are frequently considered as the main cue for the perception of elevation [48]. A formula for the time delay of these echoes is given in [19]. The structural model of the pinna  head  torso system is depicted in Fig. 11 with all its three functional blocks, repeated twice for the two ears. The only difference in the two halves of the system is in the azimuth parameter that is θ for the right ear and −θ for the left ear.
monoaural input head shadow and ITD left output channel Figure 11: Structural model of the pinna  head  torso system 3D panning The most popular and easy way to spatialize sounds using loudspeakers is amplitude panning. This approach can be expressed in matrix form for an arbitrary number of loudspeakers located at any azimuth though nearly equidistant from the listener. Such formulation is called Vector Base Amplitude Panning (VBAP) [72] and is based on a vector representation of positions in a Cartesian plane having its center in the position of the listener. In the twoloudspeaker LEFT RIGHT θ
pinna reflections θ, φ
shoulder echo θ, φ shoulder echo −θ, φ
pinna reflections −θ, φ
head shadow and ITD −θ right output channel 86 D. Rocchesso: Sound Processing u
g Ll L θ θl gl RR Figure 12: Stereo panning case (ﬁgure 12), the unitmagnitude vector u pointing toward the virtual source can be expressed as a linear combination of the unitmagnitude column vectors lL and lR pointing toward the left and right loudspeakers, respectively. In matrix form, this combination can be expressed as u=L·g = lL lR gL gR . (35) Except for degenerate loudspeaker positions, the linear system of equations (35) can be solved in the vector of gains g. This vector has not, in general, unit magnitude, but can be normalized by appropriate amplitude scaling. The solution of system (35) implies the inversion of matrix L, but this can be done beforehand for a given loudspeaker conﬁguration. The generalization to more than two loudspeakers in a plane is obtained by considering, at any virtual source position, only one couple of loudspeakers, thus choosing the best vector base for that position. The generalization to three dimensions is obtained by considering vector bases formed by three independent vectors in space. The vector of gains for such a 3D vector base is obtained by solving the system gL (36) u = L · g = lL lR lZ gR . gZ Of course, having more than three loudspeakers in a 3D space implies, for any virtual source position, the selection of a local 3D vector base. Delay Lines and Effects 87 As indicated in [72], VBAP ensures maximum sharpness in sound source location. In fact: • If the virtual source is located at a loudspeaker position only that loudspeaker has nonzero gain; • If the virtual source is located on a line connecting two loudspeakers only those two loudspeakers have nonzero gain; • If the virtual source is located on the triangle delimited by three adjacent loudspeakers only those three loudspeakers have nonzero gain. The formulation of VBAP given here is consistent with the low frequency formulation of directional psychoacoustics. The extension to high frequencies have been also proposed with the name Vector Base Panning (VBP) [68]. Room within a room A different approach to spatialization using loudspeakers can be taken by controlling the relative time delay between the loudspeaker feeds. A model supporting this approach was introduced by Moore [60], and can be described as a physical and geometric model. The metaphor underlying the Moore model is that of the Room within a Room, where the inner room has holes in the walls, corresponding to the positions of loudspeakers, and the outer room is the virtual room where sound events have to take place (ﬁg. 13). The simplest form of 2 1 3 4 Figure 13: Moore’s Room in a Room Model spatialisation is obtained by drawing direct sound rays from the virtual sound source to the holes of the inner room. If the outer room is anechoic these are the only paths taken by sound waves to reach the inner room. The loudspeakers 88 D. Rocchesso: Sound Processing will be fed by signals delayed by an amount proportional to the length of these paths, and attenuated according to relationship of inverse proportionality valid for propagation of spherical waves. In formulas, if li is the path length from the source to the ith loudspeaker, and c is the speed of sound in air, the delay in seconds is set to di = li /c , (37) and the gain is set to gi = li > 1 1, li < 1
1 li , . (38) The formula for the amplitude gain is such that sources within the distance of 1m from the loudspeaker5 will be stuck to unity gain, thus avoiding the asymptotic divergence in amplitude implied by a point source of spherical waves. The model is as accurate as the physical system being modeled would permit. A listener within a room would have a spatial perception of the outside soundscape whose accuracy will increase with the number of windows in the walls. Therefore, the perception becomes sharper by increasing the number of holes/loudspeakers. Indeed, some of the holes will be masked by some walls, so that not all the rays will be effective 6 (e.g. the rays to loudspeaker 3 in ﬁg. 13). In practice, the directional clarity of spatialisation is increased if some form of directional panning is added to the base model, so that loudspeakers opposite to the direction of the sound source are severely attenuated. With this trick, it is not necessary to burden the model with an algorithm of raywall collision detection. The Moore model is suitable to provide consistent and robust spatialization to extended audiences [60]. A reason for robustness might be found in the fact that simultaneous level and time differences are applied to the loudspeakers. This has the effect to increase the lateral displacement [13] even for virtual sources such that the rays to different loudspeaker have similar lengths. Indeed, the localization of the sound source gets even sharper if the level control is driven by laws that roll off more rapidly than the physical 1/d law of spherical waves. In practical realizations, the best results are obtained by tuning the model after psychophysical experimentation [54]. An added beneﬁt of the Room within a Room model is that the Doppler effect is intrinsically implemented. As the virtual sound source is moved in the outer room the delay lines representing the virtual rays change their lengths, thus producing the correct pitch shifts. It is true that different transpositions
5 This 6 We distance is merely conventional. are neglecting diffraction from this reasoning. Delay Lines and Effects 89 might affect different loudspeakers, as the variations are different for different rays, but this is consistent with the physical robustness of the technique. The model of the Room within a Room works ﬁne if the movements of the sound source are conﬁned to a virtual space external to the inner room. This corresponds to an enlargement of the actual listening space and it is often a highly desirable situation. Moreover, it is natural to model the physical properties of the outer room, adding reﬂections at the walls and increasing the number of rays going from a sound source to the loudspeakers. This conﬁguration, illustrated in ﬁg. 13 with ﬁrstorder reﬂections, is a step from spatialization to reverberation. 3.6.2 Reverberation Classic reverberation tools In the second half of the twentieth century, several engineers and acousticians tried to invent electronic devices capable to simulate the longterm effects of sound propagation in enclosures [14]. The most important pioneering work in the ﬁeld of artiﬁcial reverberation has been that of Manfred Schroeder at the Bell Laboratories in the early sixties [88, 89, 90, 91, 93]. Schroeder introduced the recursive comb ﬁlters (section 3.4) and the delaybased allpass ﬁlters (section 3.4.1) as computational structures suitable for the inexpensive simulation of complex patterns of echoes. These structures rapidly became standard components used in almost all the artiﬁcial reverberators designed until nowadays [61]. It is usually assumed that the allpass ﬁlters do not introduce coloration in the input sound. However, this assumption is valid from a perceptual viewpoint only if the delay line is much shorter than the integration time of the ear, i.e. about 50ms [111]. If this is not the case, the timedomain effects become much more relevant and the timbre of the incoming signal is signiﬁcantly affected. In the seventies, Michael Gerzon generalized the singleinput singleoutput allpass ﬁlter to a multiinput multioutput structure, where the delay line of m samples has been replaced by a orderN unitary network [40]. Examples of trivial unitary networks are orthogonal matrices, parallel connections of delay lines, or allpass ﬁlters. The idea behind this generalization is that of increasing the complexity of the impulse response without introducing appreciable coloration in frequency. According to Gerzon’s generalization, allpass ﬁlters can be nested within allpass structures, in a telescopic fashion. Such embedding is shown to be equivalent to lattice allpass structures [39], and it is realizable as 90 D. Rocchesso: Sound Processing long as there is at least one delay element in the block A(z ), which replaces the delay line in ﬁg. 8. An extensive experimentation on structures for artiﬁcial reverberation was conducted by Andy Moorer in the late seventies [61]. He extended the work done by Schroeder [90] in relating some basic computational structures (e.g., tapped delay lines, comb and allpass ﬁlters) with the physical behavior of actual rooms. In particular, it was noticed that the early reﬂections have great importance in the perception of the acoustic space, and that a directform FIR ﬁlter can reproduce these early reﬂections explicitly and accurately. Usually this FIR ﬁlter is implemented as a tapped delay line, i.e. a delay line with multiple reading points that are weighted and summed together to provide a single output. This output signal feeds, in Moorer’s architecture, a series of allpass ﬁlters and a parallel of comb ﬁlters(see ﬁg. 14) . Another improvement introduced by Moorer was the replacement of the simple gain of feedback delay lines in comb ﬁlters with lowpass ﬁlters resembling the effects of air absorption and lossy reﬂections. The construction of highquality reverberators is half an art and half a science. Several structures and many parameterizations were proposed in the past, especially in nondisclosed form within commercial reverb units [29]. In most cases, the various structures are combinations of comb and allpass elementary blocks, as suggested by Schroeder in the early works. As an example, we look more carefully at the Moorer’s preferred structure [61], depicted in ﬁg.14. The block (a) takes care of the early reﬂections by means of a tapped delay line. The resulting signal is forwarded to the block (b), which is the parallel of a direct path on one branch, and a delayed, attenuated diffuse reverberator on the other branch. The output of the reverberator is delayed in such a way that the last of the early echoes coming out of block (a) reaches the output before the ﬁrst of the nonnull samples coming out of the diffuse reverberator. In Moorer’s preferred implementation, the reverberator of block (b) is best implemented as a parallel of six comb ﬁlters, each with a ﬁrstorder lowpass ﬁlter in the loop, and a single allpass ﬁlter. In [61], it is suggested to set the allpass delay length to 6ms and the allpass coefﬁcient to 0.7. Despite the fact that any allpass ﬁlter does not add coloration in the magnitude frequency response, its time response can give a metallic character to the sound, or add some unwanted roughness and granularity. The feedback attenuation coefﬁcients gi and the lowpass ﬁlters of the comb ﬁlters can be tuned to resemble a realistic and smooth decay. In particular, the attenuation coefﬁcients gi determine the overall decay time of the series of echoes generated by each comb ﬁlter. If the desired decay time (usually deﬁned for an attenuation level of 60dB) is Td , the gain of each comb Delay Lines and Effects
x(n)
m1 (a) a0 + a1 m2 + a2 m3 + 91
z −mN m N−2 m N−1 aN + a N−1 + + C1
(b) C2 C3 C4 C5 C6
+ y(n)
+ A1 z−d Figure 14: Moorer’s reverberator ﬁlter has to be set to gi = 10
−3 T
mi d Fs , (39) where Fs is the sample rate and mi is the delay length in samples. Further attenuation at high frequencies is provided by the feedback lowpass ﬁlters, whose coefﬁcient can also be related with decay time at a speciﬁc frequency or ﬁne tuned by direct experimentation. In [61], an example set of feedback attenuation and allpass coefﬁcients is provided, together with some suggested values of the delay lengths of the comb ﬁlters. As a rule of thumb, they should be distributed over a ratio 1 : 1.5 between 50 and 80ms. Schroeder suggested a numbertheoretic criterion for a more precise choice of the delay lengths [91]: the lengths in samples should be mutually coprime (or incommensurate) to reduce the superimposition of echoes in the impulse response, thus reducing the so called ﬂutter echoes. This same criterion might be applied to the distances between each echo and the direct sound in early reﬂections. However, as it was noticed by Moorer [61], the results are usually better if the taps are positioned according to the reﬂections computed by means of some geometric modeling technique, such as the image method [3, 18]. Indeed, even the lengths of the recirculating delays can be computed from the geometric analysis of the normal modes of actual room shapes. 92 Feedback Delay Networks D. Rocchesso: Sound Processing In 1982, J. Stautner e M. Puckette [101] introduced a structure for artiﬁcial reverberation based on delay lines interconnected in a feedback loop by means of a matrix (see ﬁg. 15). Later, structures such as this have been called Feedback Delay Networks (FDNs). The StautnerPuckette FDN was obtained as a vector generalization of the recursive comb ﬁlter (20), where the msample delay line was replaced by a bunch of delay lines of different lengths, and the feedback gain g was replaced by a feedback matrix G. Stautner and Puckette proposed the following feedback matrix: 011 0 −1 0 0 −1 √ (40) G = g 1 0 0 −1 / 2 . 0 1 −1 0 Due to its sparse special structure, G requires only one multiply per output channel.
a 1,1 a 1,2 a 1,3 a 1,4 a 2,1 a 2,2 a 3,2 a 4,2 a 3,1 a 3,2 a 3,3 a 3,4 a 4,1 a 4,2 a 4,3 a 4,4 b1 b2 b3 b4
+ + + + x y c1 c2 c3 c4
+ + + + z−m 1 z−m 2 z−m 3 z d
−m 4 H1 H2 H3 H4 Figure 15: Fourthorder Feedback Delay Network More recently, JeanMarc Jot investigated the possibilities of FDNs very thoroughly. He proposed to use some classes of unitary matrices allowing efﬁcient implementation. Moreover, he showed how to control the positions of the poles of the structure in order to impose a desired decay time at various frequencies [44]. His considerations were driven by perceptual criteria with the Delay Lines and Effects 93 general goal to obtain an ideal diffuse reverb. In this context, Jot introduced the important design criterion that all the modes of a frequency neighborhood should decay at the same rate, in order to avoid the persistence of isolated, ringing resonances in the tail of the reverb [45]. This is not what happens in real rooms though, where different modes of close resonance frequencies can be differently affected by wall absorption [63]. However, it is generally believed that the slow variation of decay rates with frequency produces smooth and pleasant impulse responses. Referring to ﬁg. 15, an FDN is built starting from N delay lines, each being τi = mi Ts seconds long, where Ts = 1/Fs is the sampling interval. The FDN is completely described by the following equations:
N y (n) =
i=1 N ci si (n) + dx(n) si (n + mi ) =
j =1 ai,j sj (n) + bi x(n) (41) where si (n), 1 ≤ i ≤ N , are the delay outputs at the nth time sample. If mi = 1 for every i, we obtain the well known state space description of a discretetime linear system [46]. In the case of FDNs, mi are typically numbers on the orders of hundreds or thousands, and the variables si (n) are only a small subset of the system state at time n, being the whole state represented by the content of all the delay lines. From the statevariable description of the FDN it is possible to ﬁnd the system transfer function [80, 84] as H (z ) = Y (z ) = cT [D(z −1 ) − A]−1 b + d. X (z ) (42) The diagonal matrix D(z ) = diag (z −m1 , z −m2 , . . . z −mN ) is called the delay matrix, and A = [ai,j ]N ×N is called the feedback matrix. The stability properties of a FDN are all ascribed to the feedback matrix. The fact that A n decays exponentially with n ensures that the whole structure is stable [80, 84]. The poles of the FDN are found as the solutions of det[A − D(z −1 )] = 0 . (43) In order to have all the poles on the unit circle it is sufﬁcient to choose a unitary matrix. This choice leads to the construction of a lossless prototype but this is not the only choice allowed. 94 D. Rocchesso: Sound Processing In practice, once we have constructed a lossless FDN prototype, we must insert attenuation coefﬁcients and ﬁlters in the feedback loop (blocks Gi in ﬁgure 15). For instance, following the indications of Jot [45], we can cascade every delay line with a gain gi = αmi . (44) This corresponds to replacing D(z ) with D(z/α) in (42). With this choice of the attenuation coefﬁcients, all the poles are contracted by the same factor α. As a consequence, all the modes decay with the same rate, and the reverberation time (deﬁned for a level attenuation of 60dB) is given by Td = −3Ts . log α (45) In order to have a faster decay at higher frequencies, as it happens in real enclosures, we must cascade the delay lines with lowpass ﬁlters. If the attenuation coefﬁcients gi are replaced by lowpass ﬁlters, we can still get a local smoothness of decay times at various frequencies by satisfying the condition (44), where gi and α have been made frequency dependent: Gi (z ) = Ami (z ), (46) where A(z ) can be interpreted as persample ﬁltering [43, 45, 98]. It is important to notice that a uniform decay of neighbouring modes, even though commonly desired in artiﬁcial reverberation, is not found in real enclosures. The normal modes of a room are associated with stationary waves, whose absorption depends on the spatial directions taken by these waves. For instance, in a rectangular enclosure, axial waves are absorbed less than oblique waves [63]. Therefore, neighbouring modes associated with different directions can have different reverberation times. Actually, for commonlyfound rooms having irregularities in the geometry and in the materials, the response is close to that of a room having diffusive walls, where the energy rapidly spreads among the different modes. In these cases, we can ﬁnd that the decay time is quite uniform among the modes [50]. The most delicate part of the structure is the feedback matrix. In fact, it governs the stability of the whole structure. In particular, it is desirable to start with a lossless prototype, i.e. a reference structure providing an endless, ﬂat decay. The reader interested in general matrix classes that might work as prototypes is deferred to the literature [44, 84, 81, 39]. Here we only mention the Delay Lines and Effects class of circulant matrices, having general form 7 a(0) a(1) . . . a(N − 1) a(0) . . . A= ... a(1) ... a(N − 1) 95 a(N − 1) a(N − 2) . a(0) The stability of a FDN is related to the magnitude of its eigenvalues, which can be computed by the Discrete Fourier Transform of the ﬁrst raw, in the case of a circulant matrix. By keeping these eigenvalues on the unit circle (i.e., magnitude one) we ensure that the whole structure is stable and lossless. The control over the angle of the eigenvalues can be translated into a direct control over the degree of diffusion of the enclosure that is being simulated by the FDN. The limiting cases are the diagonal matrix, corresponding to perfectly reﬂecting walls, and the matrix whose rows are sequences of equalmagnitude numbers and (pseudo)randomly distributed signs [81]. Another critical set of parameters is given by the lengths of the delay lines. Several authors suggested to use lengths in samples that are mutually coprime numbers in order to minimize the collision of echoes in the impulse response. However, if the FDN is linked to a physical and geometrical interpretation, as it is done in the BallwithintheBox model [79], the delay lengths are derived from the geometry of the room being simulated and the resulting digital reverb quality is related to the quality of the actual room. In the case of a rectangular room, a delay line will be associated to a harmonic series of normal modes, all obtainable from a plane wave loop that bounces back and forth within the enclosure. Convolution with Room Impulse Responses If the impulse response of a target room is readily available, the most faithful reverberation method would be to convolve the input signal with such a response. Direct convolution can be done by storing each sample of the impulse response as a coefﬁcient of an FIR ﬁlter whose input is the dry signal. Direct convolution becomes easily impractical if the length of the target response exceeds small fractions of a second, as it would translate into several hundreds of taps in the ﬁlter structure. A solution is to perform the convolution block by block in the frequency domain: Given the Fourier transform of the impulse response, and the Fourier transform of a block of input signal, the two
7A matrix such as this is used in the Csound babo opcode. 96 D. Rocchesso: Sound Processing can be multiplied point by point and the result transformed back to the time domain. As this kind of processing is performed on successive blocks of the input signal, the output signal is obtained by overlapping and adding the partial results [65]. Thanks to the FFT computation of the discrete Fourier transform, such technique can be signiﬁcantly faster. A drawback is that, in order to be operated in real time, a block of N samples must be read and then processed while a second block is being read. Therefore, the inputoutput latency in samples is twice the size of a block, and this is not tolerable in practical realtime environments. The complexity–latency tradeoff is illustrated in ﬁg. 16, where the directform and the blockprocessing solutions can be located, together with a third efﬁcient yet lowlatency solution [37, 64]. This third realization of convolution is based on a decomposition of the impulse response into increasinglylarge chunks. The size of each chunk is twice the size of its predecessor, so that the latency of prior computation can be occupied by the computations related to the following impulseresponse chunk.
Direct form FIR complexity Nonuniform blockbased FFT Blockbased FFT latency Figure 16: Complexity Vs. Latency tradeoff in convolution Even if we have enough computer power to compute convolutions by long impulse responses in real time, there are still serious reasons to prefer reverberation algorithms based on feedback delay networks in many practical contexts. The reasons are similar to those that make a CAD description of a scene preferable to a still picture whenever several views have to be extracted or the environment has to be modiﬁed interactively. In fact, it is not easy to modify a room impulse response to reﬂect some of the room attributes, e.g. its highfrequency absorption, and it is even less obvious how to spatialize the echoes of the impulse response in order to get a proper sense of envelopment. If the impulse response is coming from a spatial rendering algorithm, such as ray tracing, these manipulations can be operated at the level of room description, Delay Lines and Effects 97 and the coefﬁcients of the room impulse response transmitted to the realtime convolver. In the lowlatency block based implementations of convolution, we can even have faster update rates for the smaller early chunks of the impulse response, and slower update rates for the reverberant tail. Still, continuous variations of the room impulse response are easier to be rendered using a model of reverberation operating on a samplebysample basis. 98 D. Rocchesso: Sound Processing Chapter 4 Sound Analysis
Sounds are timevarying signals in the real world and, indeed, all of their meaning is related to such time variability. Therefore, it is interesting to develop sound analysis techniques that allow to grasp at least some of the distinguished features of timevarying sounds, in order to ease the tasks of understanding, comparison, modiﬁcation, and resynthesis. In this chapter we present the most important sound analysis techniques. Special attention is reserved on criteria for choosing the analysis parameters, such as window length and type. 4.1 ShortTime Fourier Transform The ShortTime Fourier Transform (STFT) is nothing more than Fourier analysis performed on slices of the timedomain signal. In order to slightly simplify the formulas, we are going to present the STFT under the assumption of unitary sample rate (Fs = T −1 = 1). There are two complementary views of STFT: the ﬁlterbank view, and the DFTbased view. 4.1.1 The Filterbank View Assume we have a prototype ideal lowpass ﬁlter, whose frequency response is depicted in ﬁg. 1. Let w(·) and W (·) be the impulse response and transfer function, respectively, of such prototype ﬁlter. 99 100
W
1 0 F s /N D. Rocchesso: Sound Processing Fs f Figure 1: Frequency response of a prototype lowpass ﬁlter We deﬁne modulation of a signal y (n) by a carrier signal ejω0 n as the (complex) multiplication y (n)ejω0 n . This translates, in the frequency domain, into a frequency shift by ∆ω = ω0 (shift theorem 1.2 of chapter 1). In other words, modulating a signal means moving its low frequency content onto an area around the carrier frequency. On the other hand, we call demodulation of a signal y (n) its multiplication by e−jω0 n , that brings the components around ω0 onto a neighborhood of dc. By demodulation we can obtain a ﬁlterbank that slices the spectrum (between 0Hz and Fs ) in N equal nonoverlapping portions. Namely, we can translate the input signal in frequency and ﬁlter it by means of the prototype lowpass ﬁlter in order to isolate a speciﬁc slice of the frequency spectrum. This procedure is reported in ﬁg. 2. 4.1.2 The DFT View The scheme of ﬁg. 2 can be obtained by Fourier transformation of a “windowed” sequence. We recall from section 1.3 that the DTFT of an inﬁnite sequence is
+∞ Y (ω ) =
n=−∞ y (n)e−jωn . (1) If the DTFT is computed on a portion of y (·), weighted by an analysis Sound Analysis
y e, ω0 Ym(ω0) = (w * y W( ) e, ω0 101
)(m) e
y(n) −j ω0 n y e, ωN−1 W( ) Ym(ωN−1) = (w * y )(m) e, ωN−1 e −j ωN−1 n Figure 2: Decomposition of a signal into a set of nonoverlapping frequency slices. ω0 , . . . , ωN −1 are the central frequencies of the bands of the analysis channels. window w(m − n), we get a frame of the STFT:
+∞ Ym (ω ) =
n=−∞ w(m − n)y (n)e−jωn =
+∞ = e−jωm
r =−∞ w(r)y (m − r)ejωr , (2) where the third member of the equality is obtained by deﬁning r = m − n, and m is a variable accounting for the temporal dislocation of the window. Therefore, the STFT turns out to be a function of two variables, one can be thought of as frequency, the other is essentially a time shift. The DTFT is a periodic function of a continuous variable, and it can be inverted by means of an integral computed over a period w(m − n)y (n) = 1 2π
π Ym (ω )ejωn dω .
−π (3) By a proper alignment of the window (m = n) we can compute, if w(0) = 0 102 D. Rocchesso: Sound Processing y (n) = 1 2πw(0) π Yn (ω )ejωn dω .
−π (4) The STFT in its formulation (2) can be seen as convolution Ym (ω ) = (w ∗ ye )(m) , (5) where ye (n) = y (n)e−jωn is the demodulated signal. If w is set to the impulse response of the ideal lowpass ﬁlter, and if we set ω = ωk , we get a channel of the ﬁlterbank of ﬁg. 2. In general, w(·) will be the impulse response of a nonideal lowpass ﬁlter, but the ﬁlterbank view will keep its validity. In practice, we need to compute the STFT on a ﬁnite set of N points. In what follows we assume that the window is R ≤ N samples long, so that we can use the DFT on N points, thus obtaining a sampling of the frequency axis between 0 and 2π in multiples of 2π/N . The k th point in the transform domain (said the k th bin of the DFT) is given by
N −1 Ym (k ) =
n=0 w(m − n)y (n)e−j 2πkn N (6) and, by means of an inverse DFT w(m − n)y (n) = 1 N
N −1 Ym (k )ej
k=0 2πkn N . (7) By a proper alignment of the window (m = n), and assuming that w(0) = 0 we get N −1 j 2πkn 1 (8) y (n) = Yn (k )e N . N w(0)
k=0 More generally, we can reconstruct (resynthesis) the timedomain signal by means of N −1 j 2πkn 1 (9) y (n) = Ym (k )e N , N w(m − n)
k=0 where w(m − n) = 0, which is true, given an integer n0 , for a nontrivial window deﬁned for m + n0 ≤ n ≤ m + n0 + R − 1 . (10) Sound Analysis Example 103 Figure 3 illustrates the operations involved in analysis and resynthesis of a frame of STFT (R = 5, N = 8). Reconstruction is possible for 1 ≤ n ≤ 5 (n0 = −2).
N=8 w(n) = w(−n) n 0 R=5 w(3 − n) n m=3 n y(n) 0 n Y3 (0) 8 0 8 w(3 − n) y(n) time−centered window 0 DFT
8 Y3 (7) reconstruction of 5 samples of y(n) IDFT
8
0 1/w(2) 1 1/w(−2) 5 7 y(1) y(5) Figure 3: Analysis and resynthesis of a frame of STFT. 4.1.3 Windowing The rectangular window The simplest analysis window is the rectangular window wR (n) = 1 n = 0, . . . , R − 1 0 elsewhere , (11) 104 D. Rocchesso: Sound Processing Considered a ﬁlter having (11) as its impulse response, the frequency response is found by Fouriertransformation of wR (n):
+∞ R −1 WR (ω ) =
n=−∞ wR (n)e−jωn =
n=0 −jω R−1 2 e−jωn = 1 − e−jωR = 1 − e−jω (12) = sincR (ω ) = e sin ωR 2 . sin ω 2 The real part of the function sincR (ω ) is plotted in ﬁgure 4 for different values of the window length R.
15 R=4 R=8 R = 16 10 sinc_R 5 0 3 2 1 0 radian frequency 1 2 3 Figure 4: sincR (ω ) for different values of window length R. In ﬁgure 4, it can be noticed that 2π/R is the zero closest to dc. Therefore, we can say that if we use the rectangular window as a prototype of ﬁlter represented in ﬁgure (2), the equivalent bandwidth is 2π/R. If we neglect aliasing for a moment, we realize that we can decimate each channel Ym (ωk ) by a factor R without loosing any information. A superﬁcial look at the expression (12) seems to indicate that the shifted replicas of sincR produce aliasing in the base band − 2π , 2π . Indeed, if RR we sum R shifted replicas we verify that the aliasing components cancel out. Therefore, with this window, it is possible to decimate the output channels by a factor equal to the window length. Furthermore, if we choose N = R, we can perform one FFT per frame and advance the window by N samples at each step. Sound Analysis 105 According to (7), the reconstruction (resynthesis) of the analyzed signal can be obtained by ﬁlterbank summation, as depicted in ﬁgure 5. The reconstruction can be interpreted as a bank of oscillators driven by the analysis data. The two stages represented in ﬁgures 2 and 5, taken as a whole, are often called the phase vocoder. Ym(ω0) e j ω0 n y(m) 1/ N w(0) Ym(ωN−1) e j ωN−1 n Figure 5: Reconstruction of a signal from a set of nonoverlapping frequency slices. ω0 , . . . , ωN −1 are the central frequencies of the bands of the analysis channels. Between the analysis stage of ﬁgure 2 and the synthesis stage of ﬁgure 5, a decimation stage can be inserted. Namely, with the rectangular window we can reduce the intermediate sampling rate down to Fs /R. Of course, in order to do the ﬁlter bank summation of ﬁgure 5, an interpolation stage will be needed to take the sampling rate back to Fs . For the rectangular window, the window is shifted in time by R samples after each DFT computation. This temporal shift is technically called hop size. In the case of the rectangular window, hop sizes smaller than R do not add any information to the analysis. 106 Commonlyused windows D. Rocchesso: Sound Processing In practice, signal analysis is seldom performed using rectangular windows, because its frequency response has side lobes that are signiﬁcantly high thus potentially inducing erroneous estimations of frequency components. In general, there is a tradeoff between the mainlobe width and the sidelobe level that can be exploited by choosing or designing an appropriate window. Table 4.1 describes concisely the form and features of the mostcommonly used analysis windows. Window Name Rectangular Hann Hamming Blackman w(n) in − R−1 ≤ n ≤ R−1 2 2 1 2πn 1 2 1 + cos R πn 0.54 + 0.46 cos 2R πn 0.42 + 0.5 cos 2R + 4πn 0.08 cos R
Mainlobe Width
π (× R ) 4 8 8 12 Sidelobe Level [dB] 13.3 31.5 42.7 58.1 Table 4.1: Characteristics of popular windows. Each window is characterized by the mainlobe width and the sidelobe level. The larger the mainlobe width the smaller is the decimation that I can introduce between the analysis and synthesis stages. This has a consequence in the choice of the hop size. For instance, using Hann1 or Hamming windows I have to use at least a hop size equal to R/2 in order to preserve all information at the analysis stage. Moreover, the larger the mainlobe width, the more difﬁcult is to separate two frequency components that are close to each other. In other words, we have a reduction in frequency resolution for windows with a large main lobe. The sidelobe level indicates how much a sinusoidal component affects the DFT bins nearby. This phenomenon, called leakage, can induce an analysis procedure to detect false spectral peaks, or measurements on actual peaks can be affected by errors. For a given resolution considered to be acceptable, it is desirable that the sidelobe level be as small as possible. The window length is chosen according to the tradeoff between spectral resolution and temporal resolution governed by the uncertainty principle. The
1 The Hann window is often called Hanning window, probably for the same reason that in the US you may prefer saying “I xerox this document” rather than “I copy this document using a Xerox copier”. Sound Analysis 107 STFT analysis is based on the assumption that, within one frame, the signal is stationary. The more the window is short, the closer the assumption is to truth, but short windows determine low spectral resolution. The windows described in this section have a ﬁxed shape. When they are multiplied by an ideal lowpass impulse response they impose a ﬁxed transition bandwidth, i.e. a certain frequency space between the passband and the stopband. There are other, more versatile windows, that allow to tune their behavior by means of a parameter. The most widely used of these adjustable windows is the Kaiser window [58], whose parameter β can be related to the transition bandwidth. Zero padding It is quite common to use a window whose length R is smaller than the number N of points used to compute the DFT. In this way, we have a spectrum representation on a larger number of points, and the shape of the frequency response can be understood more easily. Usually, the sequency of R points is extended by means of N − R zeros, and this operation is called zero padding. Extending the time response with zeros corresponds to sampling the frequency response more densely, but it does not introduce any increase in frequency resolution. In fact, the resolution is only determined by the length and shape of the effective window, and additional zeros can not change it. Consider the zeropadded signal y (n) = The DFT is found as
N −1 x(n) 0 n = 0, . . . , R − 1 n = R, . . . , N − 1 . (13) Y (k ) =
n=0 y (n)e −j 2πkn N R −1 =
n=0 y (n)e −j 2πkn N = (14) = ResamplingN (X, R) , where the notation ResamplingN (X, R) indicates the resampling on N points of R points of the discretetime signal X , obtained as DFT(x) = X . Exercise Draw the timedomain shape and the frequency response of each of the windows of table 4.1. Then, using a Rectangular, a Hann, and a Blackman 108 window, analyze the signal D. Rocchesso: Sound Processing x(n) = 0.8 sin (2πf1 n/Fs ) + sin (2πf2 n/Fs ) , (15) where f1 = 0.2Fs and f2 = 0.23Fs , using N = R = 64. See the effects of halﬁng and doubling N = R, and observe the presence of leakage. Finally, repeat the exercise with R = 32, and N = 64 or N = 128. 4.1.4 Representations One of the most useful visual representations of audio signals is the sonogram, also called spectrogram, that is a color or greyscale rendition of the magnitude of the STFT, on a 2D plane where time and frequency are the orthogonal axes. Figure 6 shows the sonogram of the signal analyzed in exercise 4.1.3. Time is on the horizontal axis and frequency is on the vertical axis. Another useful visualization is the 3D plot, also called waterfall plot in sound analysis programs, when the analysis frames are presented one after the other from back to front. Figure 7 shows the 3D representation of the same signal analysis of ﬁgure 6. Figure 6: Sonogram representation of the signal (15). N = 128 and R = 64. The Matlab signal processing toolbox, as well as the octaveforge project (see the appendix B), provide a function specgram that can be used to provide plots similar to those of ﬁgures 6 and 7. Speciﬁcally, these ﬁgures have been obtained by means of the octave script: Sound Analysis 109 [dB] 0 2 4 6 8 10 12 14 0 0.01 0.02 0.03 0.04 0.05 time [seconds] 0.06 0.07 0.08 0.09 0.1 25000 20000 15000 10000 5000 0 frequency [Hz] Figure 7: 3D STFT representation of the signal (15). N = 128 and R = 64. Fs = 44100; f1 = 0.2 * Fs; f2 = 0.23 * Fs; NMAX = 4096; n = [1:NMAX]; x1 = 0.8 * sin (2*pi*f1/Fs*n); x2 = sin (2*pi*f2/Fs*n); y = x1 + x2; N = 128; R = 64; [S,f,t] = specgram(y, N, Fs, hanning(R), R/2); S = abs(S(2:N/2,:)); # magnitude in Nyquist range S = S/max(S(:)); # normalize magnitude so # that max is 0 dB. imagesc(flipud(log(S))); # display in log scale mesh(t,f(1:length(f)1),log(S)); gset view 35, 65, 1, 1.2 xlabel(’time [seconds]’); ylabel(’frequency [Hz]’); zlabel(’[dB]’); replot; In this example, the DFT length has been set to N = 128, the analysis 110 D. Rocchesso: Sound Processing window is a Hann window with length R = 64, and the hop size to R/2. If the window length is doubled, the two components separate much more clearly, as shown in ﬁgure 8. Figure 8: Sonogram representation of the signal (15). N = 128 and R = 128. 4.1.5 Accurate partial estimation If the signal under analysis has a sinusoidal component that stays in between two adjacent DFT bins, the magnitude spectrum is similar to that reported in ﬁgure 9. We notice the two following phenomena: • The sinusoidal component “leaks” some of its energy into bins that stay within a neighborhood of its theoretical position; • It is difﬁcult to determine the exact frequency of the component from visual inspection. To overcome the latter problem, we describe two techniques: parabolic interpolation and phase following. Parabolic interpolation Any kind of interpolation can be applied to estimate the value and position of a frequency peak in the magnitude spectrum of a signal. Degreetwo polynomial interpolation, i.e. parabolic interpolation, is particularly convenient as it uses only three bins of the magnitude spectrum. Sound Analysis
30 DFT magnitude 25 20 15 10 [dB] 5 0 5 10 15 20 0 5 10 15 20 25 30 35 111 Figure 9: DFT image (magnitude) of a sinusoidal component. Taken three adjacent bins of the magnitude DFT, we assign them the coordinates (x0 , y0 ), (x1 , y1 ), and (x2 , y2 ). Then, we simply apply the Lagrange interpolation formula y = (x − x0 )(x − x2 ) (x − x1 )(x − x2 ) y0 + y1 + (x0 − x1 )(x0 − x2 ) (x1 − x0 )(x1 − x2 ) (x − x0 )(x − x1 ) + y2 . (x2 − x0 )(x2 − x1 ) (16) Since x1 − x0 = x2 − x1 = ∆f = Fs N (17) is the frequency quantum, any point in the parabola has coordinates (x, y ) related by y = [(x − x1 )(x − x2 )y0 − 2(x − x0 )(x − x2 )y1 + 1 . + (x − x0 )(x − x1 )y2 ] 2∆f 2 (18) From this expression, it is straightforward to ﬁnd the peak as the point where dy the derivative vanishes: y = dx = 0. 112 Phase following D. Rocchesso: Sound Processing Let us assume that the signal to be analyzed can be expressed as a sum of sinusoids with timevarying amplitude and frequency (sinusoidal model, see sec. 5.1.1):
I y (t) =
i=1 Ai (t)ejφi (t) ,
t (19) with φi (t) = ωi (τ )dτ ,
−∞ (20) being ωi the frequency of the ith partial. For clarity, let us consider a signal containing only the ith partial. The k th bin of the mth frame of the STFT gives
N −1 Ym (k ) =
n=0 w(m − n)Ai (n)ejφi (n) e−j N kn
m 2π (21)
2π = e−j N km
r =m−N +1 2π w(r)Ai (m − r)ejφi (m−r) ej N kr . (22) In order to proceed with the accurate partial frequency estimation, we have to make a Assumption 1 Frequency and amplitude of the ith component are constant within a STFT frame: φi (m − r) Ai (m − r) We see that Ym (k ) = e−j N km Ai (m)ejφi (m) W (
2π = = φi (m) − rωi Ai (m) . (23) (24) 2π k − ωi ) , N (25) where Ai (m)ejφi (m) contains the amplitude and instantaneous phase of the sinusoid that falls within the k th bin, and W ( 2π k − ωi ) is the window transform. N If we have access to the instantaneous phase, we can deduce the instantaneous frequency by back difference between two adjacent frames. This can be done as long as we deal with the problem of phase unwrapping, due to the fact that the phase is known modulo 2π . Sound Analysis 113 It can be shown [52, pag. 287–288] that phase unwrapping can be unambiguous under Assumption 2 Said H the hop size and 2π the separation between adjacent N bins, let 2π H<π. (26) N The assumption 2 holds for rectangular windows and imposes H < N . 2 For Hann or Hamming windows the hop size must be such that H < N (75% 4 overlap). Therefore the frame rate to be used for accurate partial estimation is higher than the minimal frame rate needed for perfect reconstruction. 4.2 Linear predictive coding (with Federico Fontana) The analysis/synthesis method known as linear predictive coding (LPC) was introduced in the sixties as an efﬁcient and effective mean to achieve synthetic speech and speech signal communication [92]. The efﬁciency of the method is due to the speed of the analysis algorithm and to the low bandwidth required for the encoded signals. The effectiveness is related to the intelligibility of the decoded vocal signal. The LPC implements a type of vocoder [10], which is an analysis/synthesis scheme where the spectrum of a source signal is weighted by the spectral components of the target signal that is being analyzed. The phase vocoder of ﬁgures 2 and 5 is a special kind of vocoder where amplitude and phase information of the analysis channels are retained and can be used as weights for complex sinusoids in the synthesis stage. In the standard formulation of LPC, the source signals are either a white noise or a pulse train, thus resembling voiced or unvoiced excitations of the vocal tract, respectively. The basic assumption behind LPC is the correlation between the nth sample and the P previous samples of the target signal. Namely, the nth signal sample is represented as a linear combination of the previous P samples, plus a residual representing the prediction error: x(n) = −a1 x(n − 1) − a2 x(n − 2) − · · · − aP x(n − P ) + e(n) . (27) Equation (27) is an autoregressive formulation of the target signal, and the analysis problem is equivalent to the identiﬁcation of the coefﬁcients a1 , . . . aP 114 D. Rocchesso: Sound Processing of an allpole ﬁlter. If we try to minimize the error in a mean square sense, the problem translates into a set of P equations
P ak
k=1 n P x(n − k )x(n − i) = −
n x(n)x(n − i) , (28) or ak R(i − k ) = −R(i) , i = 1, . . . , P ,
k=1 (29) where R(i) =
n x(n)x(n − i) (30) is the signal autocorrelation. In the z domain, equation (27) reduces to E (z ) = A(z )X (z ) (31) where A(z ) is the polynomial with coefﬁcients a1 . . . aP . In the case of voice signal analysis, the ﬁlter 1/A(z ) is called the allpole formant ﬁlter because, if the proper order P is chosen, its magnitude frequency response follows the envelope of the signal spectrum, with its broad resonances called formants. The ﬁlter A(z ) is called the inverse formant ﬁlter because it extracts from the voice signal a residual resembling the vocal tract excitation. A(z ) is also called a whitening ﬁlter because it produces a residual having a ﬂat spectrum. However, we distinguish between two kinds of residuals, both having a ﬂat spectrum: the pulse train and the white noise, the ﬁrst being the idealized vocalfold excitation for voiced speech, the second being the idealized excitation for unvoiced speech. In reality, the residual is neither one of the two idealized excitations. At the resynthesis stage the choice is either to use an encoded residual, possibly choosing from a code book of templates, or to choose one of the two idealized excitations according to a voiced/unvoiced decision made by the analysis stage. When the target signal is periodic (voiced speech), a pitch detector can be added to the analysis stage, so that the resynthesis can be driven by periodic replicas of a basic pulse, with the correct interpulse period. Several techniques are available for pitch detection, either using the residual or the target signal [53]. Although not particularly efﬁcient, one possibility is to do a Fourier analysis of the residual and estimate the fundamental frequency by the techniques of section 4.1.5. Summarizing, the information extracted in a frame by the analysis stage are: Sound Analysis • the prediction coefﬁcients a1 , . . . , aP ; • the residual e; • pitch of the excitation residual; • voiced/unvoiced information; • signal energy (RMS amplitude). 115 These parameters, possibly modiﬁed, are used in the resynthesis, as explained in section 5.1.3. The equations (29) are solved via the wellknown LevinsonDurbin recursion [53], which provides the reﬂection coefﬁcients of the lattice realization of the ﬁlter 1/A(z ). As we mentioned in section 2.2.4, the reﬂection coefﬁcients are related to a piecewise cylindrical modelization of the vocal tract. The LPC analysis proceeds by frames lasting a few milliseconds. In each frame the signal is assumed to be stationary and a new estimation of the coefﬁcients is made. For the human vocal tract, P = 12 is a good estimate of the degrees of freedom that are needed to represent most articulations. Besides its applications in voice coding and transformation, LPC can be useful whenever it is necessary to represent the shape of a stationary spectrum. Spectral envelope extraction by LPC analysis can be accurate as long as the ﬁlter order is carefully chosen, as depicted in ﬁgure 10. The accuracy depends on the kind of signal that is being analyzed, as the allpole nature of the LPC ﬁlter gives a spectral envelope with rather sharp peaks. 116 D. Rocchesso: Sound Processing 25 input LPC: 8 LPC: 16 LPC: 32 20 15 10 5 [dB] 0 5 10 15 20 0 5000 10000 frequency [Hz] 15000 20000 Figure 10: DFT image (magnitude) of a target signal and frequency response of allpole ﬁlters, identiﬁed via LPC with three different values of the order P . Chapter 5 Sound Modelling
5.1
5.1.1 Spectral modelling
The sinusoidal model A sound is expressed according to the sinusoidal model if it has the form
I y (t) =
i=1 t Ai (t)ejφi (t) , (1) where φi (t) = −∞ ωi (τ )dτ , and Ai (t) and ωi (t) are the ith sinusoidalcomponent instantaneous magnitude and frequency, respectively. In practice, we consider discretetime real signals. Therefore, we can write
I y (n) =
i=1 Ai (n) cos (φi (n)) , (2) with φi (n) =
0 nT ωi (τ )dτ + φ0,i . (3) In principle, if I is arbitrarily high, any sound can be expressed according to the sinusoidal model. This principle states the generality of the additive synthesis approach. Actually, the noise components would require a multitude of sinusoids, and it is therefore convenient to treat them separately by introduction 117 118 of a “stochastic” part e(n):
I D. Rocchesso: Sound Processing y (n) =
i=0 Ai (n) cos (φi (n)) + Deterministic Part e(n) Stochastic Part . (4) The separation of the stochastic part from the deterministic part can be done by means of the ShortTime Fourier Transform using the scheme of ﬁgure 1. Here, we rely on the fact that the STFT analysis retains the phases of the sinusoidal components, thus allowing a reconstruction that preserves the wave shape [94]. In this way, the deterministic part can be subtracted from the original signal to give the stochastic residual. One popular implementation of the scheme in ﬁgure 1 is found in the software sms, an acronym for spectral modeling synthesis1 [5].
magnitude Peak Magnitude trajectory Frequency trajectory Phase trajectory sound FFT
Analysis Window phase Detection and Continuation Additive Synthesis
Smoothing window Deterministic Component residual Spectral Fitting
Filter Coefficients Noise Level Figure 1: Separation of the sinusoidal components from a stochastic residual. 1 The executable of sms is freely downloadable from http://www.iua.upf.es/˜sms/ Sound Modelling Peak detection and continuation 119 In order to separate the sinusoidal part from the residual we have to detect and track the most prominent frequency peaks, as they are indicators of strong sinusoidal components. One strategy is to draw “guides” across the STFT frames [94], in such a way that prolongation by continuity ﬁlls local holes that may occur in peak trajectories. If a guide detects missing evidence of the supporting peak for more than a certain number of frames, the guide is killed. Similarly, we start new guides as long as we detect a persistent peak. Therefore, the generation and destruction of peaks is governed by hysteresis (see ﬁgure 2). Birth Death Figure 2: Hysteretic procedure for guide activation and destruction. In order to better capture the deterministic structure during transients, it is better to run the analysis backward in time, since in most cases a sharp attack is followed by a stable release, and peak tracking is more effective when stable states are reached gradually and suddenly released, rather than vice versa. If we can rely on the assumption of harmonicity of the analyzed sounds, the partial tracking algorithm can be “encouraged” by superposition of a harmonic comb onto the spectral proﬁle. For a good separation, frequencies and phases must be determined accurately, following the procedures described in section 4.1.5. Moreover, for the purpose of smooth resynthesis, the amplitudes of partials should be interpolated between frames, the most common choice being linear interpolation. Frequencies and phases should be interpolated as well, but one should be careful to ensure that the frequency track is always the derivative of the phase track. Since a thirdorder polynomial is uniquely determined by four degrees of freedom, by using a cubic interpolating polynomial one may impose the instantaneous 120 D. Rocchesso: Sound Processing phases and frequencies between any couple of frames. Resynthesis of the sinusoidal components In the resynthesis stage, the sinusoidal components can be generated by any of the methods described in section 5.2, namely the digital oscillator in wavetable or recursive form, or the FFTbased technique. The latter will be more convenient when the sound has many sinusoidal components. The DTFT of a windowed sinusoidal signal is the transform of the window, centered on the frequency of the sinusoid, and multiplied by a complex number whose magnitude and phase are the magnitude and phase of the sine wave. A signal that is the weighted sum of sinusoids gives rise, in the frequency domain, to a weighted sum of window transforms centered around different central frequencies. If the window has a A. sufﬁcientlyhigh sidelobe attenuation, we are allowed to consider only a restricted neighborhood of the window transform peak. The sound resynthesis can be achieved by antitransformation of a series of STFT frames, and by the procedure of overlap and add applied to the timedomain frames. The signal reconstruction is free of artifacts if B. the shifted copies of the window overlap and add to give a constant. If w is the window that fulﬁlls property (A), and ∆ is the window that fulﬁlls property (B), we can use w for the analysis and multiply the sequence by ∆/w after the inverse transformation [35]. Using two windows gives good ﬂexibility in satisfying both the requirements (A) and (B). A particularly simple and effective window that satisﬁes property (B) is the triangular window. This FFTbased synthesis (or FFT−1 synthesis) is convenient when the sinusoidal model gives many sine components, because its complexity is largely due to the cost of FFT, which is independent on the number of components. It is quite easy to introduce noise components with arbitrary frequency distribution just by adding complex numbers with the desired magnitude (and arbitrary phase) in the frequency domain. Extraction of the residual The extraction of a broadspectrum noise residual could be performed either in the frequency domain or, as proposed in ﬁgure 1, directly by subtraction in Sound Modelling 121 the time domain. This is possible because the STFT analysis preserves the information on phase, thus allowing a waveshape preservation. The stochastic component can be itself represented on a framebyframe basis, but the corresponding frame can be smaller than the analysis frame so that transients are captured more accurately. Residual spectral ﬁtting The stochastic component is modeled as broadband noise ﬁltered by a linear coloring block. Such decomposition corresponds to a subtractive synthesis model [78], whose parameters may be obtained by LPC analysis (see section 4.2). However, if the purpose of the sinesplusnoise decomposition is that of sound modiﬁcation, it is more convenient to model the stochastic part in the frequency domain. The magnitude spectrum of the residual can be approximated by means of a piecewiselinear function, that is described by the coordinates of the joints. The timedomain resynthesis can be operated in the time domain by inverse FFT, after having imposed the desired magnitude proﬁle and a random phase proﬁle. Sound modiﬁcations The sinusoidal model is interesting because it allows to apply musical transformations to sounds that are taken from actual recordings. The separation of the stochastic residual from the sinusoidal part allows a separate treatment of the two components. Examples of musical transformations are: Coloring: The spectral proﬁle can be changed at will; Emphasizing: The stochastic or the sinusoidal components can be exaggerated; Time Stretching: the temporal extension of the sound can be altered without pitch modiﬁcations and with limited artifacts; Pitch Shifting: The pitch can be transposed without changing the sound length and with limited artifacts; Morphing: for instance, • The spectral envelope of a sound can be imposed to another sound; 122 D. Rocchesso: Sound Processing • A residual from a different sound can be used for resynthesis. Figure 3 shows the framework for performing these musical modiﬁcations.
Frequency Frequency Deterministic (sinusoidal) Part Magnitude Musical Transformations Magnitude Additive Synthesis Sound Control Noise Intensity Coefficients Musical Transformations Subtractive Spectral Shape Synthesis Stochastic Part Figure 3: Framework for performing music transformations. 5.1.2 Sines + Noise + Transients The fundamental assumption behind the sinusoids + noise model is that sound signals are composed of slowlyvarying sinusoids and quasistationary broadband noises. This view is quite schematic, as it neglects the most interesting part of sound events: transients. Sound modiﬁcations would be much more easily achieved if transients could be taken apart and treated separately. For instance, in most musical instruments extending the duration of a note does not have any effect on the quality of the attack, which should be maintained unaltered in a timestretching task. For these reasons, a new sines + noise + transients (SNT) framework for sound analysis was established [108]. The key idea of practical transient extraction comes from the observation that, as sinusoidal signals in the time domain are mapped to welllocalized spikes in the frequency domain, by duality short pulses in the time domain would correspond to sinelike curves in the frequency domain. Therefore, the sinusoidal model can be applied in the frequency domain to represent these sinusoidal components. The scheme of the SNT decomposition is represented in ﬁgure 4. The DCT block in ﬁgure 4 represents the operation of Discrete Cosine Sound Modelling
Transients 123 Sines Sound Sinusoidal Modelling e1 DCT Sinusoidal Modelling Transient Detector DCT −1 e 2 Noise Modelling Noise Figure 4: Decomposition of a sound into sines + noise + transients. Transform, deﬁned as
N −1 C (k ) = α
n=0 x(n) cos (2n + 1)kπ 2N . (5) The DCT has the property that an impulse is transformed into a cosine, and a cluster of impulses becomes a superposition of cosines. Therefore, in the transformed domain it makes sense to use the sinusoidal model and to extract a second residue that is given by transient components. 5.1.3 LPC Modelling As explained in section 4.2, the Linear Predictive Coding can be used to model piecewise stationary spectra. The LPC synthesis proceeds according to the feedforward scheme of ﬁgure 5. Essentially, it is a subtractive synthesis algorithm where a spectrallyrich excitation signal is ﬁltered by an allpole ﬁlter. The excitation signal can be the residual e that comes directly from the analysis, or it is selected from a code book. Alternatively, we can make use of voiced/unvoiced information to generate an excitation signal that can be either a random noise or a pulse train. In the latter case, the pulse repetition period is derived from pitch information, available as a parameter. Between the analysis and synthesis stages, several modiﬁcations are possible: • pitch shifting, obtained by modiﬁcation of the pitch parameter; • time stretching, obtained by stretching the window where the signal is assumed to be stationary; • data reduction, by model order reduction or residual coding. 124
a 1 , ..., aP e pitch v/uv RMS amplitude Excitation Synthesis D. Rocchesso: Sound Processing Allpole Filter Figure 5: LPC Synthesis 5.2 Timedomain models While the description of sound is more meaningful if done in the spectral domain, in many applications it is convenient to approach the sound synthesis directly in the time domain. 5.2.1 The Digital Oscillator We have seen in section 5.1.1 how a complex sound made of several sinusoidal partials is conveniently synthesized by the FFT−1 method. If the sinusoidal components are not too many, it may be convenient to synthesize each partial by means of a digital oscillator. From the obvious identity ejω0 (n+1) = ejω0 ejω0 n , (6) said ejω0 n = xR (n) + jxI (n), it is evident that the oscillator can be implemented by one complex multiplication, i.e., 4 real multiplications, at each time step: xR (n + 1) = cos ω0 xR (n) − sin ω0 xI (n) xI (n + 1) = sin ω0 xR (n) + cos ω0 xI (n) . (7) (8) The initial amplitude and phase can be imposed by scaling the initial phasor ejω0 0 and adding a phase shift to its exponent. It is easy to show2 that the calculation of xR (n + 1) can also be performed as xR (n + 1) = 2 cos ω0 xR (n) − xR (n − 1) ,
2 The (9) reader is invited to derive the difference equation 9 Sound Modelling or, in other words, as the free response of the ﬁlter HR (z ) = 125 1 1 = . (10) −1 + z −2 −jω0 z −1 )(1 − ejω0 z −1 ) 1 − 2 cos ω0 z (1 − e The poles of the ﬁlter (10) lay exactly on the unit circumference, at the limit of the stability region. Therefore, after the ﬁlter has received an initial excitation, it keeps ringing forever. If we call xR1 and xR2 the two state variables containing the previous samples of the output variable xR , an initial phase φ0 can be imposed by setting3 xR1 xR2 = sin (φ0 − ω0 ) = sin (φ0 − 2ω0 ) . (11) (12) The digital oscillator is particularly convenient to perform sound synthesis on generalpurpose processors, where ﬂoatingpoint arithmetics is available at no additional cost. However, this method for generating sinusoids has two main drawbacks: • Updating the parameter (i.e., the oscillation frequency) requires computing a cosine function. This is a problem for audio rate modulations, where to compute a modulated sine we need to compute a cosine at each time sample. • Changing the oscillation frequency changes the sinusoid amplitude as well. Therefore, some amplitude control logic is needed. 5.2.2 The Wavetable Oscillator The most classic and versatile approach to the synthesis of periodic waveforms (sinusoids included) is the cyclic reading of a table where a waveform period is prestored. If the waveform to be synthesized is a sinusoid, symmetry considerations allow to store only one fourth of the period and play with the index arithmetic to reconstruct the whole period. Call buf the buffer that contains the waveform period, or wavetable. The wavetable oscillator works by circularly accessing the wavetable at multiples of an increment I and reading the wavetable content at that position.
3 The reader can verify, using formulas (29–32) of appendix A, that x (0) = sin φ , given 0 R xR (−1) = xR1 and xR (−2) = xR2 . 126 D. Rocchesso: Sound Processing If B is the buffer length, and f0 is the frequency that we want to generate at the sample rate Fs , the increment has to be set to I= Bf0 . Fs (13) It is easy to realize that the reading pointer accesses the wavetable at indexes that are, in general, fractional. Therefore, some form of interpolation has to be used. The following strategies have an increasing degree of accuracy (and complexity): Truncation: buf[ index ] Rounding: buf[ index + 0.5 ] Linear Interpolation: buf[ index ] (index − index ) + buf[ index ] (1 − index + index ) Higherorder polynomial interpolation “Multirate” interpolation: the problem is recasted as a samplingrate conversion. By increasing the complexity of interpolation it is possible, given a certain level of acceptable digital noise, to decrease the wavetable size [41]. The linear interpolation is particularly attractive for implementations in custom or specialized hardware (see section B.5.1 of the appendix B). The mostsigniﬁcant bits of the index can be used to access the buffer locations, and the leastsigniﬁcant bits are used to approximate the quantity (index − index ) in the computation of the interpolation. Samplingrate conversion The problem of designing a wavetable oscillator can be recasted as a problem of samplingrate conversion, i.e., transforming a signal sampled at rate Fs,1 Fs,2 L into its copy resampled at rate Fs,2 . If Fs,1 = M , with L and M irreducible integers, we can resample by: 1. Upsampling by a factor L 2. Lowpass ﬁltering 3. Downsampling by a factor M . Sound Modelling 127 Figure 6 represents these three operations as a cascade of linear (but nontimeinvariant) blocks, where the upward arrow denots upsampling (or introducing zeros between nonzero samples) and the downward arrow denotes downsampling (or decimating).
x(n) L Fs F sL x’ h(n) F sL y’ M F s L/M y(m) Figure 6: Block decomposition of resampling Figure 7 shows the spectral effects of the various stages of resampling when L/M = 3/2. If the interpolation is realized by samplingrate conversion the problem reduces to designing a good lowpass ﬁlter. However, since the resampling ratio L/M changes for each different pitch that is obtained from the same wavetable, the characteristics of the lowpass ﬁlter have to be made pitchdependent. Alternatively, a set of ﬁlters can be designed to accomodate all possible pitches, and the appropriate coefﬁcient set is selected at run time [55]. 5.2.3 Wavetable sampling synthesis The wavetable sampling synthesis is the extension of the wavetable oscillator to • Nonsinusoidal waveforms; • Wavetables storing several periods. Usually, this kind of sound synthesis is based on the following tricks: • The attack transient is reproduced “faithfully” by straight sampling; • A selection of periods of the central part of the sound (sustain) is stored in a buffer and cyclically read (loop). The increment is selected in order to produce the desired pitch; • The keyboard4 is divided into segments of contiguous notes (splits). Each split uses transpositions of the same sample;
4 The keyboard metaphor is used very often even for sound timbres that do not come from keyboard instruments. 128
X(f) D. Rocchesso: Sound Processing −F s −F b 0 Fb Fs /2 X’(f) Fs f −F s −F b 0 Fb Fs /2 Y’(f) Fs 3Fs f −F s −F b 0 Fb Fs /2 Y(f) Fs 3Fs f −F s −F b 0 Fb 3/2 F s 3Fs f Figure 7: Example of resampling with L/M = 3/2 • Different dynamic levels are obtained by – Sampling at different dynamic levels and obtaining the intermediate samples by interpolation, or – Sampling fortissimo notes and obtaining lower intensities by dynamic ﬁltering (usually lowpass). In wavetable sampling synthesis, the control signals are extremely important to achieve a natural sound behavior. The control signals are tied to the evolution of the musical gesture, thus evolving much more slowly than audio signals. Therefore, a control rate can be used to generate signals for • Temporal envelopes (e.g., Attack  Decay  Sustain  Release); Sound Modelling • LowFrequency Oscillators (LFO) for vibrato and tremolo; • Dynamic control of ﬁlters. 129 5.2.4 Granular synthesis (with Giovanni De Poli) Short wavetables can be read at different speeds and the resulting sound grains can be concatenated and overlapped in time. This timedomain approach to sound synthesis is called granular synthesis. Granular synthesis starts from the idea of analyzing sounds in the time domain by representing them as sequences of short elements called “grains”. The parameters of this technique are the waveform of the grain gk (·), its temporal location lk and amplitude ak sg (n) =
k ak gk (n − lk ) . (14) A complex and dynamic acoustic event can be constructed starting from a large quantity of grains. The features of the grains and their temporal locations determine the sound timbre. We can see it as being similar to cinema, where a rapid sequence of static images gives the impression of objects in movement. The initial idea of granular synthesis dates back to Gabor [26], while in music it arises from early experiences of tape electronic music. The choice of parameters can be via various criteria driven by interpretation models. In general, granular synthesis is not a single synthesis model but a way of realizing many different models using waveforms that are locally deﬁned. The choice of the interpretation model implies operational processes that may affect the sonic material in various ways. The most important and classic type of granular synthesis (asynchronous granular synthesis) distributes grains irregularly on the timefrequency plane in form of clouds [77]. The grain waveform is gk (i) = wd (i) cos(2πfk Ts i) , (15) where wd (i) is a window of length d samples, that controls the time span and the spectral bandwidth around fk . For example, randomly scattered grains within a mask, which delimits a particular frequency/amplitude/time region, result in a sound cloud or musical texture that varies over time. The density of the grains within the mask can be controlled. As a result, articulated sounds can be modeled and, wherever there is no interest in controlling the microstructure exactly, problems involving the detailed control of the temporal characteristics 130 D. Rocchesso: Sound Processing of the grains can be avoided. Another peculiarity of granular synthesis is that it eases the design of sound events as parts of a larger temporal architecture. For composers, this means a uniﬁcation of compositional metaphors on different scales and, as a consequence, the control over a time continuum ranging from the milliseconds to the tens of seconds. There are psychoacoustic effects that can be easily experimented by using this algorithm, for example crumbling effects and waveform fusions, which have the corresponding counterpart in the effects of separation and fusion of tones. 5.3
5.3.1 Nonlinear models
Frequency and phase modulation The most popular nonlinear synthesis technique is certainly frequency modulation (FM). In electrical communications, FM has been used for decades, but its use as a sound synthesis algorithm in the discretetime domain is due to John Chowning [23]. Essentially, Chowning was doing experiments on different extents of vibrato applied to simple oscillators, when he realized that fast vibrato rates produce dramatic timbral changes. Therefore, modulating the frequency of an oscillator was enough to obtain complex audio spectra. Chowning’s FM model is: x(n) = A sin (ωc n + I sin (ωm n)) = A sin (ωc n + φ(n)) , (16) where ωc is called the carrier frequency, ωm is called the modulation frequency, and I is the modulation index. Strictly speaking, equation (16) represents a phase modulation because it is the instantaneous phase that is driven by the modulator. However, when both the modulator and the carrier are sinusoidal, there is no substantial difference between phase modulation and frequency modulation. The instantaneous frequency of (16) is ω (n) = ωc − Iωm cos (ωm n) , or, in Hertz, f (n) = fc − Ifm cos (2πfm n) . (18) Figure 8 shows a pd patch implementing the simple FM algorithm. The modulation frequency is used to control an oscillator directly, while the carrier frequency controls a phasor˜ unit generator. This block generates the cyclical phase ramp that, when given as index of a cosinusoidal table, produces (17) Sound Modelling 131 the same result as the osc unit generator. However, this decomposition of the oscillator into two parts (i.e., the phase generation and the table read) allows to sum the output coming from the modulator directly to the phase of the carrier. Figure 8: pd patch for phase modulation. Adapted from a help patch of the pd distribution. Given the carrier and modulation frequencies, and the modulation index, it is possible to predict the distribution of components in the frequency spectrum of the resulting sound. This analysis is based on the trigonometric identity [1] x(n) = = A sin (ωc n + I sin (ωm n)) A J0 (I ) sin (ωc n) + carrier
∞ (19) Jk (I ) sin ((ωc + kωm )n) + (−1)k sin ((ωc − kωm )n)
k=1 , side frequencies where Jk (I ) is the k th order Bessel function of the ﬁrst kind. These Bessel functions are plotted in ﬁgure 9 for several values of k (number of side frequency) and I (modulation index). Therefore, the effect of phase modulation is to introduce side components that are shifted in frequency from the fundamental by multiples of ωm 132 D. Rocchesso: Sound Processing line 1 1 0.8 0.6 0.4 0.2 0 0.2 0.4 2 4 0 5 10 Modulation Index 15 14 12 8 10 Number of side frequency 6 Figure 9: Bessel functions of the ﬁrst kind and whose amplitude is governed by Jk (I ). Generally speaking, the larger the modulation index, the wider is the sound bandwidth. Since the number of side components that are stronger than one hundredth of the carrier magnitude is approximately M = I + 0.24I 0.27 , (20) the bandwidth is approximately BW = 2 I + 0.24I 0.27 ωm ≈ 2Iωm . (21) If the ratio ωc /ωm is rational the resulting spectrum is harmonic, and the partials are multiple of the fundamental frequency ωm ωc = , (22) ω0 = N1 N2 where N1 ωc = , with N1 , N2 irreducible couple . N2 ωm (23) Sound Modelling 133 For instance, if N2 = 1, all the harmonics are present, and if N2 = 2 only the odd harmonics are present. When calculating the spectral components, some of the partials on the left of the carrier may assume a negative frequency. Since sin (−θ) = − sin θ = sin (θ − π ), these components have to be ﬂipped onto the positive axis and summed (magnitude and phase) with the components possibly already present at those frequencies. Complex carrier We can have a bank of oscillators sharing a single modulator or, equivalently, a nonsinusoidal carrier. In this case, each sinusoidal component of the complex carrier is enriched by side components as if it were the carrier of a simple FM couple. One application of FM with a complex carrier is the construction of vowellike spectra, as it was demonstrated by Chowning in the eighties. Each partial of the carrier may be associated with the center of one formant, i.e. a prominent lobe in the envelope of the magnitude spectrum. For a given person’s voice, each vowel is characterised by a certain frequency distribution of formants. Exercise The reader is invited to implement an FM instrument (in, e.g., Octave or pd) that reproduces the vowel /a/, whose formants are found at 700, 1200, and 2500 Hz. How can a vibrato be implemented in such a way that the formant position remains ﬁxed? Complex modulator The modulating waveform can be nonsinusoidal. In this case the analysis can be quite complicated. For instance, a modulator with two partials ω1 and ω2 , acting on a sinusoidal carrier, gives rise to the expansion x(n) = A
k m Jk (I1 )Jm (I2 ) sin ((ωc + kω1 + mω2 ) n) . (24) Partials are found at the positions ωc ± kω1 ± mω2  . If ωM = MCD(ω1 , ω2 ), the spectrum has partials at ωc ± kωM . For instance, a carrier fc = 700Hz and a modulator with partials at f1 = 200Hz and f1 = 300Hz, produce a harmonic 134 D. Rocchesso: Sound Processing spectrum with fundamental at 100Hz. The advantage of using complex modulators in this case is that the spectral envelope can be controlled with more degrees of freedom. Feedback FM A sinusoidal oscillator can be used to phasemodulate itself. This is a feedback mechanism that, with a unitsample feedback delay, can be expressed as x(n) = sin (ωc n + βx(n − 1)) , where β is the feedback modulation index. The trigonometric expansion x(n) =
k (25) 2 Jk (kβ ) sin (kωc n) kβ (26) holds for the output signal. By a gradual increase of β we can gradually transform a pure sinusoidal tone into a sawtooth wave [78]. If the feedback delay is longer than one sample we can easily produce routes to chaotic behaviors as β is increased [12, 15]. FM with Amplitude Modulation By introducing a certain degree of amplitude modulation we can achieve a more compact distribution of partials around the modulating frequency. In particular, we can use the expansion5 [74] eI cos (ωm n) sin (ωc n + I sin (ωm n)) = sin (ωc n) +
∞ k (27) +
k=1 I sin ((ωc + kωm ) n) , k to produce a sequence of partials that fade out as 1/k in frequency, starting from the carrier. Figure 10 shows the magnitude spectrum of the sound produced by the mixed amplitude/frequency modulation (28) with carrier frequency at 3000Hz, modulator at 1500Hz, modulation index I = 0.2, and sample rate Fs = 22100Hz.
5 The reader is invited to verify the expansion (28) using an octave script with wm = 100; wc = 200; I = 0.2; n = [1:4096]; y1 = exp(I*cos(wm*n)) .* sin(wc*n + I*sin(wm*n)); Sound Modelling
80 magnitude spectrum 135 60 40 [dB] 20 0 20 0 2000 4000 frequency [Hz] 6000 8000 10000 Figure 10: Spectrum of a sound produced by amplitude/frequency modulation as in (28). Discussion The synthesis by frequency modulation was very popular in the eighties, especially because it was implemented in the most successful synthesizer of all times: the Yamaha DX7. At that time, obtaining complex timevarying spectra with a few multiplies and adds was a major achievement. There was a theory that allowed to predict the spectra given the parameter, and the bandwidth of FM sounds could be controlled smoothly by means of the modulation index. However, it proved difﬁcult to obtain FM patches starting from the analysis of real sounds, so that the most successful reproductions have been based on intuition and multiple trials. Some of the parameters, such as the carrier/modulator frequency ratio) are too critical and nonintuitive. Namely, little changes in a modulator frequency produce dramatic changes in timbre. The modulation index itself, despite displaying a global intuitive behavior, is related to each single partial amplitude by means of exotic functions that have no relationship with the human hearing system. 5.3.2 Nonlinear distortion The sound synthesis by nonlinear distortion (NLD), or waveshaping [8], is conceptually very simple: the oscillator output is used as argument of a nonlinear function. In the discretetime digital domain, the nonlinear function is 136 D. Rocchesso: Sound Processing stored in a table, and the oscillator output is used as index to access the table. The interesting thing about NLD is that there is a theory that allows to design the distorting table given certain speciﬁcations of the desired spectrum. If the oscillator is sinusoidal, we can formulate NLD as x(n) y (n) = = A cos (ω0 n) F (x(n)) . (28) (29) For the nonlinear function, we use Chebyshev polynomials [1]. The degreen Chebyshev polynomial is deﬁned by the recursive relation: T0 (x) = 1 T1 (x) = x Tn (x) = 2xTn−1 (x) − Tn−2 (x) , and it has the property Tn (cos θ) = cos nθ . (31) (30) In virtue of property (31), if the nonlinear distorting function is a degreem Chebyshev polynomial, the output y , obtained by using a sinusoidal oscillator x(n) = cos ω0 n, is y (n) = cos (mω0 n), i.e., the mth harmonic of x. In order to produce the spectrum y (n) =
k hk cos (kω0 n) , (32) it is sufﬁcient to use the linear composition of Chebyshev functions F (x) =
k hk Tk (x) (33) as a nonlinear distorting function. Varying the oscillator amplitude A, the amount of distortion and the spectrum of the output sound are varied as well. However, the overall output amplitude does also vary as a side effect, and some form of compensation has to be introduced if a constant amplitude is desired. This is a clear drawback of NLD as compared to FM. Timevarying spectral variations can also be introduced by adding a control signal to the oscillator output x, so that the nonlinear function is dynamically shifted. Sound Modelling 137 5.4 Physical models Instead of trying to model the air pressure signal as it appears at the entrance of the ear canal, we can simulate the physical behavior of mechanical systems that produce sound as a side effect. If the simulation is accurate enough, we would obtain veridical sound dynamics and a detailed control in terms of physical variables. This allows direct manipulation of the sound synthesis model and direct coupling with gestural controllers. 5.4.1 A physical oscillator Let us consider a simple mechanical massspringdamper system, as depicted in ﬁgure 11. Let f be an exogenous force that drives the system. It is a mechanical series connection, as the components share the same x position and the forces sum up to zero: fm = fR + fk + f ⇒ mx = −Rx − kx + f . ¨ ˙ (34) By taking the Laplace transform of (34) (with null initial conditions) we get R
f m k x
Figure 11: MassSpringDamper system the algebraic relationship s2 mX (s) + sRX (s) + kX (s) = F (s) , (35) 138 D. Rocchesso: Sound Processing and we can derive the transfer function between the forcing term f and the displacement x: H (s) = X (s) 1/m =2 R F (s) s + ms +
k m . (36) The system oscillates with characteristic frequency Ω0 = k /m = 2πf0 and the damping coefﬁcient is ρ = R/m. The quality factor of the system is Q = Ω0 /ρ and it is the number of cycles that the characteristic oscillation takes to attenuate by a factor 1/eπ . The damping coefﬁcient ρ is proportional to the resonance bandwidth. If we use the bilinear transformation to discretize the transfer function (36) we obtain the discretetime system described by the transfer function H (z ) = = 1 + 2 z −1 + z −2 + Rh + k + 2(k − mh2 )z −1 + (k + mh2 − Rh)z −2 b0 + b1 z −1 + b2 z −2 (37) 1 + a1 z − 1 + a2 z − 2 mh2 Therefore, the damped mechanical oscillator can be simulated by means of a secondorder discretetime ﬁlter. For instance, the realization Direct Form I, depicted in ﬁgure 24 of chapter 2, can be used for this purpose. We notice that there is a delayfree path that connects the input f with the output x, and this may represent a problem when connecting several simulations of physical blocks together. 5.4.2 Coupled oscillators Let us consider the system obtained by coupling the massspringdamper oscillator with a second massspring system (see ﬁgure 12): m1 x1 ¨ m2 x2 ¨ = = −k1 (x1 − x2 ) − R(x1 − x2 ) + f ˙ ˙ −k1 (x1 − x2 ) − k2 x2 + R(x1 − x2 ) . ˙ ˙ (38) Using the Laplace transform, the system (39) can be converted into Sound Modelling 139 R
f k2 m1 k1 m2 x1 x2 x Figure 12: Two coupled mechanical oscillators X1 (s) = 1 [F (s) + (k1 + Rs)X2 (s)] m1 s2 + Rs + k1 = H1 (s) [F (s) + G(s)X2 (s)] 1 (k1 + Rs)X1 (s) m2 s2 + Rs + (k1 + k2 ) = H2 (s)G(s)X1 (s) , (39) X2 (s) = and this can be represented as a feedback connection of ﬁlters, as depicted in ﬁgure 13. This simple example gives us the possibility to discuss a few different x1 G(s) H1(s) H2(s) f G(s) E I x2 R Figure 13: Block decomposition of the coupled oscillators ways of looking at physical models. One of these ways is the cellular approach, 140 D. Rocchesso: Sound Processing where complex linear systems are obtained by connection of mass points (H1 and H2 in our example) and viscoelastic links. Such approach is the basis of the CORDISANIMA software developed at ACROE in Grenoble [20]. Another possibility, is to look for functional blocks in the system decomposition. In ﬁgure 13 we have outlined three functional blocks: E  exciter: a dynamic physical system that can elicit and sustain an oscillation by means of an external forcing term; R  resonator: a dynamic physical system (with small losses) that sustains the oscillations; I  interaction: a system that connects E and R in such a way that the physical variables at the two ends are compatible. Although in our example the resonator is a lumped mechanical oscillator, usually the resonator is a medium where waves propagate. Therefore the resonator is a distributed system, described by partial differential equations (PDE). Among the different ways of discretizing it, we mention • Network of elementary coupled oscillators (cellular models); • Numerical integration of the PDE (for instance, ﬁnite difference methods); • Discretization of the solutions of the PDE (waveguide models). The exciter is usually a lumped system described by ordinary differential equations (ODE) that can be integrated using numerical methods, the bilinear transformation, or the impulse invariance method. Often the exciter exhibits strong nonlinearities, such as the pressureﬂow characteristic of a clarinet reed [31]. The interaction block is the place where the different discretizations of the exciter and resonator blocks talk to each other. Moreover, this is the right place to insert sound component that are difﬁcult to capture with a physical model, either because the physics is too complicated or because we just don’t know to model some phenomena. For instance, where the clarinet reed (exciter) is connected to the bore (resonator), small ﬂowdependent noise bursts can be injected to increase the simulation realism. In a system such as the one of ﬁgure 13, if each block is separately discretized a computability problem may arise when the blocks are connected to each other. Namely, if the realization of each block has a delayfree inputoutput Sound Modelling 141 path then a noncomputable delayfree loop will appear in the model. There are techniques to cope with these delayfree loops (implicit solvers) or to eliminate them [16]. 5.4.3 Onedimensional distributed resonators Physical systems such as strings or acoustic tubes can be idealized as onedimensional distributed resonators, described by a couple of dual variables, here called Kirchhoff variables, which are functions of time and longitudinal space. For a string, the Kirchhoff variables are force and velocity. For the acoustic tube, these variables are pressure and air ﬂow. In any case, each of these variables is governed by the wave equation [63] ∂ 2 p(x, t) ∂ 2 p(x, t) = c2 , 2 ∂t ∂x2 (40) where c is the wave speed in the medium. The symbol p in (40) can be thought of as the instantaneous and local air pressure inside a tube. One of the most popular ways of solving PDEs such as (40) is ﬁnite differencing, where a grid is constructed in the spatial and time variables, and derivatives are replaced by linear combinations of the values on this grid. Two are the main problems to be faced when designing a ﬁnitedifference scheme for a partial differential equation: numerical losses and numerical dispersion. There is a standard technique [70], [103] for evaluating the performance of a ﬁnitedifference scheme in contrasting these problems: the von Neumann analysis. Replacing the second derivatives by central secondorder differences6 , the explicit updating scheme for the ith spatial sample of displacement (or pressure) is: p(i, n + 1) = 2 1− c2 ∆t2 p(i, n) − p(i, n − 1) + ∆x2 c2 ∆t2 + [p(i + 1, n) + p(i − 1, n)] , ∆x2 (41) where ∆t and ∆x are the time and space grid steps. The von Neumann analysis assumes that the equation parameters are locally constant and checks the time
6 The reader is invited to derive (41) by substituting in (40) the ﬁrstorder spatial derivative with the difference (p(i + 1, n) − p(i, n))/X , and the ﬁrstorder time derivative with the difference (p(i, n + 1) − p(i, n))/T 142 D. Rocchesso: Sound Processing evolution of a spatial Fourier transform of (41). In this way a spectral ampliﬁcation factor is found whose deviations from unit magnitude and linear phase give respectively the numerical loss (or ampliﬁcation) and dispersion errors. For the scheme (41) it can be shown that a unitmagnitude ampliﬁcation factor is ensured as long as the CourantFriedrichsLewy condition [70] c∆t ≤1 ∆x (42) is satisﬁed, and that no numerical dispersion is found if equality applies in (42). A ﬁrst consequence of (42) is that only strings having length which is an integer number of c∆t are exactly simulated. Moreover, when the string deviates from ideality and higher spatial derivatives appear (physical dispersion), the simulation becomes always approximate. In these cases, the resort to implicit schemes can allow the tuning of the discrete algorithm to the amount of physical dispersion, in such a way that as many partials as possible are reproduced in the band of interest [22]. It is worth noting that if c in equation (40) is a function of time and space, the ﬁnite difference method retains its validity because it is based on a local (in time and space) discretization of the wave equation. Another advantage of ﬁnite differencing over other modeling techniques is that the medium is accessible at all the points of the timespace grid, thus maximizing the possibilities of interaction with other objects. As opposed to ﬁnite differencing, which discretize the wave equation (see eqs. (40) and (41)), waveguide models come from discretization of the solution of the wave equation. The solution to the onedimensional wave equation (40) was found by D’Alembert in 1747 in terms of traveling waves 7 : p(x, t) = p+ (t − x/c) + p− (t + x/c) . (43) Eq. (43) shows that the physical quantity p (e.g. string displacement or acoustic pressure) can be expressed as the sum of two wave quantities traveling in opposite directions. In waveguide models waves are sampled in space and time in such a way that equality holds in (42). If propagation along a onedimensional medium, such as a cylinder, is ideal, i.e. linear, nondissipative and nondispersive, wave propagation is represented in the discretetime domain by a couple of digital delay lines (Fig. 14), which propagates the wave variables p+ and p− .
7 The D’Alembert solution can be derived by inserting the exponential eigenfunction est+vx into (40) Sound Modelling 143 p (t) + Wave Delay p (t  nT) + p(t) Wave Delay p  + nT) (t Figure 14: Wave propagation propagation in a ideal (i.e. linear, nondissipative and nondispersive) medium can be represented, in the discretetime domain, by a couple of digital delay lines. Let us consider deviations from ideal propagation due to losses and dispersion in the resonator. Usually, these linear effects are lumped and simulated with a few ﬁlters which are cascaded with the delay lines. Losses due to terminations, internal frictions, etc., give rise to gentle low pass ﬁlters, whose parameters can be identiﬁed from measurements. Wave dispersion, which is often due to medium stiffness, is simulated by means of allpass ﬁlters whose effect is to produce a frequencydependent propagation velocity [83]. The reﬂecting terminations of the resonator (e.g., a guitar bridge) can also modeled as ﬁlters. In virtue of linearity and time invariance, all the ﬁlters can be condensed in a single higherorder ﬁltering block, and all the delays can be connected to form a single longer delay line. As a result, we would get the recursive comb ﬁlter, described in chapter 3, which forms the structure of the KarplusStrong synthesis algorithm [47]. Onedimensional waveguide models can be connected together by means of waveguide junctions, thus forming digital waveguide networks, which are used for simulation of multidimensional media (e.g., membranes [34]) or complex acoustic systems (e.g., several strings attached to a bridge [17]). The general treatment of waveguide networks is beyond the scope of this book [85]. 144 D. Rocchesso: Sound Processing Appendix A Mathematical Fundamentals
A.1 Classes of Numbers A.1.1 Fields
Given a set F of numbers, two operations called sum and product over these numbers, and some algebraic properties that we are going to enumerate, F is called a ﬁeld. The sum of two elements of the ﬁeld u, v ∈ F is still an element of the ﬁeld and has the following properties: S1, Associative Property : (u + v ) + w = u + (v + w) S2, Commutative Property : u + v = v + u S3, Existence of the Zero : There exists one and only element in F , called the zero, that is the neutral element for the sum, i.e., u + 0 = u , for all u∈F S4, Existence of the Opposite : For each u ∈ F there exists one and only element in F , called the opposite of u, and written as −u, such that u + (−u) = 0. The product of two elements of the ﬁeld u, v ∈ F is still an element of the ﬁeld and has the following properties: P1, Associative Property : (uv )w = u(vw) P2, Commutative Property : uv = vu 145 146 D. Rocchesso: Sound Processing P3, Existence of the Unity : There exists one and only element in F , called the unity, that is the neutral element for the product, i.e., u1 = u , for all u∈F P4, Existence of the Inverse : For each u ∈ F different from zero, there exists one and only element in F , called the inverse of u, and written as u−1 , such that uu−1 = 1. The two operations of sum and product are jointly characterized by the distributive properties: D1, Distributive Property : u(v + w) = uv + uw D2, Distributive Property : (v + w)u = vu + wu The existence of the opposite and the reciprocal implies the existence of two other operations, namely, the difference u − v = u + (−v ) and the quotient u/v = u(v −1 ). Given the properties of a ﬁeld, we can say that the natural numbers N = 0, 1, . . . do not form a ﬁeld since, for instance, they do not have an opposite. Similarly, the integer numbers Z = . . . , −2, −1, 0, 1, . . . do not form a ﬁeld because, in general, they do not have an inverse. On the other hand, the rational numbers Q, which are given by ratios of integers, do satisfy all the properties of a ﬁeld. The real numbers R are all those numbers that can be expressed in decimal notation as x.y , where the number of digits of y is not necessarily bounded. Real numbers can be obtained as the union of the set of rational numbers with the set of transcendental numbers, i.e., those numbers that can not be expressed as a ratio of integers. An example of transcendental number is π , which is the ratio between the circumference and the diameter of any circle. The real numbers do form a ﬁeld, and the rationals are a subﬁeld of the reals. A.1.2 Rings A set of numbers provided with sum and product, and such that the properties S1–4, P1 e D1–2 are satisﬁed is called a ring. If P2 is satisﬁed we have a commutative ring, and if P3 is satisﬁed the ring has a unity. For instance, the set Z of integer numbers forms a commutative ring with a unity. Whenever we want to indicate the sets of ordered couples or triples of elements belonging to a ﬁeld (or a ring) F we will use the notation F 2 or F 3 , respectively. Mathematical Fundamentals 147 A.1.3 Complex Numbers
The classes of numbers introduced so far are instrumental to a hierarchical system, where the natural numbers are contained in the integers, which are part of the rationals, and this latter class in contained in the real numbers. This hierarchy is resemblant of the temporal evolution of the classes of numbers since the antiquity to the XVI century. The extension of the hierarchy was always motivated by the ease with which practical and formal problems could be solved by manipulation of numerical symbols. The same kind of motivation led to the introduction of the class of complex numbers. As we will see in sec. A.3), they come into play when one wants to represent the solutions of a secondorder equation. In order to deﬁne the complex numbers, we have to deﬁne the imaginary unity i as that number that multiplied by itself (i.e., squared), gives −1. Therefore, i2 = ii = −1 . (1) In several branches of engineering the symbol j is preferred to i, because it is more easily distinguished from the symbol of current. In this book, the symbol i is used exclusively. Given the preliminary deﬁnition of i, the complex numbers are deﬁned as the couples x + iy (2) where x and y are real numbers called, respectively, real and imaginary part of the complex number. Given two complex numbers c1 = x1 + iy1 and c2 = x2 + iy2 the four operations are deﬁned as follows1 : Sum : c1 + c2 = (x1 + x2 ) + i(y1 + y2 ) Difference : c1 − c2 = (x1 − x2 ) + i(y1 − y2 ) Product : c1 c2 = (x1 x2 − y1 y2 ) + i(x1 y2 + x2 y1 ) Quotient : (x1 x2 + y1 y2 ) + i(y1 x2 − x1 y2 ) c1 = . c2 x2 2 + y2 2 1 The expressions can be derived by application of the usual algebraic operations on real numbers and by substituting i2 with −1. In order to derive the quotient, it is useful to multiply and divide by x2 − iy2 . 148 D. Rocchesso: Sound Processing If the introduction of complex numbers dates back to the XVI century, their geometric interpretation, that gave an intuitive framework for widespread use, was introduced in the XVIII century. The geometric interpretation is simply obtained by considering the geometric number c = x+iy as a point of the plane having coordinates x and y . This interpretation, depicted in ﬁg. 1, allows to switch from the orthogonal coordinates x and y to the polar coordinates ρ and θ, called magnitude (or absolute value) and phase (or argument), respectively. The x and y axes are called, respectively, the real and imaginary axes. The magnitude of a complex number is calculated by application of the Theorem of Pythagoras: ρ2 = x2 + y 2 = (x + iy )(x − iy ) = cc (3) where c is the complex conjugate of c, also depicted in ﬁg 12 . The argument of a complex number is the angle formed by the positive horizontal semiaxis with the line conducted from the geometric point to the origin of the complex plane. The argument is signed, and the sign is positive for anticlockwise angles (see ﬁg. 1). A.2 Variables and Functions In mathematics, the entities that one works with are often arbitrary elements of a class of numbers. In these cases, the entities can be represented by a variable x deﬁned in a domain D. In this appendix, we have already used some variables implicitly, for instance, to state the properties of a ﬁeld. When the domain is an interval of the ﬁeld of real numbers having extremes a and b, we can say that x is a continuous variable of the interval [a, b] and we write a ≤ x ≤ b. When every value of the variable x is associated with one and only one value of another variable y we say that y is a function of x, and we write y = f (x) . (4) x is said to be the independent variable (argument) while y is the dependent variable, and the set of values that it takes for different assumed by x in its domain is called the codomain. If, for each x1 = x2 , f (x1 ) = f (x2 ), then domain and codomain have a biunivocal correspondence. In that case the roles
2 It is easy to show that the magnitude of the product is equal to the product of the magnitudes. Vice versa, the magnitude of the sum is not equal to the sum of the magnitudes Mathematical Fundamentals 149 y c=x+iy ρ θ 0 ρ c=x−iy
Figure 1: Geometric interpretation of a complex number x −θ of domain and codomain can be inverted, and it is possible to deﬁne an inverse function x = f −1 (y ). In general, functions can have more than one independent variable, thus indicating a relation among many variables. Often functions are deﬁned by means of algebraic expressions, and associated with domains and interpretations for the variables. For instance, the pitch h (in Hz) of the note produced by an ideal string can be expressed by the function 1t , (5) h= 2l d where l is the length of the string in meters, t is the string tension in Newton, and d is the density per unit length (Kg/m). This concise expression allows to represent the pitch of a note whatever are the values of length, tension, and density, as long as these values belong to the domain of nonnegative real numbers (indicated by R+ ). Functions can be graphically represented in the cartesian plane. The abscissa corresponds with an independent variable, and the ordinate corresponds to the dependent variable. If we have more than one dependent variable, only one is represented in abscissa, and the other ones are set to constant values. 150 D. Rocchesso: Sound Processing For example, ﬁg. 2 shows the function (5), with values of tension and density 3 set to 952N and 0.0367Kg/m, respectively. The domain of string lengths ranges from 0.5m to 4.0m.
200 150 h [Hz] 100 50 0 Pitch of note as a function of string length 0 1 2 l [m] 3 4 Figure 2: Pitch of a note as a function of string length The chart of ﬁg. 2 can be obtained by a simple script in Octave or Matlab: r=0.0367; t=952; % definitions of l=[0.5:0.01:4.0]; % domain for the h=1./(2*l)*sqrt(t/r); % expression plot(l,h); grid; title(’Pitch of note as a function xlabel(’l [m]’); ylabel(’h [Hz]’); % replot; % Octave only density and tension string length for pitch of string length’); In order to visualize functions of two variables, we can also use threedimensional representations. For example, the function (5) can be visualized as in ﬁg. 3 if the variables length and tension are deﬁned over intervals and the density is set to a constant. In such a representation, the function of two dependent variables becomes a surface in 3D. The Octave/Matlab script for ﬁg. 3 is the following: r=0.0367; % definition of density l=[0.5:0.1:4.0]; % domain for the string length
3 These values are appropriate for the piano note C2. Mathematical Fundamentals
Pitch of note as a function of string length and tension 151 200
h [Hz] 100 0 1200 1000 t [N] 800 0 2 l [m] 4 Figure 3: Pitch of a note as a function of string length and tension t=[800:10:1200]; % domain for the string tension h=(1./(2*l’)*sqrt(t./r))’; % expression for pitch mesh(l,t,h); grid; title(’Pitch of note as a function of string \ length and tension’); xlabel(’l [m]’); ylabel(’t [N]’); zlabel(’h [Hz]’); % replot; % Octave only Of a multivariable function we can also give the contour plot, i.e., the plot of curves obtained for constant values of the dependent variable. For example, in the function (5), if we let the dependent variable to take only seven prescribed values, the cartesian plane of length and tension displays seven curves (see ﬁg. 4). Each curve corresponds to an horizontal cut of the surface of ﬁg. 3. The Octave/Matlab script producing ﬁg. 4 is the following: r=0.0367; % definition of density l=[0.5:0.1:4.0]; % domain for the string length t=[800:10:1200]; % domain for the string tension h=(1./(2*l’)*sqrt(t./r))’; % expression for pitch % contour(h’, 7, l, t); % Octave only co=contour(l, t, h, 7); % Matlab only clabel(co); % Matlab only title(’Pitch of note as a function of string \ 152 D. Rocchesso: Sound Processing
Pitch of note as a function of string length and tension 1200 1100 t [N] 1000 900 140 800 1 120 161 79.3 99.6 59 38.8 2 l [m] 3 4 Figure 4: Contour plot of pitch as a function of string length and tension length and tension’); xlabel(’l [m]’); ylabel(’t [N]’); zlabel(’h [Hz]’); A.3 Polynomials An important class of onevariable functions is the class of polynomials, which are weighted sums of nonnegative powers of the independent variable. Each power with its coefﬁcient is called a monomial. A polynomial has the form y = f (x) = a0 + a1 x + a2 x2 + · · · + an xn , (6) where the numbers ai are called coefﬁcients and, for the moment, they can be considered as real numbers. The highest power that appears in (6) is called the order of the polynomial. The secondorder polynomials, when represented in the x − y plane, produce a class of curves called parabolas, while thirdorder polynomials generate cubic curves. We call solutions, or zeros, or roots of a polynomial those values of the independent variable that produce a zero value of the dependent variable. For second and thirdorder polynomials there are formulas to derive the zeros in Mathematical Fundamentals 153 closed form. Particularly important is the formula for secondorder polynomials: ax2 + bx + c = x 0 √ −b ± b2 − 4ac = . 2a (7) (8) As it can be easily seen by application of (8) to the polynomial x2 + 1, the roots of a realcoefﬁcient polynomial are real numbers. This observation was indeed the initial motivation for introducing the complex numbers as an extension of the ﬁeld of real numbers. The Fundamental Theorem of Algebra states that every nth order realcoefﬁcient polynomial has exactly n zeros in the ﬁeld of complex numbers, even though these zeros are not necessarily all distinct from each other. Moreover, the roots that do not belong to the real axis of the complex plane, are couples of conjugate complex numbers. For polynomial of order higher than three, it is convenient to use numerical methods in order to ﬁnd their roots. These methods are usually based on some iterative search of the solution by increasingly precise approximations, and are often found in numerical software packages such as Octave. In Octave/Matlab a polynomial is represented by the list of its coefﬁcients from an to a0 . For instance, 1 + 2x2 + 5x5 is represented by p = [5 0 0 2 0 1] and its roots are computed by the function rt = roots(p) . In this example the roots found by the program are rt = 0.87199 0.54302 0.54302 0.10702 0.10702 + + + 0.00000i 0.57635i 0.57635i 0.59525i 0.59525i and only the ﬁrst one is real. If the previous result is saved in a variable rt, the complex numbers stored in it can be visualized in the complex plane by the directive axis([1,1,1,1]); plot(real(rt),imag(rt),’o’); 154
1 D. Rocchesso: Sound Processing 0 −1 −1 −0.5 0 0.5 1 Figure 5: Roots of the polynomial 1 + 2x2 + 5x5 in the complex plane and the result is reported in ﬁg. 5. It can be shown that the realcoefﬁcient polynomials form a commutative ring with unity if the operations of sum and product are properly deﬁned. The sum of two polynomials is a polynomial whose order is the highest of the orders of the operands, and having coefﬁcients which are the sums of the respective coefﬁcients of the operands. The product is done by application of the usual distributive and associative properties to the product of sums of powers. The order of the product is given by the sum of the orders of the polynomial operands, and the k th coefﬁcient of the product is obtained by the coefﬁcients ai and bj of the operands by the formula ck =
i+ j = k ai b j , (9) where this notation indicates a sum whose addenda are characterized by a couple of indices i, j that sum up to k . As it can be seen from sec. 1.4, the polynomial multiplication is formally identical to the convolution of discrete signals, and this latter operation is fundamental in digital signal processing. A.4 Vectors and Matrices Physicists use arrows to indicate physical quantities having both an intensity and a direction (e.g., forces or velocities). These arrows, sometimes called Mathematical Fundamentals 155 vectors, are oriented according to the direction of the physical quantity and their length is proportional to the intensity. These vectors can be located in the plane (or the 3D space) as if they were departing from the origin. In this way, they can be represented by the couple (or triple) of coordinates of their second extremity. This representation allows to perform the sum of vectors and the multiplication of a vector by a constant as the usual algebraic operations done with each separate coordinate: (x1 , y1 , z1 ) + (x2 , y2 , z2 ) = (x1 + x2 , y1 + y2 , z1 + z2 ) α(x1 , y1 , z1 ) = (αx1 , αy1 , αz1 ) (10) More generally, an ncoordinate vector is deﬁned in a ﬁeld F as the ordered set of n numbers4 xi ∈ F : v = [x1 , . . . , xn ] . (11) The set of all ncoordinate vectors deﬁned in the ﬁeld F , for which the operations (10) give vectors within the set itself, form the ndimensional vector space Vn (F ). Every subset of Vn (F ) that is closed5 with respect to the operations (10) is called vector subspace of Vn (F ). For instance, in the twodimensional plane, the points of a cartesian axis form a subspace of the plane. Similar, subspaces of the plane are given by any straight line passing through the origin, and subspaces of the 3D space are given by any plane passing through the origin. m vectors v1 , . . . , vm , are said to be linearly independent if there is no choice of m coefﬁcients a1 , . . . , am (the choice of all zeros is excluded) such that a1 v 1 + · · · + am v m = 0 . (12) In the 2D plane, two points on different cartesian axes are linearly independent, as are any two points belonging to different straight lines passing through the origin. Viceversa, points belonging to the same straight line passing through the origin are always linearly dependent. It can be shown that, in an ndimensional space Vn (F ), every set of m ≥ n vectors is linearly dependent. A set of n linearly independent vectors (if they
4 In this book, the square brackets are used to indicate vectors and matrices. This is also the notation used in Octave. Moreover, the variables representing vectors or matrices are always typed in bold font. 5 A set I is closed with respect to an operation on its elements if the result of the operation is always an element of I . 156 D. Rocchesso: Sound Processing exist) is called a basis of Vn (F ), in the sense that any other vector ofVn (F ) can be obtained as a linear combination of the base vectors. For instance, the vectors [1, 0, 0], [0, 1, 0], and [0, 0, 1] form a basis for the 3D space, but there are inﬁnitely many other bases. Between any two vectors of the same vector space the operation of dot product is deﬁned, and it returns the scalar sum of the componentbycomponent products. As a formula, the dot product is written as
n vw=
j =1 vj wj . (13) By convention, with v we indicate a column vector, while v denotes its transposition into a row. Therefore, the operation (13) can be referred as a rowcolumn product. A matrix can be considered as a list of vectors, organized in a table where each element of the list occupies (by convention) one column. A matrix having n rows and m columns deﬁned over the ﬁeld F can be written as a1,1 . . . a1,m ∈ F n×m . ... (14) A= an,1 . . . an,m The multiplication of a matrix A ∈ F n×m by a (column) vector v ∈ Vm (F ) is deﬁned as m a1,j vj j =1 Av = . . . m av
j =1 n,j j , (15) i.e., as a (column) vector whose ith element is given by the dot product of the ith row by the vector v. The product of a matrix A ∈ Rl×m by a matrix B ∈ Rm×n can be obtained as a list of vectors, each being the product of matrix A by a column of B, and it is a matrix C ∈ Rl×n . The product is properly deﬁned only if the number of column of the ﬁrst matrix is equal to the number of rows of the second matrix. In general, the order of factors can not be reversed, i.e., the matrix product is not commutative. Given a matrix A, the matrix A obtained by exchanging each row with the corresponding column is called the transposed of A. Mathematical Fundamentals 157 Languages such as Octave and Matlab were initially conceived as languages for matrix manipulation. Therefore, they offer data structures and builtin operators for representing and manipulating matrices. For example, a matrix A ∈ R2×3 can be represented as A = [1, 2, 3; 4, 5, 6]; where the semicolon is used to separate one row from the following one. A column vector can be entered as b = [1; 2; 3]; or, alternatively, we can transpose a row vector b = [1, 2, 3]’; Given the deﬁnitions of the variables A and b, we can multiply the Matrix by the vector and assign the result to a new vector variable c: c=A*b thus obtaining the result c= 14 32 The product of a matrix A ∈ Rl×m by a matrix B ∈ Rm×n is represented by A*B When we want to do elementwise operations between two or more vectors or matrices having the same size, we just have to place a dot before the operator symbol. For instance, [1, 2, 3] .* [4, 5, 6] returns the (row) vector [4 10 18] as a result. Octave allows to operate on scalars, vectors, and matrices belonging to the complex ﬁeld, just by representing as a sum of real and imaginary parts (e.g., 2 + 3i). When we use Octave/Matlab to handle functions, or to draw their plot, we usually operate on collections of points that are representative of the functions. There is a concise way to assign to a variable all the values regularly spaced (with step inc) between a min and a max: x = [min, inc, max]; This kind of instruction has been used to plot the function of ﬁg. 2. After having deﬁned the domain as the vector of points l=[0.5: 0.1: 4.0]; the vector representing the codomain has been computed by application of the 158 function to the vector l: f=1./(2*l)*sqrt(t/r); D. Rocchesso: Sound Processing A.4.1 Square Matrices The nth order square matrices deﬁned over a ﬁeld F are a set F n×n which is very important for its afﬁnity with the classes of numbers. In fact, for these matrices the sum and product are always deﬁned and it is easy to verify that the properties S1–4, P1, and D1–2 of appendix A.1 do hold. The property P3 is also veriﬁed and the neutral element for the product is found in the unit diagonal matrix, which is a matrix that has ones in the main diagonal6 and zeros elsewhere. In general, the commutativity is not ensured for the product, and a matrix might not admit an inverse matrix, i.e., an inverse obeying to property P4. In the terminology introduced in appendix A.1, the square matrices F n×n form a ring with a unity. This observation allows us to treat the square matrices with compact notation, as a class of numbers which is not much different from that of integers7 . A.5 Exponentials and Logarithms
Given a number a ∈ R+ , it is clear what is its natural mth power, that is the number obtained multiplying a by itself m times. The rational power a1/m , with m a natural number, is deﬁned as the number whose mth power gives a. If we extend the power operator to negative exponents by reciprocation of the positive power, we give meaning to all powers ar , with r being any rational number. The extension to any real exponent is obtained by imposing continuity to the power function. Intuitively, the function f (x) = ax describes a continuous curve that “interpolates” the values taken at the points where x is rational. The power operator has the following fundamental properties: E1 : ax ay = ax+y E2 : ax = ax − y ay main diagonal goes from the top leftmost corner to the bottom rightmost corner. important differences with the ring of integers is the non commutativity and the possibility that two nonzero matrices multiplied together give the zero matrix (the zero matrix admits nonzero divisors).
7 Two 6 The Mathematical Fundamentals E3 : (ax )y = axy E4 : (ab)x = ax bx . 159 The function f (x) = ax is called exponential with base a. Given these preliminary deﬁnitions and properties, we deﬁne the logarithm of y with base a x = loga y , (16) as the inverse function of y = ax . In other words, it is the exponent that must be given to the base in order to get the argument y . Since the power ax has been deﬁned only for a > 0 and it gives always a positive number, the logarithm is deﬁned only for positive values of the independent variable y . Logarithms are very useful because they translate products and divisions into sums and differences, and power operations into multiplications. Simply stated, by means of the logarithms it is possible to reduce the complexity of certain operations. In fact, the properties E1–3 allow to write down the following properties: L1 : loga xy = loga x + loga y x L2 : loga = loga x − loga y y L3 : loga xy = y loga x . In sound processing, the most interesting logarithm bases are 10 and 2. The base 10 is used to deﬁne the decibel (symbol dB) as a ratio of two quantities. If the quantities x and y are proportional to sound pressures (e.g., rms level), we say that x is wdB larger than y if x > y > 0 and w = 20 log10 x . y (17) When the quantities x and y are proportional to a physical power (or intensity), their ratio in decibel is measured by using a factor 10 instead of 208 in (17). The base 2 is used in all branches of computer sciences, since most computing systems are based upon binary representations of numbers (see the appendix A.9). For instance, the number of bits that is needed to form an address in a memory of 1024 locations is log2 1024 = 10 . (18) 8 In acoustics [86], the power is proportional to the square of a pressure. Therefore, applying property L3, we fall back into the deﬁnition (17). 160 D. Rocchesso: Sound Processing In Octave/Matlab, the logarithms of x having base 2 and 10 are indicated with log2(x) and log10(x), respectively. Fig. 6 shows the curves of the logarithms in base 2 and 10. From these curves we can intuitively infer how, in any base, log 1 = 0, and how the function approaches −∞ (minus inﬁnity) as the argument approaches zero.
1 0 −1 −2 −3 −4 0 0.5 1 1.5 2 Figure 6: Logarithms expressed in the fundamental bases 2 (solid line) and 10 (dashed line) Given a logarithm expressed in base a, it is easy to convert it in the logarithm expressed in another base b. The formula that can be used is logb x = loga x . loga b (19) A base of capital importance in calculus is the Neper number e, a transcendental number approximately equal to 2.7183. As we will see in appendix A.7.1, the exponentials expressed in base e are eigenfunctions for the derivative operator. In other words, differential linear operators do not alter the form of these exponentials. Moreover, the exponential with base e admits an elegant translation into an inﬁnite series of addenda ex = 1 + x2 x3 x + + + ... , 1! 2! 3! (20) where n! is the factorial of n and is equal to the product of all integers ranging from 1 to n. It can be proved that the inﬁnite sum on the righthand side of (20) gives meaning to the exponential function even where its argument is complex. Mathematical Fundamentals 161 A.6 Trigonometric Functions Trigonometry describes the relations between angles and segments subtended by these angles. The main trigonometric functions are easily visualized on the complex plane, as in ﬁg. 7, where the unit circle is explicitly represented.
I P sin θ θ
O cos θ Q R Figure 7: Trigonometric functions on the complex plane An angle θ cuts on the unit circle an arc whose length is deﬁned as the measure in radians of the angle. Since the circumference has length 2π , the 360o angle measures 2π radians, and the 90o angle corresponds to π/2 radians. The main trigonometric functions are: Sine sin θ = P Q Cosine cos θ = OQ Tangent tan θ = P Q/OQ It is clear from ﬁg. 7 and from the Pythagoras’ theorem that, for any θ, the identity sin2 θ + cos2 θ = 1 (21) is valid. The angle, considered positive if oriented anti clockwise, can be considered the independent variable of trigonometric functions. Therefore, we can use Octave/Matlab to plot the main trigonometric functions, thus obtaining ﬁg. 8. These plots can be obtained as subplots of a same ﬁgure by the following Octave/Matlab script: 162 D. Rocchesso: Sound Processing theta = [0:0.01:4*pi]; s = sin(theta); c = cos(theta); t = tan(theta); subplot(2,2,1); plot(theta,s); axis([0,4*pi,1,1]); grid; title(’Sine of an angle’); xlabel(’angle [rad]’); ylabel(’sin’); % replot; % Octave only subplot(2,2,2); plot(theta,c); grid; title(’Cosine of an angle’); xlabel(’angle [rad]’); ylabel(’cos’); % replot; % Octave only subplot(2,2,3); plot(theta,t); grid; title(’Tangent of an angle’); xlabel(’angle [rad]’); ylabel(’tan’); axis([0,4*pi,6,6]); % replot; % Octave only It is clear from the plots that the functions sine and cosine are periodic with period 2π , while the function tangent is periodic with period π . Moreover, the codomain of sine and cosine is limited to the interval [−1, 1], while the codomain of the tangent takes values on all real axis. The tangent approaches inﬁnity for all the values of the argument that multiples of π/2, i.e. in these points we have vertical asymptotes. As we can see from ﬁg. 7, a complex number c, having magnitude ρ and argument θ, can be represented in its real and imaginary parts as c = x + iy = ρ cos θ + iρ sin θ . (22) A fundamental identity, that links trigonometry with exponential functions, is the Euler formula eiθ = cos θ + i sin θ , (23) which expresses a complex number laying on the unit circumference as an exponential with imaginary exponent9 . When θ is left free to take any real value, the exponential (23) generates the socalled complex sinusoid.
9 The actual meaning of the exponential comes from the series expansion (20) Mathematical Fundamentals
Sine of an angle Cosine of an angle 163 1 0.5 0 −0.5 −1 1 0.5 cos 0 −0.5 −1 sin 0 5 angle [rad] Tangent of an angle 10 0 5 angle [rad] 10 15 6 4 2 tan 0 −2 −4 −6 0 5 angle [rad] 10 Figure 8: Trigonometric functions Any complex number c having magnitude ρ and argument θ can be represented in compact form as c = ρeiθ , (24) and to it we can apply the usual rules of power functions. For instance, we can compute the mth power of c as cm = ρm eimθ = ρm (cos mθ + i sin mθ) , (25) thus showing that it is obtained by taking the mth power of the magnitude and multiplying by m the argument. The (25) is called De Moivre formula. The orderm root of a number c is that number b such that bm = c. In general, a complex number admits m orderm distinct complex roots10 . The De
10 For instance, 1 admits two square roots (1 and 1) and four order4 roots (1, 1, i, i). 164 D. Rocchesso: Sound Processing Moivre formula establishes that11 the orderm roots of 1 are evenly distributed along the unit circumference, starting from 1 itself, and they are separated by a constant angle 2π/m. At this point, we propose some problems for the reader: • Prove the following identities, which are corollaries of the Euler identity cos θ = eiθ + e−iθ , 2 eiθ − e−iθ . 2i (26) (27) sin θ = • Prove the “most beautiful formula in mathematics” [59] eiπ + 1 = 0 . (28) • Prove, by means of the De Moivre formula, the following identities: cos 2θ = cos2 θ − sin2 θ , sin 2θ = 2 sin θ cos θ . (29) (30) • Prove, by the representation of unitmagnitude complex numbers eiθ , that the following identities are true: cos (θ + φ) = cos θ cos φ − sin θ sin φ , sin (θ + φ) = cos θ sin φ + sin θ cos φ . (31) (32) A.7 Derivatives and Integrals A.7.1 Derivatives of Functions
Given the function y = f (x) (for the moment, we only consider functions of one variable), it might be interesting to ﬁnd the places where local maxima and minima are located. It is natural, in such a search, to focus on the slope of
11 The reader is invited to justify this statement by an example. The simplest nontrivial example is obtained by considering the cubic roots of 1. Mathematical Fundamentals 165 the line that is tangent to the function curve, in such a way that local maxima and minima are found where the slope of the tangent is zero (i.e., the tangent is horizontal). This operation is possible for all regular functions, which are functions without discontinuities and without sharp corners. Given this assumption of regularity, the shape of the curve can be deﬁned at any point, thus becoming itself a function of the same independent variable. This function is called derivative and is indicated with y= dy . dx (33) The notation (33) recalls how the local shape of a curve can be computed: the tangent line is drawn, two distinct points are taken on this line, the ratio between the differences of coordinates y and x of the points is formed. As we have already seen in appendix A.6, this operation corresponds to the computation of the trigonometric tangent, whose argument is the angle formed by the tangent line with the horizontal axis. This observation should have made the terminology more clear. In ﬁg. 9 the polynomial y = f (x) = 4 + 3x + 2x2 − x3 is plotted for x ∈ [−4, 4], together with its derivative. As we can see, the derivative is positive where f (x) is increasing, negative where f (x) is decreasing, and zero where f (x) has a local extremal point.
100 50 y, y‘ 0 −50 −100 −4 y(x), dy/dx −2 0 x 2 4 Figure 9: A degree3 polyonomial and its derivative The Octave/Matlab script used to produce ﬁg. 9 is the following: x = [4:0.01:4]; poli = [1 2 3 4]; % domain % coefficients of a degree3 166 D. Rocchesso: Sound Processing % polynomial y = polyval(poli, x); % evaluation of the polynomial % coefficients of the derivative of the polynomial % polid = polyderiv(poli); % Octave only polid = poli(1:length(poli)1).*[length(poli)1:1:1]; % Matlab only % (polyderiv is not available) yp = polyval(polid, x); % evaluation of the derivative plot(x, y, ’’); hold on; plot(x, yp, ’’); hold off; ylabel(’y, y‘’); xlabel(’x’); title(’y(x), dy/dx’); grid; % replot; % Octave only In the script there are two new directives. The ﬁrst one is the function invocation polyval(poli, x), which returns the vector of values taken by the polynomial, whose coefﬁcients are speciﬁed in poli, in correspondence with the points speciﬁed in x. The second directive is the function invocation polideriv(poli), which returns the coefﬁcient of the polynomial that is the derivative of poli. This function is not available in Matlab, but it can be replaced by an explicity calculation, as indicated in the script. The fact that the derivative of a polynomial is still a polynomial is ensured by the derivation rules of calculus. Namely, the derivative of a monomial is a lowerdegree monomial given by the rule d(axn ) = anxn−1 . dx The derivative is a linear operator, i.e., • The derivative of a sum of functions is the sum of the derivatives of the single functions • The derivative of a product of a function by a constant is the product of the constant by the derivative of the function Another important property of the derivative is that it transforms the composition of functions in a product of functions. Given two functions y = f (x) and z = g (y ), the composed function z = g (f (x)) is obtained by replacing (34) Mathematical Fundamentals the domain of the second function with the codomain of the ﬁrst one derivative of the composed function is expressed as dz dy dz = g (y )f (x) = , dx dy dx
12 167 . The (35) which remarks the effectiveness of the notation introduced for the derivatives. For the purpose of this book, it is useful to know the derivatives of the main trigonometric functions, which are given by d sin x dx d cos x dx d tan x dx = = = cos x − sin x 1 cos2 x (36) (37) (38) Therefore, we can say that a sinusoidal function conserves its sinusoidal character (it is only translated along the x axis) when it is subject to derivation. This property comes from the fact, already anticipated, that the exponential with base e is an eigenfunction for the derivative operator, i.e., dex = ex . dx (39) If we consider the complex exponential eix as the composition of an exponential function with a monomial with imaginary coefﬁcient, it is possible to apply the linearity of derivative to the composed function and derive the formulas (36) and (37). In order to derive (38) we also have to know the rule to derive quotients of functions. In general, products and quotients of functions are derived according to d [f (x)g (x)] dx d [g (x)/f (x)] dx = = f (x)g (x) + f (x)g (x) g (x)f (x) − f (x)g (x) . f 2 (x) (40) (41) 12 For instance, log x2 is obtained by squaring x and then taking the logarithm or, by the property L3 of logarithms, ... 168 D. Rocchesso: Sound Processing A.7.2 Integrals of Functions
For the purpose of this book, it is sufﬁcient to informally describe the deﬁned integral of a function f (x), x ∈ R as the area delimited by the function curve and the horizontal axis in the interval between two edges a e b (see ﬁg. 10). When the curve stays below the axis the area has to be considered negative, and positive when it stays above the axis. The deﬁned integral is represented in compact notation as
b f (x)dx ,
a (42) and it takes real values.
y y=f(x) a 0 b x Figure 10: Integral deﬁned as an area In order to compute an integral we can use a limiting procedure, by approximating the curve with horizontal segments and computing an approximation of the integral as the sum of areas of rectangles. If the segment width approaches zero, the computed integral converges to the actual measure. There is a symbolic approach to integration, which is closely related to function derivation. First of all, we observe that for the integrals the properties of linear operators do hold: • The integral of a sum of functions is the sum of integrals of the single functions • The integral of a product of a function by a constant is the product of the constant by the integral of the function. Then, we generalize the integral operator in such a way that it doesn’t give a single number but a whole function. In order to do that, the ﬁrst integration Mathematical Fundamentals 169 edge is kept ﬁxed, and the second one is left free on the x axis. This newly deﬁned operator is called indeﬁnite integral and is indicated with
x F (x) =
a f (u)du . (43) The argument of function f (), also called integration variable, has been called u to distinguish it from the argument of the integral function F (). The genial intuition, that came to Newton and Leibniz in the XVII century and that opened the way to a great deal of modern mathematics and science, was that derivative and integral are reciprocal operations and, therefore, they are reversible. This idea is translated in a remarkably simple formula: F (x) = f (x) , (44) which is valid for regular functions. The reader can justify the (44) intuitively by thinking of the derivative of F (x) as a ratio of increments. The increment at the numerator is given by the difference of two areas obtained by shifting the right edge by dx. The increment at the denominator is dx itself. Called m the average value taken by f () in the interval having length dx, such value converges to f (x) as dx approaches zero. F (x) is also called a primitive function of f (x), where the article a subtends the property that indeﬁnite integrals can differ by a constant. This is due to the fact that the derivative of a constant is zero, and it justiﬁes the fact that the position of the ﬁrst integration edge doesn’t come into play in the relationship (44) between a function and its primitive. At this point, it is easy to be convinced that the availability of a primitive F (x) for a function f (x) allows to compute the deﬁnite integral between any two edges a and b by the formula
b f (u)du = F (b) − F (a) .
a (45) We encourage the reader to ﬁnd the primitive functions of polynomials, sinusoids, and exponentials. To acquire better familiarity with the techniques of derivation and integration, the reader without a background in calculus is referred to chapter VIII of the book [25]. A.8 Transforms The analysis and manipulation of functions can be very troublesome operations. Mathematicians have tried to ﬁnd alternative ways of expressing func 170 D. Rocchesso: Sound Processing tions and operations on them. This research has expressed some transforms which, in many cases, allow to study and manipulate some classes of functions more easily. A.8.1 The Laplace Transform
The Laplace Transform was introduced in order to simplify differential calculus. The Laplace transform of a function y (t), t ∈ R is deﬁned as a function of the complex variable s:
+∞ YL (s) =
−∞ y (t)e−st dt, s ∈ Γ ⊂ C , (46) where Γ is the region where the integral is not divergent. The region Γ is always a vertical strip in the complex plane, and within this strip the transform can be inverted with y (t) = 1 2πj
σ +j ∞ YL (s)est ds, t ∈ R .
σ −j ∞ (47) The edges of the integration (47) indicate that the integration is performed along a vertical line with abscissa σ . Example 1. The most important transform for the scope of this book is that of the causal complex exponential function, which is deﬁned as y (t) = es0 t 0 t ≥ 0 , s0 ∈ C t<0 . (48) Such transform is calculated as13
+∞ +∞ +∞ YL (s) =
−∞ y (t)e−st dt =
0 es0 t e−st dt =
0 e−(s−s0 )t dt = (49) = − 1 1 (e−(s−s0 )∞ − e−(s−s0 )0 ) = , s − s0 s − s0 and it is convergent for those values of s having real part that is larger than the real part of s0 . We have seen in appendix A.7 that the exponential function is an eigenfunction for the operators derivative and integral, which are fundamental for the description of physical systems. Therefore, we can easily understand the practical importance of the transform (49).
13 In a rigorous treatment, the notation e−(s−s0 )∞ should be replaced by a limiting operation for t → ∞. Mathematical Fundamentals 171 ### A central property of the Laplace transform is given by the transformation of the derivative operator into a multiply by s: dy (t) ↔ sYL (s) − [y (0)] , dt (50) where the term within square brackets is the initial value in the case that y (t) is a causal function, i.e. y (t) = 0 for any t < 0. Conversely, the integral is converted into a division by the complex variable s:
t y (u)du ↔
−∞ 1 YL (s) . s (51) Since physics describes systems by means of equations containing derivatives and integrals, these equations can be transformed into polynomial equations by means of the Laplace transform, and the calculus turns out to be simpliﬁed. Example 2. The second Newton’s law states that, for a body having mass m, the relationship among force f , mass, acceleration a, displacement x, and time t, can be expressed by f = ma = m
2 d2 x , dt2 (52) x where the notation d 2 indicates a second derivative, i.e. the derivative applied dt twice. The relation (52) is Laplacetransformed into the polynomial equation FL (s) = s2 mXL (s) − [smx(0) + mx (0)] , (53) where the term within square brackets is determined by the initial condition of displacement and velocity at time 0. ### A.8.2 The Fourier Transform The Fourier transform of y (t), t ∈ R, can be obtained as a specialization of the Laplace transform in the case that the latter is deﬁned in a region comprising the imaginary axis. In such case we deﬁne14 Y (Ω) = YL (j Ω) ,
14 Often (54) the Fourier transform is deﬁned as a function of f , where 2πf = Ω 172 or, in detail,
+∞ D. Rocchesso: Sound Processing Y (Ω) =
−∞ y (t)e−j Ωt dt , (55) where j Ω indicates a generic point on the imaginary axis. Since the kernel of the Fourier transform is the complex sinusoid (i.e., the complex eponential) having radial frequency Ω, we can interpret each point of the transformed function as a component of the frequency spectrum of the function y (t). In fact, given a value Ω = Ω0 and considered a signal that is the complex sinusoid y (t) = ej Ω1 t , the integral (55) is maximized when choosing Ω0 = Ω1 , i.e., when y (t) is the complex conjugate of the kernel 15 . The codomain of the transformed function Y (Ω) belongs to the complex ﬁeld. Therefore, the spectrum can be decomposed in a magnitude spectrum and in a phase spectrum. A.8.3 The Z Transform The domains of functions can be classes of numbers of whatever kind and nature. If we stick with functions deﬁned over rings, particularly important are the functions whose domain is the ring of integer numbers. These are called discretevariable functions, to distinguish them from functions of variables deﬁned over R or C , which are called continuousvariable functions. For discretevariable functions the operators derivative and integral are replaced by the simplest operators difference and sum. This replacement brings a new deﬁnition of transform for a function y (n), n ∈ Z :
+∞ YZ (z ) =
n=−∞ y (n)z −n , z ∈ Γ ⊂ C . (56) The transform (56) is called Z transform and the region of convergence is a ring16 of the complex plane. Within this ring the transform can be inverted. Example 3. The Z transform of the discretevariable causal exponential
15 Exercise: ﬁnd the Fourier transform of the causal complex exponential (48), with s = α + 0 j Ω0 , and show that it has maximum magnitude for Ω = Ω0 . 16 A ring here is the area between two circles and not an algebraic structure. Mathematical Fundamentals is17
+∞ +∞ 173 YZ (z ) =
n=−∞ +∞ y (n)z −n =
n=0 ez0 n z −n = =
0 (ez0 z −1 )n = 1 , 1 − ez0 z −1 (57) and it is convergent for values of z that are larger than e (z0 ) in magnitude18 . Similarly to what we saw for continuousvariable functions, the Fourier transform for discretevariable functions can be obtained as a specialization of the Z transform where the values of the complex variable are restricted to the unit circumference. Y (ω ) = YZ (ejω ) , (58) or, in detail,
+∞ Y (ω ) =
n=−∞ y (n)e−jωn . (59) In this book, we use the symbol ω for the radian frequency in the case of discretevariable functions, leaving Ω for the continuousvariable functions. ### A.9
A.9.1 Computer Arithmetics
Integer Numbers In order to fully understand the behavior of several hardware and software tools for sound processing, it is important to know something about the internal representation of numbers within computer systems. Numbers are represented as strings of binary digits (0 and 1), but the speciﬁc meaning of the string depends on the conventions used. The ﬁrst convention is that of unsigned integer
+∞ 17 The latter equality in (57) is due to the identity
n=0 an = 1 , a < 1, which can be 1−a veriﬁed by the reader with a = 1/2. 18 (x) is the real part of the complex number x 174 D. Rocchesso: Sound Processing numbers, whose value is computed, in the case of 16 bits, by the following formula
15 x=
i=0 xi × 2i , (60) where xi is the ith binary digit starting from the right. The binary digits are called bits, the rightmost digit is called least signiﬁcant bit (LSB), and the leftmost digit is called the most signiﬁcant bit (MSB). For instance, we have 01000011001001102 = 21 + 22 + 25 + 28 + 29 + 214 = 17190 , (61) where the subscript 2 indicates the binary representation, being the usual decimal representation indicated with no subscript. The leftmost bit is often interpreted as a sign bit: if it is set to one it means that the sign is minus and the absolute value is given by the bits that follow. However, this is not the representation that is used for the signed integers. For these numbers the two’s complement representation is used, where the leftmost bit is still a sign bit, but the absolute value of a negative number is recovered by bitwise complementation of the following bits, interpretation of the result as a positive integer, and addition of one. For instance, with four bits we have 10102 = −(01012 + 1) = −(5 + 1) = −6 . The two’s complement representation has the following advantages: • there is only one representation of the zero19 . • it has a cyclic structure: a unit increment of the largest representable positive number gives the negative number with the largest absolute value • the sums between signed numbers are performed by simple bitwise operation and without caring about the sign (a carry on the left can be ignored) We note that • the negative number with the largest absolute value is 100 . . . 02 . Its absolute value exceeds that of the largest positive number (i.e., 011 . . . 12 ) by one • the negative number with the smallest absolute value is represented by 111 . . . 12
19 Vice (62) versa, the sign and magnitude representation has one positive and one negative zero Mathematical Fundamentals 175 • the range of the numbers representable in two’s complement with 16 bits is [−215 , 215 − 1] = [−32768, 32767] • the range of the numbers representable in two’s complement with 8 bits is [−27 , 27 − 1] = [−128, 127] Often, in computer memory words and addresses are organized as collections of 8bit packets, called bytes. Therefore, it is useful to use a representation where the bits are considered in packets of four units, each packet tacking integer values from 0 to 15. This representation is called hexadecimal and, for the numbers between 10 and 15, it uses the hexadecimal “digits” A, B, C, D, E, F. For instance, a 16bit binary number can be represented as 01001011001001102 = 4B 2616 . (63) A.9.2 Rational Numbers
We have two alternative possibilities to represent rational noninteger numbers: • ﬁxed point • ﬂoating point The ﬁxed point representation is similar to the representation of integer numbers, with te difference that we have a decimal point at a prescribed position. The digits are divided into two sets: the integer part and the fractional part. The 16bit representation, without sign and with 3 bits of integer part is
2 x=
i=−13 xi × 2i , (64) and is obtained by multiplication of the integer number on 16 bits by 2−13 . In the two’s complement representation, the operations can be done without caring of the position of the decimal point, as we would be operating on integer numbers. Often, the rational numbers are considered to be normalized to one, i.e., to be limited to the range [−1, 1). In such a case, the decimal point is placed before the leftmost binary digit. For the ﬂoating point representation we can follow different conventions. In particular, the IEEE 754 ﬂoatingpoint singleprecision numbers obey to the following rules 176 • the number is represented as D. Rocchesso: Sound Processing 1.xx . . . x2 × 2yy...y2 , (65) where x are the binary digits of the mantissa and y are the binary digits of the exponent • The number is represented on 32 bits according to the following block decomposition – bit 31: sign bit – bits 23–30: exponent yy . . . y in biased representation20 , from the most negative 00 . . . 0 to the most positive 11 . . . 1 – bits 0–22: mantissa in unsigned binary representation The IEEE 754 standard of doubleprecision ﬂoatingpoint numbers uses 11 bits for the exponent and 52 bits for the mantissa. It should be clear that both the ﬁxed and the ﬂoatingpoint representations take a subset of rational numbers. Fixedpoint numbers are equally spaced between the minimum and the maximum representable value with a quantization step equal to 2−d , where d is the number of digits on the right of the decimal point. Floatingpoint numbers are unevenly distributed, being more sparse for large values of the exponent and more dense for little exponents. Floatingpoint numbers have the possibility to represent a large range, from 2 × 10−38 to 2 × 1038 in single precision, and from 2 × 10−308 to 2 × 10308 in double precision. Therefore, it is possible to do many computations without worrying of errors due to overﬂow. Moreover, the high density of small numbers reduces the problems due to the quantization step. This is paid in terms of a more complicated arithmetics. 20 The bias is 127. Therefore, the exponent 1 is coded as 1 + 127 = 128 = 10000000 . The 2 biased representation simpliﬁes the bitoriented sorting operations. Appendix B Tools for Sound Processing
(with Nicola Bernardini) Audio signal processing is essentially an engineering discipline. Since engineering is about practical realizations the discipline is best taught using realworld tools rather than special didactic software. At the roots of audio signal processing there are mathematics and computational science: therefore we strongly recommend using one of the advanced maths softwares available off the shelf. In particular, we experienced teaching with Matlab, or with its Free Software counterpart Octave 1 . Even though much of the code can be ported from Matlab to Octave with minor changes, there can still be some signiﬁcant advantage in using the commercial product. However, Matlab is expensive and every specialized toolbox is sold separately, even though an lessexpensive student edition is available. On the other hand, Octave is free software distributed under the GNU public license. It is robust, highly integrated with other tools such as Emacs for editing and GNUPlot for plotting. For actual sound applications, there are at least three other categories of softwares for sound synthesis that it is worth considering: languages for sound processing, interactive graphical building environments, and inline sound editors. When sound applications are targeted to the market of information appliances, it is likely that the processing algorithms will be implemented on lowcost hardware speciﬁcally tailored for typical signalprocessing operations.
1 http://www.octave.org 177 178 D. Rocchesso: Sound Processing Therefore, it is also useful to look at how signalprocessing chips are usually structured. B.1 Sounds in Matlab and Octave
In Octave/Matlab, monophonic sounds are simply onedimensional vectors (rows or columns), so that they can be transformed by means of matrix algebra, since vectors are ﬁrst–class variables. In these systems, the computations are vectorized, and the gain in efﬁciency is high whenever looped operations on matrices are transformed into compact matrixalgebra notation [9]. This peculiarity is sometimes difﬁcult to assimilate by students, but the theory of matrices needed in order to start working is really limited to the basic concepts and can be condensed in a twohours lecture. Processing in Octave/Matlab usually proceeds using monophonic sounds, as stereo sounds are simply seen as couples of vectors. It is necessary to make clear what the sound sample rate is at each step, i.e., how many samples are needed to produce one second of sound. Let us give an example of how we can create a 440Hz sinusoidal sound, lasting 2 seconds, and using the sample rate Fs = 44100Hz : f= Fs= l= Y= 440; % pitch in Hz 44100; % sample rate in Hz 2; % soundlength in seconds sin(2*pi*f/Fs*[0:Fs*l]); % sound vector The sound is simply deﬁned by application of the function sin() to a vector of Fs*l + 1 elements (namely, 88200 elements) containing an increasing ramp, suitably scaled so that f cycles are represented in F s samples. Once the sound vector has been deﬁned, one may like to listen to it. On this point, Matlab and Octave present different behaviors, also dependent on the machine and operating system where they are running. Matlab offers the function sound() that receives as input the vector containing the sound and, optionally, a second parameter indicating the sample rate. Without the second parameter, the default sample rate is 8192Hz. Up to version 4.2 of Matlab, the number of reproduction bits was 8 on a Intelcompatible machine. More recent versions of Matlab reproduce sound vectors using 16 bits of sample resolution. In order to reproduce the sound that we have produced with the above script we should write sound(Y, Fc); Tools for Sound Processing 179 Up to now, in the core Octave distribution the function that allows to produce sounds from the Octave interpreter is playaudio(), that can receive “ﬁlename” and “extension” as the ﬁrst and second argument, respectively. The extension contains information about the audio ﬁle format, but so far only the formats raw data linear and mulaw are supported. Alternatively, the argument of playaudio can be a vector name, such as Y in our example. The reproduction is done at 8 bits and 8192 Hz, but it would be easy to modify the function so that it can use better quantizations and sample rates. Fortunately, there is the octaveforge project 2 that contains useful functions for Octave which are not in the main distribution. In the audio section we notice the following interesting functions (quoting from the help lines): sound(x [, fs]) Play the signal through the speakers. Data is a matrix with one column per channel. Rate fs defaults to 8000 Hz. The signal is clipped to [1, 1]. soundsc(x, fs, limit) or soundsc(x, fs, [ lo, hi ]) Scale the signal so that [min(x), max(x)] → [1, 1], then play it through the speakers at 8000 Hz sampling rate. The signal has one column per channel. [x, fs, sampleformat] = auload(’filename.ext’) Reads an audio waveform from a ﬁle. Returns the audio samples in data, one column per channel, one row per time slice. Also returns the sample rate and stored format (one of ulaw, alaw, char, short, long, ﬂoat, double). The sample value will be normalized to the range [1,1) regardless of the stored format. This does not do any level correction or DC offset correction on the samples. ausave(’filename.ext’, x, fs, format) Writes an audio ﬁle with the appropriate header. The extension on the ﬁlename determines the layout of the header. Currently supports .wav and .au layouts. Data is a matrix of audio samples, one row time step, one column per channel. Fs defaults to 8000 Hz. Format is one of ulaw, alaw, char, short, long, ﬂoat, double B.1.1 Digression In Matlab versions older than 5, the function sound had a bug that is worth analyzing because it sheds some light on risks that may be connected with the
2 http://www.sourceforge.net 180 D. Rocchesso: Sound Processing internal representations of integer numbers. Let us construct a sound as a casual sequence of numbers having values 1 and −1: Fs = 8192; W=rand(size(0:Fs))  0.5; for i = 1:length(W) if (W(i)>0) W(i) = 1.0; else W(i) = 1.0; end; end; In order to be convinced that such sound is a spectrallyrich noise we can plot its spectrum, that would look like that of ﬁg. 1. Surprisingly enough, in old Matlab versions on Intelcompatible architectures if the sound W was played using sound(W) the audio outcome was, at most, a couple of clicks corresponding to the start and end transients.
50 line 1 45 40 35 30 dB 25 20 15 10 5 0 500 1000 1500 2000 Hz 2500 3000 3500 4000 4500 Figure 1: Spectrum of a random 1 and 1 sequence This can be explained by thinking that, on 8 bits, 256 quantization levels can be represented. A number between −1.0 and +1.0 is recasted into the 8bits range by taking the integer part of its product by 128. The problem is that, when the resulting integer number is represented in two’s complement, the number +1.0 is not representable since, on 8 bits, the largest positive number that can be represented is 127. Due to the circularity of two’s complement representation, the multiplication 1.0 × 128 produces the number −128, which Tools for Sound Processing 181 is also the representation of −1.0. Therefore, the audio device sees a constant sequence of numbers equal to the most negative representable number, and it does not produce any sound, except for the transients due to the initial and ﬁnal steps. Once the problem had been discovered and understood, the user could circumvent it by rescaling the signal in a slightly larger range, e.g., [1, 1.1]. In the Matlab environment the acquisition and writing of sound ﬁles from and to the disk is done by means of the functions auread(), auwrite(), wavread(), e wavwrite(). The former couple of functions work with ﬁles in au format, while the latter couple work with ﬁles in the popular wav format. In earlier version of Malab (before version 5) these functions only dealt with 8bit ﬁles, thus precluding highquality audio processing. For users of old Matlab versions, two routines are available for reading and writing 16bit wav ﬁles, called wavr16.m and wavw16.m, written by F. Caron and modiﬁed to ensure Octave compatibility. An example of usage for wavr16() is [L,R,format] = wavr16(’audiofile.wav’) that returns the right and left channels of the ﬁle audiofile.wav, in the L and R vectors, respectively. The two vectors are identical if the ﬁle is monophonic. The returned vector format has four components containing format information: the kind of encoding (indeed only PCM linear is recognized), the number of channels, the sample rate, and the number of quantization bits. An example of invocation of the function wavw16() is wavw16(’audiofile.wav’, M, format) where format is, again, a fourcomponent vector containing format information, and M is a one or twocolumn matrix containing the channels to be written in a monophonic or stereophonic ﬁle. Since sounds are handled as monodimensional vectors, sound processing can be reduced in most cases to vectorial operations. The iterative, samplebysample processing is quite inefﬁcient with interpreters such as Octave or Matlab, that are optimized to handle matrices. As an example of elementary processing, consider a simple smoothing operation, obtained by substitution of each input sound sample with the average between itself and the following sample. Here is a script that does this operation in Octave, after having loaded a monophonic sound ﬁle: [L,R,format] = wavr16(’ma1.wav’); S = (L + [L(2:length(L)); 0]) / 2; %‘‘smoothed’’ sound 182 D. Rocchesso: Sound Processing The operation is expressed in a very compact way by summation of the vector L with the vector itself leftshifted by one position3 . The smoothing operation may be expressed iteratively as follows: [L,R,format] = wavr16(’ma1.wav’); S = L/2; for i=1:length(L)1 S(i) = (L(i) + L(i+1))/2; end; The code turns out to be less compact but, probably, more easily understandable. However, the running time is signiﬁcantly higher because of the for loop. In the Matlab environment, there is a collection of functions called the Signal Processing Toolbox. In the examples of this book we do not use those functions, preferring publicdomain routines written for Octave, possibly modiﬁed to be usable within Matlab. One such function is function is stft.m, that allows to have a timefrequency representation of a signal. This can be useful for timefrequency processing and representation, as in the script SS = stft(S); mesh(20*log10(SS)); whose result is a 3D representation of the timefrequency behavior of the sound contained in S. B.2 Languages for Sound Processing
In this section we brieﬂy show how sounds are acquired and processed using languages that have been explicitely designed for sound and music processing. The most widely used language is probably Csound, developed by Barry Vercoe at the Massachusetts Institute of Technology and available since the middle eighties. Csound is a direct descendant of the family of MusicN languages that was created by Max Mathews at the Bell Laboratories since the late ﬁfties. In this family, the language of choice for most computermusic composers between the sixties and the eighties was Music V, that established a standard in symbology of basic operators, called Unit Generators (UG).
3 The last element is set to zero to ﬁll the blank left by the leftshift operation on L. The reader can extend the example in such a way that the input sound is overlapped and summed with its echo delayed by 200ms. Tools for Sound Processing 183 According to the MusicN tradition, the UGs are connected as if they were modules of an analog synthesizer, and the resulting patch is called an instrument. The actual connecting wires are variables whose names are passed as arguments to the UGs. An orchestra is a collection of instruments. For every instrument, there are control parameters which can be used to determine the behavior of the instrument. These parameters are accessible to the interpreter of a score, which is a collection of timestamped invocations of instrument events (called notes). Fig. 2 shows a schematic description of how MusicVlike languages work: a) is a MusicV source text4 while b) is its graphical representation. The orchestra/score metaphor, the decomposition of an orchestra Figure 2: MusicV ﬁle description into noninteracting instruments, and the description of a score as a sequence of notes, are all design decisions which were taken in respect of a traditional view of music. However, many musical and synthesis processes do not ﬁt well in such a metaphorical frame. As an example, consider how difﬁcult it is to express modulation processing effects that involve several notes played by a single synthesis instrument (such as those played within a single violin bow4 picked up from [56, page 45] 184 D. Rocchesso: Sound Processing ing): it would be desirable to have the possibility of modifying the instrument state as a result of a chain of weakly synchronized events (that is, to perform some sort of perthread processing). Instead, languages such as Music V rely on special initialization steps encoded within instruments to handle articulatory gestures involving several pitches. Other models have been proposed for dealing with less rigid descriptions of sound and music events. One such model is tied to the Nyquist language 5 , developed by the team of Roger Dannenberg at the Carnegie Mellon University [28]. This language provides a uniﬁed treatment of music and sound events and is based on functional programming (Lisp language). Algorithmic manipulations of symbols, processing of signals, and structured temporal modiﬁcations are all possible without leaving a consistent framework. In particular, Nyquist exploits the idea of behavioral abstraction, i.e. timedomain transformations are interpreted in an abstract sense and the details are encapsulated in descriptions of behaviors [27]. In other words, musical concepts such as duration, onset time, loudness, time stretching, are speciﬁed differently in different UGs. Modern compositional paradigms beneﬁt from this uniﬁcation of control signals, audio signals, behavioral abstractions and continuous transformations. Placing some of the most widely used languages for sound manipulation along an axis representing ﬂexibility and expressiveness, the lower end is probably occupied by Csound while the upper one is probably occupied by Nyquist. Another notable language which lies somewhere in between is Common Lisp Music 6 (CLM), which was developed by Bill Schottstaedt as an extension of Common Lisp [87]. If CLM is not too far from Nyquist (thanks to the underlying Lisp language) there is another language closer to the other edge of the axis, which represents a “modernization” of Csound. The language is called SAOL 7 and it has been adopted as the formal speciﬁcation of Structured Audio for the MPEG4 standard [107]. SAOL orchestras and scores can be translated into C language by means of the software translator SFRONT 8 developed by John Lazzaro and John Wawrzynek at UC Berkeley. The simple examples that we are presenting in this book are written in Csound, and realizations in CLM and SAOL are presented for comparison. 5 http://www.cs.cmu.edu/˜rbd/nyquist.html 6 http://wwwccrma.stanford.edu/software/clm/ 7 http://www.saol.net 8 http://www.cs.berkeley.edu/˜lazzaro/sa/ Tools for Sound Processing 185 B.2.1 Unit generator The UGs are primitive modules that produce, modify, or acquire audio or control signals. For audio signal production, particularly important primitives are those that read tables (table) and run an oscillator (oscil), while for producing control signals the envelope generators (line) are important. For sound modiﬁcation, there are UGs for digital ﬁlters (reson) and timedomain processing, such as delays (delay). For sound acquisition, there are special UGs (soundin). According to the MusicV tradition, several UGs can be connected to form complex instruments. The connections are realized by means of variables. In Csound the instruments are collected in a ﬁle called orchestra. The instrument parameters can be initialized by arguments passed at invocation time, called pﬁelds. Invocations of events on single instruments are considered to be notes, and they are collected in a second ﬁle, called score. The dichotomy between orchestra and score, as well as the subdivision of the orchestra into autonomous noninteracting entities called instruments, are design choices derived from a rather traditional view of music composition. We have already mentioned how certain kinds of operation with synthesis instruments do not ﬁt well in this view. The way the control and communication variables are handled in instruments made of several UGs is another crucial aspect to understand the effectiveness of a computermusic language. In Csound, variables are classiﬁed as: audiorate variables and controlrate variables. The former can vary at audio rate, the latter are usually bandlimited to a lower rate. In this way it is possible to update the control variables at a lower rate, thus saving some computations. Following the treatment of Roads [78], such runtime organization is called blockoriented computation, as opposed to sampleoriented computation. This is not to say that blockoriented computation are vectorized, or intrinsically parallel on data blocks, but rather that control variables are not loaded in the machine registers at each audio cycle. The split of variables between audio rate and control rate does not offer any semantic beneﬁt for the composer, but it is only a way to reach higher computation speeds. Vice versa, sometimes the sound designer is forced to choose a control rate equal to the audio rate in order to avoid some artifacts. Namely, this occurs in computational structures with delayed feedback loops9 .
9 Consider the case, pointed out to my attention by Gianantonio Patella, of a CSound instrument with a feedback delay line. Since the UG delay is iterated seamlessly for a number of times equal to the ratio between sample rate and control rate, the effective length of the delay turns out to be 186 D. Rocchesso: Sound Processing On the other hand, vectorized computations are an alternative way to arrange the operations, that in many cases can lead to compact and efﬁcient code, as it was shown in the smoothing example of section B.1. In the languages that we are considering there are UGs for timefrequency processing that operate on a frame basis. Typically, the operations on a single frame can be vectorized and we can have blockoriented computations when the control rate coincides with the frame rate. Csound also presents a third family of variables, the initialization variables, whose value is computed only when a note starts in the score. In order to partially overcome the problems of articulation between different notes, Csound allows to hold a note (ihold), in such a way that the following note of the same instrument can be treated differently during the initialization (tigoto). For instance, these commands can be used to implement a smooth transition between notes, as in a legato. An interesting aspect that has to be considered is how the soundprocessing languages acquire prerecorded material for processing. In Csound there is the primitive soundin that acquires the samples from an audio ﬁle for as long as the note that invoked the instrument remains active. Alternatively, with the function table statement f a table can be loaded with the content of an audio ﬁle, and such table can be read later on by UGs such as table or oscil. This strategy allows to perform important modiﬁcations, such as transposition, stretching, grain extraction, already at the reading stage. The Csound architecture, largely inherited from Music V, is more oriented toward sound synthesis than sound manipulation. For instance, a reverb continues to produce meaningful signal even when its input has ceased to be active, and this fact has produced the practice to call the UG reverb by means of a separate instrument that takes its input from global orchestra variables. On the other hand, in CLM sound transformations are more clearly stated since any ﬁlter can have a sound ﬁle name among its parameters. For CLM, a reverb is any ﬁlter whose invocation is made explicit as an argument of a soundgeneration function. B.2.2 Examples in Csound, SAOL, and CLM Let us face the problem of reading an audio fragment memorized in the ﬁle “march.aiff” and to process it by means of a linearlyincreasing transposition a 100ms echo. A Csound solution is found in the following orchestra and score ﬁles:
extended by such number of samples. Tools for Sound Processing ; sweep.orc sr = 22000 kr = 220 ksmps = 100 nchnls = 1 187 ;audio rate ;control rate ;audio rate / control rate ;number of channels ilt kfreq gas instr 1 ;sound production = ftlen(1)/sr ;table length in samples line 1, p3, p4 ;linear envelope from 1 to p4 in p3 seconds loscil 25000, kfreq/ilt, 1, 1/ilt, 0, 1, \ ftlen(1) ;frequencyvarying oscillator on table 1 endin instr 2 delay ;sound processing gas, p4 ;p4 seconds delay on global ;variable gas out p5*as + gas ;input + delayed and ;attenuated signal endin as ; ; ; f ; i i sweep.sco table stored from sound file # time size file skip format chan 10 1048576 1 "march.aiff" 0 0 1 p1 p2 p3 p4 p5 1 0 25 2.0 ;soundproduction note 2 0 25 0.1 1.0 ;soundprocessing note The code can be easily understood by means of the comments and by reference to the Csound manual [106]. We only observe that both the sound production and processing are activated by means of notes on different instruments. The communication between the two instruments is done by means of the global variable gas. The audio ﬁle is preliminarly loaded in memory by means of the statement f of the score ﬁle. The table containing the sound ﬁle is then read by instrument 1 using the UG loscil, that is a sort of sampling device where the reading speed and iteration points (loops) can be imposed. To understand how SAOL is structurally similar to CSound but syntactic 188 D. Rocchesso: Sound Processing ally more modern, we propose some SAOL code for the same solution to our processing problem. The orchestra is global { outchannels 1; srate 22000; krate 220; table tabl(soundfile, 1, "march.aiff"); route (bus1, generator); // delay send (effect; 0.1, } amplitude 1.0; bus1); instr generator(env_ext) { // env_ext: target point of the linear envelope // (from 1 to env_ext) ksig freq; asig signa; imports table tabl; ivar lentab; lentab = ftlen(tabl)/s_rate; //table length in seconds freq = kline(1, dur, env_ext); signa = oscil(tabl, freq/lentab, 1); output(signa); } instr effect(del,ampl) { // del: echo delay in seconds // ampl: amplitude of the echo asig signa; signa = delay(input, del); output(input + ampl*signa); } Tools for Sound Processing while the score reduces to the line 0.00 generator 25.0 2.0 189 In SAOL, variable names, parameters, and instruments are handled more clearly. The block enclosed by the keyword global contains some features shared by all instruments in the orchestra, such as the sample and control rate, or the audio ﬁles that are accessed by means of tables. Moreover, this section contains a conﬁguration of the audio busses where signal travels. In the example the generator instrument sends its output to the bus called bus1. From here, signals are sent to the effect unit together with the processing parameters del and ampl, In the global section it is possible to program arbitrarilycomplex paths among production and processing units. Let us examine how the same kind of processing can be done in CLM. Here we do not have an orchestra ﬁle, but we compose as many ﬁles as there are generation or processing instruments. Every instrument is deﬁned by means of the LISP macro definstrument, and afterwords it can be compiled and loaded within the LISP environment as a primitive function. The code segment that is responsible for audio sample generation is enclosed within the Run macro, that is expanded into C code at compilation time. In the past, the Run macro could also generate code for the ﬁxedpoint Digital Signal Processor (DSP) Motorola 56000, that was available in NeXT computers, in order to speed up the computations. In contemporary generalpurpose computers there is no longer an advantage in using DSP code, as the Ccompiled functions are very efﬁcient and they do not suffer from artifacts due to ﬁxedpoint arithmetics. Here is the CLM instrument that reads an audio ﬁle at variable speed: (definstrument sweep (file &key ;; parameters: ;; DURATION of the audio segment to be ;; acquired (seconds) ;; AMPSCL: amplitude scaling ;; FREQENV: frequency envelope (duration 1.0) (ampscl 1.0) (freqenv ’(0 1.0 100 1.0)) ) (let ((f (openinput file))) ;; input file assigned to variable f (let* ((beg 0) ;; initial inst. 190 D. Rocchesso: Sound Processing (end (+ beg (floor (* samplingrate duration)))) ;; final inst. (freqreadenv (makeenv :envelope freqenv )) ;; freq. env. (srconverta (makeresample :file f :srate 1.0 )) ;; srconverta: ;; var. containing the acquired file (outsiga 0.0) ) ;; dummy var. (Run (loop for i from beg to end do (setf outsiga (* ampscl (resample srconverta (env freqreadenv)))) ;; transposition envelope (in octaves) (outa i outsiga) (if *reverb* (revout i outsiga)) ))))) The reader can notice how, within the parentheses that follow the instrument name (sweep), there are mandatory parameters, such as the file to be read, and optional parameters, such as duration, ampscl, and freqenv. For the optional parameters a default value is given. It is interesting how several kinds of objects can be used as parameters, namely strings (file), numbers (duration, ampscl), or envelopes with an arbitrary number of segments (freqenv). The intermediate code section contains various deﬁnitions of variables and objects used by the instrument. In this section envelopes and UGs are prepared to act as desired. The Run section contains a loop that is iterated for a number of times equal to the samples to be produced. This loop contains the signal processing kernel. The read at increasing pace is performed by the UG resample, whose reading step is governed by the envelope passed as a parameter. The last code line sends the signal to the postprocessing unit reverb, when that is present. In our example, the postprocessing unit is a second instrument, called eco: (definstrument eco (startime dur &optional (volume 1.0) (length 0.1)) Tools for Sound Processing (let* 191 ( (d1 (makezdelay (* samplingrate length))) (vol volume) (beg 0) (end (+ beg (floor (* dur samplingrate))))) (run (loop for i from beg to end do (outa i (* vol (zdelay d1 (revin i)))) )) )) The eco instrument will have to be compiled and loaded as well. After, the entire processing will be activated by (withsound (:reverb eco :reverbdata(1.0 0.1)) (sweep "march.wav" :duration 25 :freqenv ’(0 0.0 100 1.0))) The macro withsound operates a clear distinction between sound production and modiﬁcation, as any kind of modiﬁcation is considered as a reverb. The three soundprocessing examples written in CSound, CLM, and SAOL produce almost identical results10 . The resulting sound waveshape and its sonogram are depicted in ﬁg. 3. This ﬁgures has been obtained by means of the analysis program snd, a companion program of CLM (see section B.4). From the sonogram we can visually verify that the audio ﬁle is read at increasing speed and that such read does not contain discontinuities. Figure 3: Waveshape and sonogram of a sound ﬁle that is echoed and read at increasing speed 10 Subtle differences are possible due to the diversity of implementation of the UGs. 192 D. Rocchesso: Sound Processing B.3 Interactive Graphical Building Environments
In recent times, several software packages have been written to ease the task of designing sound synthesis and processing algorithms. Such packages make extensive use of graphical metaphors and object abstraction reducing the processing ﬂow to a number of small boxes with zero, one or more audio/control inputs and outputs connected by lines, thus replicating once again the old and well known modular synthesizer interface taxonomy. The steady increase in performance of modern computers has allowed the interactive use of these graphical building environments, that become effectively rapid prototyping tools. The speed of modern processors allow sophisticated signal computations at a rate faster than the sampling rate. For instance, if the sampling rate is Fs = 44.1kHz, it is possible that the processor is capable to produce one or more sound samples in a time quantum T = 1/Fs = 22.6µsec. If such condition holds, even the languages of section B.2 can be used for realtime processing, i.e., they can produce an audio stream directly into the analogtodigital converters. The user may alter this processing by control signals introduced by external means, such as MIDI messages11 . Initially, many interactive graphical building packages where created to tame the daunting task of writing specialized code for dedicated signal processing tasks. In these packages, each object would contain some portion of DSP assembly code or microcode which would be loaded ondemand in the appropriate DSP card. With a graphical interface the user would easily construct, then, complex DSP algorithms with detailed controls coming from different sources (audio, MIDI, sensors, etc.). Several such applications still exist and are fairly widely used in the liveelectronics music ﬁeld (just to quote a few of the latest (remaining) ones): the Kyma/Capybara environment written by Carla Scaletti and Kurt Hebel 12 , the ARES/MARS environment [7, 11, 21, 6] developed by IRISBontempi, and the Scope package produced by the german ﬁrm Creamware 13 . While these specialized packages for music composers and sound designers are bound to disappear with the rapid and manifold power increase of general purpose processors14 , the concept of graphic objectoriented abstraction to eas11 MIDI (Musical Instrument Digital Interface) is a standard protocol for communication of musical information 12 http://www.symbolicsound.com 13 http://www.creamware.de 14 This is not a personal but rather a classic darwinian consideration: the maintenance costs of such packages added to the intrinsinc tight binding of such code with rapidly obsolescent hardware exposes them to an inevitable extinction. Tools for Sound Processing 193 ily visually construct signal processing algorithms has spur an entire new line of software products. The most widespread one is indeed the Max package suite conceived and written by Miller Puckette at IRCAM. Born as a generic MIDI control logic builder, this package has known an enormous expansion in its commercial version produced by Cycling ’74 and maintained by Dave Zicarelli 15 . A recent extension to Max, written by Zicarelli, is MSP which features realtime signal processing objects on Apple PowerMacs (i.e. on generalpurpose RISC architectures). Another interesting path is being currently followed by Miller Puckette himself who is the principal author of Pure Data (pd) [71], an opensource public domain counterpart of Max which handles MIDI, audio and graphics (extensions by Mark Danks 16 ). pd is developed keeping the actual processing and its graphical display as two cooperating separate processes, thus enhancing portability and easily modeling its processing priorities (sound ﬁrst, graphics later) on the underlying operating system thread/task switching capabilities. pd is currently a very earlystage workinprogress but it already features most of the graphic objects found in the experimental version of Max plus several audio signal processing objects. Its tcl/tk graphical interface makes its porting extremely easy (virtually “no porting at all”)17 . B.3.1 Examples in ARES/MARS and pd While the use of systems that are based on specialized digital signal processors is fading out in the music and sound communities, those kinds of chips still play a crucial role in communication and embedded systems. In general, wherever one needs signal processing capabilities at a very low cost, digital signal processors come into play, with their corollary of peculiar assembly language and parallel datapaths. For this reason, it is useful to look at the ARES/MARS workstation as a prototypical example of such systems, and to see how our problem of sound echoing and continuous transposition would have been solved with such system. In the IRIS ARES/MARS workstation there is a host computer, that is used to program the audio patches and the control environments, a microcontroller that uses its proprietary realtime operating system to handle the control signals, and one or more digital signal processors that are used to process the
http://www.cycling74.com http://www.danks.org/mark/GEM/ 17 Pure Data currently runs on Silicon Graphics workstations, on Linux boxes and on Windows NT platforms; sources and binaries can be found at http://crca.ucsd.edu/˜msp/software.html
16 15 194 D. Rocchesso: Sound Processing Figure 4: A Pd screen shot signals at audio rate. The audio patch that solves our processing problem is shown in ﬁg. 5. The input signal is directly taken from an analogtodigital converter, and the output signal is sent to a digitaltoanalog converter. There are two main blocks: the ﬁrst, called HARMO, is responsible for input signal transposition. The second, having a small clock as an icon, produces the echo. Since we want a graduallyincreasing transposition, the HARMO block is controlled by a slowlyvarying envelope, updated at a lower rate, programmed to ramp from trasp_iniziale to trasp_finale. The transposed signal goes into the delay unit and produces the echo that gets summed to the transposed signal itself before being sent to the output. Among the parameters of the HARMO and delay units, there are those responsible for memory management, since both units use memory buffers that must be properly allocated, as explained in section B.5. Figure 6 shows a possible solution to our sweepandecho problem using pd. Again, we have a harmo block that performs the pitch transposition. However, in pd this harmonizer is not a native module, but it is implemented in a separate patch by means of crossfading delay lines [110]. Similarly, the ramped_phase block encapsulates the operations necessary to perform a onepass read of the wavetable containing the sound ﬁle. The subgraph in the Tools for Sound Processing 195 Figure 5: MARS patch for echoing and linearlyincreasing transposition lower right corner represents the linear increase in pitch transposition, obtained by means of the line UG and used by the harmo unit. B.4 Inline sound processing
A completely different category of music software deals with inline sound processing. The software included in this category implies direct user control over sound on several levels, from its inner microscopic details up to its full external form. In its various forms, it allows the user to: (i) process single or multiple sounds (ii) build complex sound structures into a sound stream (iii) view different graphical representations of sounds. Hence, the major difference between this category and the one outlined in the preceding paragraphs lies perhaps in this software’s more general usage at the expense of less ’inherent’ musical capabilities: as an example, the difference between single event and event organization (the abovementioned orchestra/score metaphor and other organizational forms) which is pervasive in the languages for sound processing hardly exists in this category. However, this software allows direct manipulation of various sound parameters in many different ways and is often indispensable in musical preproduction and postproduction stages. Compared to the MusicNtype software the one of this category belongs to a sort of “second generation” computer hardware: it makes widespread and intensive use of highdeﬁnition graphical devices, highspeed sounddedicated 196 D. Rocchesso: Sound Processing Figure 6: pd patch for echoing and linearlyincreasing transposition hardware, large core memory, large hard disks, etc. . In fact, we will shortly show that the most hardwareintensive software in music processing  the digital liveelectronics realtime control software  belongs to one of the subcategories exposed below. B.4.1 TimeDomain Graphical Editing and Processing The most obvious application for inline sound processing is that of graphical editing of sounds. While text data ﬁles lend themselves very conveniently to musical data description, highresolution graphics are fundamental to this speciﬁc ﬁeld of applications where singlesample accuracy can be sacriﬁced to a more intuitive sound event global view. Most graphic sound editors allow to splice and process sound ﬁles in different ways. As ﬁg. 7 18 shows the typical graphical editor displays one or more soundﬁles in the timedomain, allowing to modify it with a variety of tools. The important concepts in digital audio editing can be summarised as follows:
18 The editor in this example is called Audacity, an Free Software audio editing and processing application written by Dominic Mazzoni, Roger Dannenberg et al.[57] ( http://audacity.sourceforge.net ) for Unix, Windows and MacOs workstations. Tools for Sound Processing 197 Figure 7: A typical sound editing application • regions  these are graphically selected portions of sound in which the processing and/or splicing takes place; • incore editing versus window editing  while simpler editors load the sound in RAM memory for editing, the most professional ones offer buffered ondisk editing to allow editing of sounds of any length: given the current storage techniques, highquality sound is fairly expensive in terms of storage (ca. 100 kbytes per second and growing), ondisk editing is absolutely essential to serious editing; • editing and rearranging of large soundﬁles can be extremely expensive in terms of hardware resources and hardly lend themselves to the general editing features that are expected by any multimedia application: multiplelevel undos, quick trialanderror, nondestructive editing, etc.: several techniques have been developed to implement these features the most important one being the playlist, which allows soundﬁle editing and rearranging without actually touching the soundﬁle itself but simply storing pointers to the beginning and end of each region. As can be easily understood, this technique offers several advantages being extremely fast and nondestructive; In ﬁg. 8, a collection of soundﬁles is aligned on the time axis according to a playlist indicating the starting time and duration of each soundﬁle reference (i.e. a pointer to the actual soundﬁle). Notice the ontheﬂy amplitude rescaling of some of the soundﬁles19 Graphical sound editors are extremely widespread on most hardware platforms: while there is no current favourite application, each platform sports one
19 ProTools c is manufactured by Digidesign ( http://www.digidesign.com ) 198 D. Rocchesso: Sound Processing Figure 8: A snapshot of a typical ProTools c editing session or more widely used editors which may range from the US$ 10000 professional editing suites for the Apple Macintosh to the many Free Software programs for unix workstations. In the latter category, it is worthwile to mention the snd application by Bill Schottstaedt 20 which features a backend processing in CLM. More precisely, sounds and commands can be exchanged back and forth between CLM and snd, in such a way that the user can choose at any time the most adequate between inline and languagebased processing. B.4.2 Analysis/Resynthesis Packages Analysis/Resynthesis packages belong to a closely related but substantially different category: they are generally mediumsized applications which offer different editing capabilities. These packages are termed analysis/resynthesis packages because editing and processing is preceded by an analysis phase which extracts the desired parameters in their most signiﬁcant and convenient form; editing is then performed on the extracted parameters in a variety of ways and after editing, a resynthesis stage is needed to retransform the edited parameters into a sound in the time domain. In different forms, these applications do: (i) perform various types of analyses on a sound (ii) modify the analysis data (iii) resynthesize the modiﬁed analysis.
20 http://wwwccrma.stanford.edu/software/snd/ Tools for Sound Processing 199 Many applications feature a graphical interface that allows direct editing in the frequencydomain: the prototypical application in this ﬁeld is Audiosculpt developed by Philippe Depalle, Chris Rogers and Gilles Poirot at the IRCAM 21 (Institut de Recherche et Coordination AcoustiqueMusique) for the Apple Macintosh platform. Based on a versatile FFTbased phase vocoder called SVP(which stands for Super Vocodeur de Phase), Audiosculpt is essentially a drawing program which allows the user to “draw” on the spectrum surface of a sound. Figure 9: A typical AudioSculpt session In ﬁg. 9, some portions of the spectrogram have been delimited and different magnitude reductions have been applied to them. Other applications, such as Lemur 22 , (running on Apple Macintoshes) [33] or Ceres (developed by Oyvind Hammer at NoTam 23 ) perform different sets of operations such as partial tracking and tracing, logical and algorithmic editing, timbre morphing, etc. The contemporary sound designer can also beneﬁt from tools which are speciﬁcally designed to transform sound objects in a controlled fashion. One
21 22 23 http://www.ircam.fr http://www.cerlsoundgroup.org/Lemur/ http://www.NoTam.uio.no/ 200 D. Rocchesso: Sound Processing such tool is SMS 24 (Spectral Modeling Synthesis), designed by Xavier Serra as an offspring of his and Smith’s idea of analyzing sounds by decomposing them into stochastic and deterministic components [95] or, in other words, noise and sinusoids. SMS uses the ShortTime Fourier Transform (STFT) for analysis, tracking the most relevant peaks and resynthesizing from them the deterministic component of sound, while the stochastic component is obtained by subtraction. The decomposition allows ﬂexible transformations of the analysis parameters, thus allowing goodquality time warping, pitch contouring, and sound morphing. In order to further improve the quality of transformations, extensions of the SMS model have been proposed though not included in the distributed software yet. Namely, a special treatment of transients has been devised as the way of getting rid of artifacts which can easily come into play when severe transformations are operated [108]. SMS comes with a very appealing graphical interface under Microsoft Windows, with a webbased interface, and is available as a commandline program for other operating systems, such as the various ﬂavors of unix. SMS uses an implementation of the Spectral Description Interchange Format 25 , which could potentially be used by other packages operating transformations based on the STFT. As an example, consider the following SMS synthesis score which takes the results of analysis and resynthesizes with application of a pitchshifting envelope and an accentuation of inharmonicity: InputSmsFile march.sms OutputSoundFile exroc.snd FreqSine 0 1.2 .5 1.1 .8 1 1 1 FreqSineStretch 0.2 B.5 Structure of a Digital Signal Processor
In this section we examine the ARES/MARS workstation as a prototypical case of hardware/software systems dedicated to digital audio processing. Namely, we explain the internal arithmetics of the X20 processor, the computational core of the workstation, and the memory management system. We have mentioned that the ARES/MARS workstation uses an expansion board divided into two parts: a control part based on the microcontroller Motorola MC68302, and an audio processing part based on two proprietary X20
24 25 http://www.iua.upf.es/˜sms/ http://cnmat.cnmat.Berkeley.edu/SDIF/ Tools for Sound Processing 201 processors. The X20 processor runs, for each audio cycle, a 512instruction microprogram, contained in a static external memory. Each microinstruction is 64 bits long, and it is computed in a 25ns machine cycle. Multiplying this cycle by the 512 instructions we get the working sampling rate of the machine, that is Fs = 39062.5Hz. A rough scheme of the X20 processor is shown in ﬁgure 10, where we can notice three units: • Functional Unit: adder (ALU), multiplier (MUL), registers (RM), data busses (C and Z); • Data Memory Unit: data memories DMA and DMB, data busses (A, B, and W); • Control Unit: addresses of data memories (ADR), access to external memory (FUN), connection to DAC/ADC audio bus, connection to microprogram memory and microcontroller (not shown in ﬁgure 10).
to DAC Abus from FUN Bbus RM MUL
Zbus Cbus RM ALU
to FUN ADR DMA ADR from ADC Wbus RM DMB Figure 10: Block structure of the X20 processor The computations are based on a circular data ﬂow that involves the data memories and the functional unit. The presence of two data memories and one functional unit allows a parallel organization of microprograms. The data ﬂow can be divided into four phases: • Data gathering from memories DMA, DMB, or external memory (FUN); 202 D. Rocchesso: Sound Processing • Selection of input data for the functional unit; • Data processing by the functional unit; • Insertion of the result back into the functional unit (by means of C and Z busses) or memorization into the data memories (W bus). B.5.1 Memory Management The waveforms, tables, samples, or delay lines, are allocated in the external memory26 , that is organized in, at most, 16 banks of 1MWord27 . Each word is 16 bits long. In order to access the external memory we have to specify the base address in a 16bit control word. Those bits are divided into two variablelength ﬁelds, separated by a zero bit. On the right there are ones, in a number n such that 32 × 2n is the size of the table28 . The ﬁeld on the left is a binary number that denotes the ordinal number of the 32 × 2n words area allocated in memory. For instance, the control word 0001110111111111 (1DF F16 in hexadecimal) represents the eight area of 16 KWords. Summarizing, in order to select an external table, the user has to specify the memory bank (0 to 15), the table size in powers of two, the offset, i.e., the ordinal number of table of the dimension that we are considering. The 16bit control word is indeed only part of the 24bit CWO register, the remaining 8 bits being used to select a waveform derived from reading the fourth part of a sine wave, memorized in 1024 words of internal readonly memory. In another 24bit register, called VAD, the tablereading phase pointer is stored. In order to access successive elements of the table, such register gets summed with the content of a 24bit increment register. For example, 4KWord tables are accessed using an increment equal to 00100016 , while for 2KWord tables the increment is 00200016 . A 4KWord table is not stored in contiguous locations of the memory bank, but it uses locations that are seprated by 1024/4 = 001016 positions. The ﬁrst 2 bytes of the increment account for this distance. The extension of the phase to 3 bytes allows a fractional addressing, with interpolations between logicallycontiguous samples. For instance, consider reading a 4KWord table: only the 12 most signiﬁcant bits of the phase are used to address the table, the remaining 12 bits29 being considered as the fractional part of the address and assigned to a register ALFA. If the 12 bits
26 Called 27 1MWord FUN or function memory is equal to 220 ≈ 1000000 words 28 The minimal number of words in a table is, therefore, 32 29 Actually, only the ﬁrst 8. Tools for Sound Processing 203 of the phase give the value n, an interpolated read of the table will return the value 30 y = (1 − ALF A)table(n) + ALF A table(n + 1) (1) With an increasing table size, the number of bits available for the fractional part decreases, and indeed this corresponds to a decrease in accuracy of interpolation for tables larger than 64KWord. B.5.2 Internal Arithmetics The data memories of the X20 processor are made of 24bit locations, and 24 bits are also used for the registers feeding the ALU and for the busses C, Z, and W. On the other hand, we have only 16 bits for external functions and for the registers feeding the MUL. The internal arithmetics of the X20 can be summarized as: • Representation of signals in two’s complement ﬁxed point, with normalization to one; • Algebraic sum with 24bit precision; • Multiplication of two 16bit numbers with 24bit result; • Tables and delay lines stored with 16bit precision (FUN memory) • 16bit digitaltoanalog and analogtodigital conversion. The addition can be performed as follows • Normal mode: For the result, all the ﬁeld of two’s complement 24bit numbers is used, with no handling of overﬂows. Ex.: 50000016 + 40000016 = 90000016 = −70000016 . • Overﬂowprotected mode: when an overﬂow occurs the result is set to the maximum or minimum representable number. Ex.: 50000016 + 40000016 = 90000016 = 7F F F F F16 . • Zeroprotected mode: every negative result is forced to be zero. • Overﬂow and Zeroprotected mode: the sum is ﬁrst executed in overﬂowprotected mode, and any negative result is forced to be zero.
30 The reader may observe that for ALFA equal to zero, the value table(n) is returned, while for ALFA equal to one the returned value is table(n + 1). 204 D. Rocchesso: Sound Processing The ﬁrst mode is useful whenever one has to generate cyclic waveforms or access to the memory cyclically, for instance to compute the phase pointer of an oscillator. The second mode is used when we are doing signal processing, since it protects from largeamplitude discontinuities and limit cycles (see section 1.6). The following table shows some examples of sums performed with the different modes31 a 0.5 0.5 0.5 0.5 b 0.7 0.7 0.7 0.7 a+b 0.2 0.8 0.2 0.8 a+b (OVP) 0.2 1.0 0.2 1.0 a+b (ZEP) 0.2 0.0 0.0 0.8 a+b (OVPZEP) 0.2 1.0 0.0 0.0 Multiplications are performed on the 16 mostsigniﬁcant bits of the operands in order to give a 24bit result. The multiplication can be summarized in the following steps: 1) Consider only the 16 mostsigniﬁcant bits of the operands; 2) Multiply with 16bit operand precision; 3) Consider only the 24 mostsigniﬁcant bits of the (31bit) result. The steps 1 and 3 imply quantization operations and precision loss. I passi 1 e 3 comportano delle operazioni di quantizzazione e pertanto comportano una perdita di precisione. The following table shows some examples of multiplications expressed in decimal and hexadecimal notations32 a 1.0 1.0 0.001 1.0 1.0 b 1.0 0.5 0.001 1.0 1.0 ab 0.999939 0.499985 0.000001 0.99970 1.0 a16 7FFFFF 7FFFFF 0020C5 800000 800000 b16 7FFFFF 400000 0020C5 7FFFFF 800000 ab16 7FFE00 3FFF80 000008 800100 800000 The examples highlight the need of looking at the results of multiplications with special care. The worst mistake is the one in the last line, where the result is off by 200% !
31 Copied 32 Copied from the online help system of the ARES/MARS workstation. from the online help system of the ARES/MARS workstation. Tools for Sound Processing 205 Another observation concerns the jump operations, that seem to be forbidden in an architecture that is based on the cyclic reading of a ﬁxed number of microinstructions. Indeed, there are conditional instructions, that can change the selection of operands feeding the ALU according to a control value taken, for instance, from bus C. The presence of these instructions justify the name ALU for the adder, since it is indeed a Arithmetic Logic Unit. B.5.3 The Pipeline We have seen that the architecture of a Digital Signal Processor allows to perform some operations in parallel. For instance, we can simultaneously perform data transfers, multiplication, and addition. Most digital ﬁlters are based on the iterative repetition of operations such as y = y + hi si (2) where hi are the coefﬁcients of the ﬁlter and si are memory words containing the ﬁlter state. A DSP architecture such as the one of the X20 allows to specify, in a single microinstruction, the product of two registers containing hi and si , the accumulation of the product obtained at the prior cycle into another register (containing y ), and the register load with values hi and si to be used at the next cycle. In other terms, the Multiply and Accumulate (MAC) operation is distributed onto three clock cycles, but for each cycle three MAC operations are in execution simultaneously. This is a realization of the principle of the pipeline, where the sample being “manufactured” has a latency time of three samples, but the frequency of sample delivery is one per clock cycle. In digital ﬁlters, another fundamental operation is the state update. In practice, after si has been used, it has to assume the value si−1 . As it is shown in chapter 2, such operation can be avoided by proper indexing of memory accesses (circular buffering): Instead of moving the data with si ← si−1 we shift the indexes with i ← i − 1, in a circular fashion. 206 D. Rocchesso: Sound Processing Appendix C Fundamentals of psychoacoustics
Psychoacoustics is a “discipline within psychology concerned with sound, its perception and the physiological foundations of hearing” [75]. A few concepts and facts of psychoacoustics are certainly useful to the sound designer and to any computer scientist interested in working with sound. Several books provide a wider treatment of this topic, at different degrees of depth [86, 105, 42, 111]. C.1 The ear The human ear is usually described as composed of three parts. This system is schematically depicted in ﬁgure 1. the outer ear: The pinna couples the external space to the ear canal. Its shape is exploited by the hearing system to extract directional information from incoming sounds. The ear canal is a tube (length l ≈ 2.6cm, diameter d ≈ 0.6cm) closed on the inner side by a membrane called the ear drum. The tube acts as a quarterofwavelength resonator, exciting frequencies in c the neighborhood of f0 = 4l ≈ 3.3kHz, where c is the speed of sound in air; the middle ear: It transmits mechanical energy, received from the ear drum, to the inner ear through a membrane called the oval window. To do so, it 207 208 D. Rocchesso: Sound Processing uses a chain of small bones, called the hammer, the anvil, and the stirrup; the inner ear: It is a cavity, called cochlea, shaped like a snail shell, which is shown rectiﬁed for clarity in ﬁgure 1. It contains a ﬂuid and it is divided by the basilar membrane into two chambers: the scala vestibuli and the scala timpani. The length of the cochlea is about 3.5cm. Its diameter is about 2mm at the oval window (base) and it gets narrower at the other extreme (apex), where a narrow aperture (the helicotrema) allows the two chambers to communicate. On top of the basilar membrane, the tectorial membrane sustains about 16, 000 hair cells that pick up the transversal motion of the basilar membrane and transmit it to the brain.
Acoustic Nerve Pinna Ear canal Oval window Eardrum Scala Vestibuli H o ic el Tectorial Membrane Round window Basilar Mrmbrane tre a m Hair Cells Scala Tympani Base Apex Outer ear Middle ear Inner ear Figure 1: Cartoon physiology of the ear The vibrations of the oval window excite the ﬂuid of the scala vestibuli. By pressure differences between the scala vestibuli and scala timpani, the basilar membrane oscillates and transversal waves are propagated. The basilar membrane can be thought of as a string having a decreasing tension as we move from the base to the apex. This tension changes by about four orders of magnitude from base to apex. Along a string, the waves propagate at speed c= T = ρL Tension , Linear density (1) and the wavelength associated with the component at frequency f is λ= 1 f c T =. ρL f (2) Fundamentals of Psychoacoustics 209 √ The impedance of the string is z0 = ρL T and, if vmax is the peak value of transversal velocity, the wave power is P= 1 1 2 z0 vmax = 2 2
2 ρL T vmax . (3) While a wave component at frequency f is propagating from the base to the apex, its wavelength decreases (because tension decreases) and, due to the physical requirement of power constancy, its amplitude increases. However, this propagation is not lossless, and dissipation increases with the amplitude, so that a frequencydependent maximum region will emerge along the basilar membrane (see ﬁgure 2). Since the high frequencies are more affected by propagation losses, their characteristic resonance areas are cluttered close to the base, while low frequencies are more widely distributed toward the apex. About two thirds of the length of the cochlea is devoted to low frequencies (about one fourth of the audio bandwidth), thus giving more frequency resolution to the slowlyvarying components.
velocity transversal velocity base < position along the basilar membrane > apex Figure 2: Cartoon of the transversal velocity pattern elicited by an incoming pure sine tone C.2 Sound Intensity Consider a sinusoidal point source in free space. It generates spherical pressure waves that carry energy. The acoustic intensity is the power by unit surface 210 D. Rocchesso: Sound Processing that is carried by a wave front. It is a vectorial quantity having magnitude I= p2 p2 p2 max 1 = max = RM S , 2 z0 2ρc ρc (4) where pmax and pRM S are the peak and rootmeansquare (RMS) values of pressure wave, respectively, and z0 = ρc = density × speed is the impedance of air. At 1000Hz the human ear can detect sound intensities ranging from Imin = 10−12 W/m2 (threshold of hearing) to Imax = 1W/m2 (threshold of pain). Consider two spherical shells of areas a1 and a2 , at distances r1 and r2 from the point source. The lossless propagation of a wavefront implies that the intensities registered at the two distances are related to the areas by I1 a1 = I2 a2 . (5) Since the area is proportional to the square of distance from the source, we also have 2 r2 I1 . (6) = I2 r1 The intensity level is deﬁned as IL = 10 log10 I , I0 (7) where I0 = 10−12 W/m2 is the sound intensity at the threshold of hearing. The intensity level is measured in decibel (dB), so that multiplications by a factor are turned into additions by an offset, as represented in table C.2. Similarly, the sound pressure level is deﬁned as SP L = 20 log10 pRM S pmax = 20 log10 p0,max p0,RM S (8) where p0,max and p0,RM S are the peak and RMS pressure values at the threshold of hearing. For a propagating wave, we have that IL = SP L. For a standing wave, since there is no power transfer and since IL is a powerbased measure, the SPL is more appropriate. Given a reference tone with a certain value of IL at 1kHz, we can ask a subject to adjust the intensity of a probe tone at a different frequency until it matches the reference loudness perceptually. What we would obtain are the FletcherMunson curves, or equalloudness curves, sketched in ﬁgure 3. Each Fundamentals of Psychoacoustics I ×1.26 ×2 ×10 IL +1 +3 +10 211 Table C.1: Relation between factors in the linear intensity scale and shifts in the dB intensitylevel scale Equalloudness curves 120 100 80 IL [dB] 60 40 Thresholds 90 phons 60 phons 20 phons 20 0 100 frequency [Hz] 1000 Figure 3: Equalloudness curves. The parameters express values of loudness level in phons. curve is parameterized on a value of loudness level (LL), measured in phons. The loudness level is coincident with the intensity level at 1kHz. Even though the FletcherMunson curves are obtained by averaging the responses of human subjects, the LL is still a physical quantity, because it refers to the physical quantity IL and it does not represent the perceived loudness in absolute terms. In other words, doubling the loudness level does not mean doubling the perceived loudness. A genuine psychophysic measure is the loudness in sones, which can be obtained as a function of LL by asking listeners to compare sounds and decide when one sound is “twice as loud” as another. Somewhat arbitrarily, a LL of 40 phons is set equal to 1 sone. Figure 4 represents a possible average curve that may emerge from an experiment. The standardized loudness scale (ISO) uses the straight line approximation of ﬁgure 4, 212 that corresponds to the power law L[sones] = 1 15.849 D. Rocchesso: Sound Processing I I0 0. 3 . (9) Roughly speaking, an increment by 9 phons is needed to double the perceived subjective loudness in sones. This holds for tones at the same frequency or within the same critical band. In a physiological perspective, the critical band can be deﬁned as the band of frequencies whose positions along the basilar membrane stay within the area excited by a single pure tone (see ﬁgure 2 and section C.4). We can say that the intensities of uncorrelated signals effectively sum: I = I1 + I2 ; p2 = p2 + p2 ⇒ p = 1 2 p2 + p2 . 1 2 (10) For uncorrelated pure tones within a critical band, if the law represented by the straight line in ﬁgure 4 does apply, if we double the intensity we have 3 phons of increment. Therefore, 3 doublings (×8) are needed to have an increase by 9 phons. This is the increase that roughly corresponds to a doubling in loudness. For example, 8 violins playing the same note at the same loudness level are needed to effectively double the perceived loudness. If two sounds are far apart in frequency, their intensities sum much more effectively. In this case, using two sources at different frequencies also doubles the loudness.
100 10 Loudness (sones) 1 0.1 0.01 20 40 60 80 100 Loudness Level (phons) Figure 4: Sones vs. phons Fundamentals of Psychoacoustics Physics Physical Sound Φ Intensity I, ∆I Frequency f, ∆f Duration d, ∆d Psychophysics Perceived Sound Ψ Loudness L, ∆L Pitch p, ∆p ˜ ˜ Apparent Duration d, ∆d Table C.2: Physics vs. Psychophysics 213 C.2.1 Psychophysics
In psychophysics, the Just Noticeable Difference (JND) of a physical quantity is the minimal difference of that quantity that can be noticed in two stimuli, or by modulation of a single stimulus. Since our perception is driven by neural ﬁrings statistically distributed in time, the appropriate way to measure JNDs is by subjective experimentation and statistical analysis. The random nature of perception is indeed the cause of JNDs, because the accuracy of our internal representations is limited by the intrinsic noise of these random processes. The relation between physics and psychophysics is represented in table C.2.1 by means of three important acoustic quantities. The JNDs are represented by the symbol ∆ preceding the physical or psychophysical variable name, in the latter case being a mnemonic for the internal noise variance. The construction of psychophysical scales relies on the Fechner’s idea1 that: The value of the perceived quantity is obtained by counting the JNDs, and the result of such counting is the same whether we count physical or sensed JNDs. There is a “zero level” for sensation, i.e., the scale of sensations is a ratio scale (all four arithmetic operations are allowed). For instance, for loudness: ∆L · NJN D = L ⇒ NJN D = If the JND is not constant:
L L . ∆L (11) NJN D =
0
1 Gustav dL . ∆L(L) (12) Theodor Fechner (18011887) is considered the father of psychophysics. 214 From the Fechner’s idea we have NJN D = dL = ∆L(L) D. Rocchesso: Sound Processing dI . ∆I (I ) (13) Fechner’s psychophysics is based on two assumptions (exempliﬁed for loudness): 1. ∆L is constant; 2. ∆I is proportional to I , or
∆I I = k , with k constant (Weber’s law). Based on the two assumptions, the Fechner’s law is derived as L = ∆L · NJN D = ∆L dI ˜ = k log(I ) , kI (14) ˜ for a certain value of the constant k . For the loudness of pure tones neither the assumption 1 nor 2 are valid. Therefore, the Fechner’s law (14) does not hold2 . However, the Fechner’s paradigm is the basis of new developments that provide models matching the experimental results quite closely. More details can be found in [42, 4]. Experimental curves similar to that reported in ﬁgure 4 show in many cases signiﬁcant deviations from (14). For instance, the relation between intensity and loudness is more similar to L∝ √ 3 I, (15) as three doublings of intensity are needed for approximating one doubling in loudness. Power laws such as the (15) are the natural outcome of the so called direct methods of psychophysical experimentation, where it is the sensation itself that is the unit for measuring other sensations. Such experimental paradigm was largely established by Stevens3 , and it is the one in use when the experimenter asks the subject to double or half the perceived loudness of a tone, or when a direct magnitude production or estimation is used.
2 Weber’s 3 Stanley and Fechner’s laws are taken for granted quite often in humancomputer interaction. Smith Stevens (19061973). Fundamentals of Psychoacoustics 215 C.3 Pitch Periodic tones elicit a sensation of pitch, thus meaning that they can be ordered on a scale from low to high. Many aperiodic or even stochastic sounds can elicit pitch sensations, with different degrees of strength. If we stick with pure tones for this section, pitch is the sensorial correlate of frequency, and it makes sense to measure the frequency JND using the tools of psychophysics. For instance, if a pure tone is slowly modulated in frequency we may seek for the threshold of modulation audibility. The resulting curve of average results would look similar to ﬁgure 5.
JND in frequency for a modulated pure tone JND 3% resolution 1% resolution 0.6% resolution 0.5% resolution 10 JND in Hz 1 100 1000 Central frequency in Hz 10000 Figure 5: JND in frequency for a slowly modulated pure tone. Again, from the curve of ﬁgure 5 we notice a signiﬁcant deviation from the Weber’s law ∆f ∝ f . The physiological interpretation is that there is more internal noise in the frequency detection in the verylow range. If we integrate ∆f1(f ) we obtain a curve such as that of ﬁgure 6 that can be interpreted as a subjective scale for pitch, whose unit is called mel. Conventionally 1000 Hz corresponds to 1000 mel. This curve shouldn’t be confused with the scales that organize musical height. Musical scales are based on the subdivision of the musical octave into a certain number of intervals. The musical octave is usually deﬁned as the frequency range having the higher bound that has twice the value in Hertz of the ﬁrst bound. On the other hand, the subjective scale for pitch measures the subjective pitch relationship between two sounds, and it is strictly connected with the spatial distribution of frequencies along the 216 D. Rocchesso: Sound Processing basilar membrane. In musical reasoning, pitch is referred to as chroma, which is a different thing from the tonal height that is captured by ﬁgure 6.
Subjective pitch curve pitch 3000 2500 Pitch in Mels 2000 1500 1000 500 0 100 Frequency in Hz 1000 10000 Figure 6: Subjective frequency curve, mel vs. Hz. So far, we have described pitch phenomena referring to the position of hair cells that get excited along the basilar membrane. Indeed, the place theory of hearing is not sufﬁcient to explain the accuracy of pitch perception and some intriguing effects such as the virtual pitch. In this effect, if a pure tone at fre3 quency f1 is superimposed to a pure tone at frequency f2 = 2 f1 , the perceived pitch matches the missing fundamental at f0 = f1 /2. If the reader, as an excercise, plots this superposition of waveforms, she may notice that the apparent periodicity of the resulting waveform is 1/f0 . This indicates that a temporal processing of sound may occur at some stages of our perception. The hair cells convey signals to the ﬁbers of the acoustic nerve. These neural contacts ﬁre at a rate that depends on the transversal velocity of the basilar membrane and on its lateral displacement. The rate gets higher for displacements that go from the apex to the base, and this creates a periodicity in the ﬁring rate that is multiple of the waveform periodicity. Therefore, the statistical distribution of neural spikes keeps track of the temporal behavior of the acoustic signals, and this may be useful at higher levels to extract periodicity information, for instance by autocorrelation processes [86]. Even for pure tones, pitch perception is a complex business. For instance, it is dependent on loudness and on the nature and quality of interfering sounds [42]. The pitch of complex tones is an overly complex topic to be discussed in this Fundamentals of Psychoacoustics 217 appendix. It sufﬁces to know that pitch perception of complex tones is linked to the third (after loudness and pitch) and most elusive attribute of sound, that is timbre. C.4 Critical Band As illustrated in ﬁgure 6 of chapter 2, two pure tones whose frequencies are close to each other give rise to the phenomenon of beating. In formula, from simple trigonometry sin Ω1 t + sin Ω2 t = 2 sin (Ω1 − Ω2 )t (Ω1 + Ω2 )t cos , 2 2 (16) where the ﬁrst sinusoidal term in the product can be interpreted as a carrier signal modulated by the second, cosinusoidal term. As we vary the distance between the frequencies Ω1 and Ω2 , the resulting sound is perceived differently, and a sense of roughness emerges for distances smaller than a certain threshold. A schematic view of the sensed signal is represented in ﬁgure 7. The solid lines may be interpreted as timevarying sensed pitch tracks. If they are far enough we perceive two tones. When they get closer, at a certain point a sensation of roughness emerges, but they are still resolved. As they get even closer, we stop perceiving two separate tones and, at a certain point, we hear a single tone that beats. Also, when they are very close to each other, the roughness sensation decreases. The region where roughness gets in deﬁnes a critical band, and that frequency region roughly corresponds to the segment of basilar membrane that gets excited by the tone at frequency Ω1 . The sensation of roughness is related with that property of sound quality that is called consonance, and that can be evaluated along a continuous scale, as reported in ﬁgure 8. We notice that the maximum degree of dissonance is found at about one quarter of critical bandwidth. C.5 Masking When a sinusoidal tone impinges the outer ear, it propagates mechanically until the basilar membrane, where it affects the reception of other sinusoidal tones at nearby frequencies. If the incoming 400Hz tone, called the masker, has 70dB of IL, a tone at 600Hz has to be more than 30dB louder than its miniminal thresholding level in order to become audible in presence of the masker. 218
Ω Roughness (Critical Band) One Tone D. Rocchesso: Sound Processing Ω1 Beats Ω2 t t Figure 7: Schematic representation of the subjective phenomena of beats and roughness (adapted from [86]) This phenomenon is called masking and it is cartooniﬁed in ﬁgure 9. Indeed, masking is illdeﬁned in the immediate proximity of the masker, because there the presence of beats may let the interference between masker and masked tone become apparent. Two features of masking can be noticed in ﬁgure 9. First, masking is much more effective towards high frequencies (note also the log scale in frequency). Second, highintensity maskers spread their effects even more towards high frequencies. The latter phenomenon is called upward spread of masking, and it is due to the nonlinear behavior of the outer hair cells of the cochlea, whose stiffness depends on the excitation they receive [4]. A highfrequency cell, excited by a lowerfrequency tone, increases its stiffness and becomes less sensitive to components at its characteristic frequency. In complex tones, the partials affect each other as far as masking is concerned, so that it may well happen that in a tone with a few dozens partials, only ﬁve or six emerge from a collective masking threshold. In a sound coding task, it is obvious that we should use all our resources (i.e., the bits) to encode those partials, thus neglecting the components that are masked. This idea is the Fundamentals of Psychoacoustics
Degree of consonance 1 dissonance 219 0.8 Degree of consonance 0.6 0.4 0.2 0 0 0.2 0.4 0.6 0.8 Frequency separation in critical bandwidth 1 1.2 Figure 8: Degree of consonance between two sine tones as a function of their frequency distance, measured as a fraction of critical bandwidth (Measurement by Plomp and Levelt (1965) reported also in [105]). basis for perceptual audio coding, as it is found in the MPEG1 standard [69]. For coding purposes, it is also useful to look at temporal masking. Namely, the effects of masking extend in the future for up to 40ms (forward masking), and in the past for up to 10ms (backward masking). These temporal effects may occur because the brain integrates sound information over time, and there are inherent delays in this operation. Therefore, a soft tone preceding a louder tone by a couple of milliseconds is likely to be just canceled from our perceptual system. C.6 Spatial sound perception Classic psychoacoustic experiments showed that, when excited with simple sine waves, the hearing system uses two strong cues for estimating the apparent direction of a sound source. Namely, interaural intensity and time differences (IID and ITD) are jointly used to that purpose. IID is mainly useful above 1500Hz , where the acoustic shadow produced by the head becomes effective, thus reducing the intensity of the waves reaching the contralateral ear. For this highfrequency range and for stationary waves, the ITD is also far less reliable, since it produces phase differences in sine waves which often exceed 360◦ . Below 1500Hz the IID becomes smaller due to head diffraction which 220
60 D. Rocchesso: Sound Processing
30 dB Intensity Level 50 dB Intensity Level 70 dB Intensity Level 50 40 Masking Level in dB 30 20 10 0 100 Frequency in Hz 1000 Figure 9: Schematic view of masking level for a sinusoidal masker at 400Hz at 30, 50, and 70 dB of intensity level. overcomes the shadowing effect. In this lowfrequency range it is possible to rely on phase differences produced by the ITD. IID and ITD can only partially explain the ability to discriminate among different spatial directions. In fact, if the sound source would move laterally along a circle (see ﬁgure 10) the IID and ITD would not change. The cone formed by the circle with the center of the head has been called cone of confusion. Frontback and vertical
z y φ θ x Figure 10: Interaural polar coordinate system and cone of confusion discrimination within a cone of confusion are better understood in terms of Fundamentals of Psychoacoustics 221 broadband signals and HeadRelated Transfer Functions (HRTF). The system pinna  head  torso acts like a linear ﬁlter for a plane wave coming from a given direction. The magnitude and phase responses of this ﬁlter are very complex and direction dependent, so that it is possible for the listener to disambiguate between directions having the same, stationary, ITD and IID. In some cases, it is advantageous to think about these ﬁltering effects in the time domain, thus considering the HeadRelated Impulse Responses (HRIR) [13, 82]. Index
legato, 186 absolute value, 148 absolutely summable, 14 acoustic intensity, 209 additive synthesis, 117 adjustable windows, 107 aliasing, 6 allpass comb ﬁlter, 79 allpass ﬁlter, 56 allpass ﬁlters, 89 allpole ﬁlter, 114, 123 ALU, 201 amplitude modulation, 134 analog signal, 19 analog system, 15 analogtodigital converter, 194 analysis window, 101, 103 antiresonances, 74 antisymmetric impulse response, 33 anvil, 208 apex, 208 ARES/MARS, 192 ARES/MARS workstation, 200 argument, 148 Arithmetic Logic Unit, 205 artiﬁcial reverberation, 89 asynchronous granular synthesis, 129 Attack  Decay  Sustain  Release, 128 audio busses, 189 audio stream, 192 audiorate, 185 AutoRegressive Moving Average, 43 autocorrelation, 114 averaging ﬁlter, 24 backward masking, 219 bandlimited, 4 bandwidth, 49 bank of oscillators, 133 base, 208 basilar membrane, 208 basis, 156 biased representation, 176 BIBO, 14 bilinear transformation, 15 bin, 102 binary digits, 173 bins, 10, 106 bits, 174 blockoriented computation, 185 boost, 55 boundedinput boundedoutput, 14 broadband noise, 121 bytes, 175 carrier, 100 carrier frequency, 130 carrier/modulator frequency ratio, 135 causality, 14 cellular models, 140 characteristic frequency, 138 chorus, 79 chroma, 216 circulant matrices, 95 circular buffer, 41 222 References
circular buffering, 205 cochlea, 208 codomain, 148 coefﬁcients, 152 Coloring, 121 column vector, 156 comb ﬁlters, 89 Common Lisp Music, 184 commutative ring, 146 complementary ﬁlters, 62 complex conjugate, 148 complex numbers, 147 complex sinusoid, 162 complexity–latency tradeoff, 96 composition of functions, 166 cone of confusion, 220 conformal mapping, 17 conformal transformation, 64 contour plot, 151 control rate, 128 control signals, 128 control word, 202 controlrate, 185 convolution, 3, 13, 95 CORDISANIMA, 140 Cosine, 161 critical band, 212, 217 crossover ﬁlter, 62 damped oscillator, 50 damping coefﬁcient, 138 data ﬂow, 201 data reduction, 123 dB, 217 dc component, 17 dc frequency, 35 DCT, 122 De Moivre formula, 163 decibel, 159, 210 decimation, 105 decimator, 12 default, 190 deﬁned integral, 168 223
delay matrix, 93 demodulation, 100 dependent variable, 148 derivative, 165 deterministic part, 118 DFT, 10 digital ﬁlter, 23 digital frequencies, 6 digital noise, 126 digital oscillator, 124 digital signal, 19 Digital Signal Processor (DSP), 189 digital signal processors, 193 digital waveguide networks, 143 digitaltoanalog converter, 194 Direct Form I, 58 Direct Form II, 58 direct manipulation, 137 direct methods, 214 Discrete Cosine Transform, 123 Discrete Fourier Transform, 10 DiscreteTime Fourier Transform, 7 discretetime system, 11 dither, 21 domain, 148 dominant pole, 47 dot product, 156 DTFT, 7 dynamic levels, 128 ear canal, 207 ear drum, 207 effective delay length, 69 eigenfunctions, 160 elementary resonator, 47 Emphasizing, 121 envelope, 30 equalloudness curves, 210 Euler formula, 162 excitation signal, 123 exponent, 176 exponential, 159 exponential function, 170 224
factorial, 160 Fast Fourier Transform, 11 FDN, 92, 93 Fechner’s idea, 213 Feedback Delay Networks, 92 feedback matrix, 93 feedback modulation index, 134 FFT, 11 FFTbased synthesis, 120 ﬁeld, 145 ﬁlter coefﬁcients, 32 ﬁlter order, 43 ﬁlterbank, 99 ﬁlterbank summation, 105 ﬁnite difference methods, 140 Finite Impulse Response, 23 FIR, 23 FIR comb, 75 ﬁxed point, 175 ﬂanger, 79 FletcherMunson curves, 210 ﬂoating point, 175 FM, 130 FM couple, 133 foldover, 6 formant, 133 formant ﬁlter, 114 formants, 114 fortissimo, 128 forward masking, 219 Fourier matrix, 11 frame, 186 frame rate, 186 frequency JND, 215 frequency leakage, 9 frequency modulation, 6, 130 frequency resolution, 8 frequency response, 3 frequency warping, 65 frequencydependent absorption, 74 Fundamental Theorem of Algebra, 153 gestural controllers, 137 D. Rocchesso: Sound Processing
grains, 129 granular synthesis, 129 graphical building environments, 192 group delay, 29 guides, 119 hair cells, 208 hammer, 208 harmonizer, 194 HeadRelated Impulse Responses, 221 HeadRelated Transfer Function, 82 HeadRelated Transfer Functions, 221 helicotrema, 208 hexadecimal, 175 holder, 6 hop size, 105 HRIR, 82, 221 HRTF, 221 hysteresis, 119 Hz, 74 IID, 81, 219 IIR, 23 IIR comb, 77 images, 4 imaginary unity, 147 impedance of the string, 209 impulse invariance, 15 impulse response, 2, 12 increment, 126 indeﬁnite integral, 169 independent variable, 148 Inﬁnite Impulse Response, 23 initialization, 186 inner ear, 208 instantaneous frequency, 130 instrument, 183 intensity level, 210 interaural intensity and time differences, 219 inverse, 146 Inverse Discrete Fourier Transform, 11 inverse formant ﬁlter, 114 References
inverse function, 149 inverse matrix, 158 ITD, 81, 219 JND, 213 jump operations, 205 Just Noticeable Difference, 213 just noticeable difference, 69, 73 KarplusStrong synthesis, 143 kernel of the Fourier transform, 172 kernel of the transform, 13 Kirchhoff variables, 141 Kyma/Capybara, 192 Lagrange interpolation, 71, 111 Laplace Transform, 170 lattice structure, 60 leakage, 106 least signiﬁcant bit, 174 LFO, 129 limit cycles, 21 linear and timeinvariant systems, 12 linear predictive coding, 113 linear quantization, 19 linear systems, 1 linear timeinvariant, 1 linearly independent, 155 localization blur, 82 logarithm, 159 loop, 127 loops, 187 lossless prototype, 93 lossy delay line, 74 lossy quantization, 22 loudness, 211 loudness level, 211 LowFrequency Oscillators, 129 lowlatency block based implementations of convolution, 97 lowpass ﬁlter, 26 LPC, 113 LPC analysis, 121 LTI, 1, 12 225 magnitude, 148 magnitude response, 25 magnitude spectrum, 172 main lobe, 9 mainlobe width, 106 mantissa, 176 masker, 217 masking, 218 mass points, 140 massspringdamper system, 137 Matlab, 177 matrix, 156 matrix product, 156 Max, 193 mel, 215 memory buffers, 194 middle ear, 207 MIDI, 192 missing fundamental, 216 modulation, 100 modulation frequency, 130 modulation index, 130 Morphing, 121 most signiﬁcant bit, 174 MSP, 193 MUL, 201 Multiply and Accumulate (MAC), 205 multiplyandaccumulate, 40 Multirate, 126 multivariable function, 151 musical octave, 215 Musical scales, 215 Neper number, 160 NLD, 135 nonrecursive comb ﬁlter, 75 nonrecursive ﬁlters, 23 nonlinear distortion, 135 normal modes, 94 notch, 55 notes, 183 226
Nyquist frequency, 5 Nyquist language, 184 Octave, 177 onedimensional distributed resonators, 141 onedimensional resonator, 78 opposite, 145 orchestra, 183, 185 ordinary differential equations, 140 orthogonal coordinates, 148 outer ear, 207 outer hair cells, 218 oval window, 207 overﬂow oscillations, 21 overﬂowprotected operations, 22 overlap and add, 120 pﬁelds, 185 parabolic interpolation, 110 parameters, 183 parametric ﬁlters, 64 partial differential equations, 140 partial fraction expansion, 46 passband, 107 patch, 183 pd, 193 perthread processing, 184 phase, 148 phase delay, 29 phase following, 110 phase modulation, 130 phase opposition, 62 phase response, 25 phase spectrum, 172 phase unwrapping, 32, 112 phase vocoder, 105 phaser, 79 phons, 211 pinna, 207 pipeline, 205 pitch, 215 Pitch Shifting, 121 D. Rocchesso: Sound Processing
pitch shifting, 123 place theory of hearing, 216 plucked string synthesis, 78 polar coordinates, 148 pole, 2 polezero couple, 55 poles of the ﬁlter, 27 polynomials, 152 postprocessing unit, 190 power, 158 precursors, 29 prediction coefﬁcients, 115 prediction error, 113 presence ﬁlter, 64 primitive function, 169 pulse train, 113 Pure Data, 193 quality factor, 54, 138 quantization error, 19 quantization levels, 19 quantization noise, 19 quantization step, 176 quantum interval, 19 radians, 161 rapid prototyping tools, 192 realtime processing, 192 reconstruction ﬁlter, 5 rectangular window, 9, 103 recursive comb ﬁlter, 77 reﬂection coefﬁcient, 60 reﬂection coefﬁcients, 115 region of convergence, 46 regular functions, 165 residual, 113 resonances, 76 resonator, 76 resynthesis, 102, 105, 120 ring, 146 RMS, 210 rms level, 159 RMS value, 20 References
Room within a Room, 87 rootmeansquare value, 20 roots, 152 roughness, 217 sample and hold, 6 sampleoriented computation, 185 sampler, 6 sampling, 3 sampling interval, 3 Sampling Theorem, 4 samplingrate conversion, 126 SAOL, 184 sawtooth wave, 134 scala timpani, 208 scala vestibuli, 208 Scope, 192 score, 183, 185 secondorder ﬁlter, 47 shift operation, 13 ShortTime Fourier Transform, 99 side components, 131 side lobes, 106 sidelobe level, 106 signal ﬂowchart, 15 signal ﬂowgraph, 58 signal ﬂowgraphs, 41 Signal Processing Toolbox, 182 signal quantization, 19 signaltoquantization noise ratio, 21 signed integers, 174 Sine, 161 sines + noise + transients, 122 sinesplusnoise decomposition, 121 sinusoidal model, 112, 117 SISO, 1 smoothing, 181 sms, 118 SNR, 21 SNT, 122 solutions, 152 sones, 211 sonogram, 53, 108, 191 sound bandwidth, 132 sound modiﬁcation, 121 sound pressure level, 210 source signal, 113 spatial processing, 81 spectral envelope, 134 spectral modeling synthesis, 118 spectral resolution, 106 spectrogram, 108 spectrum, 4, 172 splits, 127 stability, 14 standardized loudness scale, 211 standing wave, 210 state space description, 93 state update, 205 state variables, 53 steadystate response, 28, 46 STFT, 99 stirrup, 208 stochastic part, 118 stochastic residual, 118 stopband, 107 subjective scale for pitch, 215 subtractive synthesis, 123 superposition principle, 1 sustain, 127 symmetric impulse response, 32 227 Tangent, 161 tapped delay line, 41 taps, 41 target signal, 113 tectorial membrane, 208 Temporal envelopes, 128 temporal masking, 219 temporal processing of sound, 216 temporal resolution, 106 threshold of hearing, 210 threshold of pain, 210 timbre, 217 time constant, 46 time invariance, 12 228
Time Stretching, 121 time stretching, 123 transfer function, 2, 13 transforms, 170 transient response, 28, 46 transients, 122 transition band, 65 transition bandwidth, 107 transposed, 156 Transposed Form I, 59 Transposed Form II, 59 transposition, 156 transposition of a signal ﬂowgraph, 59 trapezoid rule, 18 traveling waves, 142 tremolo, 129 two’s complement representation, 174 Uncertainty Principle, 8 uncertainty principle, 106 unit diagonal matrix, 158 Unit Generators (UG), 182 unity, 146 unsigned integer, 173 unvoiced, 113 unwarping, 65 upward spread of masking, 218 variable, 148 VBAP, 85 Vector Base Amplitude Panning, 85 Vector Base Panning, 87 vector space, 155 vector subspace, 155 vectors, 155 vibrato, 129 virtual pitch, 216 viscoelastic links, 140 vocalfold excitation, 114 vocoder, 113 voiced, 113 vowellike spectra, 133 D. Rocchesso: Sound Processing
waterfall plot, 108 wave equation, 141 wave packets, 31 waveguide junctions, 143 waveguide models, 140, 142 waveshape preservation, 121 waveshaping, 135 wavetable, 125 wavetable oscillator, 125 wavetable sampling synthesis, 127 Weber’s law, 214 white noise, 20, 113 whitening ﬁlter, 114 window, 7 X20 processor, 200 Yamaha DX7, 135 Z transform, 172 zero, 2, 145 zero padding, 107 zeros, 152 zeros of the ﬁlter, 27 Bibliography
[1] M. Abramowitz and I. Stegun, editors. Handbook of Mathematical Functions. Dover Publications, New York, 1972. [2] V. Algazi, R. Duda, D. Thompson, and C. Avendano. The CIPIC HRTF database. In Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pages 99–102, Mohonk, NY, Oct. 2001. [3] J. Allen and D. Berkley. Image method for efﬁciently simulating smallroom acoustics. J. Acoustical Soc. of America, 65(4):943–950, Apr. 1979. [4] J. B. Allen. Psychoacoustics. In J. G. Webster, editor, Wiley Encyclopedia of Electrical and Electronics Engineering, pages 422–437. John Wiley & Sons, 1999. [5] X. Amatriain, J. Bonada, A. Loscos, and X. Serra. Spectral processing. In U. Zölzer, editor, Digital Audio Effects. John Wiley and Sons, Ltd., Chichester Sussex, UK, 2002. [6] P. Andrenacci, F. Armani, R. Bessegato, A. Paladin, P. Pisani, A. Prestigiacomo, C. Rosati, S. Sapir, and M. Vetuschi. The new MARS workstation. In Proc. International Computer Music Conference, pages 215–219, Thessaloniki, Greece, Sept. 1997. ICMA. [7] P. Andrenacci, E. Favreau, N. Larosa, A. Prestigiacomo, C. Rosati, and S. Sapir. MARS: RT20M/EDIT20 Development tools and graphical user interface for a sound generation board. In A. Strange, editor, Proc. International Computer Music Conference, pages 340–343, San Jose, CA, Oct. 1992. ICMA. [8] D. Arﬁb. Digital synthesis of complex spectra by means of multiplication of nonlinear distorted sine waves. J. Audio Eng. Soc., 27(10):757–779, 1979. [9] D. Arﬁb. Different ways to write digital audio effects programs. In Proc. Conf. Digital Audio Effects (DAFx98), Barcelona, Spain, pages 188–191, Nov. 1998. [10] D. Arﬁb, F. Keiler, and U. Zölzer. Sourceﬁlter processing. In U. Zölzer, editor, Digital Audio Effects, pages 299–372. John Wiley and Sons, Ltd., Chichester Sussex, UK, 2002. 229 230 D. Rocchesso: Sound Processing [11] F. Armani, L. Bizzarri, E. Favreau, and A. Paladin. MARS  DSP Environment and Applications. In A. Strange, editor, Proc. International Computer Music Conference, pages 344–347, San Jose, CA, Oct. 1992. ICMA. [12] A. Bernardi, G. Bugna, and G. D. Poli. Music signal analysis with chaos. In C. Roads, S. Pope, A. Picialli, and G. D. Poli, editors, Musical Signal Processing, pages 187–220. Swets & Zeitlinger, 1997. [13] J. Blauert. Spatial Hearing: the Psychophysics of Human Sound Localization. MIT Press, Cambridge, MA, 1983. [14] B. Blesser. An interdisciplinary synthesis of reverberation viewpoints. J. Audio Eng. Soc., 49(10):867–903, 2001. [15] G. Borin, G. De Poli, and A. Sarti. Sound Synthesis by Dynamic Systems Interaction. In D. Baggi, editor, Readings in ComputerGenerated Music, pages 139–160. IEEE Computer Society Press, 1992. [16] G. Borin, G. D. Poli, and D. Rocchesso. Elimination of delayfree loops in discretetime models of nonlinear acoustic systems. IEEE Transactions on Speech and Audio Processing, 8(5):597–605, 2000. [17] G. Borin, D. Rocchesso, and F. Scalcon. A physical piano model for music performance. In Proc. International Computer Music Conference, pages 350– 353, Thessaloniki, Greece, Sept. 1997. ICMA. [18] J. Borish. An Auditorium Simulation for Home Use. In Audio Eng. Soc. Convention, New York, 1983. AES. [19] C. P. Brown and R. O. Duda. A structural model for binaural sound synthesis. IEEE Trans. Speech and Audio Processing, 6(5):476–488, Sept. 1998. [20] C. Cadoz, A. Luciani, and J.L. Florens. CORDISANIMA: A modeling and simulation system for sound synthesis  the general formalism. Computer Music J., 17(1):19–29, Spring 1993. [21] S. Cavaliere, G. D. Giugno, and E. Guarino. MARS  The X20 device and the SM1000 board. In A. Strange, editor, Proc. International Computer Music Conference, pages 348–351, San Jose, CA, Oct. 1992. ICMA. [22] A. Chaigne. On the Use of Finite Differences for Musical Synthesis. Application to Plucked Stringed Instruments. J. Acoustique, 5:181–211, 1992. [23] J. M. Chowning. The synthesis of complex audio spectra by means of frequency modulation. Journal of the Audio Eng. Soc., 21(7):526–534, 1973. [24] T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to Algorithms. MIT Press, Cambridge, MA, 1990. [25] R. Courant and H. Robbins. What is Mathematics?: an elementary approach to ideas and methods. Oxford Un. Press, New York, 1941. Trad. It. Che Cos’è la Matematica?, Universale Scientiﬁca Boringhieri, 1971. References
[26] D. Gabor. Acoustical Quanta and the Theory of Hearing. 159(4044):591–594, May 1947. 231
Nature, [27] R. B. Dannenberg. Abstract time warping of compound events and signals. Computer Music J., 21(3):61–70, 1997. [28] R. B. Dannenberg. Machine tongues XIX: Nyquist, a language for composition and sound synthesis. Computer Music J., 21(3):50–60, 1997. [29] J. Dattorro. Effect design  part 1: Reverberator and other ﬁlters. J. Audio Eng. Soc., 45(19):660–684, Sept. 1997. [30] J. Dattorro. Effect design  part 2: Delayline modulation and chorus. J. Audio Eng. Soc., 45(10):764–788, Oct. 1997. [31] G. De Poli and D. Rocchesso. Physicallybased sound modeling. Organised Sound, 3(1):61–76, 1998. [32] R. O. Duda and W. L. Martens. Range dependence of the response of a spherical head model. J. Acoustical Soc. of America, 104(5):3048–3058, Nov. 1998. [33] K. Fitz and L. Haken. Sinusoidal modeling and manipulation using lemur. Computer Music J., 20(4):44–59, 1997. [34] F. Fontana and D. Rocchesso. Physical modeling of membranes for percussion instruments. Acustica, 84(13):529–542, Jan. 1998. S. Hirzel Verlag. [35] A. Freed, X. Rodet, and P. Depalle. Synthesis and control of hundreds of sinusoidal partials on a desktop computer without custom hardware. In Proc. 1993 Int. Computer Music Conf., Tokyo, pages 98–101, 1993. [36] B. Gardner and K. Martin. HRTF measurements of a KEMAR dummyhead microphone. Technical report # 280, MIT Media Lab, Cambridge, MA, 1994. [37] W. G. Gardner. Efﬁcient convolution without inputoutput delay. J. Audio Eng. Soc., 43(3):127–136, 1995. [38] W. G. Gardner. 3D Audio using Loudspeakers. Kluwer Academic Publishers, Norwell, MA, 1998. [39] W. G. Gardner. Reverberation algorithms. In M. Kahrs and K. Brandenburg, editors, Applications of Digital Signal Processing to Audio and Acoustics, pages 85–131. Kluwer Academic Publishers, Norwell, MA, 1998. [40] M. A. Gerzon. Unitary (Energy Preserving) Multichannel Networks with Feedback. Electronics Letters V, 12(11):278–279, 1976. [41] W. M. Hartmann. Digital waveform generation by fractional addressing. J. Acoustical Soc. of America, 82(6):1883–1891, 1987. [42] W. M. Hartmann. Signals, Sound, and sensation. SpringerVerlag, New York, 1998. 232 D. Rocchesso: Sound Processing [43] D. A. Jaffe and J. O. Smith. Extensions of the KarplusStrong Plucked String Algorithm. Computer Music J., 7(2):56–69, 1983. [44] J.M. Jot. Etude et Reálisation d’un Spatialisateur de Sons par Modèles Physiques et Perceptifs. PhD thesis, TELECOM, Paris 92 E 019, 1992. [45] J.M. Jot and A. Chaigne. Digital Delay Networks for Designing Artiﬁcial Reverberators. In Audio Eng. Soc. Convention, Paris, France, Feb. 1991. AES. [46] T. Kailath. Linear Systems. PrenticeHall, Englewood Cliffs, 1980. [47] K. Karplus and A. Strong. Digital Synthesis of Plucked String and Drum Timbres. Computer Music J., 7(2):43–55, 1983. [48] G. S. Kendall. A 3D Sound Primer: Directional Hearing and Stereo Reproduction. Computer Music J., 19(4):23–46, Winter 1995. [49] G. Kuhn. Model for the interaural time differences in the azimuthal plane. J. Acoustical Soc. of America, 62:157–167, July 1977. [50] H. Kuttruff. A Simple Iteration Scheme for the Computation of Decay Constants in Enclosures with Diffusely Reﬂecting Boundaries. J. Acoustical Soc. of America, 98(1):288–293, July 1995. [51] T. I. Laakso, V. Välimäki, M. Karjalainen, and U. K. Laine. Splitting the Unit Delay—Tools for Fractional Delay Filter Design. IEEE Signal Processing Magazine, 13(1):30–60, Jan 1996. [52] J. Laroche. Time and pitch scale modiﬁcation of audio signals. In M. Kahrs and K. Brandenburg, editors, Applications of Digital Signal Processing to Audio and Acoustics, pages 279–309. Kluwer Academic Publishers, 1998. [53] J. Makhoul. Linear prediction: A tutorial review. Proc. IEEE, 63(4):561–580, Apr. 1975. [54] W. L. Martens. Psychophysical calibration for controlling the range of a virtual sound source: multidimensional complexity in spatial auditory display. In Proc.Int. Conf. Auditory Display (ICAD01), pages 197–207, Espoo, Finlnd, 2001. [55] D. C. Massie. Wavetable sampling synthesis. In M. Kahrs and K. Brandenburg, editors, Applications of Digital Signal Processing to Audio and Acoustics, pages 311–341. Kluwer Academic Publishers, 1998. [56] M. Mathews, J. E. Miller, F. R. Moore, J. R. Pierce, and J.C. Risset. The Technology of Computer Music. MIT Press, Cambridge, MA, 1969. [57] D. Mazzoni and R. Dannenberg. A fast data structure for diskbased audio editing. In Proc. International Computer Music Conference, La Habana, Cuba, Sep 2001. ICMA. [58] S. K. Mitra. Digital Signal Processing: A computerBased Approach. McGrawHill, New York, 1998. References 233 [59] F. R. Moore. An Introduction to the Mathematics of Digital Signal Processing. Part I: Algebra, Trigonometry, and the Most Beautiful Formula in Mathematics. Computer Music J., 2(1):38–47, 1978. [60] F. R. Moore. A General Model for Spatial Processing of Sounds. Computer Music J., 7(3):6–15, 1982. [61] J. A. Moorer. About this Reverberation Business. Computer Music J., 3(2):13– 18, 1979. [62] J. A. Moorer. The Manifold Joys of Conformal Mapping: Applications to Digital Filtering in the Studio. J. Audio Eng. Soc., 31(11):826–840, 1983. [63] P. M. Morse. Vibration and Sound. American Institute of Physics for the Acoustical Society of America, New York, 1991. 1st ed. 1936, 2nd ed. 1948. [64] C. MüllerTomfelde. Lowlatency convolution for realtime applications. In Proc. Audio Eng. Soc. Int. Conference, pages 454–459, Rovaniemi, Finland, April 1999. Journal of the Audio Eng. Soc. [65] A. V. Oppenheim and R. W. Schafer. DiscreteTime Signal Processing. PrenticeHall, Inc., Englewood Cliffs, NJ, 1989. [66] A. V. Oppenheim and A. S. Willsky (with S. H. Nawab). Signals and Systems. PrenticeHall, Inc., Upper Saddle River, NJ, 1997. Second edition. [67] S. J. Orfanidis. Introduction to Signal Processing. Prentice Hall, Englewood Cliffs, N.J., 1996. [68] J.M. Pernaux, P. Boussard, and J.M. Jot. Virtual sound source positioning and mixing in 5.1 implementation on the realtime system genesis. In Proc. Conf. Digital Audio Effects (DAFx98), Barcelona, Spain, pages 76–80, Nov. 1998. [69] K. C. Pohlmann. Principles of Digital Audio. McGrawHill, New York, 1995. [70] W. H. Press, B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling. Numerical recipes in c. 1988. sections available online at . [71] M. Puckette. Pure data. In Proc. International Computer Music Conference, pages 224–227, Thessaloniki, Greece, Sept. 1997. ICMA. [72] V. Pulkki. Virtual sound source positioning using vector base amplitude panning. J. Audio Eng. Soc., 45(6):456–466, 1997. [73] V. Pulkki, M. Karjalainen, and J. Huopaniemi. Analyzing virtual sound source attributes using binaural auditory models. J. Audio Eng. Soc., 47(4):203–217, Apr. 1999. [74] T. Quatieri and R. McAulay. Audio signal processing based on sinusoidal analysis/synthesis. In M. Kahrs and K. Brandenburg, editors, Applications of Digital Signal Processing to Audio and Acoustics, pages 343–416. Kluwer Academic Publishers, 1998. 234 D. Rocchesso: Sound Processing [75] A. S. Reber and E. Reber. The Penguin Dictionary of Psychology. Penguin Books Ltd., London, UK, 2001. Third Edition. [76] P. A. Regalia, S. K. Mitra, and P. P. Vaidyanathan. The Digital AllPass Filter: A Versatile Signal Processing Building Block. Proc. IEEE, 76(1):19–37, Jan. 1988. [77] C. Roads. Asynchronous granular synthesis. In Representations of Musical Signals, pages 143–186. MIT Press, Cambridge, MA, 1991. [78] C. Roads. The Computer Music Tutorial. MIT Press, Cambridge, Mass., 1996. [79] D. Rocchesso. The Ball within the Box: a soundprocessing metaphor. Computer Music J., 19(4):47–57, Winter 1995. [80] D. Rocchesso. Strutture ed Algoritmi per l’Elaborazione del Suono basati su Reti di Linee di Ritardo Interconnesse. Phd thesis, Università di Padova, Dipartimento di Elettronica e Informatica, Feb. 1996. [81] D. Rocchesso. MaximallyDiffusive yet Efﬁcient Feedback Delay Networks for Artiﬁcial Reverberation. IEEE Signal Processing Letters, 4(9):252–255, Sept. 1997. [82] D. Rocchesso. Spatial effects. In U. Zölzer, editor, Digital Audio Effects, pages 137–200. John Wiley and Sons, Ltd., Chichester Sussex, UK, 2002. [83] D. Rocchesso and F. Scalcon. Bandwidth of perceived inharmonicity for physical modeling of dispersive strings. IEEE Transactions on Speech and Audio Processing, 7(5):597–601, Sept. 1999. [84] D. Rocchesso and J. O. Smith. Circulant and Elliptic Feedback Delay Networks for Artiﬁcial Reverberation. IEEE Transactions on Speech and Audio Processing, 5(1):51–63, Jan. 1997. [85] D. Rocchesso and J. O. Smith. Generalized digital waveguide networks. IEEE Transactions on Speech and Audio Processing, 11(5), 2003. [86] J. G. Roederer. Introduction to the Physics and Psychophysics of Music. SpringerVerlag, Heidelberg, 1975. [87] B. Schottstaedt. Machine tongues XVII: CLM: Music V meets common lisp. Computer Music J., 18(2):30–37, 1994. [88] M. R. Schroeder. Improved QuasiStereophony and “Colorless” Artiﬁcial Reverberation. J. Acoustical Soc. of America, 33(8):1061–1064, Aug. 1961. [89] M. R. Schroeder. NaturalSounding Artiﬁcial Reverberation. J. Audio Eng. Soc., 10(3):219–233, July 1962. [90] M. R. Schroeder. Digital Simulation of Sound Transmission in Reverberant Spaces. J. Acoustical Soc. of America, 47(2):424–431, 1970. [91] M. R. Schroeder. Computer Models for Concert Hall Acoustics. American Journal of Physics, 41:461–471, 1973. References 235 [92] M. R. Schroeder. Computer Speech: Recognition, Compression, and Synthesis. Springer Verlag, Berlin, Germany, 1999. [93] M. R. Schroeder and B. Logan. “Colorless” Artiﬁcial Reverberation. J. Audio Eng. Soc., 9:192–197, July 1961. reprinted in the IRE Trans. on Audio. [94] X. Serra. Musical sound modeling with sinusoids plus noise. In C. Roads, S. Pope, A. Picialli, and G. D. Poli, editors, Musical Signal Processing, pages 91–122. Swets & Zeitlinger, 1997. [95] X. Serra and J. O. Smith. Spectral modeling synthesis: A sound analysis/synthesis system based on a deterministic plus stochastic decomposition. Computer Music Journal, 14(4):12–24, 1990. [96] J. O. Smith. An allpass approach to digital phasing and ﬂanging. In Proc. International Computer Music Conference, page 236, Paris, France, 1984. ICMA. Also available as Rep. STANM21, CCRMA, Stanford University. [97] J. O. Smith. Fundamentals of Digital Filter Theory. Computer Music J., 9(3):13– 23, 1985. [98] J. O. Smith. Physical modeling using digital waveguides. Computer Music J., 16(4):74–91, Winter 1992. [99] J. O. Smith and J. S. Abel. The Bark Bilinear Transform. In Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, Mohonk, NY, Oct. 1995. [100] J. O. Smith and B. Friedlander. Adaptive interpolated timedelay estimation. IEEE Trans. Aerospace and Electronic Systems, 21(2):180–199, Mar. 1985. [101] J. Stautner and M. Puckette. Designing Multichannel Reverberators. Computer Music J., 6(1):52–65, Spring 1982. [102] K. Steiglitz. A Digital Signal Processing Primer. AddisonWesley, Menlo Park, CA, 1996. [103] J. Strikwerda. Finite Difference Schemes and Partial Differential Equations. Wadsworth & Brooks, Paciﬁc Grove, CA, 1989. [104] C. R. Sullivan. Extending the KarplusStrong Algorithm to Synthesize Electric Guitar Timbres with Distortion and Feedback. Computer Music J., 14(3):26–37, 1990. [105] J. Sundberg. The Science of Musical Sounds. Academic Press, San Diego, CA, 1989. First Ed. 1973. [106] B. Vercoe. Csound: A manual for the audio processing system and supporting programs with tutorials. Technical report, Media Lab, M.I.T., Cambridge, Massachusetts. Software and Manuals available from ftp://ftp.maths.bath.ac.uk/pub/dream/, 1993. 236 D. Rocchesso: Sound Processing [107] B. L. Vercoe, W. G. Gardner, and E. D. Scheirer. Structured Audio: Creation, Transmission, and Rendering of Parametric Sound Representations. Proc. IEEE, 86(5):922–940, May 1998. [108] T. S. Verma, S. N. Levine, and T. H. Y. Meng. Transient modeling synthesis: a ﬂexible analysis/synthesis tool for transient signals. In Proc. International Computer Music Conference, pages 164–167, Thessaloniki, Greece, Sept. 1997. ICMA. [109] U. Zölzer. Digital Audio Signal Processing. Chichester, England, 1997. John Wiley and Sons, Inc., [110] U. Zölzer, editor. Digital Audio Effects. John Wiley and Sons, Ltd., Chichester Sussex, UK, 2002. [111] E. Zwicker and H. Fastl. Psychoacoustics: Facts and Models. Springer Verlag, Berlin, Germany, 1990. ...
View
Full
Document
This note was uploaded on 10/25/2010 for the course FOSEE CVL1040 taught by Professor None during the Spring '09 term at Multimedia University, Cyberjaya.
 Spring '09
 none
 1984

Click to edit the document details