Ch3-Production_and_Classification_of_Speech_Sounds2

Ch3-Production_and_Classification_of_Speech_Sounds2 -...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Speech Processing Production and Classification of Speech Sounds Introduction Simplified view of Speech Production (see Figure 3.1 in the next slide) Lungs act as a power supply and provide airflow to the larynx stage. Larynx modulates airflow and provides either: Vocaltract gives the modulated airflow its "color" (spectrally shaping the source) with: Oral, Nasal, and Pharynx cavities. Periodic pufflike airflow, or Noisy airflow to vocal tract. February 13, 2012 Veton Kpuska 2 Figure 3.1 February 13, 2012 Veton Kpuska 3 Introduction Sound sources can also be generated by constrictions and boundaries that are made within the vocal tract itself: Periodic source, Noisy source, or Impulsive airflow source. Note that speech production mechanism does not generate a perfect periodic, impulsive, or noisy source. 1. 2. 3. Three general categories of the source for speech sounds: Periodic Noisy Impulsive Illustration of each in the word "shop": "sh" noisy "o" periodic "p" impulse February 13, 2012 Veton Kpuska 4 Example of "Shop" Noise like signal Period Source Impulse Source February 13, 2012 Veton Kpuska 5 Introduction Distinguishable speech sounds are determined not only by source, but also by different vocal tract configurations, and combination of both. Speech sound classes are referred to as phonemes. Phonemics is the discipline that studies phoneme realizations (e.g., in a language). Each phoneme class provides a certain meaning in a word. Within a phoneme class there exist many sound variations that provide the same meaning. The study of these sound variations is called phonetics. Phonemes are the basic building blocks of a language: They are concatenated (more or less), as discrete elements into words, According to a certain phonemic and grammatical rules. February 13, 2012 Veton Kpuska 6 Introduction This chapter will cover: Description of speech production mechanism Resulting variety of phonetic sound patterns How these sounds differ among different speakers. February 13, 2012 Veton Kpuska 7 Anatomy and Physiology of Speech Production February 13, 2012 Veton Kpuska 8 Anatomy and Physiology of Speech Production Anatomy of speech production is shown in Figure 3.2 Lungs: Inhalation and exhalation of air. Connected through trachea ("windpipe") and epiglottis to Vocal Tract. During the speaking, rhythmical cycle of inhalation and exhalation changes to accommodate speech production: ~12cmlong and ~1.52cmdiameter pipe. Duration of exhalation becomes roughly equal to the length of sentence/phrase. Lung air pressure during this time is maintained at a constant level, slightly above the atmospheric pressure. February 13, 2012 Veton Kpuska 9 Anatomy and Physiology of Speech Production Larynx Complicated system of cartilages, flesh, muscles, and ligaments. Primary function (in context of speech production) is to control the vocal cords (vocal folds) as illustrated in Figure 3.3. Vocal folds are: ~15 mm in men ~13 mm in women February 13, 2012 Veton Kpuska 10 Anatomy and Physiology of Speech Production Three primary states of the vocal folds: Breathing Arytenoid Cartilages are held outward Voiced Arytenoid Cartilages are held close together. Unvoiced Arytenoid Cartilages are held outward or partially close Complex motion of the vocal folds illustrated in Figure 3.4 Nonlinear twomass model of Flanagan et al. (Figure 3.5) Arytenoid: arytenoid Pronunciation: \artnoid, ritnoid\ Function: adjective Etymology: New Latin arytaenoides, from Greek arytainoeids, literally, ladle shaped, from arytaina ladle Date: circa 1751 1 : relating to or being either of two small laryngeal cartilages to which the vocal cords are attached 2 : relating to or being either of a pair of small muscles or an unpaired muscle of the larynx -- arytenoid noun February 13, 2012 Veton Kpuska 11 Anatomy and Physiology of Speech Production February 13, 2012 Veton Kpuska 12 Anatomy and Physiology of Speech Production If one were to measure the airflow velocity at the glottis as a function of time, obtained waveform will be approximately similar to that of Figure 3.6. Closed phase: folds are closed and no flow occurs Open phase: folds are open and the flow increases up to a maximum. Return phase: Time interval from the maximum air flow until the glottal closure. Speaker Speaking style And specific speech sound. Specific flow shape can change with: Glottal airflow is referred to glottal flow. Time duration of one glottal cycle is referred to as the pitch period Reciprocal of pitch period is referred to as pitch, also as fundamental frequency. February 13, 2012 Veton Kpuska 13 Example 3.1 Consider a glottal flow waveform model of the form: u[n] = g[n]*p[n] Where g[n] is the glottal flow waveform over a single cycle and p[n] is an impulse train with spacing P. p[ n] = Because the waveform is infinitely long, a segment is extracted by multiplying u[n] by a short sequence called an analysis window or simply a window. The window, denoted by w[n,], is centered at time , as illustrated in Figure 3.7 next slide, and the resulting waveform segment is written as: u[n, ] = w[n,](g[n]*p[n]) Using Multiplication and Convolution Theorem of Chapter 2, the following expression in frequency domain is obtained: k = - [n - kP] 1 U [ , ] = W ( , ) * G ( ) [ - k ] P k = - February 13, 2012 Veton Kpuska 14 Example 3.1 1 U [ , ] = W ( , ) G ( ) ( - k ) P k = - 1 U [ , ] = G (k )W ( - k , ) P k = - where W(,) is the Fourier transform of w[n,], G() is the Fourier transform of g[n], k=(2/P)k, where 2/P is the fundamental frequency or pitch. As illustrated in Figure 3.7 the Fourier transform of the window sequence is characterized by a narrow main lobe centered at =0 with lower surrounding side lobes. Effect of the harmonics of the glottal waveform on the spectrum. Veton Kpuska 15 February 13, 2012 Figure 3.7 February 13, 2012 Veton Kpuska 16 Example 3.1 Degrease in pitch period ( ) causes increase () in the spacing of harmonics of glottal waveform: k=(2/P)k. First harmonic is also the fundamental frequency. At each harmonic frequency there is a translated window Fourier transform W(k) weighted by G(k) Magnitude of the spectral shaping function, i.e., glottal flow |G(k)| is referred to as spectral envelope of the harmonics. February 13, 2012 Veton Kpuska 17 Anatomy and Physiology of Speech Production Fourier transform of periodic glottal waveform is characterized by harmonics. Typically the spectral envelope of the harmonics (governed by the glottal flow over tone cycle, has on average a 12 dB/octave rolloff. Rolloff is dependent on the nature of airflow and speaker characteristics. See Exercise 3.18 for further details. The model in Example 3.1 is ideal in the sense that even for sustained voicing a fixed pitch period is almost never maintained in time: It can "randomly" vary over successive periods pitch "jitter". Amplitude of the airflow velocity within a glottal cycle may differ across consecutive pitch periods amplitude "shimmer". Those variations are due to (perhaps!) Timevarying characteristics of the vocal tract and vocal folds. Nonlinear behavior in the speech anatomy, or Appear random while being the result of an underlying deterministic (chaotic) system. Jitter and shimmer are one component that give the vowels its naturalness. In contrast a monotone pitch and fixed amplitude results in a machinelike sound. Voice character is determined by the extend of jitter and shimmer in voice (e.g., hoarse voice). February 13, 2012 Veton Kpuska 18 Anatomy and Physiology of Speech Production States of Vocal Folds: Breathing Voicing Unvoicing Turbulence at the vocal folds aspiration Aspiration occurs also with voiced sounds (breathy voice) Part of the vocal folds vibrate and part of it are nearly fixed. Example: "he" whispered sounds February 13, 2012 Veton Kpuska 19 Anatomy and Physiology of Speech Production Other forms of atypical Vocal Fold movement: High pitch, and Irregular pitch Creaky voice very tense vocal folds with only a short portion of the folds oscillating. Resulting in a voice that has Vocal fry focal folds are massy and relaxed resulting in a voice with an abnormally: Low pitch Irregular pitch. Characterized by secondary glottal pulses close to and overlapping the primary glottal pulse. Result of coupling of false vocal folds with true vocal folds. Diplophonic voice secondary glottal pulses occur between the primary pulses within the closed phase (see Figure 3.9b and Figure 3.16). February 13, 2012 Veton Kpuska 20 Anatomy and Physiology of Speech Production February 13, 2012 Veton Kpuska 21 Examples of atypical voice types February 13, 2012 Veton Kpuska 22 Vocal Tract Comprised of the oral cavity: From larynx To the lips including the nasal passage coupled to the oral tract by way of the velum. Tongue Teeth Lips Jaw. Oral tract takes on many different lengths and crosssections. This is accomplished by moving the articulators: Average length for a adult male is 17 cm, and cross sectional area of up to 20 cm2. Purpose of vocal tract is to: Spectrally "color" the source, and Generate new sources for sound production. February 13, 2012 Veton Kpuska 23 Spectral Shaping Under a certain conditions, the relation between a glottal airflow velocity input and vocal tract airflow velocity output can be approximated by a linear filter with resonances. Resonance frequencies of the vocal tract are called formant frequencies or simply formants. Formants (resonance frequencies) change with different vocal tract configurations as depicted in Figure 3.10. February 13, 2012 Veton Kpuska 24 Figure 3.10 February 13, 2012 Veton Kpuska 25 Spectral Shaping The peaks of the spectrum of the vocal tract response correspond approximately to its formants: For a timeinvariant allpole linear system model of vocal tract with a pole at z0=r0ej0 that corresponds approximately to a vocal tract formant. Frequency of the formant is 0 Bandwidth is dependent on the distance from the unit circle (r 0). Because the vocal tract is assumed stable (with poles inside the unit circle), its transfer function can be expressed either in product or partial fraction expansion form: H ( z) = A * (1 - ck z -1 )(1 - ck z -1 ) k =1 Ni H ( z) = February 13, 2012 Ak * (1 - ck z -1 )(1 - ck z -1 ) k =1 Veton Kpuska 26 Ni Spectral Shaping Formants of the vocal tract are numbered from the low to high formants according to their location. F1, F2, etc. In general, the formant frequencies degrease as the vocal tract length increases: Male speakers tend to have lower formants than a female. Female speakers have lower formants than children. Under a vocaltract's: Linearity and timeinvariance assumption, and When the sound source occurs at the glottis, Then: The speech waveform (the airflow velocity at the vocal tract output) can be expressed as the convolution of the glottal flow input and vocal tract impulse response. February 13, 2012 Veton Kpuska 27 Example 3.2 Consider a periodic glottal flow source of the form: u[n]=g[n]*p[n] Where g[n] is the airflow over one glottal cycle and p[n] is the unit sample train with spacing P. When the sequence u[n] is passed through a linear timeinvariant vocal tract with impulse response h[n], the vocal tract output is given by: x[n]=h[n]*(g[n]*p[n]) A window center at time , w[n,], is applied to the vocal tract output to obtain the speech segment: x[n,]=w[n,]{h[n]*(g[n]*p[n])} Using Multiplication and Convolution Theorems, Fourier transform of the speech segment representing frequency domain representation is obtained: February 13, 2012 Veton Kpuska 28 Example 3.2 1 X ( , ) = W ( , ) * H ( )G ( ) ( - k ) P k = - 1 X ( , ) = H (k )G (k )W ( - k , ) P k = - Where W(,) is the Fourier transform of w[n,], and k=(2/P)k, and (2/P) is fundamental frequency or pitch. Figure 3.11 (next slide) illustrates that the spectral shaping of the windowed transform at the harmonics 1, 2 ,..., N is determined by the spectral envelope | H()G()| consisting of: Glottal and Vocal tract contributions (unlike example 3.1 consisting only of glottal contribution) Veton Kpuska February 13, 2012 29 Example 3.2 February 13, 2012 Veton Kpuska 30 Example 3.2 The general upward or downward slope of the spectral envelope, also called spectral tilt, is influenced by: The nature of the glottal flow waveform over a cycle, e.g., a gradual or abrupt closing, and by The manner in which formant tails add. Note also from the figure 3.11 that the formant locations are not always clear from the shorttime Fourier transform magnitude |X(,)| because of sparse sampling of the spectral envelope |H()G()| by the source harmonics. This is especially the case for high pitched speech. February 13, 2012 Veton Kpuska 31 Spectral Shaping Previous example is important because: It illustrates the difference between: Formant (resonance frequency of vocal tract), and Harmonic frequency. A formant corresponds to the vocal tract pole (resonant frequency) Harmonics arise due to the periodicity of glottal source (pitch). In developing signal processing algorithms that require formants the scarcity of spectral information can perhaps be detriment to formant estimation. On the other hand, the spectral sampling harmonics can be exploited to enhance perception of sound (as in singing voice). February 13, 2012 Veton Kpuska 32 Example 3.3 A soprano singer often signs a tone whose first harmonic (fundamental frequency) (1) much higher than the first formant frequency (F1) of the vowel being sung. As shown in the next figure (Figure 3.12), when the nulls of the vocal tract spectrum are sampled at the harmonics, the resulting sound is weak, especially in the face of competing instruments. To enhance the sound, the singer creates a vocal tract configuration with a widened jaw which increases the first formant frequency (Exercise 3.4) and can match the frequency of the first harmonic, thus generating a louder sound. February 13, 2012 Veton Kpuska 33 Figure 3.12 February 13, 2012 Veton Kpuska 34 Nasal Sounds Spectral Shaping Nasal and oral components of the vocal tract are coupled by the velum. When the vocal tract velum is lowered introducing an opening into the nasal passage, and Oral tract is shut off by the tongue or lips, Sound propagates through the nasal passage and out through the nose. The resulting sounds have a spectrum that is dominated by lowfrequency formants of the large volume of the nasal cavity and are appropriately called nasal sounds: Examples: "nose" and "mouse". Veton Kpuska February 13, 2012 36 Spectral Shaping: Nose February 13, 2012 Veton Kpuska 37 Spectral Shaping: Mouse February 13, 2012 Veton Kpuska 38 Spectral Shaping Because the nasal cavity (unlike the oral tract) is essentially constant, characteristics of nasal sounds may be particularly useful in speaker identification. Velum can be lowered even when the vocal tract is open: When this coupling occurs the resulting sound is said to be nasalized (e.g., nasalized vowel): There are two dominant effects that characterize nasalization: Broadening of the formant bandwidth of oral tract because of loss of energy through nasal passage, Introduction of antiresonances (i.e., zeros in the vocal tract transfer function) due to the absorption of energy at the resonances of the nasal passage. February 13, 2012 Veton Kpuska 39 Plosives Source Generation In previous section the effect of vocal tract shape in the sound production was discussed. In the Figure 3.10 (b) a complete closure of the tract (the tongue pressing against the palate) is depicted. This closure is required when making an impulsive sound (plosives): Buildup of pressure behind the palate, and Abrupt release of pressure. February 13, 2012 Veton Kpuska 41 Source Generation: Plosives "Drop" February 13, 2012 Veton Kpuska 42 Fricatives Source Generation Another sound source is created when the tongue is very close to the palette (but not completely impeded) used to generate turbulence and thus noise source (e.g., fricatives). As with periodic glottal sound source, a spectral shaping can also occur for either type of input (i.e., impulse or noise source). There is no harmonic structure with these types of inputs. The source spectrum is shaped at all frequencies by |H()|. Note that the spectrum of noise was idealized assuming a flat spectrum. In reality these sources will themselves have a nonflat spectral shape. February 13, 2012 Veton Kpuska 44 Source Generation: Fricatives "NASA" February 13, 2012 Veton Kpuska 45 Source Generation There is another class of the source type that is generated within the vocal tract, however, it is less understood than noisy and impulsive sources at oral tract constrictions. This source arises from the interaction of vortices with vocal tract boundaries such as the false vocal folds, teeth, or occlusions in the oral tract. Vortex can be thought off as a tiny rotational airflow in the oral tract. There is evidence that sources due to vortices influence the temporal and spectral and perhaps perceptual characteristics of speech sounds. February 13, 2012 Veton Kpuska 46 Categorization of Sound By Source Voiced: Speech sounds generated with a periodic glottal source. Unvoiced: Speech sounds not generated with periodic glottal source. There are variety of unvoiced sounds: Fricatives Sounds that are generated from the friction of the moving air against an oral tract constriction. Example: "thin" Plosives Created with an impulsive source within the oral tract. Example: "top" Whispers Barrier made at the vocal folds by partially closing the vocal folds, but without oscillations. Example: "he". However, the unvoiced sounds do not exclusively relate to the sound source. That is the Vocal folds can be vibrating simultaneously with impulsive or noisy sources. Thus above subcategories may exists for voiced sounds. Example: "zebra" "bin" vs. vs. "sheba" "pin" Fricatives Plosives February 13, 2012 Veton Kpuska 47 Categorization of Sound By Source February 13, 2012 Veton Kpuska 48 Spectrographic Analysis of Speech Spectrographic Analysis of Speech Speech waveform consists of a sequence of different events. This timevariation corresponds to highly fluctuating spectral characteristics over time. Example of a word "to". A single Fourier transform of the entire acoustic signal of the word "to" cannot capture this timevarying frequency content. In contrast shorttime Fourier transform (SFFT) that consists of a separate Fourier transform of pieces of the waveform under a sliding window can capture this temporal variability. Veton Kpuska 50 February 13, 2012 Spectrographic Analysis of Speech In examples 3.1 and 3.2 presented earlier, a sliding (analysis) window concept was introduced. This window, w[n,], is typically tapered at its end (Figure 3.14) to avoid unnatural discontinuities in the speech segment and distortion in its underlying spectrum. Example Hamming window: for 0nNw1 Window typically does not necessarily move one sample at a time, but rather moves at some frame interval (determines frame rate) consistent with temporal structure one wants to reveal. w[n,]=0.540.46cos[2(n)/(Nw1)] X ( , ) = where n = - x[n, ]e - jn x[n,]= w[n,]x[n] represents the windowed speech segments as function of the window center at time . February 13, 2012 Veton Kpuska 51 Spectrographic Analysis of Speech The spectrogram is graphically displayed as: S(,) = |X(,)|2 S(,) is a 2D (two dimensional) representation of "energy density" of the signal. For each window position , one could plot S(,). A better and more compact representation of timefrequency display of the spectrogram places spectral magnitude measurements vertically in three dimensional mesh or twodimensionally with intensity coming out of the page. This display is illustrated (caricature) in Figure 3.14. This figure also illustrates two kinds of spectrograms: Narrowband it gives good spectral resolution: a good view of the frequency content of sinewaves with closely spaced frequencies. Wideband which gives a good temporal resolution: a good view of the temporal context of impulses closely spaced in time. February 13, 2012 Veton Kpuska 52 Spectrographic Analysis of Speech February 13, 2012 Veton Kpuska 53 Wideband Spectrogram February 13, 2012 Veton Kpuska 54 Narrowband Spectrogram February 13, 2012 Veton Kpuska 55 Spectrographic Analysis of Speech Note that for voiced speech, the speech waveform was approximated as the output of a linear timeinvariant system with impulse response h[n] and with a glottal flow input given by the convolution of the glottal flow over one cycle, g[n], with the impulse train p[n] = [nkP]: x[n,]= w[n,]{(p[n]*g[n])*h[n]} x[n,]= w[n,]{p[n]*[n]} Where glottal waveform over a cycle and vocal tract impulse response was combined as [n] = g[n]*h[n]. From the result of example 3.2 the spectrogram of x[n] can be therefore expressed as: 1 S ( , ) = 2 P k = - ~ H (k )W ( - k , ) where 2 ~ H ( ) = H ( )G ( ) 2 2 and where k = k , and - is the fundametal frequency P P February 13, 2012 Veton Kpuska 56 Spectrographic Analysis of Speech Difference of narrowband and wideband spectrogram is in the length of the (analysis) window w[n,]. Narrowband Spectrogram: Uses "long" window with a duration of typically at least two pitch periods. Under the conditions that: The main lobes of shifted window Fourier transforms are non overlapping, and that Corresponding transform sidelobes are negligible, from the equation in pervious slide the following approximation holds (exercise 3.8): 1 S ( , ) 2 P k = - 2 ~ 2 H (k ) W ( - k , ) Veton Kpuska 57 February 13, 2012 Spectrographic Analysis of Speech Narrowband Spectrogram (cont): Harmonic lines are "resolved" horizontal striations in the timefrequency plane of the spectrogram. Long window which covers several pitch periods smears closely spaced temporal events and thus gives poor time resolutions (e.g., plosives that are closely spaced to a succeeding voiced sound are poorly represented). Veton Kpuska 58 February 13, 2012 Spectrographic Analysis of Speech Wideband Spectrogram: Wideband spectrogram is defined by a short window with a duration of less than one pitch period (see Figure 3.14). Shortening the window widens the Fourier transform (recall the uncertainty principle). Widening of Fourier transform will cause neighboring harmonics to overlap and add of neighboring window transforms thus smearing the harmonic line structure: roughly tracing out the spectral envelope |()| due to vocal tract and glottal flow contributions. From temporal perspective since the window length is less than a pitch period, the window "sees" essentially pieces of the periodically occurring sequence [n]. February 13, 2012 Veton Kpuska 59 Spectrographic Analysis of Speech Wideband Spectrogram (cont): For the steadystate voiced sound, we can therefore express the wideband spectrogram roughly as (see Exercise 3.9): 2 ~ S ( , ) H (k ) E[ ] Where is a constant scale factor and where E[n] is the energy in the waveform under the sliding window: E[ ] = February 13, 2012 n = - x[n, ] 2 Veton Kpuska 60 Spectrographic Analysis of Speech Wideband Spectrogram (cont): Shows the formants of the vocal tract in frequency, also Gives vertical striations in time every pitch period, rather than the harmonic horizontal striations as in narrowband spectrogram. Vertical striations arise because the short window is sliding through fluctuating energy regions of the speech waveform. Figure 3.15 in the next slide compares the narrowband (20ms Hamming window) and wideband (4ms Hamming window) spectrograms. February 13, 2012 Veton Kpuska 61 Figure 3.15 February 13, 2012 Veton Kpuska 62 Figure 3.16 February 13, 2012 Veton Kpuska 63 Categorization of Speech Sounds Sound source can be created with either the 1. Classification of speech sounds can be also be done from the following perspectives: The nature of the source: Periodic Noisy Impulsive, or Combination of the three. vocal folds or constriction in the vocal tract. 1. The shape of vocal tract place and manner of articulation. 1. 2. The timedomain waveform which gives the pressure change with time at the lips output. The timevarying spectral characteristics revealed through the spectrogram. Place of the tongue hump along the oral tact and The degree of the constriction of the hump. The shape is also determined by possible connection to the nasal passage by way of velum. February 13, 2012 Veton Kpuska 64 Elements of a Language Phoneme a fundamental distinctive unit of a language. To emphasize the distinction between the concept of a phoneme and sounds that convey a phoneme, speech scientist use the term phone to mean a particular instantiation of a phoneme. Different languages contain different phoneme sets. Syllables contain one or more phonemes. Words are formed from one or more syllables. Phrases are concatenation of words. If first two factors are used to study speech sounds then this is referred to as articulatory phonetics. If last two descriptors are used to study the speech sounds then this is referred to as acoustic phonetics. February 13, 2012 Veton Kpuska 65 Elements of a Language One broad classification for English language is done in terms of: Vowels, Consonants, Diphthongs, Affricates, and Semivowels. In the next slide, this classification is illustrated in Figure 3.17. February 13, 2012 Veton Kpuska 66 Figure 3.17 February 13, 2012 Veton Kpuska 67 Elements of a Language Phonemes arise from a combination of vocal fold and vocal tract articulatory features. Articulatory features (corresponding to the first 2 category descriptors) include: Vocal fold state Vibrating or Open Tongue position and height Front Central Back along the palate. Partial Complete Nasal sound Not a nasal sound. Constriction Velum state February 13, 2012 Veton Kpuska 68 Elements of a Language In English the combinations of features are such to give 40 phonemes. Other languages can yield a smaller/larger number: Rules of a language define which phones can be stringed together and how to form words. A phoneme is not strictly defined by the precise adjustment of articulators (dialects and accents). The articulatory properties are influenced by: Adjacent phonemes, Speaking rate, Emphasis in speaking, and Timevarying nature of the articulators. In Italian consonants are not allowed at the end of words. 11 in Polynesian 141 in the "click" language of Khosian The variants of sounds or phones, that convey the same phoneme are called the allophones of the phoneme: Example: "butter", "but" and "to", were /t/ in each word is somewhat different. Motor theory of perception uses articulatory features from the speech waveform and its acoustic temporal and spectral features to study the sounds in a language. Veton Kpuska 69 February 13, 2012 Elements of a Language: Vowels Vowels Source: quasiperiodic System: Pitch (not important to categorize a sound in English, however, in Mandarin Chinese language some sounds are interpreted based on the pitch tonal languages) Each vowel phoneme corresponds to a different vocal tract configuration. The particular shape of the vocal tract determines its resonances (concentrations of energies in the spectrogram). Certain vowels properties are also seen in the speech waveform within a pitch period. (see Figure 3.19 in the slide after next) Spectrogram: Waveform: In spite of the specific properties of different vowels, there is much variability of vowel characteristics among speakers. Articulatory differences in speakers is one cause of allophonic variations. => The place and degree of constriction of the tongue hump, and Crosssection and length of vocal tract, And therefore the vocal tract formants will vary with speaker. February 13, 2012 Veton Kpuska 70 Figure 3.18 February 13, 2012 Veton Kpuska 71 Figure 3.19 February 13, 2012 Veton Kpuska 72 Elements of a Language: Nasals Nasals: Source: System: Quasiperiodic airflow puffs from the vibrating vocal folds. The velum is lowered and the air flows mainly through the nasal cavity. Because oral tract is being constricted the sound is radiated at the nostrils. Nasal consonants are distinguished by the place along the oral tract at which the tongue makes a constriction (Figure 3.20). Is dominated by the low resonance of the large volume of the nasal cavity. Closed oral cavity acts as a side branch with its own resonances that change with the place of constriction of the tongue: Spectrogram: These resonances absorb acoustic energy and thus are antiresonances of the vocal tract. Antiresonances of the oral tract tend to lie beyond the lowresonances of the nasal tract. Consequently nasals have very low energy in highfrequency range. February 13, 2012 Veton Kpuska 73 Figure 3.20 February 13, 2012 Veton Kpuska 74 Figure 3.21 February 13, 2012 Veton Kpuska 75 Elements of a Language: Fricatives There are two broad classes of fricatives: Source: Voiced and Unvoiced System: Vocal folds are relaxed and not vibrating for unvoiced fricatives. Vocal folds are vibrating simultaneously with noise generation at the constriction. Noise is generated by turbulent airflow at some point of constriction along the oral tract. Constriction is narrower than with vowels. The location of the constriction by the tongue, lips determines which sound is produced: Back Center, or Front of the oral tract, as well as The teeth or lips. Spectrogram: Noise like. Energy is concentrated in higher frequencies. February 13, 2012 Veton Kpuska 76 Example 3.4 A voiced fricative is generated with both a periodic and noise source. The periodic glottal flow component can be expressed as: u[n] = g[n]*p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P. Voiced fricative simplified model of the output at the lips: xg[n] = h[n]*(g[n]*p[n]) h[n] a linear timeinvariant vocal tract with impulse response under periodic signal u[n]. Modeling the noise source component of the turbulent airflow velocity source at the constriction denoted by q[n] (assumed white noise). The glottal flow u[n] modulates this noise function q[n] which in turn excites the front oral cavity that has impulse response hf[n]: xq[n] = hf[n]*(q[n]u[n]) February 13, 2012 Veton Kpuska 77 Example 3.4 We assume in simplified model that the results of the two airflow sources add: x[n] = xg[n] + xq[n] = h[n]*u[n] + hf[n]*(q[n]u[n]) See Exercise 3.10 for special characteristics of x[n]. Issues that have been ignored: u[n] is modified by the oral cavity xq[n] can be influenced by the back cavity. Sources of nonlinear effects (distributed sources due to traveling vortices) February 13, 2012 Veton Kpuska 78 Elements of a Language: Fricatives Spectrogram: Unvoiced fricatives are characterized by a "noisy" spectrum, while Voiced fricatives often show both noise and harmonics. Waveform: Unvoiced fricative contains only noise, Voiced fricative contains noise superimposed on quasi periodic signal. Whisper: Forms a class of its own under general category of Consonants. Turbulent flow is produced at the glottis rather than at the vocal tract constriction. February 13, 2012 Veton Kpuska 79 Figure 3.24 Fricatives February 13, 2012 Veton Kpuska 80 Figure 3.23 February 13, 2012 Veton Kpuska 81 Elements of a Language: Plosives Plosives form a class of sounds where the constriction is complete however brief followed by the burst of flow. As with fricatives plosives can be: System: 1. 2. 3. 4. Voiced and Unvoiced. Constriction can occur at: Sequence of events: Front Center, or Back of the oral tract. (Figure 3.24) Complete closure of the oral tract and buildup of air pressure. Release of air pressure and generation of turbulence over a very shorttime duration Generation of aspiration due to turbulence at the open vocal folds Onset of the following vowel about 4050 ms after the burst. With voiced plosives vocal folds vibrate for duration of all 4 steps. During the period when oral tract is closed, we hear a lowfrequency vibration due to propagation of vocal folds vibrations through the walls of the throat. This activity is referred to as a "voice bar". Figure 3.26 compares voiced/unvoiced plosive pair. After the release of the burst, unlike the unvoiced plosive, there is little or no aspiration. There is much shorter delay between the burst and the voicing of the vowel onset. February 13, 2012 Veton Kpuska 82 Elements of a Language: Plosives Waveform: February 13, 2012 Veton Kpuska 83 Elements of a Language: Plosives Spectrogram: February 13, 2012 Veton Kpuska 84 Elements of a Language: Plosives Example 3.5: A time varying system model for the voiced plosive. Voiced plosive is generated with a burst source and can also have present a periodic source throughout the user and into the following vowel. Assuming that the burst occurs at time n=0, we idealize the burst source as an impulse [n]. The glottal flow velocity model for the periodic source component is given by: u[n] = g[n]*p[n] g[n] is the glottal flow over one cycle p[n] is an impulse train with pitch period P. Assume that the vocal tract is linear but timevarying, due to changing vocal tract shape during its transition from the burst to a following steady vowel. This implies that vocal tract output cannot be obtained by the convolution operator. Vocal tract output thus must be computed using the timevarying impulse response concept introduced in Chapter 2. In this simple model, the periodic glottal flow excites a timevarying vocal tract, with impulse response denoted by h[n,m], while the burst excites a timevarying front cavity beyond a constriction, denoted by h f[n.m]. h[n,m] and hf[n.m] represent timevarying impulse responses at time n due to a unit sample applied m samples earlier at time nm. The output then can be written using generalization of the convolution operator: We have assumed that two outputs can be linearly combined. m=- m=- x[ n] = h[ n,m] u[ n-m] + h f [ n,m] [ n-m] February 13, 2012 Veton Kpuska 85 Elements of a Language: Transitional Speech Sounds Diphthongs: Vowel like nature with vibrating vocal folds. Do not have a steady vocal tract configuration.: They are produced by varying in time the vocal tract smoothly between two vowel configurations. Characterized by movement from one vowel target to another. hide /Y/ out /W/ boy /O/ new /JU/ February 13, 2012 Veton Kpuska 86 Elements of a Language: Transitional Speech Sounds SemiVowels: Two categories of vowel like sounds: Glides (/w/ as in "we" and /y/ as in "you"), and Liquids (/r/ as in "read", and /l/ as in "let"). Glides: Greater constriction of oral tract during the transition, and Greater speed of the oral tract movement, compared to diphthongs February 13, 2012 Veton Kpuska 87 Figure 3.28 Liquids & Glides February 13, 2012 Veton Kpuska 88 Elements of a Language: Transitional Speech Sounds Affricates: are the counterpart of diphthongs consisting of consonant plosivefricative combinations. The difference as compared to fricatives is that the affricates have: A fricative portion preceded by a complete constriction of the oral cavity Formed at the same place as for the plosive. Examples: /tS/ as in "chew" unvoiced /J/ as in "just" voiced Veton Kpuska 89 February 13, 2012 Coarticulation Vocal fold/vocal tract muscles are "programmed" to seek a target state or shape, often the target is never reached: Our speech anatomy cannot move to a desired position instantaneously and thus past positions influence the present. Furthermore, to make anatomical movement easy and graceful, the brain anticipates the future, and so the articulators at any time instant are influenced by where they have been and where they are going. Coarticulation can occur on different temporal level: "horse" vs. "horseshoe". "sweep" vs. "seep" Coarticulation refers to the influence of the articulation of one sound on the articulation of another sound in the same utterance. Local articulation of a phoneme is influenced by its adjacent neighbors or by neighbors close in time: Global articulators are influenced by phonemes that occur some time in the future beyond the succeeding or nearby phonemes; February 13, 2012 Veton Kpuska 90 Prosody: The Melody of Speech Prosody of a language is defined by the rules that define changes in speech extending over more than one phoneme: Intonation (change in pitch) Amplitude/Energy (loudness) Timing (articulation rate or rhythm). These rules are followed to convey different: Meaning, Stress, and Emotion February 13, 2012 Veton Kpuska 91 Figure 3.29 Prosody February 13, 2012 Veton Kpuska 92 Figure 3.30 Global Coarticulation February 13, 2012 Veton Kpuska 93 ...
View Full Document

Ask a homework question - tutors are online