Ch4-Hidden_Markov_Models

Ch4-Hidden_Markov_Mo - Speech Recognition Hidden Markov Models Outline Introduction Problem formulation ForwardBackward algorithm Viterbi search

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Speech Recognition Hidden Markov Models Outline Introduction Problem formulation ForwardBackward algorithm Viterbi search BaumWelch parameter estimation Other considerations Multiple observation sequences Phonebased models for continuous speech recognition Continuous density HMMs Implementation issues February 13, 2012 Veton Kpuska 2 Information Theoretic Approach to ASR Speaker's Mind W Speaker Speech Producer Speech Acoustic Processor A Linguistic Decoder Acoustic Channel Speech Recognizer Statistical Formulation of Speech Recognition A denotes the acoustic evidence (collection of feature vectors, or data in general) based on which recognizer will make its decision about which words were spoken. W denotes a string of words each belonging to a fixed and known vocabulary. February 13, 2012 Veton Kpuska 3 Information Theoretic Approach to ASR Assume that A is a sequence of symbols taken from some alphabet A. A = 1 ,a2 ,...,am a ai A W denotes a string of n words each belonging to a fixed and known vocabulary V. W = 1 ,w2 ,..., wm w wi V February 13, 2012 Veton Kpuska 4 Information Theoretic Approach to ASR If P(W|A) denotes the probability that the words W were spoken, given that the evidence A was observed, then the recognizer should decide in favor of a word string satisfying: ^ W =arg max P ( W|A ) W The recognizer will pick the most likely word string given the observed acoustic evidence. February 13, 2012 Veton Kpuska 5 Information Theoretic Approach to ASR From the well known Bayes' rule of probability theory: P ( A|W ) P ( W ) P ( W|A ) = P( A ) P(W) Probability that the word string W will be uttered P(A|W) Probability that when W was uttered the acoustic evidence A will be observed P(A) is the average probability that A will be observed: P ( A ) = A|W ' P W ' P W' P ( W|A ) P ( A ) =P ( A|W ) P ( W ) ( ) ( ) 6 February 13, 2012 Veton Kpuska Information Theoretic Approach to ASR Since Maximization in: ^ W =arg max P ( W|A ) W Is carried out with the variable A fixed (e.g., there is not other acoustic data save the one we are give), it follows from Baye's rule that the recognizer's aim is to find the word string that maximizes the product P(A|W)P(W), that is ^ W =arg max P ( W|A ) P ( W ) W February 13, 2012 Veton Kpuska 7 Hidden Markov Models About Markov Chains: Let X1, X2, ..., Xn, ... be a sequence of random variables taking their values in the same finite alphabet = {1,2,3,...,c}. If nothing more is said then Bayes' formula applies: i =1 The random variables are said to form a Markov chain, however, if P( X 1 , X 2 ,..., X n ) = P( X i | X 1 , X 2 ,..., X i-1 ) n P X i | X 1 , X 2 ,..., X i-1 = P X i | X i Thus for Markov chains the following holds:-1 n ( ) ( ) i P( X 1 , X 2 ,..., X n ) = P ( X i | X i-1 ) i=1 February 13, 2012 Veton Kpuska 8 Markov Chains The Markov chain is time invariant or homogeneous if regardless of the value of the time index i, P( X i = x'| X i-1 = x ) = p( x'| x ) x, x' p(x'|x) referred to as transition function and can be represented as a c x c matrix and it satisfies the usual conditions: One can think of the values of Xi as sates and thus of the Markov chain as a finite state process with transitions between states specified by the function p(x'|x). p( x'| x ) =1; x' p( x'| x ) 0 x,x' February 13, 2012 Veton Kpuska 9 Markov Chains If the alphabet is not too large then the chain can be completely specified by an intuitively appealing diagram presented below: p(1|3) 1 p(1|1) p(2|1) 2 p(3|1) p(3|2) p(2|3) 3 Arrows with attached transition probability values mark the transitions between states Missing transitions imply zero transition probability: p(1|2)=p(2| 2)=p(3|3)=0. Veton Kpuska 10 February 13, 2012 Markov Chains Markov chains are capable of modeling processes of arbitrary complexity even though they are restricted to one step memory: Consider a process Z1, Z2, ..., Zn,... of memory length k: i If we define new random variables: P( Z1 ,Z 2 ,...,Z n ) = P( Z i |Z i-k ,Z i-k +1,...,Z i-1 ) X i = Z i-k +1 ,Z i-k +2 ,...,Z i Then Z sequence specifies Xsequence (and vice versa), and X process is a Markov chain as defined earlier. February 13, 2012 Veton Kpuska 11 Hidden Markov Model Concept Hidden Markov Models allow more freedom to the random process while avoiding a substantial complications to the basic structure of Markov chains. This freedom can be gained by letting the states of the chain generate observable data while hiding the sate sequence itself from the observer. February 13, 2012 Veton Kpuska 12 Hidden Markov Model Concept Focus on three fundamental problems of HMM design: 1. 2. 3. The evaluation of the probability (likelihood) of a sequence of observations given a specific HMM; The determination of a best sequence of model states; The adjustment of model parameters so as to best account for the observed signal. February 13, 2012 Veton Kpuska 13 DiscreteTime Markov Processes Examples Define: A system with N distinct states S = {1,2,...,N} Time instances associated with state changes as t=1,2,... Actual state at time t as st Statetransition probabilities as: aij = p(st=j|sti=i), 1i,jN aij i j Statetransition probability properties aij 0 j,i i 14 a j =1 N ij =1 February 13, 2012 Veton Kpuska DiscreteTime Markov Processes Examples Consider a simple threestate Markov Model of the weather as shown: 0.4 1 0.3 0.6 2 0.2 0.1 0.3 3 0.1 0.2 State 1: Precipitation (rain or snow) State 2: Cloudy State 3: Sunny 0.8 February 13, 2012 Veton Kpuska 15 DiscreteTime Markov Processes Examples Matrix of state transition probabilities: 0.4 0.3 0.3 A={aij } = 0.2 0.6 0.2 0.1 0.1 0.8 Given the model in the previous slide we can now ask (and answer) several interesting questions about weather patterns over time. February 13, 2012 Veton Kpuska 16 DiscreteTime Markov Processes Examples Problem 1: What is the probability (according to the model) that the weather for eight consecutive days is "sunsunsunrainsun cloudysun"? Solution: Define the observation sequence, O, as: Day O = ( sunny, O = ( 3, 1 2 sunny, 3, 3 sunny, 3, 4 rain, 1, 5 rain, 1, 6 3, 7 2, sunny, cloudy, sunny ) 3 ) 8 Want to calculate P(O|Model), the probability of observation sequence O, given the model of previous slide. Given that: P( s1 , s2 ,..., sk ) = p( si | si -1 ) i =1 k February 13, 2012 Veton Kpuska 17 DiscreteTime Markov Processes Examples P(O|Model ) = P( 3,3,3,1,1,3,2,3|Model ) = P (3) P (3|3) 2 P (1|3) P (1|1) P (3|1) P(2|3) P (3|2) = 3 ( a33 ) a31a11a13 a32 a23 2 =(1.0 )( 0.8) ( 0.1)( 0.4)( 0.3)( 0.1)( 0.2) 2 =1.53610 -4 Above the following notation was used i = P ( s1 =i ) 1i N February 13, 2012 Veton Kpuska 18 DiscreteTime Markov Processes Examples Problem 2: Given that the system is in a known state, what is the probability (according to the model) that it stays in that state for d consecutive days? Solution Day 1 2 3 d d+1 O = ( i, i, i, ..., i, ji ) P(O|Model ,s1 =i ) = P( O,s1 =i|Model ) P ( s1 =i ) = i ( aii ) (1-aii ) d -1 =( aii ) (1-aii ) = pi ( d ) d -1 i The quantity pi(d) is the probability distribution function of duration d in state i. This exponential distribution is characteristic of the sate duration in Markov Chains. February 13, 2012 Veton Kpuska 19 DiscreteTime Markov Processes Examples Expected number of observations (duration) in a state conditioned on starting in that state can be computed as d i = dpi ( d ) d =1 = d ( aii ) d =1 d -1 Thus, according to the model, Where we have used the formula: the expected number of consecutive days of b 1 (1-aii ) = 1-aii Sunny weather: 1/0.2=5 Cloudy weather: 2.5 Rainy weather: 1.67 kb k = k =0 (1-b ) 2 Exercise Problem: Derive the above formula or directly mean of pi(d) Hint: k ( x ) =kx k-1 x 20 February 13, 2012 Veton Kpuska Extensions to Hidden Markov Model In the examples considered only Markov models in which each state corresponded to a deterministically observable event. This model is too restrictive to be applicable to many problems of interest. Obvious extension is to have observation probabilities to be a function of the state, that is, the resulting model is doubly embedded stochastic process with an underlying stochastic process that is not directly observable (it is hidden) but can be observed only through another set of stochastic processes that produce the sequence of observations. February 13, 2012 Veton Kpuska 21 Illustration of Basic Concept of HMM. Exercise 1. 1. 2. 3. Given a single fair coin, i.e., P(Heads)=P(Tails)=0.5. which you toss once and observe Tails. What is the probability that the next 10 tosses will provide the sequence (HHTHTTHTTH)? What is the probability that the next 10 tosses will produce the sequence (HHHHHHHHHH)? What is the probability that 5 out of the next 10 tosses will be tails? What is the expected number of tails overt he next 10 tosses? February 13, 2012 Veton Kpuska 22 Illustration of Basic Concept of HMM. Solution 1. 1. For a fair coin, with independent coin tosses, the probability of any specific observation sequence of length 10 (10 tosses) is (1/2)10 since there are 210 such sequences and all are equally probable. Thus: 1 P(H H T H T T H T T H)= 2 1. 10 Using the same argument: 1 P(H H H H H H H H H H)= 2 February 13, 2012 Veton Kpuska 23 10 Illustration of Basic Concept of HMM. Solution 1. (Continued) 3. Probability of 5 tails in the next 10 tosses is just the number of observation sequences with 5 tails and 5 heads (in any order) and this is: 10 1 252 P ( 5 H, 5T ) = = 0.25 5 2 1024 Expected Number of tails in 10 tosses is: 10 10 1 10 E ( T in 10 tosses ) = d =5 d 2 d =0 10 Thus, on average, there will be 5H and 5T in 10 tosses, but the probability of exactly 5H and 5T is only 0.25. Veton Kpuska 24 February 13, 2012 Illustration of Basic Concept of HMM. CoinToss Models Assume the following scenario: You are in a room with a barrier (e.g., a curtain) through which you cannot see what is happening. On the other side of the barrier is another person who is performing a cointossing experiment (using one or more coins). The person (behind the curtain) will not tell you which coin he selects at any time; he will only tell you the result of each coin flip. Thus a sequence of hidden cointossing experiments is performed, with the observation sequence consisting of a series of heads and tails. Veton Kpuska 25 February 13, 2012 CoinToss Models A typical observation sequence could be: O =( o1 o 2 o 3 ... oT ) =( H H T T T H T T H ... H ) Given the above scenario, the question is: How do we build an HMM to explain (model) the observation sequence of heads and tails? First problem we face is deciding what the states in the model correspond to. Second, how many states should be in the model. February 13, 2012 Veton Kpuska 26 CoinToss Models One possible choice would be to assume that only a single biased coin was being tossed. In this case, we could model the situation with a twostate model in which each state corresponds to the outcome of the previous toss (i.e., heads or tails). P(T)=1P(H) 1 P(H) P(T)=1P(H) 2 1 Coin Model (Observable Markov Model) O = H H T T H T H H T T H ... S = 1 1 2 2 1 2 1 1 2 2 1 ... HEADS P(H) TAILS February 13, 2012 Veton Kpuska 27 CoinToss Models Second HMM for explaining the observed sequence of con toss outcomes is given in the next slide. In this case: There are two states in the model, and Each state corresponds to a different, biased coin being tossed. Each state is characterized by a probability distribution of heads and tails, and Transitions between state are characterized by a statetransition matrix. The physical mechanism that accounts for how state transitions are selected could be itself be a set of independent coin tosses or some other probabilistic event. February 13, 2012 Veton Kpuska 28 CoinToss Models a11 1 1a11 2 a22 2 Coins Model (Hidden Markov Model) O = H H T T H T H H T T H ... S = 2 1 1 2 2 2 1 2 2 1 2 ... 1a22 P(H) = P1 P(T) = 1P1 P(H) = P2 P(T) = 1P2 February 13, 2012 Veton Kpuska 29 CoinToss Models A third form of HMM for explaining the observed sequence of coin toss outcomes is given in the next slide. In this case: There are three states in the model. Each state corresponds to using one of the three biased coins, and Selection is based on some probabilistic event. February 13, 2012 Veton Kpuska 30 CoinToss Models a11 1 3 Coins Model (Hidden Markov Model) a12 2 a22 O = H H T T H T H H T T H ... S = 3 1 2 3 3 1 1 2 3 1 3 ... a21 a31 3 a13 a32 a23 a33 State Probability 1 P1 1P1 2 P2 1P2 31 3 P3 1P3 P(H) P(T) February 13, 2012 Veton Kpuska CoinToss Models Given the choice among the three models shown for explaining the observed sequence of heads and tails, a natural question would be which model best matches the actual observations. It should be clear that the simple onecoin model has only one unknown parameter, The twocoin model has four unknown parameters, and The threecoin model has nine unknown parameters. HMM with larger number of parameters inherently has greater number of degrees of freedom and thus potentially more capable of modeling a series of cointossing experiments than HMM's with smaller number of parameters. Although this is theoretically true, practical considerations impose some strong limitations on the size of models that we can consider. February 13, 2012 Veton Kpuska 32 CoinToss Models Another fundamental question here is whether the observed headtail sequence is long and rich enough to be able to specify a complex model. Also, it might just be the case that only a single coin is being tossed. In such a case it would be inappropriate to use threecoin model because it would be using an underspecified system. February 13, 2012 Veton Kpuska 33 The UrnandBall Model To extend the ideas of the HMM to a somewhat more complicated situation, consider the urnandball system depicted in the figure. Assume that there are N (large) brass urns in a room. Assume that there are M distinct colors. Within each urn there is a large quantity of colored marbles. A physical process for obtaining observations is as follows: This entire process generates a finite observation sequence of colors, which we would like to model as the observable output of an HMM. A genie is in the room, and according to some random procedure, it chooses an initial urn. From this urn, a ball is chosen at random, and its color is recorded as the observation. The ball is then replaced in the urn form which it was selected. A new urn is then selected according to the random selection procedure associated with the current urn. Ball selection process is repeated. February 13, 2012 Veton Kpuska 34 The UrnandBall Model Simples HMM model that corresponds to the urnandball process is one in which: It should be noted that the color of the marble in each urn may be the same, and the distinction among various urns is in the way the collection of colored marbles is composed. Therefore, an isolated observation of a particular color ball does not immediately tell which urn it is drawn from. Each state corresponds to a specific urn, and For which a (marble) color probability is defined for each state. The choice of state is dictated by the statetransition matrix of the HMM. February 13, 2012 Veton Kpuska 35 The UrnandBall Model ... An NState urnandball model illustrating the general case of a discrete symbol HMM O = {GREEN, GREEN, BLUE, RED, YELLOW, ..., BLUE} URN 1 P(RED)=b1(1) P(BLUE)=b1(2) P(GREEN)=b1(3) P(YELLOW)=b1(4) ... P(ORANGE)=b1(M) URN 2 P(RED)=b2(1) P(BLUE)=b2(2) P(GREEN)=b2(3) P(YELLOW)=b2(4) ... P(ORANGE)=b2(M) URN N P(RED)=bN(1) P(BLUE)=bN(2) P(GREEN)=bN(3) P(YELLOW)=bN(4) ... P(ORANGE)=bN(M) February 13, 2012 Veton Kpuska 36 Elements of a Discrete HMM N : number of states in the model states s = {s1,s2,...,sN} state at time t, qt s M: number of (distinct) observation symbols (i.e., discrete observations) per state observation symbols, v = {v1,v2,...,vM } observation at time t, ot v aij = P(qt+1=sj|qt=si), 1 i,j N A = {aij} : state transition probability distribution B = {bj} : observation symbol probability distribution in state j bj(k) = P(vk at t|qt=sj ), 1 j N, 1 k M = {i}: initial state distribution i = P(q1=si )1 i N HMM is typically written as: = {A, B, } This notation also defines/includes the probability measure for O, i.e., P(O|) February 13, 2012 Veton Kpuska 37 HMM: An Example For our simple example: February 13, 2012 Veton Kpuska 38 HMM Generator of Observations Given appropriate values of N, M, A, B, and , the HMM can be used as a generator to give an observation sequence: Each observation ot is one of the symbols from V, and T is the number of observation in the sequence. O =( o1 o 2 o 3 ... oT ) February 13, 2012 Veton Kpuska 39 HMM Generator of Observations The algorithm: 1. 2. Choose an initial state q1=si according to the initial state distribution . For t=1 to T: Choose ot=vk according to the symbol probability distribution in state si, i.e., bi(k). Transit to a new state qt+1 = sj according the statetransition probability distribution for state si, i.e., aij. 1. Increment t, t=t+1; return to step 2 if t<T; otherwise, terminate the procedure. February 13, 2012 Veton Kpuska 40 Three Basic HMM Problems 1. Scoring: Given an observation sequence O={o1,o2,...,oT} and a model = {A, B,}, how do we compute P(O| ), the probability of the observation sequence? 1. Matching: Given an observation sequence O={o1,o2,...,oT} how do we choose a state sequence Q={q1,q2,...,qT} which is optimum in some sense? Training: How do we adjust the model parameters = {A,B, } to maximize P(O| )? The Probability Evaluation (Forward & Backward Procedure) The Viterbi Algorithm 1. The BaumWelch Reestimation Veton Kpuska 41 February 13, 2012 Three Basic HMM Problems Problem 1 Scoring: Is the evaluation problem; namely, given a model and a sequence of observations, how do we compute the probability that the observed sequence was produced by the model? It can also be views as the problem of scoring how well a given model matches a given observation sequence. The later viewpoint is extremely useful in cases in which we are trying to choose among several competing models. The solution to Problem 1 allows us to choose the model that best matches the observations. February 13, 2012 Veton Kpuska 42 Three Basic HMM Problems Problem 2 Matching: Is the one in which we attempt to uncover the hidden part of the model that is to find the "correct" state sequence. It must be noted that for all but the case of degenerate models, there is no "correct" state sequence to be found. Hence, in practice one can only find an optimal state sequence based on chosen optimality criterion. Several reasonable optimality criteria can be imposed and thus the choice of criterion is a strong function of the intended use. Typical uses are: Learn about the structure of the model Find optimal state sequences for continues speech recognition. Get average statistics of individual states, etc. February 13, 2012 Veton Kpuska 43 Three Basic HMM Problems Problem 3 Training: Attempts to optimize the model parameters to best describe how a given observation sequence comes about. The observation sequence used to adjust the model parameters is called a training sequence because it is used to "train" the HMM. Training algorithm is the crucial one since it allows to optimally adapt model parameters to observed training data to create best HMM models for real phenomena. February 13, 2012 Veton Kpuska 44 Simple IsolatedWord Speech Recognition For each word of a W word vocabulary design separate N state HMM. Speech signal of a given word is represented as a time sequence of coded spectral vectors (How?). There are M unique spectral vectors; hence each observation is the index of the spectral vector closest (in some spectral distortion sense) to the original speech signal. For each vocabulary word, we have a training sequence consisting of a number of repetitions of sequences of codebook indices of the word (by one ore more speakers). February 13, 2012 Veton Kpuska 45 Simple IsolatedWord Speech Recognition First task is to build individual word models. Use solution to Problem 3 to optimally estimate model parameters for each word model. To develop an understanding of the physical meaning of the model states: Use the solution to Problem 2 to segment each word training sequences into states Study the properties of the spectral vectors that led to the observations occurring in each state. Goal is to make refinements of the model: Once the set of W HMM's has been designed and optimized, recognition of an unknown word is performed using the solution to Problem 1 to score each word model based upon the given test observation sequence, and select the word whose model score is highest (i.e., the highest likelihood). Improve and optimize the model More states, Different Codebook size, etc. February 13, 2012 Veton Kpuska 46 Computation of P(O|) Computation of P(O|) Solution to Problem 1: Wish to calculate the probability of the observation sequence, O={o1,o2,...,oT} given the model . The most straight forward way is through enumeration of every possible state sequence of length T (the number of observations). Thus there are NT such state sequences: P( O | ) = Where: all Q P ( O, Q | ) P( O,Q| ) = P( O|Q, ) P( Q| ) February 13, 2012 Veton Kpuska 48 Computation of P(O|) Consider the fixed state sequence: Q= q1q2 ...qT The probability of the observation sequence O given the state sequence, assuming statistical independence of observations, is: Thus: P( O|Q, ) = P( o t |qt , ) t =1 T P( O|Q, ) =bq1 ( o1 ) bq2 ( o 2 )bqT ( oT ) P( Q, ) = q1 aq1q2 aq2q3 aqT -1qT Veton Kpuska The probability of such a state sequence q can be written as: February 13, 2012 49 Computation of P(O|) The joint probability of O and Q, i.e., the probability that O and Q occur simultaneously, is simply the product of the previous terms: P( O,Q| ) = P( O|Q, ) P( Q| ) The probability of O given the model is obtained by summing this joint probability over all possible state sequences Q: P( O| ) = P( O|Q, ) P( Q| ) Q = February 13, 2012 q1 ,q2 ,...,qT q1 q1 b ( o1 ) aq1q2 bq2 ( o 2 )aqT -1qT bqT ( oT ) 50 Veton Kpuska Computation of P(O|) Interpretation of the previous expression: Practical Problem: Initially at time t=1 we are in state q1 with probability q1, and generate the symbol o1 (in this state) with probability bq1(o1). In the next time instance t=t+1 (t=2) transition is made to state q2 from state q1 with probability aq1q2 and generate the symbol o2 with probability bq2(o2). Process is repeated until the last transition is made at time T from state qT from state qT1 with probability aqT1qT and generate the symbol oT with probability bqT(oT). Calculation required 2T NT (there are NT such sequences) For example: N =5 (states),T = 100 (observations) 2 100 5100 . 1072 computations! More efficient procedure is required Forward Algorithm February 13, 2012 Veton Kpuska 51 The Forward Algorithm Let us define the forward variable, t(i), as the probability of the partial observation sequence up to time t and state si at time t, given the model, i.e. t ( i ) = P( o1o 2 o t , qt = si | ) 1 ( i ) = i bi ( o1 ) N i =1 It can easily be shown that: 1 i N P( O | ) = T ( i ) Thus the algorithm: February 13, 2012 Veton Kpuska 52 The Forward Algorithm 1. Initialization 1 ( i ) = i bi ( o1 ) 1i N s1 s2 t a1j a2j a3j aNj t+1 2. Induction t +1 ( 1 t T -1 N s j ) = t ( i ) aij b j ( ot +1 ) , 1 j N i =1 3 sj 3. Termination P( O | ) = T ( i ) i =1 N sN t(i) t+1(j) February 13, 2012 Veton Kpuska 53 The Forward Algorithm February 13, 2012 Veton Kpuska 54 The Backward Algorithm Similarly, let us define the backward variable, t(i), as the probability of the partial observation sequence from time t+1 to the end, given state si at time t and the model, i.e. It can easily be shown that: t ( i ) = P( o t +1o t +2oT |qt = si , ) T ( i ) =1 N 1i N i=1 By induction the following algorithm is obtained: P( O| ) = i bi ( o1 ) 1 ( i ) February 13, 2012 Veton Kpuska 55 The Backward Algorithm 1. Initialization T ( i ) =1 1i N T -1t 1 j ), 1i N si 2. Induction t ( i ) = aij b j ( ot +1 ) t +1 ( j =1 N t ai1 ai2 ai3 aiN t(i) t+1 s1 s2 s3 3. Termination P( O| ) = i bi ( o1 ) 1 ( i ) i=1 N sN t+1(j) February 13, 2012 Veton Kpuska 56 The Backward Algorithm February 13, 2012 Veton Kpuska 57 Finding Optimal State Sequences Finding Optimal State Sequences One criterion chooses states, qt, which are individually most likely This maximizes the expected number of correct states Let us define t(i) as the probability of being in state si at time t, given the observation sequence and the model, i.e. i t i i Then the individually most likely state, qt, at time t is: i=1 ( i ) = P( q = s |O, ) ( i ) =1, t N qt =argmax i ( i ) 1t T 1i N February 13, 2012 Veton Kpuska 59 Finding Optimal State Sequences Note that it can be shown that: t ( i ) t ( i ) t ( i ) t ( i ) i (i) = = N P ( O| ) t ( i ) t ( i ) i=1 The individual optimality criterion has the problem that the optimum state sequence may not obey state transition constraints Another optimality criterion is to choose the state sequence which maximizes P(Q ,O|); This can be found by the Viterbi algorithm Veton Kpuska February 13, 2012 60 The Viterbi Algorithm Let us define t(i) as the highest probability along a single path, at time t, which accounts for the first t observations, i.e. q1 ,q2 ,..., By induction: qt -1 t ( i ) = max P( q1 ,q2 ,...,qt -1 ,qt = si ,o1 ,o2 ,...,ot -1 ,ot | ) t +1 ( j ) = max t ( i ) aij b j ( ot +1 ) i [ ] To retrieve the state sequence, we must keep track of the state sequence which gave the best path, at time t, to state si We do this in a separate array t(i). February 13, 2012 Veton Kpuska 61 The Viterbi Algorithm 1. Initialization: 1 ( i ) =0 2. 1 ( i ) = i bi ( o1 ) , 1i N Recursion t ( i ) = arg max[ t -1 ( i ) aij ] , 1 i N 3. t ( j ) = max t -1 ( i ) aij b j ( ot ) , 1 i N [ 2 t T 1 j N 2 t T 1 j N Termination p * = max[ T ( i ) ] * qT =argmax[ T ( i ) ] 1i N 1i N February 13, 2012 Veton Kpuska 62 The Viterbi Algorithm 4. Path (statesequence) backtracking: * qT = t +1 ( qt*+1 ) , t =T -1,T -2 ,...,1 Computation Order: N2T February 13, 2012 Veton Kpuska 63 The Viterbi Algorithm Example 0.5*0.8 0.5*0.8 0.3*0.7 0.2*1 0.3*0.7 0.2*1 0.4*0.5 0.4*0.5 February 13, 2012 Veton Kpuska 64 The Viterbi Algorithm: An Example (cont'd) February 13, 2012 Veton Kpuska 65 Matching Using ForwardBackward Algorithm February 13, 2012 Veton Kpuska 66 BaumWelch Reestimation Solution to Problem 3: BaumWelch Reestimation BaumWelch reestimation uses EM to determine ML parameters Define t(i,j) as the probability of being in state si at time t and state sj at time t+1, given the model and observation sequence t ( i, j ) = P ( qt = si , qt +1 = s j |O, ) P ( qt = si , qt +1 = s j ,O| ) P( O| ) 68 Then, from the definitions of the forward and backward variables, can write t(i,j) in the form: t ( i, j ) = February 13, 2012 Veton Kpuska Solution to Problem 3: BaumWelch Reestimation t ( i ) aij b j ( o t +1 ) t +1 ( j ) t ( i , j ) = P( O| ) t ( i ) aij b j ( o t +1 ) t +1 ( j ) = N N t ( i ) aij b j ( ot+1 ) t+1 ( j ) i=1 j =1 t(i) as the probability of being in state si at time t, we can relate t(i) to t(i,j) by Hence considering that we have defined summing over j: N t ( i ) = t ( i, j ) j= 1 February 13, 2012 Veton Kpuska 69 Solution to Problem 3: BaumWelch Reestimation Summing over T-1 t =1 t t(i) and t(i,j), we get i ( i ) = Expected number of transitions from state s ( i, j ) = Expected number of transitions from state s t =1 t T -1 i to state s j February 13, 2012 Veton Kpuska 70 BaumWelch Reestimation Procedures February 13, 2012 Veton Kpuska 71 BaumWelch Reestimation Formulas i = Expected number of times in state si at t = 1 Expected number of transitions from state si to state s j aij = Expected number of transitions from state si = ( i, j ) t =1 T -1 t =1 t T -1 (i) t bj = Expected number of times in state s j and observing symbol v k Expected number of times in state s j t =1 ot = v k T t =1 ( j) t t T = ( j) Veton Kpuska 72 February 13, 2012 BaumWelch Reestimation Formulas = A, B , If =(A , B, ).is the initial model, and reestimated model. Then it can be proved that either: 1. The initial model, , defines a critical point of the likelihood function, in which case = , or 2. Model is more likely than in the sense that P(O| )>P(O|), i.e., we have found a new model from Thus we can improve the probability of O being observed from the model if we iteratively use in place of and repeat the re estimation until some limiting point is reached. The resulting model is called the maximum likelihood HMM. ( ) which the observation sequence is more likely to have been produced. February 13, 2012 Veton Kpuska 73 Multiple Observation Sequences Multiple Observation Sequences Speech recognition typically uses lefttoright HMMs. These HMMs can not be trained using a single observation sequence, because only a small number of observations are available to train each state. To obtain reliable estimates of model parameters, one must use multiple observation sequences. In this case, the reestimation procedure needs to be modified. Let us denote the set of K observation sequences as O = {O(1), O(2), ..., O(K)} Where O(k) = {O1(k), O2(k), ..., OTk(k)} is the kth observation sequence. February 13, 2012 Veton Kpuska 75 Multiple Observation Sequences Assume that the observations sequences are mutually independent, we want to estimate the parameters so as to maximize P( O| ) = P O | = Pk (k) k =1 k =1 K ( ) K Since the reestimation formulas are based on frequency of occurrence of various events, we can modify them by adding up the individual frequencies of occurrence for each sequence. The modified reestimation formulas for ij and bj(l) are: _ February 13, 2012 Veton Kpuska 76 Multiple Observation Sequences 1 Tk -1 k (k) k k t ( i,j ) Pk t ( i ) aij b j ot+1 t ( j ) t 1 = aij = k =K tT1-1 = k =1 K=1 Tk -1 k 1 k k k t ( i ) P t ( i ) t ( i ) k =1 t =1 k =1 k t =1 K Tk -1 K ( ) bj = K k =1 t =1 o( k ) = vl t t ( j ) k Tk 1 P k =1 k = K K o( k ) = v t t =1 tk ( i ) tk ( i ) l Tk ( j ) k k =1 t =1 t K Tk 1 Tk k k P t ( i ) t ( i ) k =1 k t =1 77 February 13, 2012 Veton Kpuska Multiple Observation Sequences Note: i is not reestimated since: 1 = 1 and i = 0, i1. February 13, 2012 Veton Kpuska 78 HMMs in Speech Recognition Example of HMM for Speech Recognition a11 a22 a33 q0 a01 q1 a12 q2 a13 a23 q3 a3e qe What are states used for and what do they model? For speech the (hidden) states can be Phones, Parts of speech, or Words February 13, 2012 Veton Kpuska 80 Example of HMM for Speech Recognition Observation is information about the spectrum and energy of waveform at a point in time. Decoding process maps this sequence of acoustic information to phones, words and ultimately to sentences. The observation sequence for speech recognition is a sequence of acoustic feature vectors. Each acoustic feature vector represents information such as the amount of energy in different frequency bands at a particular point in time. Each observation consists of a vector of e.g. 39 realvalued features indicating spectral information. Observations are generally drawn every 10 milliseconds, so 1 second of speech requires 100 spectral feature vectors, each vector of length e.g. 39. February 13, 2012 Veton Kpuska 81 Hidden States of HMM The hidden states of Hidden Markov Models can be used to model speech in a number of different ways. For small tasks, like digit recognition, (the recognition of the 10 digit words zero through nine), or for yesno recognition (recognition of the two words yes and no), we could build an HMM whose states correspond to entire words. For most larger tasks, however, the hidden states of the HMM correspond to phonelike units, and words are sequences of these phonelike units February 13, 2012 Veton Kpuska 82 Wordbased HMMs Wordbased HMMs are appropriate for small vocabulary speech recognition. For large vocabulary ASR, subword based (e.g., phonebased) models are more appropriate. February 13, 2012 Veton Kpuska 83 Example of HMM a11 a22 a33 a44 q0 a01 q1 s a12 q2 ih a23 q3 k a34 q4 s a4e qe An HMM for the word six, consisting of four emitting states and two nonemitting states, the transition probabilities A, the observation probabilities B, and a sample observation sequence. This kind of lefttoright HMM structure is called a Bakis network. February 13, 2012 Veton Kpuska 84 Phonebased HMMs (cont'd) The phone models can have many states, and words are made up from a concatenation of phone models. February 13, 2012 Veton Kpuska 85 Duration Modeling The use of selfloops allows a single phone to repeat so as to cover a variable amount of the acoustic input. Phone durations vary hugely, dependent on the phone identity, the speaker's rate of speech, the phonetic context, and the level of prosodic prominence of the word. Looking at the Switchboard corpus, the phone [aa] varies in length from 7 to 387 milliseconds (1 to 40 frames), while the phone [z] varies in duration from 7 milliseconds to more than 1.3 seconds (130 frames) in some utterances! Selfloops thus allow a single state to be repeated many times. February 13, 2012 Veton Kpuska 86 What HMM's States Model? For very simple speech tasks (recognizing small numbers of words such as the 10 digits), using an HMM state to represent a phone is sufficient. In general LVCSR tasks, however, a more finegrained representation is necessary. This is because phones can last over 1 second, i.e., over 100 frames, but the 100 frames are not acoustically identical. The spectral characteristics of a phone, and the amount of energy, vary dramatically across a phone. For example, recall that stop consonants have a closure portion, which has very little acoustic energy, followed by a release burst. Similarly, diphthongs are vowels whose F1 and F2 change significantly. Fig. 9.6 (next slide) shows these large changes in spectral characteristics over time for each of the two phones in the word "Ike", ARPAbet [ay k]. February 13, 2012 Veton Kpuska 87 Ike [ay k] February 13, 2012 Veton Kpuska 88 Spectrogram of Mike [m ay k] SPEECH 1 0 .5 0 -0 .5 -1 0 0 .1 0 .2 0 .3 0 .4 Tim e (s ) 0 .5 0 .6 0 .7 0.8 S P E C TR O G R A M 4 0 00 3 5 00 3 0 00 2 5 00 F re qu e n c y [H z ] 2 0 00 1 5 00 1 0 00 500 0 0 0.1 0.2 0.3 0.4 Tim e [s ] 0.5 0 .6 0 .7 0 .8 February 13, 2012 Veton Kpuska 89 Phone Modeling with HMMs To capture this fact about the nonhomogeneous nature of phones over time, in LVCSR we generally model a phone with more than one HMM state. The most common configuration is to use three HMM states, a beginning, middle, and end state. Each phone thus consists of 3 emitting HMM states instead of one (plus two nonemitting states at either end), as shown in next slide. It is common to reserve the word model or phone model to refer to the entire 5state phone HMM, and use the word HMM state (or just state for short) to refer to each of the 3 individual subphone HMM states. February 13, 2012 Veton Kpuska 90 Phone Modeling with HMMs a11 a22 a33 q0 a01 q1 begin a12 q2 middle a 23 q3 end a3e qe A standard 5state HMM model for a phone, consisting of three emitting states (corresponding to the transitionin, steady state, and transitionout regions of the phone) and two nonemitting states. February 13, 2012 Veton Kpuska 91 Phone Modeling with HMMs To build a HMM for an entire word using these more complex phone models, we can simply replace each phone/state of the word model in Example of HMM for Speech Recognition with a 3state phone HMM. We replace the nonemitting start and end states for each phone model with transitions directly to the emitting state of the preceding and following phone, leaving only two nonemitting states for the entire word. Figure in the next slide shows the expanded word. February 13, 2012 Veton Kpuska 92 Phone Modeling with HMMs a11 a22 a33 a44 q0 a01 q1 s a12 q2 ih a23 q3 k a34 q4 s a4e qe a11 a22 a33 q3 end a3e qe q0 a01 q1 begin a12 q2 middle a23 a11 q0 a01 q1 sbegin HMM for S a22 a33 q3 send a34 a44 q4 ihbegin a45 HMM for ih a55 a66 q5 ih middle a56 q6 ihend a67 a77 q7 k begin a78 HMM for ih a88 a99 q8 kmiddle a89 q9 kend a10,10 HMM for S a11,11 a12,12 qe a12 q2 smiddle a23 a9,10 q10 a10,11 q11 a11,12 q12 a12,e sbegin smiddle send A composite word model for "six", [s ih k s], formed by concatenating four phone models, each with three emitting states. February 13, 2012 Veton Kpuska 93 Phone Modeling with HMMs Another way of looking at the A transitional probabilities and the states Q is that together they represent a lexicon: a set of pronunciations for words, each pronunciation consisting of a set of subphones, with the order of the subphones specified by the transition probabilities A. We have now covered the basic structure of HMM states for representing phones and words in speech recognition. Later in this chapter we will see further augmentations of the HMM word model shown in the Figure in previous slide, such as the use of triphone models which make use of phone context, and the use of special phones to model silence. February 13, 2012 Veton Kpuska 94 HMM for Acoustic Observations Modeling We have now almost completed discussion of Acoustic Modeling with HMM's. The only one remaining factor is computation of output observation probability bj(ot) for a given observation ot at time t. This topic is discussed next. February 13, 2012 Veton Kpuska 95 Continuous Density Hidden Markov Models Continuous Density Hidden Markov Models A continuous density HMM replaces the discrete observation probabilities, bj(k), by a continuous PDF bj(x) A common practice is to represent bj(x) as a mixture of Gaussians: Where: b j ( x ) = c jk N x, jk , jk k =1 N [ 1 j N c jk =1,1 j N k =1 M c jk is the mixture weight, c jk 0 1 j M, 1 k M, and N is the normal density jk and jk are the mean vector and covariance matrix associated with state j and mixture k. February 13, 2012 Veton Kpuska 97 Continuous Density Hidden Markov Models Continuous Density Hidden Markov Models are Described in Detail in previous lecture series (Ch3Pattern Classification2.ppt) February 13, 2012 Veton Kpuska 98 Acoustic Modeling Variations Semicontinuous HMMs first compute a VQ codebook of size M Each codeword is represented by a Gaussian PDF, and may be used together with others to model the acoustic vectors The VQ codebook is then modeled as a family of Gaussian PDFs From the CDHMM viewpoint, this is equivalent to using the same set of M mixtures to model all the states It is therefore often referred to as a Tied Mixture HMM All three methods have been used in many speech recognition tasks, with varying outcomes For largevocabulary, continuous speech recognition with sufficient amount (i.e., tens of hours) of training data, CDHMM systems currently yield the best performance, but with considerable increase in computation February 13, 2012 Veton Kpuska 99 Implementation Issues Scaling: Segmental Kmeans Training: To prevent underflow Initial estimates of : To train observation probabilities by first performing Viterbi alignment To provide robust models To reduce search computation Veton Kpuska 100 Pruning: February 13, 2012 References X. Huang, A. Acero, and H. Hon, Spoken Language Processing, PrenticeHall, 2001. F. Jelinek, Statistical Methods for Speech Recognition. MIT Press, 1997. L. Rabiner and B. Juang, Fundamentals of Speech Recognition, PrenticeHall, 1993. February 13, 2012 Veton Kpuska 101 Hidden Markov Model Concept Definitions: 1. 2. 3. An output alphabet Y={0,1,...,b1} A state space S={1,2,...,c} with a unique starting state s0. A probability distribution of transitions between states p(s'|s), and An output probability distribution q(y|s,s') associated with transitions from state s to state s'. 4. February 13, 2012 Veton Kpuska 102 Hidden Markov Model Concept The probability of observing an HMM output string y1, y2, ...,yk is given by: P( y1 , y2 ,..., yn ) = s1 ,...,sk i=1 p ( s | s )q ( y | s i i-1 i k i-1 ,si ) Next Figure is an example of an HMM with b=2 and c=3. q(y|3,1) 1 q(y|1,1) q(y|1,2) 2 q(y|1,3) q(y|2,3) 3 q(y|3,2) Three State Hidden Markov Model with outputs y {0,1} February 13, 2012 Veton Kpuska 103 Hidden Markov Model Concept The underlying state process still has only onestep memory: k P( s1 , s2 ,..., sk ) = p( si | si -1 ) i =1 However, the memory of observables is unlimited (except in degenerate cases. That is is, in general for all k2 P( yk +1 | y1 , y2 ,..., yk ) P ( yk +1 | y j , y j+1 ,..., yk ) k j 2 February 13, 2012 Veton Kpuska 104 Hidden Markov Model Concept It will frequently be convenient to regard the HMM as having multiple transitions between pairs of states, each associated with a different output symbol generated, with probability of 1, when transition is taken. The HMM example given below can generate the same random sequence as the pervious example assuming that q(1|1,1)=q(1|1,3)=q(0| 3,1)=q(0|3,2) = 0 1 1 0 0 1 2 0 0 1 3 1 Hidden Markov Model representation attaching outputs to transitions February 13, 2012 Veton Kpuska 105 Hidden Markov Model Concept This view has the advantage of allowing us to provide each transition of the entire HMM with a different identifier t and to define an output function Y(t) that assigns to t a unique output symbol taken from the alphabet Y. We then denote by L(t) and R(t) the source and target states of the transition t, respectively. We let p(t) denote the probability that the state L(t) is exited via the transition t, so that for all s S The correspondence between the two ways of viewing an HMM is given by the relationship t:L t =s p( t ) =1 ( ) p ( t ) = q ( Y ( t ) | L( t ) , R ( t ) ) p ( R ( t ) | L( t ) ) February 13, 2012 Veton Kpuska 106 Hidden Markov Model Concept i =1 When transitions determine outputs, the probability P(y1,y2,..., yk) becomes equal to the sum of the products: k p ( ti ) over all transition sequences t1,...,tk such that L(t1)=s0, Y(ti)=yi, and R(ti)=L(ti+1) for i=1,...,k or formally: P( y1 , y2 ,..., yk ) = S ( y1 , y 2 ,..., y k i =1 ) p( t ) i k Where S (y1,y2,..., yk) = {t1,t2,..., tk: L(t1)=s0,Y(ti)=yi, R(ti)=L(ti+1) for i=1,...,k} In the following sections we will take whichever point of view: Will be more convenient for the problem at hand. February 13, 2012 Veton Kpuska Multiple transitions between states s and s', or Multiple possible outputs generated by the single transition s s' 107 The evaluation of the probability: The Trellis There is an easy way to calculate the probability P(y1,y2,..., yk) with the help of trellis. Trellis: Consists of the concatenation of elementary stages determined by the particular outputs yi. The number of elementary states is equal to the number of different output symbols. 1 2 3 1 2 3 1 2 3 1 2 3 y=0 y=1 Two different trellis stages corresponding to the binary HMM presented in the slide 16. February 13, 2012 Veton Kpuska 108 The Trellis Trellis corresponding to the output sequence 0110. 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 y=0 y=1 y=1 y=0 Trellis for output sequence 0110 generated by the binary HMM presented in the slide 16. The required probability P(0110) is equal to the sum of the probabilities of all complete paths through the trellis (those ending in the last column) that start in the obligatory starting state. Veton Kpuska 109 February 13, 2012 The Trellis Example of paths for s0=1 that could generate 0110 sequence. s0 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 y=0 y=1 y=1 y=0 Trellis of previous slide purged of all paths that could not have generated the output sequence 0110. February 13, 2012 Veton Kpuska 110 The Trellis The probability P(y1,y2,..., yn) can be obtained recursively: Define the probabilities: i ( s ) = P ( y1 , y2 ,..., yi ,si = s ) Using boundary conditions: 0 ( s ) =1 for s = s0 0 ( s ) =0 for s s0 And applying the simplified notation: p ( yi ,s|s ')= q ( yi |s ',s ) p ( s|s ') February 13, 2012 Veton Kpuska 111 The Trellis We get the recursion expression: i ( s) = p( yi |s',s ) i-1 ( s ') s' From the definition of i(s) the desired probability is then: P( y1 , y2 ,..., yk ) = k ( s ) s February 13, 2012 Veton Kpuska 112 The Trellis Unit probability assigned to starting state s0=1, i.e., 0(s0)=1. Computing the flows p(0,s|1)0(1) for s{1,2,3} 1 ( s =1) = qy21(= 0|1),==1)q(0y2 =1|s ',s =1)1 ( s ') ( s =s ' s ( s ') 2 ( =0|s s q ( ') 1 ( s = 2) = q ( y1 s = 2)',= = 2)(0y1s=1|s ',s = 2)1 ( s ') 1 ( s =3) = qy1 ==|3)',= ) 0y( s ') |s ',s =3)1 ( s ') ( 2 ( s 0 s s =3q ( 1 =1 3 s '=1, 2 s0 1 s '=1 1 s '=3 1 1 2 3 1 2 3 2 s '=1 2 s '=1, 3 2 3 s '=2 3 y=0 y=1 y=1 y=0 Trellis of previous slide purged of all paths that could not have generated the output sequence 0110. February 13, 2012 Veton Kpuska 113 ...
View Full Document

This note was uploaded on 02/11/2012 for the course ECE 5526 taught by Professor Staff during the Summer '09 term at FIT.

Ask a homework question - tutors are online