{[ promptMessage ]}

Bookmark it

{[ promptMessage ]}

huang-vq-gmm

# huang-vq-gmm - tern Recognition t layer to output(4.70 m...

This preview shows pages 1–13. Sign up to view the full content.

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: tern Recognition t layer to output (4.70) m classiﬁcation, ) one and the re- way to minimize ent descent algo- hout loss of gen- th desired output rd from the input 11 4.1. n rule. w; is the ivative is: (4.71) ab" aw; (t) ' dient 11 mean squared CIT; arence between dig—1.; is directly related to lg procedure is th acceptable value 0 1e training exam nber of trainihg“ .. i i Unsupervised Estimation Methods ALGORITHM 4.1: THE BACK PROPAGATION ALGORITHM Step 1: Initialization: Set t=0 and choose initial weight matrices W for each layer. Let’s de- note w; (t) as the weighting coefficients connecting i"' input node in layer k—l and j"' out- put node in layer k at time t . Step 2: FonNard Propagation: Compute the values in each node from input layer to output layer in a propagating fashion, for k = l to K N v; = sigmoid(woj (t) + 2 w; (t)v,." '1 ) Vj (4-72) i=1 1 ’ 1 + e—x Step 3: Back Propagation: Update the weights matrix for each layer from output layer to input layer according to: where sigmoid(x) = and v; is denoted as the j‘” node in the k"' layer w; (t+ 1) = wf (t) -a 8E (4.73) awga) where E = 2” y, —oi H2 and (y1 , y2,...y5) is the computed output vector in Step 2. i=1 a is referred to as the learning rate and has to be small enough to guarantee convergence. One popular choice is l/(t +1) . Step 4: Iteration: Let t: t +1. Repeat Steps 2 and 3 until some convergence condition is met. _ 4.4. UNSUPERVISED ESTIMATION METHODS ' As described 1n Section 4 2, in unsupervised learning, information about class a) of the data Vector Quantization hed in Chapter 3, source coding refers to techniques that convert the signal source uence of bits that are transmitted over a communication channel and then used to ‘ 163 . 'e might wonder why we are interested in such an unpromising problem, and whether or '. ; othit is possible to learn anything from incomplete data. Interestingly enough, the formal 164 Pattern Recognition reproduce the original signal at a different location or time. In speech communication, the reproduced sound usually allows some acceptable level of distortion to achieve low bit rate, The goal of source coding is to reduce the number of bits necessary to transmit or store data, subject to a distortion or ﬁdelity criterion, or equivalently, to achieve the minimum possible distortion for a prescribed bit rate. Vector quantization (VQ) is one of the most efﬁcient source-coding techniques. speech coding, image coding, and speech recognition [36, 85]. In both speech recognition and synthesis systems, vector quantization serves an important role in many aspects of the to robust signal processing and data compression. A vector quantizer is described by a codebook, which is a set of ﬁxed prototype vec- vector quantization process includes: 1. the distortion measure; 2. the generation of each codeword in the codebook. Assume that x=(x,,x2,...,xd )' e R" is a d—dimensional vector whose components {x,t ,l S k S d} are real-valued, continuous-amplitude random variables. After vector quanti- zation, the vector x is mapped (quantized) to another discrete-amplitude d—dimensional vec- tor z. z = q(x) (4.74) In Eq. (4.74) q() is the quantization operator. Typically, z is a vector from a ﬁnite set Z = {zpl S j S M } , where z]. is also a d-dimensional vector. The set Z is referred to as the codebook, M is the size of the codebook, and zj is f” codeword. The size M of the codebook is also called the number of partitions (or levels) in the codebook. To design a codebook, the When be deﬁned b ure, Eq. (4.7; 400 = The di speech eonte also must be dicate differ: the Euclidear ern Recognition munication, the :ve low bit rate. iit or store data, nimum possible e most efﬁcient gnals by discrete ) as scalar quan- ers is referred to aeen used effec- ul application to eech recognition 1y aspects of the e discrete HMM, :d prototype vec- red to as a code- red against each r is then replaced iescription of the stortion, the trans :e with the corre r the original data ‘. tral goal of vectQE"‘*_ )Il measures. tfter vector quaIl :or from a ﬁnite is referred to 35:! '. “ ;M of the codeh 0 Unsupervised Estimation Methods 165 d—dimensional space of the original random vector x can be partitioned into M regions or cells {Ci,1 Si SM}, and each cell C. is associated with a codeword vector 2,. VQ then maps (quantizes) the input vector x to codeword zi if x lies in Q. That is, q(x) = 2,. ifxe C,. (4.75) An example of partitioning of a two-dimensional space (d = 2) for the purpose of vec- tor quantization is shown in Figure 4.12. The shaded region enclosed by the dashed lines is the cell Ci. Any input vector x that lies in the cell Ci is quantized as zi . The shapes of the various cells can be different. The positions of the codewords within each cell are shown by dots in Figure 4.12. The codeword z,. is also referred to as the centroid of the cell Q. be- cause it can be viewed as the central point of the cell C,. Figure 4.12 Partitioning of a two-dimensional space into 16 cells. When x is quantized as z, a quantization error results. A distortion measure d(x, 2) can eizned between x and z to measure the quantization quality. Using this distortion meas- .- (4. 75) can be reformulated as follows: - .99) = 2, if and only if i = argmin d(x, 2*) (4.76) . k ('Stortion measure between x and z is also known as a distance measure in the text. The measure must be tractable in order to be computed and analyzed, and .be subjectively relevant so that differences 1n distortion values can be used to in- 166 Pattern Recognition ' ing the different parameters are equal. Therefore, the distortion measure d(x, 2) can be de- ﬁned as follows: This distortion measure, known as the Mahalanobis distance, is actually the exponential term in a Gaussian density function. Another way to weight the contributions to the distoru'on measure is to use perceptu- ally—based distortion measures. Such distortion measures take advantage of subjective judg- ceived different should be associated with large distances. Similarly signal changes that keep the sound perceived the same should be associated with small distances. A number of per- ceptually based distortion measures haveibeen used in speech coding [3, 75, 76]. 4.4.1.2. The K-Means Algorithm To design an M—level codebook, it is necessary to partition d—dimensional space into M cells and associate a quantized vector with each cell. Based on the source-coding principle, the criterion for optimization of the vector quantizer is to minimize overall average distoru'on over all M-levels of the VQ. The overall average distortion can be deﬁned by D = E[d(x,z)] =ip(xe Q)E[d(x,z,)l xe q] i=l =ip(xe C,)J;Ec d(x,z,)p(x| xe C,)dx=iD, i=1 (4.79) where the integral is taken over all components of vector x; p(xe Q.) denotes the prior probability of codeword zi ; p(x| x 6 C,.) denotes the multidimensional probability density function of x in cell (1,; and Di is the average distortion in cell C,. No analytic solution Ci, they can centroid of ti z.=ce I The ce measure. In 1 will be locatc comes K , / T l D:— ' T The second c z,=ar; When I tempt to ﬁnd error estimati respect to z, VIID, = tern Recognition x, 1) can be de- (4.77) error. In general, distortion more cations is to use (4.78) ' the exponential to use perceptu- f subjective judg- erceptually—based ounds being per— changes that keep A number of per- i, 76]. space into M cells iing principle, the average distortion Unsupervised Estimation Methods 167 We say a quantizer is optimal if the overall average distortion is minimized over all M- levels of the quantizer. There are two necessary conditions for optimality. The ﬁrst is that the optimal quantizer is realized by using a nearest-neighbor selection rule as speciﬁed by Eq. (4.76). Note that the average distortion for each cell C, E[d(x,z,.) | xe C] (4.80) can be minimized when z,. is selected such that d(x,z,.) is minimized for x. This means that the quantizer must choose the codeword that results in the minimum distortion with respect to x. The second condition for optimality is that each codeword z, is chosen to minimize the average distortion in cell C, . That is, z,. is the vector that minimizes D.- =p(z.-)E[d(x,z)|xe 0.] (4.81) Since the overall average distortion D is a linear combination of average distortions in C, , they can be independently computed after classiﬁcation of x. The vector 2,. is called the centroid of the cell Cl. and is written = cent(C,.) (4.82) The centroid for a particular region (cell) depends on the deﬁnition of the distortion measure. In practice, given a set of training vectors {x,, IS! S T}, a subset of K, vectors will be located in cell Q. In this case, p(x | 2,.) can be assumed to be 1/ K. , and p(z,.) be- comes K,. / T . The average distortion D, in cell C, can then be given by D.=—2d(x,z) (4-33) szC, The second condition for optimality can then be rewritten as follows: = argminD. (z )= argmin— 12 d(x,z.) (4.84) 1: Txe C, When the sum of squared error in Eq. (4. 77) 18 used for the distortion measure, the at- R 168 Pattern Recognition By solving Eq. (4.85), we obtain the least square error estimate of centroid 2,. simply as the sample mean of all the training vectors x, quantized to cell C, : . 1 .=— x 4.86 2, K12 < ) x5 C, If the Mahalanobis distance measure [Eq. (4.78)] is used, minimization of D, in Eq. (4.84) can be done similarly: Vlei=V 12(x—zi)'2'l(x—zi) 1: T EC, = % 2 V1, (x — z, )‘>:-I (x — 2,.) (4.87) xe C, =_7222“(x—z,)=0 xe C, and centroid 2,. is obtained from 1 i. =— X 4.88 . 192 ( ) xEC, One can see that 2,. is again the sample mean of all the training vectors x, quantized to cell Ci. Although Eq. (4.88) is obtained based on the Mahalanobis distance measure, it also works with a large class of Euclidean-like distortion measures [61]. Since the Mahalanobis distance measure is actually the exponential term in a Gaussian density, minimization of the distance criterion can be easily translated into maximization of the logarithm of the Gaussian likelihood. Therefore, similar to the relationship between least square error estimation for the linear discrimination function and the Gaussian classiﬁer described in Section 4.3.3.1, the distance minimization process (least square error estimation) above is in fact a maximum likelihood estimation. According to these two conditions for VQ optimality, one can iteratively apply the nearest-neighbor selection rule and Eq. (4.88) to get the new centroid 2,. for each cell in order to minimize the average distortion measure. This procedure is known as the k—means algorithm or the generalized Lloyd algorithm [29, 34, 56]. In the k-means algorithm, the basic idea is to partition the set of training vectors into M clusters Ci (1 S i S M) in such a way that the two necessary conditions for optimality described above are satisﬁed. The k- means algorithm can be described as in Algorithm 4.2. Step 1: (1,, ISi Step 2: N C, by Cht classificat Step 3: C the trainin Step 4: It rent iterat old. In the actually bre. mean) for e. partitioning measure. A] ﬁnding the 1 distortion L D smaller th Theon thermore, ar the quality c repeating tht one can Ch01 section we v 4.4.1.3. Since the ini shown that i means algor. The LBG al; the codewor until the des: Algorithm 4. _____‘-——- I Recognition simply as the (4.86) of D,. in EQ- (4.87) (4.88) ruantized to cell measure, it also he Mahalanobis rimization of the 1 of the Gaussian )r estimation for Section 4.3.3.1. rfact a maximmfl, ' .ac - atively apply the; .- for each cell in- f in as the k—meanS:- . ns -a1gorithm thei'. _ -_. Unsupervised Estimation Methods 169 ALGORITHM 4.2: THE K-MEANS ALGORITHM Step 1: initialization: Choose some adequate method to derive initial VQ codewords (z,_ ISiSM ) in the codebook. Step 2: Nearest-neighbor Classification: Classify each training vector {x, } into one of the cells C, by choosing the closest codeword z, (xeC,, i.f.f.d(x,z,)sd(x,zj) for all j ¢i). This classification is also called minimum-distance classifier. Step 3: Codebook Updating: Update the codeword of every cell by computing the centroid of the training vectors in each cell according to Eq. (4.84) (i, = cent(C,), l s i s M ). Step 4: iteration: Repeat steps 2 and 3 until the ratio of the new overall distortion D at the cur- rent iteration relative to the overall distortion at the previous iteration is above a preset thresh- old. In the process of minimizing the average distortion measure, the k-means procedure actually breaks the minimization process into two steps. Assuming that the centroid z, (or mean) for each cell C, has been found, then the minimization process is found simply by partitioning all the training vectors into their corresponding cells according to the distortion measure. After all of the new partitions are obtained, the minimization process involves ﬁnding the new centroid within each cell to minimize its corresponding within-cell average distortion D, based on Eq. (4.84). By iterating over these two steps, a new overall distortion D smaller than that of the previous step can be obtained. Theoretically, the k—means algorithm can converge only to a local optimum [56]. Fur- thermore, any such solution is, in. general, not unique [33]. Initialization is oﬁen critical to the quality of the eventual converged codebook. Global optimality may be approximated by repeating the k—means algorithm for several sets of codebook initialization values, and then one can choose the codebook that produces the minimum overall distortion. In the next sub- section we will describe methods for ﬁnding a decent initial codebook. 4.4.1.3. The LBG Algorithm Since the initial codebook is critical to the ultimate quality of the ﬁnal codebook, it has been shown that it is advantageous to design an M-vector codebook in stages. This extended k- means algorithm is known as the LBG algorithm proposed by Linde, Buzo, and Gray [56]. The LBG algorithm first computes a l-vector codebook, then uses a splitting algorithm on the codewords to obtain the initial 2-vector codebook, and continues the splitting process - until the desired M-vector codebook is obtained. The procedure 1s formally implemented by ' Algorithm 4 3. ALGORITHM 4.3: THE LBG ALGORITHM Step 1: Initialization: Set M (number of partitions or cells) =1. Find the centroid of all the train- ing data according to Eq. (4.84). Step 2: Splitting: Split M into 2M partitions by splitting each current codeword by finding two points that are far apart in each partition using a heuristic method, and use these two points as the new centroids for the new 2M codebook. Now set M = 2M. Step 3: K-means Stage: Now use the k-means iterative algorithm described in the previous section to reach the best set of centroids for the new codebook. Step 4: Termination: if M equals the V0 codebook size required, STOP; otherwise go to Step 2. 4.4.2. The EM Algorithm We introduce the EM algorithm that is important to hidden Markov models and other leam- ing techniques. It discovers model parameters by maximizing the log-likelihood of incom- plete data and by iteratively maximizing the expectation of log-likelihood from complete data. The EM algorithm is a generalization of the VQ algorithm described above. The EM algorithm can also be viewed as a generalization of the MLE method, when the data observed is incomplete. Without loss of generality, we use scalar random variables hidden data x (that is unobserved). For example, x may be a hidden number that refers to component densities of observable data y, or x may be the underlying hidden state sequence in hidden Markov models (as discussed in Chapter 8). Without knowing this hidden data x, The issue now is whether or not the process (EM algorithm) described above con- verges. Without loss of generality, we assume that both random variables X (unobserved) and Y (observed) are discrete random variables. According to Bayes’ rule, P(X=x,Y=yl<5)=P(X=x|Y=y,<i>)P(Y=yl<i>) (4.89) Our goal is to maximize the log-likelihood of the observable, real data y generated by pa- rameter vector (I) . Based on Eq. (4.89), the log-likelihood can be expressed as follows: rogP(Y=yl<5):logP(X=x,Y=y|<i>)—logP(X=xl Y=y,<T>) (4.90) 170 Pattern RecognitiOn “.- where we d tér vector ( log P = Q(‘ where Q(q)9 and H((I>, The converg Q(<P.' then log P( since it foll Q(<D,<T>) is maximize thr (x. y), to HP L(x,(l>) incr mum if we it The na The implemc the auxiliary Q((I),(T>) ove way. 'n Recognition all the train- I ﬁnding two wo points as the previous .e go to Step 1nd other leam- 100d of incom- from complete we. i method, when ndom variables order to deter- :l to know some er that refers to a state sequence s hidden data )6. b , which maxi— e the probability 3 had in fact ob- the probability mate of (I). We ess to iteratively ibed above con- : X (unobserved) (4.89) '_ generated by P3‘ 4-- l as follows: _J'. ' (4.90)‘ T 1.. i l '| l 3 Unsupervised Estimation Methods 171 Now, we take the conditional expectation of log P(Y = ylé) over X computed with pa- rameter vector (1): E,[1ogP(Y= ylanm = 2(P(X =x|Y=y,‘I’)10gP(Y =yl<i>) (4 91) =logP(Y=yl5) . where we denote E3, [ f] my as the expectation of function f over X computed with parame- ter vector (I) . Then using Eq. (4.90) and (4.91) , the following expression is obtained: 10% P(Y = y l 5) = Esllog P(X, Y = y l 5mm, - EollogP(X | Y = y,5)]x.y=, _ _ (4.92) = Q(<I>, (1)) - H ((1)59) where Q(¢96) = Ewﬂog P(X, Y = y I (filmy-.3: =2(P(X=x|Y=y,<I>)IogP(X=x,Y=y|<i>)) (4'93) and £16156) = Edi [10g P(X I Y = y: (Bunny =2(P(X=x|Y=y,<i>)IogP(X=x|Y=y,<i>)) (4'94) The convergence of the EM algorithm lies in the fact that if we choose 5 so that Q(<I>, <5) 2 Q(<I>, (D) (4.95) then logP(Y =y|<§)210gP(Y=y|d>) (4.96) since it follows from Jensen's inequality that H ((1)515) S H (dub) [21]. The function Q(<I>,<T>) is known as the Q-function or auxiliary function. This fact implies that we can maximize the Q-function, which is the expectation of log-likelihood from complete data pair (x, y), to update parameter vector from (I) to (T) , so that the incomplete log-likelihood L(x, (1)) increases monotonically. Eventually, the likelihood will converge to a local maxi- mum if we iterate the process. The name of the EM algorithm comes from E for expectation and M for maximization. The implementation of the EM algorithm includes the E (expectation) step, which calculates the auxiliary Q-function Q(<I>,5) and the M (maximization) step, which maximizes " .53 931355) over (I) to obtain (I) . The general EM algorithm can be described in the following -. way. 172 Pattern Recognition ALGORITHM 4.4: THE EM ALGORITHM Step 1: Initialization: Choose an initial estimate (I). Step 2: E-Step: Compute auxiliary 0-function Q(<I),<T>) (which is also the expectation of log- likelihood from complete data) based on (I). Step 3: M-Step: Compute (i3 = arggrax Q(<I>,<I>) to maximize the auxiliary O-function. Step 4: Iteration: Set (I) = (I) , repeat from Step 2 until convergence. The M-step of the EM algorithm is actually a maximum likelihood estimation of com- plete data (assuming we know the unobserved data x based on observed data y and initial parameter vector (D). The EM algorithm is usually used in applications where no analytic solution ex1sts for maxrmrzatron of log-likelihood of incomplete data. Instead, the Q- function is iteratively maximized to obtain the estimation of parameter vector. 4.4.3. Multivariate Gaussian Mixture Density Estimation The vector quantization process described in Section 4.4.1 partitions the data space into separate regions based on some distance measure regardless of the probability distributions of the original data. This process may introduce errors in partitions that could potentia...
View Full Document

{[ snackBarMessage ]}

### What students are saying

• As a current student on this bumpy collegiate pathway, I stumbled upon Course Hero, where I can find study resources for nearly all my courses, get online help from tutors 24/7, and even share my old projects, papers, and lecture notes with other students.

Kiran Temple University Fox School of Business ‘17, Course Hero Intern

• I cannot even describe how much Course Hero helped me this summer. It’s truly become something I can always rely on and help me. In the end, I was not only able to survive summer classes, but I was able to thrive thanks to Course Hero.

Dana University of Pennsylvania ‘17, Course Hero Intern

• The ability to access any university’s resources through Course Hero proved invaluable in my case. I was behind on Tulane coursework and actually used UCLA’s materials to help me move forward and get everything together on time.

Jill Tulane University ‘16, Course Hero Intern