This preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: tern Recognition t layer to output (4.70) m classiﬁcation,
) one and the re
way to minimize
ent descent algo
hout loss of gen
th desired output
rd from the input
11 4.1. n rule. w; is the ivative is: (4.71) ab"
aw; (t) ' dient 11 mean squared CIT;
arence between dig—1.;
is directly related to lg procedure is th acceptable value 0 1e training exam
nber of trainihg“ .. i
i Unsupervised Estimation Methods ALGORITHM 4.1: THE BACK PROPAGATION ALGORITHM Step 1: Initialization: Set t=0 and choose initial weight matrices W for each layer. Let’s de
note w; (t) as the weighting coefficients connecting i"' input node in layer k—l and j"' out put node in layer k at time t .
Step 2: FonNard Propagation: Compute the values in each node from input layer to output layer
in a propagating fashion, for k = l to K N
v; = sigmoid(woj (t) + 2 w; (t)v,." '1 ) Vj (472) i=1 1
’ 1 + e—x
Step 3: Back Propagation: Update the weights matrix for each layer from output layer to input
layer according to: where sigmoid(x) = and v; is denoted as the j‘” node in the k"' layer w; (t+ 1) = wf (t) a 8E (4.73) awga) where E = 2” y, —oi H2 and (y1 , y2,...y5) is the computed output vector in Step 2. i=1
a is referred to as the learning rate and has to be small enough to guarantee
convergence. One popular choice is l/(t +1) . Step 4: Iteration: Let t: t +1. Repeat Steps 2 and 3 until some convergence condition is met. _ 4.4. UNSUPERVISED ESTIMATION METHODS ' As described 1n Section 4 2, in unsupervised learning, information about class a) of the data Vector Quantization hed in Chapter 3, source coding refers to techniques that convert the signal source
uence of bits that are transmitted over a communication channel and then used to ‘ 163 . 'e might wonder why we are interested in such an unpromising problem, and whether or
'. ; othit is possible to learn anything from incomplete data. Interestingly enough, the formal 164 Pattern Recognition reproduce the original signal at a different location or time. In speech communication, the
reproduced sound usually allows some acceptable level of distortion to achieve low bit rate,
The goal of source coding is to reduce the number of bits necessary to transmit or store data,
subject to a distortion or ﬁdelity criterion, or equivalently, to achieve the minimum possible
distortion for a prescribed bit rate. Vector quantization (VQ) is one of the most efﬁcient
sourcecoding techniques. speech coding, image coding, and speech recognition [36, 85]. In both speech recognition
and synthesis systems, vector quantization serves an important role in many aspects of the to robust signal processing and data compression.
A vector quantizer is described by a codebook, which is a set of ﬁxed prototype vec vector quantization process includes: 1. the distortion measure;
2. the generation of each codeword in the codebook. Assume that x=(x,,x2,...,xd )' e R" is a d—dimensional vector whose components
{x,t ,l S k S d} are realvalued, continuousamplitude random variables. After vector quanti
zation, the vector x is mapped (quantized) to another discreteamplitude d—dimensional vec
tor z. z = q(x) (4.74) In Eq. (4.74) q() is the quantization operator. Typically, z is a vector from a ﬁnite set
Z = {zpl S j S M } , where z]. is also a ddimensional vector. The set Z is referred to as the
codebook, M is the size of the codebook, and zj is f” codeword. The size M of the codebook
is also called the number of partitions (or levels) in the codebook. To design a codebook, the When
be deﬁned b ure, Eq. (4.7;
400 = The di
speech eonte
also must be
dicate differ:
the Euclidear ern Recognition munication, the
:ve low bit rate.
iit or store data,
nimum possible
e most efﬁcient gnals by discrete
) as scalar quan
ers is referred to
aeen used effec
ul application to
eech recognition
1y aspects of the e discrete HMM, :d prototype vec
red to as a code red against each
r is then replaced
iescription of the stortion, the trans
:e with the corre r the original data ‘.
tral goal of vectQE"‘*_ )Il measures. tfter vector quaIl :or from a ﬁnite is referred to 35:! '. “
;M of the codeh 0 Unsupervised Estimation Methods 165 d—dimensional space of the original random vector x can be partitioned into M regions or
cells {Ci,1 Si SM}, and each cell C. is associated with a codeword vector 2,. VQ then
maps (quantizes) the input vector x to codeword zi if x lies in Q. That is, q(x) = 2,. ifxe C,. (4.75) An example of partitioning of a twodimensional space (d = 2) for the purpose of vec
tor quantization is shown in Figure 4.12. The shaded region enclosed by the dashed lines is
the cell Ci. Any input vector x that lies in the cell Ci is quantized as zi . The shapes of the
various cells can be different. The positions of the codewords within each cell are shown by
dots in Figure 4.12. The codeword z,. is also referred to as the centroid of the cell Q. be
cause it can be viewed as the central point of the cell C,. Figure 4.12 Partitioning of a twodimensional space into 16 cells. When x is quantized as z, a quantization error results. A distortion measure d(x, 2) can
eizned between x and z to measure the quantization quality. Using this distortion meas
. (4. 75) can be reformulated as follows:  .99) = 2, if and only if i = argmin d(x, 2*) (4.76)
. k ('Stortion measure between x and z is also known as a distance measure in the
text. The measure must be tractable in order to be computed and analyzed, and .be subjectively relevant so that differences 1n distortion values can be used to in 166 Pattern Recognition ' ing the different parameters are equal. Therefore, the distortion measure d(x, 2) can be de
ﬁned as follows: This distortion measure, known as the Mahalanobis distance, is actually the exponential
term in a Gaussian density function. Another way to weight the contributions to the distoru'on measure is to use perceptu
ally—based distortion measures. Such distortion measures take advantage of subjective judg ceived different should be associated with large distances. Similarly signal changes that keep
the sound perceived the same should be associated with small distances. A number of per
ceptually based distortion measures haveibeen used in speech coding [3, 75, 76]. 4.4.1.2. The KMeans Algorithm To design an M—level codebook, it is necessary to partition d—dimensional space into M cells
and associate a quantized vector with each cell. Based on the sourcecoding principle, the
criterion for optimization of the vector quantizer is to minimize overall average distoru'on
over all Mlevels of the VQ. The overall average distortion can be deﬁned by D = E[d(x,z)] =ip(xe Q)E[d(x,z,)l xe q] i=l =ip(xe C,)J;Ec d(x,z,)p(x xe C,)dx=iD, i=1 (4.79) where the integral is taken over all components of vector x; p(xe Q.) denotes the prior
probability of codeword zi ; p(x x 6 C,.) denotes the multidimensional probability density
function of x in cell (1,; and Di is the average distortion in cell C,. No analytic solution Ci, they can
centroid of ti z.=ce I The ce measure. In 1
will be locatc
comes K , / T l
D:—
' T The second c z,=ar; When I
tempt to ﬁnd
error estimati
respect to z, VIID, = tern Recognition x, 1) can be de (4.77) error. In general,
distortion more
cations is to use (4.78) ' the exponential to use perceptu
f subjective judg
erceptually—based
ounds being per—
changes that keep
A number of per
i, 76]. space into M cells
iing principle, the
average distortion Unsupervised Estimation Methods 167 We say a quantizer is optimal if the overall average distortion is minimized over all M
levels of the quantizer. There are two necessary conditions for optimality. The ﬁrst is that
the optimal quantizer is realized by using a nearestneighbor selection rule as speciﬁed by
Eq. (4.76). Note that the average distortion for each cell C, E[d(x,z,.)  xe C] (4.80) can be minimized when z,. is selected such that d(x,z,.) is minimized for x. This means that
the quantizer must choose the codeword that results in the minimum distortion with respect
to x. The second condition for optimality is that each codeword z, is chosen to minimize the
average distortion in cell C, . That is, z,. is the vector that minimizes D. =p(z.)E[d(x,z)xe 0.] (4.81) Since the overall average distortion D is a linear combination of average distortions in
C, , they can be independently computed after classiﬁcation of x. The vector 2,. is called the
centroid of the cell Cl. and is written = cent(C,.) (4.82) The centroid for a particular region (cell) depends on the deﬁnition of the distortion
measure. In practice, given a set of training vectors {x,, IS! S T}, a subset of K, vectors
will be located in cell Q. In this case, p(x  2,.) can be assumed to be 1/ K. , and p(z,.) be
comes K,. / T . The average distortion D, in cell C, can then be given by D.=—2d(x,z) (433)
szC,
The second condition for optimality can then be rewritten as follows: = argminD. (z )= argmin— 12 d(x,z.) (4.84) 1: Txe C, When the sum of squared error in Eq. (4. 77) 18 used for the distortion measure, the at R 168 Pattern Recognition By solving Eq. (4.85), we obtain the least square error estimate of centroid 2,. simply as the sample mean of all the training vectors x, quantized to cell C, : . 1
.=— x 4.86
2, K12 < ) x5 C, If the Mahalanobis distance measure [Eq. (4.78)] is used, minimization of D, in Eq.
(4.84) can be done similarly: Vlei=V 12(x—zi)'2'l(x—zi) 1:
T EC, = % 2 V1, (x — z, )‘>:I (x — 2,.) (4.87) xe C, =_7222“(x—z,)=0 xe C, and centroid 2,. is obtained from 1
i. =— X 4.88
. 192 ( ) xEC, One can see that 2,. is again the sample mean of all the training vectors x, quantized to cell
Ci. Although Eq. (4.88) is obtained based on the Mahalanobis distance measure, it also
works with a large class of Euclideanlike distortion measures [61]. Since the Mahalanobis
distance measure is actually the exponential term in a Gaussian density, minimization of the
distance criterion can be easily translated into maximization of the logarithm of the Gaussian
likelihood. Therefore, similar to the relationship between least square error estimation for
the linear discrimination function and the Gaussian classiﬁer described in Section 4.3.3.1,
the distance minimization process (least square error estimation) above is in fact a maximum
likelihood estimation. According to these two conditions for VQ optimality, one can iteratively apply the
nearestneighbor selection rule and Eq. (4.88) to get the new centroid 2,. for each cell in
order to minimize the average distortion measure. This procedure is known as the k—means
algorithm or the generalized Lloyd algorithm [29, 34, 56]. In the kmeans algorithm, the
basic idea is to partition the set of training vectors into M clusters Ci (1 S i S M) in such a way that the two necessary conditions for optimality described above are satisﬁed. The k
means algorithm can be described as in Algorithm 4.2. Step 1: (1,, ISi
Step 2: N
C, by Cht
classificat
Step 3: C
the trainin
Step 4: It
rent iterat
old. In the
actually bre.
mean) for e.
partitioning
measure. A]
ﬁnding the 1
distortion L
D smaller th Theon
thermore, ar
the quality c
repeating tht
one can Ch01
section we v 4.4.1.3. Since the ini
shown that i
means algor.
The LBG al;
the codewor
until the des:
Algorithm 4. _____‘—— I Recognition simply as the (4.86) of D,. in EQ (4.87) (4.88) ruantized to cell
measure, it also
he Mahalanobis
rimization of the
1 of the Gaussian
)r estimation for Section 4.3.3.1. rfact a maximmfl, ' .ac 
atively apply the; . for each cell in f
in as the k—meanS: .
ns a1gorithm thei'. _ _. Unsupervised Estimation Methods 169 ALGORITHM 4.2: THE KMEANS ALGORITHM Step 1: initialization: Choose some adequate method to derive initial VQ codewords
(z,_ ISiSM ) in the codebook. Step 2: Nearestneighbor Classification: Classify each training vector {x, } into one of the cells
C, by choosing the closest codeword z, (xeC,, i.f.f.d(x,z,)sd(x,zj) for all j ¢i). This classification is also called minimumdistance classifier. Step 3: Codebook Updating: Update the codeword of every cell by computing the centroid of
the training vectors in each cell according to Eq. (4.84) (i, = cent(C,), l s i s M ). Step 4: iteration: Repeat steps 2 and 3 until the ratio of the new overall distortion D at the cur
rent iteration relative to the overall distortion at the previous iteration is above a preset thresh
old. In the process of minimizing the average distortion measure, the kmeans procedure
actually breaks the minimization process into two steps. Assuming that the centroid z, (or
mean) for each cell C, has been found, then the minimization process is found simply by
partitioning all the training vectors into their corresponding cells according to the distortion
measure. After all of the new partitions are obtained, the minimization process involves
ﬁnding the new centroid within each cell to minimize its corresponding withincell average
distortion D, based on Eq. (4.84). By iterating over these two steps, a new overall distortion
D smaller than that of the previous step can be obtained. Theoretically, the k—means algorithm can converge only to a local optimum [56]. Fur
thermore, any such solution is, in. general, not unique [33]. Initialization is oﬁen critical to
the quality of the eventual converged codebook. Global optimality may be approximated by
repeating the k—means algorithm for several sets of codebook initialization values, and then
one can choose the codebook that produces the minimum overall distortion. In the next sub
section we will describe methods for ﬁnding a decent initial codebook. 4.4.1.3. The LBG Algorithm Since the initial codebook is critical to the ultimate quality of the ﬁnal codebook, it has been
shown that it is advantageous to design an Mvector codebook in stages. This extended k
means algorithm is known as the LBG algorithm proposed by Linde, Buzo, and Gray [56].
The LBG algorithm first computes a lvector codebook, then uses a splitting algorithm on
the codewords to obtain the initial 2vector codebook, and continues the splitting process  until the desired Mvector codebook is obtained. The procedure 1s formally implemented by
' Algorithm 4 3. ALGORITHM 4.3: THE LBG ALGORITHM Step 1: Initialization: Set M (number of partitions or cells) =1. Find the centroid of all the train
ing data according to Eq. (4.84). Step 2: Splitting: Split M into 2M partitions by splitting each current codeword by finding two
points that are far apart in each partition using a heuristic method, and use these two points as the new centroids for the new 2M codebook. Now set M = 2M. Step 3: Kmeans Stage: Now use the kmeans iterative algorithm described in the previous
section to reach the best set of centroids for the new codebook. Step 4: Termination: if M equals the V0 codebook size required, STOP; otherwise go to Step
2. 4.4.2. The EM Algorithm We introduce the EM algorithm that is important to hidden Markov models and other leam
ing techniques. It discovers model parameters by maximizing the loglikelihood of incom
plete data and by iteratively maximizing the expectation of loglikelihood from complete
data. The EM algorithm is a generalization of the VQ algorithm described above. The EM algorithm can also be viewed as a generalization of the MLE method, when
the data observed is incomplete. Without loss of generality, we use scalar random variables hidden data x (that is unobserved). For example, x may be a hidden number that refers to
component densities of observable data y, or x may be the underlying hidden state sequence
in hidden Markov models (as discussed in Chapter 8). Without knowing this hidden data x, The issue now is whether or not the process (EM algorithm) described above con
verges. Without loss of generality, we assume that both random variables X (unobserved)
and Y (observed) are discrete random variables. According to Bayes’ rule, P(X=x,Y=yl<5)=P(X=xY=y,<i>)P(Y=yl<i>) (4.89) Our goal is to maximize the loglikelihood of the observable, real data y generated by pa
rameter vector (I) . Based on Eq. (4.89), the loglikelihood can be expressed as follows: rogP(Y=yl<5):logP(X=x,Y=y<i>)—logP(X=xl Y=y,<T>) (4.90) 170 Pattern RecognitiOn “. where we d tér vector (
log P
= Q(‘ where Q(q)9 and H((I>, The converg
Q(<P.'
then
log P( since it foll
Q(<D,<T>) is
maximize thr
(x. y), to HP
L(x,(l>) incr
mum if we it
The na
The implemc
the auxiliary
Q((I),(T>) ove way. 'n Recognition all the train I ﬁnding two
wo points as the previous .e go to Step 1nd other leam
100d of incom
from complete
we.
i method, when
ndom variables
order to deter
:l to know some
er that refers to
a state sequence
s hidden data )6.
b , which maxi—
e the probability
3 had in fact ob
the probability
mate of (I). We
ess to iteratively ibed above con : X (unobserved) (4.89) '_ generated by P3‘ 4 l as follows: _J'. ' (4.90)‘ T 1.. i
l
'
l
3 Unsupervised Estimation Methods 171 Now, we take the conditional expectation of log P(Y = ylé) over X computed with pa
rameter vector (1): E,[1ogP(Y= ylanm = 2(P(X =xY=y,‘I’)10gP(Y =yl<i>) (4 91)
=logP(Y=yl5) . where we denote E3, [ f] my as the expectation of function f over X computed with parame
ter vector (I) . Then using Eq. (4.90) and (4.91) , the following expression is obtained: 10% P(Y = y l 5) = Esllog P(X, Y = y l 5mm,  EollogP(X  Y = y,5)]x.y=, _ _ (4.92)
= Q(<I>, (1))  H ((1)59)
where
Q(¢96) = Ewﬂog P(X, Y = y I (filmy.3:
=2(P(X=xY=y,<I>)IogP(X=x,Y=y<i>)) (4'93)
and
£16156) = Edi [10g P(X I Y = y: (Bunny
=2(P(X=xY=y,<i>)IogP(X=xY=y,<i>)) (4'94)
The convergence of the EM algorithm lies in the fact that if we choose 5 so that
Q(<I>, <5) 2 Q(<I>, (D) (4.95)
then
logP(Y =y<§)210gP(Y=yd>) (4.96) since it follows from Jensen's inequality that H ((1)515) S H (dub) [21]. The function
Q(<I>,<T>) is known as the Qfunction or auxiliary function. This fact implies that we can
maximize the Qfunction, which is the expectation of loglikelihood from complete data pair
(x, y), to update parameter vector from (I) to (T) , so that the incomplete loglikelihood
L(x, (1)) increases monotonically. Eventually, the likelihood will converge to a local maxi
mum if we iterate the process. The name of the EM algorithm comes from E for expectation and M for maximization.
The implementation of the EM algorithm includes the E (expectation) step, which calculates
the auxiliary Qfunction Q(<I>,5) and the M (maximization) step, which maximizes " .53 931355) over (I) to obtain (I) . The general EM algorithm can be described in the following
. way. 172 Pattern Recognition ALGORITHM 4.4: THE EM ALGORITHM Step 1: Initialization: Choose an initial estimate (I). Step 2: EStep: Compute auxiliary 0function Q(<I),<T>) (which is also the expectation of log
likelihood from complete data) based on (I). Step 3: MStep: Compute (i3 = arggrax Q(<I>,<I>) to maximize the auxiliary Ofunction. Step 4: Iteration: Set (I) = (I) , repeat from Step 2 until convergence. The Mstep of the EM algorithm is actually a maximum likelihood estimation of com
plete data (assuming we know the unobserved data x based on observed data y and initial
parameter vector (D). The EM algorithm is usually used in applications where no analytic solution ex1sts for maxrmrzatron of loglikelihood of incomplete data. Instead, the Q
function is iteratively maximized to obtain the estimation of parameter vector. 4.4.3. Multivariate Gaussian Mixture Density Estimation The vector quantization process described in Section 4.4.1 partitions the data space into
separate regions based on some distance measure regardless of the probability distributions
of the original data. This process may introduce errors in partitions that could potentia...
View
Full Document
 Spring '10
 Glass

Click to edit the document details