This preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: CHAPTER 4 Pattern Recognition Spoken language processing relies heavily on
In recognition, one of the most challenging problems for machines. In a broader sense,
W’ﬂi ability to recognize patterns forms the core of our intelligence. If we can incorporate the
:‘t‘y to reliably recognize patterns in our work and life, we can make machines much eas
_I:t ' use. The process of human pattern recognition is not well understood.
Due to the inherent variability of spoken language patterns, we emphasize the use of
:gal approaches in this book. The decision for pattern recognition is based on appropri
robabilistic models of the patterns. This chapter presents several mathematical funda
s, for statistical pattern recognition and classiﬁcation. In particular, Bayes’ decision
ﬁnd estimation techniques for parameters of classiﬁers are introduced. Bayes’ deci _data, and prior knowledge of the categories. To build such a classiﬁer or predic
t1c_a1 to estimate prior class probabilities and the classconditional probabilities 134 Pattern Recognition Supervised learning has class information for the data. Only the probabilistic structure
needs to be learned. Maximum likelihood estimation (MLE) and maximum posterior prob
ability estimation (MAP) that we discussed in Chapter 3 are two most powerful methods
Both MLE and MAP aim to maximize the likelihood function. The MLE criterion does not
necessarily minimize the recognition error rate. Various discriminant estimation methods are
introduced for that purpose. Maximum mutual information estimation (MMIE) is based on
criteria to achieve maximum model separation (the model for the correct class is well sepa
rated from other competing models) instead of likelihood criteria. The MMIE criterion is
one step closer but still is not directly related to minimizing the error rate. Other discrimi
nant estimation methods, such as minimum errorrate estimation, use the ultimate goal of
pattern recognition — minimizing the classiﬁcation errors. Neural networks are one class of
discriminant estimation methods. The EM algorithm is an iterative algorithm for unsupervised learning in which class
information is unavailable or only partially available. The EM algorithm forms the theoreti
cal basis for training hidden Markov models (HMM) as described in Chapter 8. To better
understand the relationship between MLE and EM algorithms, we ﬁrst introduce vector
quantization (VQ), a widely used sourcecoding technique in speech analysis. The well
known kmeans clustering algorithm best illustrates the relationship between MLE and the
EM algorithm. We close this chapter by introducing a powerful binary prediction and re
gression technique, classiﬁcation and regression trees (CART). CART represents an impor
tant technique that combines rulebased expert knowledge and statistical learning. 4.1. BAYES’ DECISION THEORY formalization of a commonsense procedure, i.e., the aim to achieve minimumerror—rate
classiﬁcation. This commonsense procedure can be best observed in the following real
world decision examples. Consider the problem of making predictions for the stock market. We use the Dow
Jones Industrial average index to formulate our example, where we have to decide tomor
row’s Dow Jones Industrial average index in one of the three categories (events): Up, Down,
or Unchanged. The available information is the probability function P(a)) of the three cate
gories. The variable a) is a discrete random variable with the value a) = a), (i = 1,2,3) . We
call the probability P(a),) a prior probability, since it reﬂects prior knowledge of tomor
row's Dow Jones Industrial index. If we have to make a decision based only on the prior
probability, the most plausible decision may be made by selecting the class (0,. with the
highest prior probability P(a),). This decision is unreasonable, in that we always make the
same decision even though we know that all three categories of Dow Jones Industrial index
changes will possibly appear. If we are given further observable data, such as the federal
funds interest rate or the jobless rate, we can make a more informed decision. Let x be a con ‘ fﬂﬁuous ranc ,classconditi
i2, 3 unless 2' _ referred to a derlying par
prior probab
probability . P(€0,  where p(x) The pr
ity as it is tl
intuitive dec
ity. That is, k=arg In general, t]
for all classe: k=arg The ru]
data x changt
rior probabili
able, becaust
matter of fact
the observed
biguous, then
based on post
overall risk, c 4.1.1. I Bayes’ decisi
Bayes’ decisi
is a natural cl
cide that the 1
minimum dec 1 Recognition stic structure
nsterior prob
ful methods.
:ion does not
methods are
) is based on
is well sepa
3 criterion is
her discrimi~
mate goal of
. one class of 1 which class
; the theoreti
' 8. To better
'oduce vector
is. The well
MLE and the
tction and re
nts an impor~ ig. reory is based
:errns and that
.e viewed as a
rumerrorrate
)llowing real~ use the Dow
decide tomor
:s): Up, Down,
the three cate
i = 1, 2, 3) . We
.dge of tomor
y on the prior s a), with the
vays make the
idustrial index' . ''
as the federal} . .'
Let x be a co’nt ' Bayes’ Decision Theory ' 135 tinuous random variable whose value'is the federalfund interest rate, and fom (x  co) be a
classconditional pdf For simplicity, we denote the pdf fxlw (x  (o) as p(x 1 mi) , where i = 1,
2, 3 unless there is ambiguity. The classconditional probability density function is often
referred to as the likelihood function as well, since it measures how likely it is that the un
derlying parametric model of class a), will generate the data sample x. Since we know the
prior probability P(a),.) and classconditional pdf p(x  a),) , we can compute the conditional
probability P(a),. Ix) using Bayes’ rule: ‘ pa... x) mm 4.1
p(x) ( ) where p(x) = in): 1 women.) . The probability term in the lefthand side of Eq. (4.1) is called the posterior probabil
ity as it is the probability of class a), after observing the federalfunds interest rate x. An intuitive decision rule would be choosing the class (0,, with the greatest posterior probabil
ity. That is, k =argmax P(a),. x) (4.2) In general, the denominator p(x) in Eq. (4.1) is unnecessary because it is a constant term
for all classes. Therefore, Eq. (4.2) becomes k =argmax PUD. x) =argmax p(x l 01,)P(€0.) (4.3) The rule in Eq. (4.3) is referred to as Bayes’ decision rule. It shows how the observed
data x changes the decision based on the prior probability P(a),.) to one based on the poste
rior probability P(a),. x) . Decision making based on the posterior probability is more reli
able, because it employs prior knowledge together with the present observed data. As a
matter of fact, when the prior knowledge is noninformative (P(a)1) = P(a)2) = P(a)3) = 1/ 3 ),
the observed data fully control the decision. On the other hand, when observed data are am
biguous, then prior knowledge controls the decision. There are many kinds of decision rules
based on posterior probability. Our interest is to ﬁnd the decision rule that leads to minimum
overall risk, or minimum error rate in decision. 4.1.1. MinimumErrorRate Decision Rules Bayes’ decision rule is designed to minimize the overall risk involved in making a decision.
Bayes’ decision based on posterior probability P(a),. x) instead of prior probability P((o,)
IS a natural choice. Given an observation x, if P(a)k x) 2 P(a),. x) for all i¢ k , we can de cide that the true class is (0/. _ To justify this procedure, we show such a decision results in
nununum decision error. 136 Pattern Recognitiou Let Q={cq,..., I} be the ﬁnite Set of s possible categories to be predicted and
A={61,...,5,} be a finite set of t possible decisions. Let [(6, l (0].) be the loss function in
curred for making decision 5,. when the true class is a), . Using the prior probability P(60.)
and classconditional pdf p(x l 60,.) , the posterior probability P((o,. Ix) is computed by
Bayes’ rule as shown in Eq. (4.1). Since the posterior probability P((oj Ix) is the probability
that the true class is def after observing the data x, the expected loss associated with making
decision 6,. is: Ra). Ix)=_2:11(6, Icoj)P<w,. Ix) (4.4) In decisiontheoretic terminology, the above expression is called conditional risks.
The overall risk R is the expected loss associated with a given decision mle. The decision
rule is employed as a decision function 6 (x) that maps the data x to one of the decisions A ={51,...,5,}. Since R(6,. Ix) is the conditional risk associated with decision 6,. , the over
all risk is given by: R= IR<6<x)Ix)p(x)dx (4.5) If the decision function 6 (x) is chosen so that the conditional risk R(6 (x)  x) is minimized
for every x, the overall risk is minimized. This leads to the Bayes’ decision rule: To mini
mize the overall risk, we compute the conditional risk shown in Eq. (4.4) for i= l,...,t and
select the decision 6,. for which the conditional risk R(5,  x) is minimum. The resulting
minimum overall risk is known as Bayes’ risk that has the best performance possible. The loss function [(6, [60].) in the Bayes’ decision rule can be deﬁned as: o i=j
1(6,Ia)j)= i,j=l,...,s (4.6)
1 i¢j This loss function assigns no loss to a correct decision where the true class is a), and the
decision is 6,, which implies that the true class must be (0,. It assigns a unit loss to any er
ror where i at j ; i.e., all errors are equally costly. This type of loss function is known as a
symmetrical or zeroone loss function. The risk corresponding to this loss function equals
the classiﬁcation error rate, as shown in the following equation. Ros, I x) = 2K6. I w, )P(w, Ix) =2 Ha», Ix)
1:1 j==i , (4.7)
=2P<w,. Ix)—P<w, Ix)=1—P(w. Ix) i=1 \ ._ [The Bayes’
I 5(x) This
P(co, Ix) , i
error rate. 1
cable to mu A pat
decision re;
classiﬁer di errors, we r
In the other
exclusive, \ P (er. Figur
terms in th 4—‘
._ Figure I
the integ Recognition edicted and
function in
)ility P(w,)
)mputed by
:probability
with making (4.4) 'tional risks.
['he decision
he decisions
6,. , the over (4.5) is minimized
lle: To mini—
i= 1,...,t and
The resulting
sible. (4.6) .s a), and the
oss to any er
.s known as a
.nction equals (4.7) N
Bayes’ Decision Theory 137 Here P(a), x) is the conditional probability that decision 6,. is correct after observing the
data x. Therefore, in order to minimize classiﬁcation error rate, we have to choose the deci
sion of class i that maximizes the posterior probability P(a), Ix) . Furthermore, since p(x) is
a constant, the decision is equivalent to picking the class i that maximizes p(x I (0,.)P(a),.).
The Bayes’ decision rule can be formulated as follows: 5(x) =argmax P(w. IX) = argmax P(x I ai)P(w.) (4.8) This decision rule, which is based on the maximum of the posterior probability
P(a), Ix) , is called the minimumerror—rate decision rule. It minimizes the classiﬁcation
error rate. Although our description is for random variable x, Bayes’ decision rule is appli
cable to multivariate random vector x without loss of generality. A pattern classiﬁer can be regarded as a device for partitioning the feature space into
decision regions. Without loss of generality, we consider a twoclass case. Assume that the
classiﬁer divides the space 9? into two regions, 9i, and 912. To compute the likelihood of
errors, we need to consider two cases. In the ﬁrst case, x falls in 9i, , but the true class is (02. In the other case, x falls in 5R2 , but the true class is (0,. Since these two cases are mutually
exclusive, we have P(error) = P(xe 5K, ,w2)+ P(xe $2,601)
=P(xe9i,w2)P(a)2)+P(xe SRziwl)P(w1) (49)
= I,“ P(x I w.)P(w.)dx+ 19,2 P(x I w. )P(w, )dx Figure 4.1 illustrates the calculation of the classiﬁcation error in Eq. (4.9). The two
terms in the summation are merely the tail areas of the function P(x I (0,.)P_(a),.). It is clear In P(o
optimal p(x 2) I)
decision
boundary p(xlm,) P(n,) \ decision . / boundary 9t,——+hm2—5> Figure 4.1 Calculation of the likelihood of classiﬁcation error [22]. The shaded area represents
the integral value in Eq. (4.9). \ 138 Pattern Recognition that this decision boundary is not optimal. If we move the decision boundary a little bit to
the left, so that the decision is made to choose the class i based on the maximum value of P(xw,.)P((D,), the tail integral area P(error) becomes minimum, which is the Bayes’
decision rule. 4.1.2. Discriminant Functions The decision problem above can also be viewed as a pattern classiﬁcation problem where There
unknown data x1 are classiﬁed into known categories, such as the classiﬁcation of sounds ses testing 1
into phonemes using spectral data x. A classiﬁer is designed to classify data x into s catego Bayes’ deci
ries by using 3 discriminant functions, (1,.(x), computing the similarities between the un
known data x and each class (0,. and assigning x to class (of if (  p x
dj(x) >d,(x) Via: j (4.10) This representation of a clasmﬁer is illustrated in Figure 4.2. Eq. (4.13) c
*1 e<x>=
x2 The term I The term I Often it is . decision ru (11 (x) and
xd
d(X)
Feature Discriminant Maximum Decision
Vector Function Selector AS th
I 9“! 9“!
Figure 4.2 Block diagram of a classiﬁer based on discriminant functions [22]. l ’ 2 " ' ' ’ decision bo
' A Bayes’ classiﬁer can be represented in the same way. Based on the Bayes’ classiﬁer, unknown data x are classiﬁed on the basis of Bayes’ decision rule, which minimizes the di(x)
conditional risk R03, Ix) . Since the classiﬁcation decision of a pattern classiﬁer is based on . the maximum discriminant function shown in Eq. (4.10), we deﬁne our discriminant func For pomts ‘ tion as: siﬁer, the c tie does not 4(3) = —R(5. l X) (411) a threeclas \
' Assuming x is a ddimensional vector. Recognition 1 little bit to
imum value
; the Bayes’ ~b1em where
n of sounds
no 3 catego
'een the un am 011
. :s’ classiﬁer,
inimizes the
r is based on
ninant func am Bayes’ Decision Theory 139 As such, the maximum discriminant function corresponds to the minimum conditional risk.
In the minimumerror—rate classiﬁer, the decision rule is to maximize the posterior probabil
ity P(a),. x) . Thus, the discriminant function can be written as follows: 4W=HQHF£EEEE21JEEQEEL , am
pm ZP(XIw,)P(w,) There is a very interesting relationship between Bayes’ decision rule and the hypothe
ses testing method described in Chapter 3. For a twoclass pattern recognition problem, the
Bayes’ decision rule in Eq. (4.2) can be written as follows: 170!  00PM) [’0‘  €02)P(€02) (413) §A v.9 Eq. (4.13) can be rewritten as: m): p(xa>.)“>‘P(w2)
polar): P(w.) (4.14) The term [(x) is called likelihood ratio and is the basic quantity in hypothesis testing [73].
The term P(co2 )/P(a)1) is called the threshold value of the likelihood ratio for the decision.
Often it is convenient to use the loglikelihood ratio instead of the likelihood ratio for the decision rule. Namely, the following single discriminant function can be used instead of
d,(x) and d2 (x) for: d(x) = log [(x) = log p(x  a), ) — log p(x  (02) log P(a)2 ) — log P(a),) (4.15) 3A V3 As the classiﬁer assigns data x to class a), , the data space is divided into s regions,
91:,9tg,...,sxf, called decision regions. The boundaries between decision regions are called
decision boundaries and are represented as follows (if they are contiguous): ¢m=4m #j @M) For points on the decision boundary, the classiﬁcation can go either way. For a Bayes’ clas
siﬁer, the conditional risk associated with either decision is the same and how to break the tie does not matter. Figure 4.3 illustrates an example of decision boundaries and regions for
a threeclass classiﬁer on a scalar data sample x. 140 Pattern Recognition Figure 4.3 An example of decision boundaries and regions. For simplicity, we use scalar vari
able x instead of a multidimensional vector [22]. 4.2. How To CONSTRUCT CLASSIFIERS In the Bayes’ classiﬁer, or the minimumerrorrate classiﬁer, the prior probability P(a),)
and classconditional pdf p(xco,.) are known. Unfortunately, in pattern recognition, we
rarely have complete knowledge of classconditional pdfs and/or prior probability. They
oﬁen must be estimated or learned from the training data. In practice, the estimation of the
prior probabilities is relatively easy. Estimation of the classconditional pdf is more compli
cated. There is always concern to have sufﬁcient training data relative to the tractability of
the huge dimensionality of the sample data x. In this chapter we focus on estimation meth
ods for the classconditional pdf. The estimation of the classconditional pdfs can be nonparametric or parametric. In
nonparametric estimation, no model structure is assumed and the pdf is directly estimated
from the training data. When large amounts of sample data are available, nonparametric
learning can accurately reﬂect the underlying probabilistic structure of the training data.
However, available sample data are normally limited in practice, and parametric learning
can achieve better estimates if valid model assumptions are made. In parametric learning,
some general knowledge about the problem space allows one to parameterize the class
conditional pdf, so the severity of sparse training data can be reduced signiﬁcantly. Suppose
the pdf p(x  (0,) is assumed to have a certain probabilistic structure, such as the Gaussian pdf In such cases, only the mean vector u, (or mean p,) and covariance matrix 2, (or vari
ance 0'2) need to be estimated. inﬁll" waa
Férametric :
conditional 1
of Gaussian
In patt
parameters c
the training
set of data s2
For pa;
vised leamir.
is the observ
is clear that
formation al:
labeled data
is missing fr
discussed in
In Chz
maximum lil
(MAP). Botl
required. MI
the set of pa:
observed. Th
vector for c
p(x MAD.)
sample data :
class a), giv
assumption :1
different cate
as MK I (D),
drawn indepi
unknown, thr
Similar
distribution i
without any _
the behavior
available. W
MLE. 2 Since all the pr
’ This assumptio
0, of the other rn Recognition scalar vari ability P(a),) :ognition, we
yability. They
mation of the
more compli
lIactability of
mation meth tarametric. In
:tly estimated
.onparametric
training data.
etric learning
:tric learning,
ze the class
utly. Suppose
the Gaussian
: 2, (or vari How To Construct Classiﬁers 141 When the observed data x only takes discrete values from a ﬁnite set of N values, the
clas3conditional pdf is often assumed nonparametric, so there will be N —1 free parameters
in the probability function p(xlq) .2 When the observed data x takes continuous values,
parametric approaches are usually necessary. In many systems, the continuous class
conditional pdf (likelihood) p(x l 60,.) is aSSumed to be a Gaussian distribution or a mixture
of Gaussian distributions. In pattern recognition, the set of data samples, which is often collected to estimate the
parameters of the recognizer (including the prior and classconditional pdf), is referred to as
the training set. In contrast to the training set, the testing set is referred to the independent
set of data samples, which is used to evaluate the recognition performance of the recognizer. For parameter estimation or learning, it is also important to distinguish between super
vised learning and unsupervised learning. Let’s denote the pair (x,a)) as a sample, where x
is the observed data and a) is the class from which the data x comes. From the deﬁnition, it
is clear that (x,a)) are jointly distributed random variables. In supervised learning, a) , in
formation about the class of the sample data x is given. Such sample data are usually called
labeled data or complete data, in contrast to incomplete data where the class information a)
is missing for unsupervised learning. Techniques for parametric unsupervised learning are
discussed in Section 4.4. In Chapter 3 we introduced two most popular parameter estimation techniq...
View
Full Document
 Spring '10
 Glass

Click to edit the document details