Classifiers - CHAPTER 4 Pattern Recognition Spoken language...

Info icon This preview shows pages 1–17. Sign up to view the full content.

View Full Document Right Arrow Icon
Image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Image of page 2
Image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Image of page 4
Image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Image of page 6
Image of page 7

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Image of page 8
Image of page 9

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Image of page 10
Image of page 11

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Image of page 12
Image of page 13

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Image of page 14
Image of page 15

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Image of page 16
Image of page 17
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: CHAPTER 4 Pattern Recognition Spoken language processing relies heavily on In recognition, one of the most challenging problems for machines. In a broader sense, W’fli ability to recognize patterns forms the core of our intelligence. If we can incorporate the :‘t‘y to reliably recognize patterns in our work and life, we can make machines much eas- _I:t ' use. The process of human pattern recognition is not well understood. Due to the inherent variability of spoken language patterns, we emphasize the use of :gal approaches in this book. The decision for pattern recognition is based on appropri- robabilistic models of the patterns. This chapter presents several mathematical funda- s, for statistical pattern recognition and classification. In particular, Bayes’ decision find estimation techniques for parameters of classifiers are introduced. Bayes’ deci- _data, and prior knowledge of the categories. To build such a classifier or predic- t1c_a1 to estimate prior class probabilities and the class-conditional probabilities 134 Pattern Recognition Supervised learning has class information for the data. Only the probabilistic structure needs to be learned. Maximum likelihood estimation (MLE) and maximum posterior prob- ability estimation (MAP) that we discussed in Chapter 3 are two most powerful methods Both MLE and MAP aim to maximize the likelihood function. The MLE criterion does not necessarily minimize the recognition error rate. Various discriminant estimation methods are introduced for that purpose. Maximum mutual information estimation (MMIE) is based on criteria to achieve maximum model separation (the model for the correct class is well sepa- rated from other competing models) instead of likelihood criteria. The MMIE criterion is one step closer but still is not directly related to minimizing the error rate. Other discrimi- nant estimation methods, such as minimum error-rate estimation, use the ultimate goal of pattern recognition — minimizing the classification errors. Neural networks are one class of discriminant estimation methods. The EM algorithm is an iterative algorithm for unsupervised learning in which class information is unavailable or only partially available. The EM algorithm forms the theoreti- cal basis for training hidden Markov models (HMM) as described in Chapter 8. To better understand the relationship between MLE and EM algorithms, we first introduce vector quantization (VQ), a widely used source-coding technique in speech analysis. The well- known k-means clustering algorithm best illustrates the relationship between MLE and the EM algorithm. We close this chapter by introducing a powerful binary prediction and re- gression technique, classification and regression trees (CART). CART represents an impor- tant technique that combines rule-based expert knowledge and statistical learning. 4.1. BAYES’ DECISION THEORY formalization of a common-sense procedure, i.e., the aim to achieve minimum-error—rate classification. This common-sense procedure can be best observed in the following real- world decision examples. Consider the problem of making predictions for the stock market. We use the Dow Jones Industrial average index to formulate our example, where we have to decide tomor- row’s Dow Jones Industrial average index in one of the three categories (events): Up, Down, or Unchanged. The available information is the probability function P(a)) of the three cate- gories. The variable a) is a discrete random variable with the value a) = a), (i = 1,2,3) . We call the probability P(a),) a prior probability, since it reflects prior knowledge of tomor- row's Dow Jones Industrial index. If we have to make a decision based only on the prior probability, the most plausible decision may be made by selecting the class (0,. with the highest prior probability P(a),). This decision is unreasonable, in that we always make the same decision even though we know that all three categories of Dow Jones Industrial index changes will possibly appear. If we are given further observable data, such as the federal- funds interest rate or the jobless rate, we can make a more informed decision. Let x be a con- ‘ fflfiuous ranc ,class-conditi i2, 3 unless 2' _ referred to a derlying par prior probab probability . P(€0,- | where p(x) The pr ity as it is tl intuitive dec ity. That is, k=arg In general, t] for all classe: k=arg The ru] data x changt rior probabili able, becaust matter of fact the observed biguous, then based on post overall risk, c 4.1.1. I Bayes’ decisi Bayes’ decisi is a natural cl cide that the 1 minimum dec 1 Recognition stic structure nsterior prob- ful methods. :ion does not methods are ) is based on is well sepa- 3 criterion is her discrimi~ mate goal of -. one class of 1 which class ; the theoreti- ' 8. To better 'oduce vector is. The well- MLE and the tction and re- nts an impor~ ig. reory is based :errns and that .e viewed as a rum-error-rate )llowing real~ use the Dow decide tomor- :s): Up, Down, the three cate- i = 1, 2, 3) . We .dge of tomor- y on the prior s a), with the vays make the idustrial index' . '-' as the federal} . .' Let x be a co’nt- ' Bayes’ Decision Theory ' 135 tinuous random variable whose value'is the federal-fund interest rate, and fom (x | co) be a class-conditional pdf For simplicity, we denote the pdf fxlw (x | (o) as p(x 1 mi) , where i = 1, 2, 3 unless there is ambiguity. The class-conditional probability density function is often referred to as the likelihood function as well, since it measures how likely it is that the un- derlying parametric model of class a), will generate the data sample x. Since we know the prior probability P(a),.) and class-conditional pdf p(x | a),) , we can compute the conditional probability P(a),. Ix) using Bayes’ rule: ‘ pa... |x) -mm 4.1 p(x) ( ) where p(x) = in): 1 women.) . The probability term in the left-hand side of Eq. (4.1) is called the posterior probabil- ity as it is the probability of class a), after observing the federal-funds interest rate x. An intuitive decision rule would be choosing the class (0,, with the greatest posterior probabil- ity. That is, k =argmax P(a),. |x) (4.2) In general, the denominator p(x) in Eq. (4.1) is unnecessary because it is a constant term for all classes. Therefore, Eq. (4.2) becomes k =argmax PUD. |x) =argmax p(x l 01,)P(€0.-) (4.3) The rule in Eq. (4.3) is referred to as Bayes’ decision rule. It shows how the observed data x changes the decision based on the prior probability P(a),.) to one based on the poste- rior probability P(a),. |x) . Decision making based on the posterior probability is more reli- able, because it employs prior knowledge together with the present observed data. As a matter of fact, when the prior knowledge is non-informative (P(a)1) = P(a)2) = P(a)3) = 1/ 3 ), the observed data fully control the decision. On the other hand, when observed data are am- biguous, then prior knowledge controls the decision. There are many kinds of decision rules based on posterior probability. Our interest is to find the decision rule that leads to minimum overall risk, or minimum error rate in decision. 4.1.1. Minimum-Error-Rate Decision Rules Bayes’ decision rule is designed to minimize the overall risk involved in making a decision. Bayes’ decision based on posterior probability P(a),. |x) instead of prior probability P((o,) IS a natural choice. Given an observation x, if P(a)k |x) 2 P(a),. |x) for all i¢ k , we can de- cide that the true class is (0/. _ To justify this procedure, we show such a decision results in nununum decision error. 136 Pattern Recognitiou Let Q={cq,..., I} be the finite Set of s possible categories to be predicted and A={61,...,5,} be a finite set of t possible decisions. Let [(6, l (0].) be the loss function in- curred for making decision 5,. when the true class is a), . Using the prior probability P(60.-) and class-conditional pdf p(x l 60,.) , the posterior probability P((o,. Ix) is computed by Bayes’ rule as shown in Eq. (4.1). Since the posterior probability P((oj Ix) is the probability that the true class is def after observing the data x, the expected loss associated with making decision 6,. is: Ra).- Ix)=_2:11(6,- Icoj)P<w,. Ix) (4.4) In decision-theoretic terminology, the above expression is called conditional risks. The overall risk R is the expected loss associated with a given decision mle. The decision rule is employed as a decision function 6 (x) that maps the data x to one of the decisions A ={51,...,5,}. Since R(6,. Ix) is the conditional risk associated with decision 6,. , the over- all risk is given by: R= IR<6<x)Ix)p(x)dx (4.5) If the decision function 6 (x) is chosen so that the conditional risk R(6 (x) | x) is minimized for every x, the overall risk is minimized. This leads to the Bayes’ decision rule: To mini- mize the overall risk, we compute the conditional risk shown in Eq. (4.4) for i= l,...,t and select the decision 6,. for which the conditional risk R(5, | x) is minimum. The resulting minimum overall risk is known as Bayes’ risk that has the best performance possible. The loss function [(6, [60].) in the Bayes’ decision rule can be defined as: o i=j 1(6,Ia)j)= i,j=l,...,s (4.6) 1 i¢j This loss function assigns no loss to a correct decision where the true class is a), and the decision is 6,, which implies that the true class must be (0,. It assigns a unit loss to any er- ror where i at j ; i.e., all errors are equally costly. This type of loss function is known as a symmetrical or zero-one loss function. The risk corresponding to this loss function equals the classification error rate, as shown in the following equation. Ros,- I x) = 2K6.- I w,- )P(w,- Ix) =2 Ha», Ix) 1:1 j==i , (4.7) =2P<w,. Ix)—P<w,- Ix)=1—P(w.- Ix) i=1 \ ._ [The Bayes’ I 5(x) This -P(co, Ix) , i error rate. 1 cable to mu A pat decision re; classifier di errors, we r In the other exclusive, \ P (er. Figur terms in th 4—‘ ._ Figure I the integ Recognition edicted and function in- )ility P(w,) )mputed by :probability with making (4.4) 'tional risks. ['he decision he decisions 6,. , the over- (4.5) is minimized lle: To mini— i= 1,...,t and The resulting sible. (4.6) .s a), and the oss to any er- .s known as a .nction equals (4.7) N Bayes’ Decision Theory 137 Here P(a), |x) is the conditional probability that decision 6,. is correct after observing the data x. Therefore, in order to minimize classification error rate, we have to choose the deci- sion of class i that maximizes the posterior probability P(a), Ix) . Furthermore, since p(x) is a constant, the decision is equivalent to picking the class i that maximizes p(x I (0,.)P(a),.). The Bayes’ decision rule can be formulated as follows: 5(x) =argmax P(w. IX) = argmax P(x I ai)P(w.-) (4.8) This decision rule, which is based on the maximum of the posterior probability P(a), Ix) , is called the minimum-error—rate decision rule. It minimizes the classification error rate. Although our description is for random variable x, Bayes’ decision rule is appli- cable to multivariate random vector x without loss of generality. A pattern classifier can be regarded as a device for partitioning the feature space into decision regions. Without loss of generality, we consider a two-class case. Assume that the classifier divides the space 9? into two regions, 9i, and 912. To compute the likelihood of errors, we need to consider two cases. In the first case, x falls in 9i, , but the true class is (02. In the other case, x falls in 5R2 , but the true class is (0,. Since these two cases are mutually exclusive, we have P(error) = P(xe 5K, ,w2)+ P(xe $2,601) =P(xe9i,|w2)P(a)2)+P(xe SRziwl)P(w1) (4-9) = I,“ P(x I w.)P(w.)dx+ 19,2 P(x I w. )P(w, )dx Figure 4.1 illustrates the calculation of the classification error in Eq. (4.9). The two terms in the summation are merely the tail areas of the function P(x I (0,.)P_(a),.). It is clear In P(o optimal p(x 2) I) decision boundary p(xlm,) P(n,) \ decision .- / boundary 9t,——+hm2—5> Figure 4.1 Calculation of the likelihood of classification error [22]. The shaded area represents the integral value in Eq. (4.9). \ 138 Pattern Recognition that this decision boundary is not optimal. If we move the decision boundary a little bit to the left, so that the decision is made to choose the class i based on the maximum value of P(x|w,.)P((D,), the tail integral area P(error) becomes minimum, which is the Bayes’ decision rule. 4.1.2. Discriminant Functions The decision problem above can also be viewed as a pattern classification problem where There unknown data x1 are classified into known categories, such as the classification of sounds ses testing 1 into phonemes using spectral data x. A classifier is designed to classify data x into s catego- Bayes’ deci ries by using 3 discriminant functions, (1,.(x), computing the similarities between the un- known data x and each class (0,. and assigning x to class (of if ( | p x dj(x) >d,(x) Via: j (4.10) This representation of a clasmfier is illustrated in Figure 4.2. Eq. (4.13) c *1 e<x>= x2 The term I The term I Often it is . decision ru (11 (x) and xd d(X) Feature Discriminant Maximum Decision Vector Function Selector AS th I 9“! 9“! Figure 4.2 Block diagram of a classifier based on discriminant functions [22]. l ’ 2 " ' ' ’ decision bo ' A Bayes’ classifier can be represented in the same way. Based on the Bayes’ classifier, unknown data x are classified on the basis of Bayes’ decision rule, which minimizes the di(x) conditional risk R03, Ix) . Since the classification decision of a pattern classifier is based on . the maximum discriminant function shown in Eq. (4.10), we define our discriminant func- For pomts ‘ tion as: sifier, the c tie does not 4(3) = —R(5.- l X) (4-11) a three-clas \ ' Assuming x is a d-dimensional vector. Recognition 1 little bit to imum value ; the Bayes’ ~b1em where n of sounds no 3 catego- 'een the un- am 011 . :s’ classifier, inimizes the r is based on ninant func- am Bayes’ Decision Theory 139 As such, the maximum discriminant function corresponds to the minimum conditional risk. In the minimum-error—rate classifier, the decision rule is to maximize the posterior probabil- ity P(a),. |x) . Thus, the discriminant function can be written as follows: 4W=HQHF£EEEE21JEEQEEL , am pm ZP(XIw,)P(w,) There is a very interesting relationship between Bayes’ decision rule and the hypothe- ses testing method described in Chapter 3. For a two-class pattern recognition problem, the Bayes’ decision rule in Eq. (4.2) can be written as follows: 170! | 00PM) [’0‘ | €02)P(€02) (4-13) §A v.9 Eq. (4.13) can be rewritten as: m): p(x|a>.)“>‘P(w2) polar): P(w.) (4.14) The term [(x) is called likelihood ratio and is the basic quantity in hypothesis testing [73]. The term P(co2 )/P(a)1) is called the threshold value of the likelihood ratio for the decision. Often it is convenient to use the log-likelihood ratio instead of the likelihood ratio for the decision rule. Namely, the following single discriminant function can be used instead of d,(x) and d2 (x) for: d(x) = log [(x) = log p(x | a), ) — log p(x | (02) log P(a)2 ) — log P(a),) (4.15) 3A V3 As the classifier assigns data x to class a), , the data space is divided into s regions, 91:,9tg,...,sxf, called decision regions. The boundaries between decision regions are called decision boundaries and are represented as follows (if they are contiguous): ¢m=4m #j @M) For points on the decision boundary, the classification can go either way. For a Bayes’ clas- sifier, the conditional risk associated with either decision is the same and how to break the tie does not matter. Figure 4.3 illustrates an example of decision boundaries and regions for a three-class classifier on a scalar data sample x. 140 Pattern Recognition Figure 4.3 An example of decision boundaries and regions. For simplicity, we use scalar vari- able x instead of a multi-dimensional vector [22]. 4.2. How To CONSTRUCT CLASSIFIERS In the Bayes’ classifier, or the minimum-error-rate classifier, the prior probability P(a),) and class-conditional pdf p(x|co,.) are known. Unfortunately, in pattern recognition, we rarely have complete knowledge of class-conditional pdfs and/or prior probability. They ofien must be estimated or learned from the training data. In practice, the estimation of the prior probabilities is relatively easy. Estimation of the class-conditional pdf is more compli- cated. There is always concern to have sufficient training data relative to the tractability of the huge dimensionality of the sample data x. In this chapter we focus on estimation meth- ods for the class-conditional pdf. The estimation of the class-conditional pdfs can be nonparametric or parametric. In nonparametric estimation, no model structure is assumed and the pdf is directly estimated from the training data. When large amounts of sample data are available, nonparametric learning can accurately reflect the underlying probabilistic structure of the training data. However, available sample data are normally limited in practice, and parametric learning can achieve better estimates if valid model assumptions are made. In parametric learning, some general knowledge about the problem space allows one to parameterize the class- conditional pdf, so the severity of sparse training data can be reduced significantly. Suppose the pdf p(x | (0,) is assumed to have a certain probabilistic structure, such as the Gaussian pdf In such cases, only the mean vector u, (or mean p,) and covariance matrix 2, (or vari- ance 0'2) need to be estimated. infill" waa Férametric : conditional 1 of Gaussian In patt parameters c the training set of data s2 For pa; vised leamir. is the observ is clear that formation al: labeled data is missing fr discussed in In Chz maximum lil (MAP). Botl required. MI the set of pa: observed. Th vector for c p(x MAD.) sample data : class a), giv assumption :1 different cate as MK I (D), drawn indepi unknown, thr Similar distribution i without any _ the behavior available. W MLE. 2 Since all the pr ’ This assumptio 0, of the other rn Recognition scalar vari- ability P(a),) :ognition, we yability. They mation of the more compli- lIactability of mation meth- tarametric. In :tly estimated .onparametric training data. etric learning :tric learning, ze the class- utly. Suppose the Gaussian : 2, (or vari- How To Construct Classifiers 141 When the observed data x only takes discrete values from a finite set of N values, the clas3-conditional pdf is often assumed nonparametric, so there will be N —1 free parameters in the probability function p(xlq) .2 When the observed data x takes continuous values, parametric approaches are usually necessary. In many systems, the continuous class- conditional pdf (likelihood) p(x l 60,.) is aSSumed to be a Gaussian distribution or a mixture of Gaussian distributions. In pattern recognition, the set of data samples, which is often collected to estimate the parameters of the recognizer (including the prior and class-conditional pdf), is referred to as the training set. In contrast to the training set, the testing set is referred to the independent set of data samples, which is used to evaluate the recognition performance of the recognizer. For parameter estimation or learning, it is also important to distinguish between super- vised learning and unsupervised learning. Let’s denote the pair (x,a)) as a sample, where x is the observed data and a) is the class from which the data x comes. From the definition, it is clear that (x,a)) are jointly distributed random variables. In supervised learning, a) , in- formation about the class of the sample data x is given. Such sample data are usually called labeled data or complete data, in contrast to incomplete data where the class information a) is missing for unsupervised learning. Techniques for parametric unsupervised learning are discussed in Section 4.4. In Chapter 3 we introduced two most popular parameter estimation techniq...
View Full Document

{[ snackBarMessage ]}

What students are saying

  • Left Quote Icon

    As a current student on this bumpy collegiate pathway, I stumbled upon Course Hero, where I can find study resources for nearly all my courses, get online help from tutors 24/7, and even share my old projects, papers, and lecture notes with other students.

    Student Picture

    Kiran Temple University Fox School of Business ‘17, Course Hero Intern

  • Left Quote Icon

    I cannot even describe how much Course Hero helped me this summer. It’s truly become something I can always rely on and help me. In the end, I was not only able to survive summer classes, but I was able to thrive thanks to Course Hero.

    Student Picture

    Dana University of Pennsylvania ‘17, Course Hero Intern

  • Left Quote Icon

    The ability to access any university’s resources through Course Hero proved invaluable in my case. I was behind on Tulane coursework and actually used UCLA’s materials to help me move forward and get everything together on time.

    Student Picture

    Jill Tulane University ‘16, Course Hero Intern