SPR_LectureHandouts_Chapter_02

SPR_LectureHandouts_Chapter_02 - Pattern Recognition...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Pattern Recognition ECE-8443 Chapter 2 Bayesian Decision Theory Electrical and Computer Engineering Department, Mississippi State University. 1 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Outline • Bayesian Decision Theory (Continuous features) – Prior and posterior probabilities – Likelihood and evidence – Bayesian decision theory and conditional posterior probabilities – Probability of error and the concept of risk 2 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Recapping from last lecture… • Training and test data • Concept of “classes”, or, “categories”, and class label w, w ∈ {w1 , w2 , w3 ,..., wc } • Concept of a d-dimensional feature space: d x∈ℜ • The initial discussion that follows assumes d = 1, and c = 2 but we will generalize this later on. 3 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Prior probability • The sea bass/salmon example – State of nature, prior • State of nature is a random variable • The catch of salmon (ω1) and sea bass (ω2) is equiprobable – P(ω1) = P(ω2) (uniform priors) – P(ω1) + P( ω2) = 1 (exclusivity and exhaustivity) • “Prior” probabilities are learned from training datasets 4 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Decision rule using only prior probability information • Decision rule with only the prior information – Decide ω1 if P(ω1) > P(ω2) otherwise decide ω2 – Very crude – A high rate of mislabeling is expected • Use of the class–conditional information • p(x | ω1) and p(x | ω2) describe the difference in lightness between populations of sea and salmon 5 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Decision rule with only Prior probability 6 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Decision rule – Posterior, Likelihood and Evidence • Bayes theorem for arbitrary events A and B • P(A|B) = P(B|A) P(A) / P(B) (Bayes rule for arbitrary events A and B) • Continuous version of Bayes theorem P( A | X = x) p ( x) p( x | A) = P( A) – Can be rewritten as: *Small p() refers to PDFs, while Capital P() refers to probabilities. 7 Chapter 2 A: Any event x: A continuous random variable, example, a feature p( x | A) P( A) P( A | X = x) = p( x) Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Decision rule – Posterior, Likelihood and Evidence • In the previous equation, substitute A with ωj • Posterior, likelihood, evidence – P(ωj | x) = p(x | ωj) . P (ωj) / p(x) – Posterior = (Likelihood. Prior) / Evidence – Where in case of two categories p( x) = j =2 ∑ p( x | ω j ) P(ω j ) j =1 8 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Decision rule – Posterior, Likelihood and Evidence 9 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Decision rule – Posterior, Likelihood and Evidence • Decision given the posterior probabilities X is an observation for which: if P(ω1 | x) > P(ω2 | x) if P(ω1 | x) < P(ω2 | x) True state of nature = ω1 True state of nature = ω2 Therefore: Whenever we observe a particular x, the probability of error is : P(error | x) = P(ω1 | x) if we decide ω2 P(error | x) = P(ω2 | x) if we decide ω1 10 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Decision rule – Posterior, Likelihood and Evidence P ( w1 | x) if we decide w 2 P(error | x) = P ( w2 | x) if we decide w1 ∞ −∞ P(error ) = ∞ −∞ ∫ P(error , x)dx = ∫ P(error | x) p( x)dx Bayes decision: • Minimizing the probability of error • Decide ω1 if P(ω1 | x) > P(ω2 | x); otherwise decide ω2 Therefore, under the Bayes decision rule: P(error | x) = min [P(ω1 | x), P(ω2 | x)] 11 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Decision rule – Posterior, Likelihood and Evidence Decide ω1 if P(ω1 | x) > P(ω2 | x); otherwise decide ω2 Alternately, Decide ω1 if p(x | ω1) . P (ω1) > p(x | ω2) . P (ω2); otherwise decide ω2 12 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Decision rule – Posterior, Likelihood and Evidence Rule: Decide ω1 if p(x | ω1) . P (ω1) > p(x | ω2) . P (ω2); otherwise decide ω2 Notes: • If p(x | ω1) = p(x | ω2), then the observation x gives us no information about the state of nature (class) – The decision hinges entirely upon prior probabilities • If P (ω1) = P (ω2) (equal priors) – The decision hinges entirely upon the likelihoods p(x | ωj) 13 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Bayes Decision Theory – Generalizing preceding ideas • Generalization of the preceding ideas – Use of more than one feature – Use more than two states of nature – Allowing actions other than merely deciding on the state of nature – Introduce a loss function which is more general than the probability of error 14 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Bayes Decision Theory – Generalizing preceding ideas • Allowing actions other than classification primarily allows the possibility of rejection • Refusing to make a decision in close or bad cases • The loss function states how costly each action will be 15 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Bayes Decision Theory – Generalizing preceding ideas Let {ω1, ω2,…, ωc} be the set of c states of nature (or “categories”, or “classes”) Let {α1, α2,…, αa} be the set of possible actions Let λ(αi | ωj) be the loss incurred for taking action αi when the state of nature is ωj In what follows, x need not be restricted to a scalar, but can be generalized to a feature “vector” 16 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Bayes Decision Theory – Generalizing preceding ideas Conditional risk j =c R(α i | x) = ∑ λ (α i | ω j ) P(ω j | x) j =1 Overall risk j =c R = ∫ R(α ( x) | x) p( x)dx = ∫ ∑ λ (α i | ω j ) P(ω j | x) p ( x)dx j =1 Minimizing R Minimizing R(αi | x) for i = 1,…, a Select the action αi for which R(αi | x) is minimum R is minimum and R in this case is called the Bayes risk corresponds to the best performance that can be achieved 17 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Bayes Risk – Two Class Example α1 : deciding ω1 α2 : deciding ω2 λij = λ(αi | ωj) = loss incurred for deciding ωi when the true state of nature is ωj Conditional risk: R(α1 | x) = λ11P(ω1 | x) + λ12P(ω2 | x) R(α2 | x) = λ21P(ω1 | x) + λ22P(ω2 | x) 18 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Bayes Risk – Two Class Example Our rule is the following: if R(α1 | x) < R(α2 | x) action α1: “decide ω1” is taken This results in the equivalent rule : Decide ω1 if: (λ21- λ11) p(x | ω1) P(ω1) > (λ12- λ22) p(x | ω2) P(ω2) and decide ω2 otherwise. 19 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Bayes Risk – Two Class Example Likelihood Ratio: The preceding rule is equivalent to the following rule: Likelihood ratio p( x | ω1 ) λ12 − λ22 P(ω 2 ) if > . p ( x | ω 2 ) λ21 − λ11 P(ω1 ) Threshold independent of the feature vector x Then take action α1 (decide ω1) Otherwise take action α2 (decide ω2) Decide ω1 if the likelihood ratio is greater than a threshold Threshold can be learned from “training” data 20 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Minimum Error Rate Classification Let’s come back to Loss Functions and Error Rates - If action αi is taken and the true state of nature is ωj then: the decision is correct if i = j and in error if i ≠ j Zero-One Loss Function: 0 i = j λ ( α i ,ω j ) = 1 i ≠ j 21 Chapter 2 Saurabh Prasad Pattern Recognition i , j = 1 ,..., c Electrical and Computer Engineering Department Minimum Error Rate Classification The conditional risk corresponding to this loss function is: j =c R(α i | x) = ∑ λ (α i | ω j ) P(ω j | x) j =1 = ∑ P(ω j | x) = 1 − P(ωi | x) j ≠1 The risk corresponding to this loss function is the average probability of error Task: Seek a decision rule that minimizes the probability of error which is the error rate 22 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Minimum Error Rate Classification Minimizing risk is equivalent to maximizing P(ωi | x), since R(αi | x) = 1 – P(ωi | x) Hence, for the zero-one loss function, the decision rule corresponding to the minimum error rate degenerates to what we saw previously: Decide ωi if P (ωi | x) > P(ωj | x) ∀j ≠ i 23 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Minimum Error Rate Classification λ12 − λ22 P(ω 2 ) p( x | ω1 ) . = θ λ then decide ω1 if : > θλ Let λ21 − λ11 P(ω1 ) p( x | ω 2 ) For the zero-one loss function: 0 1 1 0 λ = then θ λ = P(ω 2 ) = θa P(ω1 ) 0 2 2 P(ω 2 ) then θ λ = = θb if λ = 1 0 P(ω1 ) 24 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Minimum Error Rate Classification To sum up, choice of loss function can: • Set an appropriate threshold θ λ • Allows us to weigh actions/decisions (example, mislabeling classes) non-uniformly • Allows us to penalize certain actions/decisions more than other actions/decisions depending upon the application at hand – Example: Automated Target Recognition – Example: Computer Aided Diagnosis from Biomedical Signals 25 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Minimum Error Rate Classification 26 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Minimum Error Rate Classification 27 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Classifiers, Discriminant Functions and Decision Surfaces • The multi-category (/multi-class) case – Set of discriminant functions gi(x), i = 1,…, c – The classifier assigns a feature vector x to class ωi if: gi(x) > gj(x) ∀j ≠ i 28 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Classifiers, Discriminant Functions and Decision Surfaces 29 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Classifiers, Discriminant Functions and Decision Surfaces • The choice of discriminant functions is not unique • In general, if we replace every gi(x) with f(gi(x)), the resulting classification is unchanged, provided f(.) is monotonically increasing • Decision regions and Decision rules: For a c-class problem, a decision rule essentially partitions the feature space Rd into c decision regions Rd1 , Rd2 , Rd3 ,… ,Rdc. • If gi(x) > gj(x) ∀j ≠ i then x is in R di (Ri means assign x to ωi) 30 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Classifiers, Discriminant Functions and Decision Surfaces • For the general case with risks, we take gi(x) = - R(αi | x) (max. discriminant corresponds to min. conditional risk!) • For the minimum error rate case, we take gi(x) = P(ωi | x) (max. discriminant corresponds to max. posterior!) gi(x) ≡ p(x | ωi) P(ωi) gi(x) = ln p(x | ωi) + ln P(ωi) (ln: natural logarithm!) 31 Chapter 2 Saurabh Prasad Pattern Recognition Why take ln( )? •Computational/analytical convenience for Normal pdfs •Numerical accuracy (e.g., probabilities numerically tending to zero) •Decomposing likelihood and prior – if needed, they can be weighed separately Electrical and Computer Engineering Department Classifiers, Discriminant Functions and Decision Surfaces 32 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Classifiers, Discriminant Functions and Decision Surfaces • The two-category case – A classifier is a “dichotomizer” that has two discriminant functions g1 and g2. A convenient representation of the discriminant function in this case is: g(x) ≡ g1(x) – g2(x) Decide ω1 if g(x) > 0 ; Otherwise decide ω2 p(x|ω1 ) P(ω1 ) + ln g(x) = ln p(x|ω2 ) P(ω2 ) 33 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department The Normal Density Function • Structure of the Bayes classifier is determined by the conditional densities p(x|ωi ) • Gaussian (Normal) density function has been most widely studied – – – – 34 Analytical tractability Continuous pdf A lot of processes are asymptotically Gaussian Central limit theorem Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department The Normal Density Function p( x) = 1 x − µ 2 1 exp − , 2π σ 2 σ Where: µ = mean (or expected value) of x σ2 = expected squared deviation or variance 35 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department The Normal Density Function 36 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department The Normal Density Function Recall that… Multivariate Gaussian pdf in d dimensions is given by p( x) = 1 (2π ) d / 2 Σ 1/ 2 1 t −1 exp − ( x − µ ) Σ ( x − µ ) 2 where: x = (x1, x2, …, xd)t (t stands for the transpose vector form) µ = (µ1, µ2, …, µd)t mean vector Σ = d*d covariance matrix |Σ| and Σ-1 are determinant and inverse respectively 37 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department The Normal Density Function Recall that… Multivariate Gaussian pdf in d dimensions is completely specified by d + d(d+1)/2 parameters Samples drawn from a multivariate Gaussian pdf tend to fall in a single cloud/cluster – center of the cluster is determined by the mean vector – shape of the cluster is determined by the covariance matrix Shape can be determined analytically by finding contours of equal probability density in the pdf. Entropy for a Gaussian random variable (max. entropy for a given mean and variance)? ∞ 1 H ( p ( x)) = − ∫ p ( x) ln p ( x)dx = log(2πeσ 2 ) 2 −∞ Higher order moments? 38 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department The Normal Density Function 39 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department The Normal Density Function Reading assignment: section 2.5 40 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Discriminant Functions for the Normal Density • We saw that the minimum error-rate classification can be achieved by the discriminant function gi(x) = ln p(x | ωi) + ln P(ωi) • For the case of multivariate normal pdfs (each class is represented by Gaussian pdfs in the feature space), this discriminant function becomes: 1 1 d g i ( x) = − ( x − µ i ) t ∑ i−1 ( x − µ i ) − ln 2π − ln Σ i + ln P(ωi ) 2 2 2 41 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Discriminant Functions for the Normal Density – Case 1 • Case Σi = σ2.I (I stands for the identity matrix) – Statistically independent features in the feature space – Equi-probable contours are circular/spherical/hyperspherical – A good assumption if we wish to minimize the number of parameters we have to learn (i.e., instead of learning all d(d+1)/2 entries of the covariance matrix, we need to learn only one parameter, σ2) – For this case: σ 2 0 0 0 0 σ 2 ... 0 d = (σ 2 ) = σ 2 d ∑i = 0 0 ... 0 ... ... ... σ 2 ∑ i = σ 2 d and is independent of i ∑i 42 Chapter 2 −1 = (1 / σ 2 )I Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Discriminant Functions for the Normal Density – Case 1 • The discriminant function can be reduced to: 1 1 d −1 g i ( x) = − ( x − μi ) t ∑ i ( x − μi ) − ln(2π ) − ln ∑ i + ln P(ωi ) 2 2 2 Constant with respect to i 1 −1 g i ( x) = − ( x − μi ) t ∑ i ( x − μi ) + ln P(ωi ) 2 1 g i ( x) = − 2 ( x t x − 2 μit x + μit μi ) + ln P(ωi ) 2σ Constant with respect to i g i ( x) = wit x + wi 0 (linear discriminant function) where : wi = µi 1 ; wi 0 = − 2 µ it µ i + ln P(ωi ) σ2 2σ (ωi 0 is called the threshold for the i ' th category!) 43 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Discriminant Functions for the Normal Density – Case 1 – A classifier that uses linear discriminant functions is called “a linear machine” – The decision surfaces for a linear machine are pieces of hyperplanes defined by: gi(x) = gj(x) 44 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Discriminant Functions for the Normal Density – Case 1 The decision region when the priors are equal and the support regions are spherical is simply halfway between the means (Euclidean distance). 45 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Discriminant Functions for the Normal Density – Case 1 – The hyperplane separating Ri and Rj 1 σ2 x0 = ( µ i + µ j ) − 2 µi − µ j 2 P(ωi ) ln (µi − µ j ) P(ω j ) Always orthogonal to the line linking the means! 1 if P(ωi ) = P(ω j ) then x0 = ( µ i + µ j ) 2 46 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Discriminant Functions for the Normal Density – Case 1 47 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Discriminant Functions for the Normal Density – Case 1 48 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Discriminant Functions for the Normal Density – Case 2 • Case Σi = Σ (covariance of all classes are identical but arbitrary!) • It can be shown (see section 2.6.2) that the discrimination functions simplify to: g i ( x) = wit x + wi 0 (linear discriminant function) where : 1 wi = ∑ −1 µ i ; wi 0 = − µ it ∑ −1 µ i + ln P(ωi ) 2 (ωi 0 is called the threshold for the i ' th category!) 49 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Discriminant Functions for the Normal Density – Case 2 It can also be shown (see section 2.6.2) Hyperplane separating Ri and Rj [ ln P(ωi ) / P(ω j ) 1 x0 = ( µ i + µ j ) − .( µ i − µ j ) t −1 2 (µi − µ j ) Σ (µi − µ j ) (the decision surface is still linear, but the hyperplane separating Ri and Rj is generally not orthogonal to the line between the means!) 50 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Discriminant Functions for the Normal Density – Case 2 Continued on next slide… 51 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Discriminant Functions for the Normal Density – Case 2 52 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Discriminant Functions for the Normal Density – Case 2 • Case Σi = arbitrary – Most generic representation for the Gaussian pdfs – But we now need to estimate d+d(d+1)/2 parameters per class – The covariance matrices are different for each category g i ( x) = x tWi x + wit x + wi 0 where : Wi = − 1 −1 Σi 2 wi = Σ i−1 µ i 1 t −1 1 wi 0 = − µ i Σ i µ i − ln Σ i + ln P (ωi ) 2 2 (Hyperquadrics which are: hyperplanes, pairs of hyperplanes, hyperspheres, hyperellipsoids, hyperparaboloids etc.) 53 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Discriminant Functions for the Normal Density 54 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Discriminant Functions for the Normal Density Continued on next slide… 55 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Discriminant Functions for the Normal Density Review example 1 from book (page 44) 56 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Discriminant Functions for the Normal Density 57 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Bayes Decision Theory – Recap P(ωj | x) = p(x | ωj) . P (ωj) / p(x) Posterior = (Likelihood. Prior) / Evidence Decision Rule: Decision Rule: p ( x | ω1 ) P(ω 2 ) > p ( x | ω 2 ) P(ω1 ) Decide w1, else, decide w2 Ensures minimum error rate (error probability). p( x | ω1 ) λ12 − λ22 P(ω 2 ) > . p ( x | ω 2 ) λ21 − λ11 P(ω1 ) Decide w1, else, decide w2 Ensures minimum total risk if if Error rate (conditional and total error): Risk (Conditional and total): P(error | x ) = ∑ P(ωi x) , x ∈ ω j c P(error ) = ∞ i =1 i≠ j j =c R(α i | x) = ∑ λ (α i | ω j ) P(ω j | x) j =1 ∞ ∫ P(error , x)dx = ∫ P(error | x) p( x)dx −∞ −∞ Discriminant function j =c R = ∫ R(α ( x) | x) p( x)dx = ∫ ∑ λ (α i | ω j ) P(ω j | x) p ( x)dx j =1 Discriminant function gi(x) = P(ωi | x) gi(x) = - R(αi | x) (max. discriminant corresponds to max. posterior!) (max. discriminant corresponds to min. conditional risk!) Generalized classification (Arbitrary actions follow classification; arbitrary loss function) Traditional classification (Actions equal assigning class labels; zeroone loss function) 58 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Bayes Decision Theory – Bayes error and reducible error Revisiting error probabilities… Recall: ∞ −∞ P(error ) = ∞ −∞ ∫ P(error , x)dx = ∫ P(error | x) p( x)dx Alternately, the error probability can also be expressed as: ∞ −∞ P(error ) = ∞ −∞ ∫ P(error , x)dx = ∫ p( x | error ) P(error )dx For a two class problem, the event “error” corresponds to: w1 when the true state of nature is w2 error = w2 when the true state of nature is w1 59 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Bayes Decision Theory – Bayes error and reducible error xB is Bayes optimal decision 60 Chapter 2 x* chosen arbitrarily Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Bayes Decision Theory – Error bounds • Bayes decision rule guarantees lowest average error rate • Closed-form solution for two-class problems with Gaussian distributions (likelihoods) • Difficult to extend this to high dimensional spaces • Although we may not be able to derive exact error probabilities in many cases, we can estimate upper bounds on the probability of error 61 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Bayes Decision Theory – Error bounds (Chernoff bound for a two class problem) We will need the following inequality: min[a, b] ≤ a β b1− β ∀a, b ≥ 0 and 0 ≤ β ≤ 1 Assume a ≥ b without loss of generality: min[a,b] = b Also, aβb(1- β) = (a/b)βb and (a/b)β ≥ 1 Therefore, b ≤ (a/b)βb, which implies min[a,b] ≤ aβb(1- β) Apply to our expression for P(error). Recall that one way of expressing P(error) is: P(error ) = ∫ min[ P(ω1 x), P(ω 2 x)] p ( x)dx 62 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Bayes Decision Theory – Error bounds (Chernoff bound for a two class problem) P(error ) = ∫ min[ P(ω1 x), P(ω 2 x)] p( x)dx = ∫ min[ P(ω1 ) p ( x ω1 ) P(ω 2 ) p ( x ω 2 ) , p ( x)dx p( x) p( x) = ∫ min[ P (ω1 ) p ( x ω1 ), P (ω 2 ) p ( x ω 2 )]dx ≤ ∫ P β (ω1 ) p β ( x ω1 ) P1− β (ω 2 ) p1− β ( x ω 2 )dx ≤ P β (ω1 ) P1− β (ω 2 ) ∫ p β ( x ω1 ) p1− β ( x ω 2 )dx The integral is over the entire feature space Rd, not over individual decision regions R1d and R2d For Gaussian density functions (likelihoods), this expression can be further simplified 63 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Bayes Decision Theory – Error bounds (Chernoff bound for a two class problem) For Gaussian density functions (likelihoods), this expression can be further simplified 1− β β ∫ p (x ω1 ) p (x ω2 )dx = exp(− k ( β )) where: k (β ) = β (1 − β ) 2 (µ 2 − µ1 ) t [ β ∑1 + (1 − β ) ∑ 2 ]−1 (µ 2 − µ1 ) 1 β ∑1 + (1 − β ) ∑ 2 + ln β (1− β ) 2 ∑1 ∑ 2 Finding the bound: find the value of β that minimizes exp(-k(β )), and then compute P(error) using the bound 1-D optimization over β, instead of over the (possibly) high dimensional feature space 64 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Bayes Decision Theory – Error bounds (Bhattacharya bound for a two class problem) • The Chernoff bound is loose for extreme values • The Bhattacharyya bound can be derived by β = 0.5: ≤ P β (ω1 ) P1− β (ω2 ) ∫ p β (x ω1 ) p1− β (x ω2 )dx = P (ω1 ) P (ω2 ) ∫ p (x ω1 ) p (x ω2 )dx = P (ω1 ) P (ω2 ) exp(− k ( β )) where: ∑1 + ∑ 2 1 1 ∑ + ∑ 2 −1 2 (µ 2 − µ1 ) + ln k ( β ) = (µ 2 − µ1 ) t [ 1 8 2 2 ∑1 ∑ 2 • These bounds can still be used if the distributions are not Gaussian (Occam’s Razor). However, they might not be adequately tight. 65 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Bayes Decision Theory – What did we learn? • Bayes formula: factors a posterior into a combination of a likelihood, prior and the evidence • Bayes decision rule, and its relationship to minimum error? • Error probabilities • Loss functions and generalized risk: There are various applications where such formulations would be appropriate • Decision surfaces: geometric interpretation of a Bayesian classifier • Bayesian classifiers for Gaussian distributions: how does the decision surface change as a function of the mean and covariance? • Bounds on performance (i.e., Chernoff, Bhattacharyya) are useful abstractions for obtaining closed-form solutions to problems. 66 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department ...
View Full Document

This note was uploaded on 02/20/2012 for the course ECE 8443 taught by Professor Staff during the Spring '10 term at University of Houston.

Ask a homework question - tutors are online