This preview shows page 1. Sign up to view the full content.
Unformatted text preview: Pattern Recognition
ECE8443 Chapter 2
Bayesian Decision Theory
Electrical and Computer Engineering Department,
Mississippi State University. 1 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Outline
• Bayesian Decision Theory (Continuous features)
– Prior and posterior probabilities
– Likelihood and evidence
– Bayesian decision theory and conditional posterior
probabilities
– Probability of error and the concept of risk 2 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Recapping from last lecture…
• Training and test data
• Concept of “classes”, or, “categories”, and class
label w, w ∈ {w1 , w2 , w3 ,..., wc }
• Concept of a ddimensional feature space:
d
x∈ℜ
• The initial discussion that follows assumes d = 1,
and c = 2 but we will generalize this later on.
3 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Prior probability
• The sea bass/salmon example
– State of nature, prior
• State of nature is a random variable
• The catch of salmon (ω1) and sea bass (ω2) is equiprobable
– P(ω1) = P(ω2) (uniform priors) – P(ω1) + P( ω2) = 1 (exclusivity and exhaustivity) • “Prior” probabilities are learned from training
datasets
4 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Decision rule using only prior
probability information
• Decision rule with only the prior information
– Decide ω1 if P(ω1) > P(ω2) otherwise decide ω2
– Very crude
– A high rate of mislabeling is expected • Use of the class–conditional information
• p(x  ω1) and p(x  ω2) describe the difference in
lightness between populations of sea and
salmon
5 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Decision rule with only Prior
probability 6 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Decision rule – Posterior,
Likelihood and Evidence
• Bayes theorem for arbitrary events A and B
• P(AB) = P(BA) P(A) / P(B) (Bayes rule for arbitrary events A and B) • Continuous version of Bayes theorem
P( A  X = x) p ( x)
p( x  A) =
P( A) – Can be rewritten as:
*Small p() refers to PDFs, while
Capital P() refers to probabilities. 7 Chapter 2 A: Any event
x: A continuous random
variable, example, a feature p( x  A) P( A)
P( A  X = x) =
p( x)
Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Decision rule – Posterior,
Likelihood and Evidence
• In the previous equation, substitute A with ωj
• Posterior, likelihood, evidence
– P(ωj  x) = p(x  ωj) . P (ωj) / p(x)
– Posterior = (Likelihood. Prior) / Evidence
– Where in case of two categories
p( x) = j =2 ∑ p( x  ω j ) P(ω j )
j =1 8 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Decision rule – Posterior,
Likelihood and Evidence 9 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Decision rule – Posterior,
Likelihood and Evidence
• Decision given the posterior probabilities
X is an observation for which:
if P(ω1  x) > P(ω2  x)
if P(ω1  x) < P(ω2  x) True state of nature = ω1
True state of nature = ω2 Therefore:
Whenever we observe a particular x, the probability of error
is :
P(error  x) = P(ω1  x) if we decide ω2
P(error  x) = P(ω2  x) if we decide ω1
10 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Decision rule – Posterior,
Likelihood and Evidence P ( w1  x) if we decide w 2 P(error  x) = P ( w2  x) if we decide w1 ∞ −∞ P(error ) = ∞ −∞ ∫ P(error , x)dx = ∫ P(error  x) p( x)dx Bayes decision:
• Minimizing the probability of error
• Decide ω1 if P(ω1  x) > P(ω2  x); otherwise decide ω2
Therefore, under the Bayes decision rule:
P(error  x) = min [P(ω1  x), P(ω2  x)]
11 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Decision rule – Posterior,
Likelihood and Evidence
Decide ω1 if P(ω1  x) > P(ω2  x); otherwise decide ω2 Alternately,
Decide ω1 if p(x  ω1) . P (ω1) > p(x  ω2) . P (ω2);
otherwise decide ω2 12 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Decision rule – Posterior,
Likelihood and Evidence
Rule: Decide ω1 if
p(x  ω1) . P (ω1) > p(x  ω2) . P (ω2);
otherwise decide ω2
Notes:
• If p(x  ω1) = p(x  ω2), then the observation x gives us no information
about the state of nature (class)
– The decision hinges entirely upon prior probabilities
• If P (ω1) = P (ω2) (equal priors)
– The decision hinges entirely upon the likelihoods p(x  ωj)
13 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Bayes Decision Theory –
Generalizing preceding ideas
• Generalization of the preceding ideas
– Use of more than one feature
– Use more than two states of nature
– Allowing actions other than merely deciding on the
state of nature
– Introduce a loss function which is more general than
the probability of error 14 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Bayes Decision Theory –
Generalizing preceding ideas
• Allowing actions other than classification
primarily allows the possibility of rejection
• Refusing to make a decision in close or bad cases
• The loss function states how costly each action
will be 15 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Bayes Decision Theory –
Generalizing preceding ideas
Let {ω1, ω2,…, ωc} be the set of c states of nature
(or “categories”, or “classes”)
Let {α1, α2,…, αa} be the set of possible actions
Let λ(αi  ωj) be the loss incurred for taking action αi when
the state of nature is ωj
In what follows, x need not be restricted to a scalar, but
can be generalized to a feature “vector”
16 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Bayes Decision Theory –
Generalizing preceding ideas
Conditional risk j =c R(α i  x) = ∑ λ (α i  ω j ) P(ω j  x)
j =1 Overall risk j =c
R = ∫ R(α ( x)  x) p( x)dx = ∫ ∑ λ (α i  ω j ) P(ω j  x) p ( x)dx j =1 Minimizing R Minimizing R(αi  x) for i = 1,…, a Select the action αi for which R(αi  x) is minimum
R is minimum and R in this case is called the Bayes risk
corresponds to the best performance that can be achieved
17 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Bayes Risk – Two Class
Example
α1 : deciding ω1
α2 : deciding ω2
λij = λ(αi  ωj) = loss incurred for deciding ωi when the true
state of nature is ωj Conditional risk:
R(α1  x) = λ11P(ω1  x) + λ12P(ω2  x)
R(α2  x) = λ21P(ω1  x) + λ22P(ω2  x) 18 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Bayes Risk – Two Class
Example
Our rule is the following:
if R(α1  x) < R(α2  x)
action α1: “decide ω1” is taken
This results in the equivalent rule :
Decide ω1 if:
(λ21 λ11) p(x  ω1) P(ω1) > (λ12 λ22) p(x  ω2) P(ω2)
and decide ω2 otherwise.
19 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Bayes Risk – Two Class
Example
Likelihood Ratio:
The preceding rule is equivalent to the following rule: Likelihood ratio p( x  ω1 ) λ12 − λ22 P(ω 2 )
if
>
.
p ( x  ω 2 ) λ21 − λ11 P(ω1 ) Threshold
independent of the
feature vector x Then take action α1 (decide ω1)
Otherwise take action α2 (decide ω2)
Decide ω1 if the likelihood ratio is greater than a threshold
Threshold can be learned from “training” data
20 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Minimum Error Rate
Classification
Let’s come back to Loss Functions and Error Rates
 If action αi is taken and the true state of nature is ωj then:
the decision is correct if i = j and in error if i ≠ j
ZeroOne Loss Function: 0 i = j
λ ( α i ,ω j ) = 1 i ≠ j 21 Chapter 2 Saurabh Prasad Pattern Recognition i , j = 1 ,..., c Electrical and Computer Engineering Department Minimum Error Rate
Classification
The conditional risk corresponding to this loss function is:
j =c R(α i  x) = ∑ λ (α i  ω j ) P(ω j  x)
j =1 = ∑ P(ω j  x) = 1 − P(ωi  x)
j ≠1 The risk corresponding to this loss function is the average
probability of error
Task: Seek a decision rule that minimizes the probability of error
which is the error rate
22 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Minimum Error Rate
Classification
Minimizing risk is equivalent to maximizing P(ωi  x), since R(αi  x) =
1 – P(ωi  x)
Hence, for the zeroone loss function, the decision rule
corresponding to the minimum error rate degenerates to what
we saw previously:
Decide ωi if P (ωi  x) > P(ωj  x) ∀j ≠ i 23 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Minimum Error Rate
Classification
λ12 − λ22 P(ω 2 )
p( x  ω1 )
.
= θ λ then decide ω1 if :
> θλ
Let
λ21 − λ11 P(ω1 )
p( x  ω 2 ) For the zeroone loss function: 0 1 1 0 λ = then θ λ = P(ω 2 )
= θa
P(ω1 ) 0 2 2 P(ω 2 ) then θ λ =
= θb
if λ = 1 0 P(ω1 ) 24 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Minimum Error Rate
Classification
To sum up, choice of loss function can:
• Set an appropriate threshold θ λ
• Allows us to weigh actions/decisions (example, mislabeling
classes) nonuniformly
• Allows us to penalize certain actions/decisions more than
other actions/decisions depending upon the application at
hand
– Example: Automated Target Recognition
– Example: Computer Aided Diagnosis from Biomedical
Signals
25 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Minimum Error Rate
Classification 26 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Minimum Error Rate
Classification 27 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Classifiers, Discriminant Functions
and Decision Surfaces
• The multicategory (/multiclass) case
– Set of discriminant functions gi(x), i = 1,…, c
– The classifier assigns a feature vector x to class ωi
if:
gi(x) > gj(x) ∀j ≠ i 28 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Classifiers, Discriminant Functions
and Decision Surfaces 29 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Classifiers, Discriminant Functions
and Decision Surfaces
• The choice of discriminant functions is not unique
• In general, if we replace every gi(x) with f(gi(x)), the
resulting classification is unchanged, provided f(.) is
monotonically increasing
• Decision regions and Decision rules: For a cclass problem,
a decision rule essentially partitions the feature space Rd
into c decision regions Rd1 , Rd2 , Rd3 ,… ,Rdc.
• If gi(x) > gj(x) ∀j ≠ i then x is in R di
(Ri means assign x to ωi) 30 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Classifiers, Discriminant Functions
and Decision Surfaces
• For the general case with risks, we take
gi(x) =  R(αi  x) (max. discriminant corresponds to min. conditional risk!)
• For the minimum error rate case, we take gi(x) = P(ωi  x)
(max. discriminant corresponds to max. posterior!) gi(x) ≡ p(x  ωi) P(ωi)
gi(x) = ln p(x  ωi) + ln P(ωi)
(ln: natural logarithm!)
31 Chapter 2 Saurabh Prasad Pattern Recognition Why take ln( )?
•Computational/analytical
convenience for Normal pdfs
•Numerical accuracy (e.g.,
probabilities numerically tending
to zero)
•Decomposing likelihood and
prior – if needed, they can be
weighed separately
Electrical and Computer Engineering Department Classifiers, Discriminant Functions
and Decision Surfaces 32 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Classifiers, Discriminant Functions
and Decision Surfaces
• The twocategory case
– A classifier is a “dichotomizer” that has two discriminant functions
g1 and g2. A convenient representation of the discriminant function
in this case is:
g(x) ≡ g1(x) – g2(x)
Decide ω1 if g(x) > 0 ; Otherwise decide ω2
p(xω1 )
P(ω1 )
+ ln
g(x) = ln
p(xω2 )
P(ω2 ) 33 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department The Normal Density Function
• Structure of the Bayes classifier is determined by the
conditional densities p(xωi )
• Gaussian (Normal) density function has been most widely
studied
–
–
–
– 34 Analytical tractability
Continuous pdf
A lot of processes are asymptotically Gaussian
Central limit theorem Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department The Normal Density Function p( x) = 1 x − µ 2 1
exp − ,
2π σ 2 σ Where:
µ = mean (or expected value) of x
σ2 = expected squared deviation or variance 35 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department The Normal Density Function 36 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department The Normal Density Function
Recall that…
Multivariate Gaussian pdf in d dimensions is given by p( x) = 1
(2π ) d / 2 Σ 1/ 2 1 t −1
exp − ( x − µ ) Σ ( x − µ )
2 where:
x = (x1, x2, …, xd)t (t stands for the transpose vector form)
µ = (µ1, µ2, …, µd)t mean vector
Σ = d*d covariance matrix
Σ and Σ1 are determinant and inverse respectively 37 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department The Normal Density Function
Recall that…
Multivariate Gaussian pdf in d dimensions is completely specified by
d + d(d+1)/2 parameters
Samples drawn from a multivariate Gaussian pdf tend to fall in a single
cloud/cluster
– center of the cluster is determined by the mean vector
– shape of the cluster is determined by the covariance matrix
Shape can be determined analytically by finding contours of equal probability density in the pdf. Entropy for a Gaussian random variable (max. entropy for a given mean and
variance)?
∞
1
H ( p ( x)) = − ∫ p ( x) ln p ( x)dx = log(2πeσ 2 )
2
−∞ Higher order moments?
38 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department The Normal Density Function 39 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department The Normal Density Function Reading assignment:
section 2.5 40 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Discriminant Functions for the Normal
Density
• We saw that the minimum errorrate classification
can be achieved by the discriminant function
gi(x) = ln p(x  ωi) + ln P(ωi)
• For the case of multivariate normal pdfs (each class is
represented by Gaussian pdfs in the feature space),
this discriminant function becomes:
1
1
d
g i ( x) = − ( x − µ i ) t ∑ i−1 ( x − µ i ) − ln 2π − ln Σ i + ln P(ωi )
2
2
2
41 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Discriminant Functions for the Normal
Density – Case 1
• Case Σi = σ2.I (I stands for the identity matrix) – Statistically independent features in the feature space
– Equiprobable contours are circular/spherical/hyperspherical
– A good assumption if we wish to minimize the number of parameters we
have to learn (i.e., instead of learning all d(d+1)/2 entries of the
covariance matrix, we need to learn only one parameter, σ2)
– For this case:
σ 2 0 0 0 0 σ 2 ... 0 d
= (σ 2 ) = σ 2 d
∑i = 0 0 ...
0 ... ... ... σ 2 ∑ i = σ 2 d and is independent of i
∑i
42 Chapter 2 −1 = (1 / σ 2 )I Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Discriminant Functions for the Normal
Density – Case 1
• The discriminant function can be reduced to:
1
1
d
−1
g i ( x) = − ( x − μi ) t ∑ i ( x − μi ) − ln(2π ) − ln ∑ i + ln P(ωi )
2
2
2
Constant with respect to i 1
−1
g i ( x) = − ( x − μi ) t ∑ i ( x − μi ) + ln P(ωi )
2
1
g i ( x) = − 2 ( x t x − 2 μit x + μit μi ) + ln P(ωi )
2σ
Constant with respect to i g i ( x) = wit x + wi 0 (linear discriminant function)
where :
wi = µi
1
; wi 0 = − 2 µ it µ i + ln P(ωi )
σ2
2σ (ωi 0 is called the threshold for the i ' th category!)
43 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Discriminant Functions for the Normal
Density – Case 1
– A classifier that uses linear discriminant functions is
called “a linear machine” – The decision surfaces for a linear machine are pieces
of hyperplanes defined by: gi(x) = gj(x) 44 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Discriminant Functions for the Normal
Density – Case 1 The decision region when the priors are equal and the support regions are spherical is simply halfway between the means
(Euclidean distance). 45 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Discriminant Functions for the Normal
Density – Case 1
– The hyperplane separating Ri and Rj
1
σ2
x0 = ( µ i + µ j ) −
2
µi − µ j 2 P(ωi )
ln
(µi − µ j )
P(ω j ) Always orthogonal to the line linking the means!
1
if P(ωi ) = P(ω j ) then x0 = ( µ i + µ j )
2
46 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Discriminant Functions for the Normal
Density – Case 1 47 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Discriminant Functions for the Normal
Density – Case 1 48 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Discriminant Functions for the Normal
Density – Case 2
• Case Σi = Σ (covariance of all classes are identical but
arbitrary!)
• It can be shown (see section 2.6.2) that the discrimination
functions simplify to:
g i ( x) = wit x + wi 0 (linear discriminant function)
where :
1
wi = ∑ −1 µ i ; wi 0 = − µ it ∑ −1 µ i + ln P(ωi )
2
(ωi 0 is called the threshold for the i ' th category!) 49 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Discriminant Functions for the Normal
Density – Case 2
It can also be shown (see section 2.6.2) Hyperplane
separating Ri and Rj [ ln P(ωi ) / P(ω j )
1
x0 = ( µ i + µ j ) −
.( µ i − µ j )
t −1
2
(µi − µ j ) Σ (µi − µ j ) (the decision surface is still linear, but the hyperplane
separating Ri and Rj is generally not orthogonal to the
line between the means!) 50 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Discriminant Functions for the Normal
Density – Case 2 Continued on next slide… 51 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Discriminant Functions for the Normal
Density – Case 2 52 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Discriminant Functions for the Normal
Density – Case 2
• Case Σi = arbitrary – Most generic representation for the Gaussian pdfs
– But we now need to estimate d+d(d+1)/2 parameters per class
– The covariance matrices are different for each category g i ( x) = x tWi x + wit x + wi 0
where :
Wi = − 1 −1
Σi
2 wi = Σ i−1 µ i
1 t −1
1
wi 0 = − µ i Σ i µ i − ln Σ i + ln P (ωi )
2
2
(Hyperquadrics which are: hyperplanes, pairs of hyperplanes, hyperspheres,
hyperellipsoids, hyperparaboloids etc.)
53 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Discriminant Functions for the Normal
Density 54 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Discriminant Functions for the Normal
Density Continued on next slide… 55 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Discriminant Functions for the Normal
Density Review example 1 from
book (page 44)
56 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Discriminant Functions for the Normal
Density 57 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Bayes Decision Theory – Recap
P(ωj  x) = p(x  ωj) . P (ωj) / p(x)
Posterior = (Likelihood. Prior) / Evidence Decision Rule: Decision Rule: p ( x  ω1 ) P(ω 2 )
>
p ( x  ω 2 ) P(ω1 )
Decide w1, else, decide w2
Ensures minimum error rate (error probability). p( x  ω1 ) λ12 − λ22 P(ω 2 )
>
.
p ( x  ω 2 ) λ21 − λ11 P(ω1 )
Decide w1, else, decide w2
Ensures minimum total risk if if Error rate (conditional and total error): Risk (Conditional and total): P(error  x ) = ∑ P(ωi x) , x ∈ ω j
c P(error ) = ∞ i =1
i≠ j j =c R(α i  x) = ∑ λ (α i  ω j ) P(ω j  x)
j =1 ∞ ∫ P(error , x)dx = ∫ P(error  x) p( x)dx −∞ −∞ Discriminant function j =c
R = ∫ R(α ( x)  x) p( x)dx = ∫ ∑ λ (α i  ω j ) P(ω j  x) p ( x)dx j =1 Discriminant function gi(x) = P(ωi  x) gi(x) =  R(αi  x) (max. discriminant corresponds to max. posterior!) (max. discriminant corresponds to min. conditional risk!)
Generalized classification (Arbitrary
actions follow classification;
arbitrary loss function) Traditional classification (Actions
equal assigning class labels; zeroone loss function) 58 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Bayes Decision Theory – Bayes error and
reducible error
Revisiting error probabilities… Recall:
∞ −∞ P(error ) = ∞ −∞ ∫ P(error , x)dx = ∫ P(error  x) p( x)dx Alternately, the error probability can also be expressed as:
∞ −∞ P(error ) = ∞ −∞ ∫ P(error , x)dx = ∫ p( x  error ) P(error )dx For a two class problem, the event “error” corresponds to:
w1 when the true state of nature is w2 error = w2 when the true state of nature is w1 59 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Bayes Decision Theory – Bayes error and
reducible error
xB is Bayes
optimal decision 60 Chapter 2 x* chosen
arbitrarily Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Bayes Decision Theory – Error bounds
• Bayes decision rule guarantees lowest average error rate
• Closedform solution for twoclass problems with Gaussian
distributions (likelihoods)
• Difficult to extend this to high dimensional spaces
• Although we may not be able to derive exact error
probabilities in many cases, we can estimate upper bounds on
the probability of error 61 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Bayes Decision Theory – Error bounds
(Chernoff bound for a two class problem)
We will need the following inequality:
min[a, b] ≤ a β b1− β ∀a, b ≥ 0 and 0 ≤ β ≤ 1
Assume a ≥ b without loss of generality: min[a,b] = b
Also, aβb(1 β) = (a/b)βb and (a/b)β ≥ 1
Therefore, b ≤ (a/b)βb, which implies min[a,b] ≤ aβb(1 β)
Apply to our expression for P(error).
Recall that one way of expressing P(error) is: P(error ) = ∫ min[ P(ω1 x), P(ω 2 x)] p ( x)dx 62 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Bayes Decision Theory – Error bounds
(Chernoff bound for a two class problem)
P(error ) = ∫ min[ P(ω1 x), P(ω 2 x)] p( x)dx
= ∫ min[ P(ω1 ) p ( x ω1 ) P(ω 2 ) p ( x ω 2 )
, p ( x)dx
p( x)
p( x) = ∫ min[ P (ω1 ) p ( x ω1 ), P (ω 2 ) p ( x ω 2 )]dx
≤ ∫ P β (ω1 ) p β ( x ω1 ) P1− β (ω 2 ) p1− β ( x ω 2 )dx
≤ P β (ω1 ) P1− β (ω 2 ) ∫ p β ( x ω1 ) p1− β ( x ω 2 )dx
The integral is over the entire feature space Rd, not over individual decision regions R1d and
R2d
For Gaussian density functions (likelihoods), this expression can be further simplified 63 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Bayes Decision Theory – Error bounds
(Chernoff bound for a two class problem)
For Gaussian density functions (likelihoods), this expression can be further simplified
1− β
β
∫ p (x ω1 ) p (x ω2 )dx = exp(− k ( β )) where:
k (β ) = β (1 − β )
2 (µ 2 − µ1 ) t [ β ∑1 + (1 − β ) ∑ 2 ]−1 (µ 2 − µ1 ) 1 β ∑1 + (1 − β ) ∑ 2
+ ln
β
(1− β )
2
∑1 ∑ 2
Finding the bound: find the value of β that
minimizes exp(k(β )), and then compute
P(error) using the bound
1D optimization over β, instead of over the
(possibly) high dimensional feature space
64 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Bayes Decision Theory – Error bounds (Bhattacharya bound for a two class problem)
• The Chernoff bound is loose for extreme values
• The Bhattacharyya bound can be derived by β = 0.5:
≤ P β (ω1 ) P1− β (ω2 ) ∫ p β (x ω1 ) p1− β (x ω2 )dx
= P (ω1 ) P (ω2 ) ∫ p (x ω1 ) p (x ω2 )dx
= P (ω1 ) P (ω2 ) exp(− k ( β )) where: ∑1 + ∑ 2
1
1
∑ + ∑ 2 −1
2 (µ 2 − µ1 ) + ln
k ( β ) = (µ 2 − µ1 ) t [ 1
8
2
2
∑1 ∑ 2 • These bounds can still be used if the distributions are not Gaussian (Occam’s Razor).
However, they might not be adequately tight. 65 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Bayes Decision Theory – What did we
learn?
• Bayes formula: factors a posterior into a combination of a likelihood, prior and the
evidence
• Bayes decision rule, and its relationship to minimum error?
• Error probabilities
• Loss functions and generalized risk: There are various applications where such
formulations would be appropriate
• Decision surfaces: geometric interpretation of a Bayesian classifier
• Bayesian classifiers for Gaussian distributions: how does the decision surface change as a
function of the mean and covariance?
• Bounds on performance (i.e., Chernoff, Bhattacharyya) are useful abstractions for
obtaining closedform solutions to problems. 66 Chapter 2 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department ...
View
Full
Document
This note was uploaded on 02/20/2012 for the course ECE 8443 taught by Professor Staff during the Spring '10 term at University of Houston.
 Spring '10
 Staff

Click to edit the document details