This preview shows page 1. Sign up to view the full content.
Unformatted text preview: Pattern Recognition
ECE8443 Chapter 3, Part 3
Parameter estimation of the feature space –
Practical aspects
Electrical and Computer Engineering Department,
Mississippi State University. 1 Chapter 3 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Outline
• When do ML and Bayes estimation methods differ?
• Different sources of error
• Problems introduced by high dimensional feature spaces
• Practical aspects of ML estimates when building discriminant
functions
– Handling inverses and determinants of covariance estimates when they are
not full ranked 2 Chapter 3 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Maximumlikelihood – versus
Bayesian Estimation
• For infinite amounts of data, the solutions converge.
However, limited data is always a problem.
• If prior information is reliable, a Bayesian estimate can be
superior.
• Bayesian estimates for uniform priors are similar to an ML
solution.
• If p(θ D) is broad or asymmetric around the true value, the
approaches are likely to produce different solutions. 3 Chapter 3 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Maximumlikelihood – versus
Bayesian Estimation
• When designing a classifier using these techniques, there are
three sources of error: Bayes Error (Irreducible/Indistinguishability error): the error
due to overlapping distributions Inherent property of the problem for the given features Can never be eliminated Model Error: the error due to an incorrect model or
incorrect assumption about the parametric form. Estimation Error: the error arising from the fact that the
parameters are estimated from a finite amount of data
(unreliable statistical estimates, unstable inverses etc.)
4 Chapter 3 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Maximumlikelihood – versus
Bayesian Estimation • In the limit of infinite training data, estimation error vanishes
• Total error will be same for both ML and Bayesian estimation approaches
• ML classifiers (where likelihood functions for each class are represented by parameters that
are estimated by ML techniques) are simpler
• Lead to classifiers nearly as accurate as those based on Bayesian estimation of the
parameters
5 Chapter 3 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Dimensionality and Probability of
Error
• Consider a two class problem, where the two likelihoods are multivariate Gaussian with
equal covariance but different means. Also assume that the priors are equal. It can be show
that: P (error ) =
where:
and: −u 2
∞2 1
∫e
2π r / 2 du r 2 = ( µ1 − µ2 )t Σ −1 ( µ1 − µ2 ) Mahalanobis distance lim P (error ) = 0 r →∞ 22
2
• If the features are independent, then: Σ = diag (σ1 ,σ 2 ,...,σ d ) ,
i =d µ − µi 2 2 and r = ∑ i1
σ i =1 i 6 Chapter 3 2 Saurabh Prasad • Hence, for this simple case, theoretically, each additional feature contributes to lowering
the P (error) .
• Most useful features are those for which difference in means is large relative to their
variance.
• Each new feature may not add much to the performance, but if r can be increased without
limit, the probability of error can “theoretically” be made arbitrarily small, and the
performance hence must improve. Pattern Recognition Electrical and Computer Engineering Department Dimensionality and Probability of
Error 7 Chapter 3 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Problem of Dimensionality
• For many reallife pattern recognition tasks, it is not unusual to
have feature spaces that are very high dimensional (10, 100, even
1000…)
• We do hope that each feature is atleast useful for some of the
discriminations;
• While we may doubt that each feature provides independent
information (even though that is a convenient assumption to make
sometimes), intentionally superflous features have not been
included
• Two key questions arise:
• How does classification accuracy depend upon the dimensionality of the feature space?
• Statistical estimation problems
• Overfitting problems
• How does the computation complexity of the system scale with the dimensionality?
8 Chapter 3 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department High dimensionality and limited
training data size
• The most useful features are the ones for which the
difference between the means is large relative to the
standard deviation.
• How do we tell if a given collection (set) of features is useful
for the problem at hand?
•
•
•
• Bhattacharya Distance
KLDivergence
Mahalanobis Distance
Jeffries Matsushita Distance •These quantify the degree of separability between classes, or
the upper bound on the probability of error (Bayes error).
•These are hence, in some sense, a “distance” between classes
in the feature space (though they may not all satisfy all the
properties to be qualified as a “distance”) • Fusion of different types of information, referred to as
feature fusion, is a good application for Principal
Components Analysis (PCA). 9 Chapter 3 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department High dimensionality and limited
training data size
• Too many features can lead to a decrease in performance
• Increasing the feature vector dimension can significantly increase the
memory (e.g., the number of elements in the covariance matrix grows as the
square of the dimension of the feature vector) and computational
complexity.
• Good rule of thumb (The “10n rule”): 10 independent data samples for every
parameter to be estimated.
• For practical systems, such as speech or hyperspectral image classification
systems, even this simple rule can result in a need for vast amounts of data.
Target Detection
Accuracy The curse of
dimensionality ∞ samples >N samples
N samples Bayes error + Model error Bayes error + Model error + Estimation error # Features (Dimensionality) 10 Chapter 3 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department High dimensionality and limited
training data size
Overfitting and poor generalization 11 Chapter 3 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department High dimensionality and limited
training data size
• How do we get a handle on this problem of overdimensionality? • If the likelihoods are parametrized, simplify (reduce) the parameters, for
example:
• Assume the likelihoods to be Gaussian, if possible
• For a Gaussian likelihood, if needed, assume a diagonal (or even an
identity) covariance matrix
• Assume equal covariance structure across classes (“pooled covariance”)
• Regularize illconditioned covariance matrices to stabilize inverse and
determinant estimates (Will be covered later)
• Shrinkage
∑ i (α ) = (1 − α )ni ∑ i +αn ∑
(1 − α )ni + αn − or −
∑(β ) = (1 − β ) ∑ + βI 12 Chapter 3 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department High dimensionality and limited
training data size
• How do we get a handle on this problem of overdimensionality • Dimensionality reduction Two key approaches to perform dim. reduction ℜN→ℜM (M<N) (Contd. from last slide)? • Feature “selection”: Identify a subset of features such that the selected subset
maximizes useful classspecific information • Transform based reduction techniques (Also known as feature extraction): A
transformation (linear or nonlinear) projects the higher dimensional features
onto a lower dimensional subspace. Examples: Principal Component (PCA),
Fisher’s Linear Discriminant Analysis (LDA), Stepwise LDA etc.
In either case, the goal is to identify a lower dimensional subspace (/representation
of the data) that preserves (most of) the useful information or structure of the
data. 13 Chapter 3 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Dimensionality Reduction – Signal
Representation vs. Classification
• Two criteria can be used to find the “optimal” mapping y=f(x) for feature extraction
• Signal representation: The goal is to represent the samples as accurately as possible in a lower
dimensional subspace (A typical metric that needs to be optimized for such methods is the Mean
Squared Error, MSE – For such tasks, the feature extractor is optimal in a MSE sense)
• Examples: PCA, Kernel PCA etc.
• Classification: The goal is to project the samples in a lower dimensional subspace such that
information specific to classification is maximized, while irrelevant or noisy information is discarded
(A typical metric that needs to be optimized for such methods is the classseparation in the feature
space, for example, the Bhattacharya distance, Fisher’s ratio etc.)
• Examples: Fisher’s LDA, Stepwise LDA, K.D.A. etc. 14 Chapter 3 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department In the next couple of lectures…
• We will study the following dimensionality reduction techniques:
• Feature selection: Feature subset selection methods, such as those based on entropy or
Bhattacharya distance, etc.
• Linear transformation based projection techniques for dimensionality reduction, including: PCA,
LDA and subspace LDA
• A combination of feature selection and linear transformation based techniques such as Stepwise
LDA
• Towards the end of this course, we will also study a nonlinear transformation based technique
called Kernel Discriminant Analysis. 15 Chapter 3 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department ...
View Full
Document
 Spring '10
 Staff

Click to edit the document details