SPR_LectureHandouts_Chapter_03_Part3

SPR_LectureHandouts_Chapter_03_Part3 - Pattern Recognition...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Pattern Recognition ECE-8443 Chapter 3, Part 3 Parameter estimation of the feature space – Practical aspects Electrical and Computer Engineering Department, Mississippi State University. 1 Chapter 3 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Outline • When do ML and Bayes estimation methods differ? • Different sources of error • Problems introduced by high dimensional feature spaces • Practical aspects of ML estimates when building discriminant functions – Handling inverses and determinants of covariance estimates when they are not full ranked 2 Chapter 3 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Maximum-likelihood – versus Bayesian Estimation • For infinite amounts of data, the solutions converge. However, limited data is always a problem. • If prior information is reliable, a Bayesian estimate can be superior. • Bayesian estimates for uniform priors are similar to an ML solution. • If p(θ| D) is broad or asymmetric around the true value, the approaches are likely to produce different solutions. 3 Chapter 3 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Maximum-likelihood – versus Bayesian Estimation • When designing a classifier using these techniques, there are three sources of error: Bayes Error (Irreducible/Indistinguishability error): the error due to overlapping distributions Inherent property of the problem for the given features Can never be eliminated Model Error: the error due to an incorrect model or incorrect assumption about the parametric form. Estimation Error: the error arising from the fact that the parameters are estimated from a finite amount of data (unreliable statistical estimates, unstable inverses etc.) 4 Chapter 3 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Maximum-likelihood – versus Bayesian Estimation • In the limit of infinite training data, estimation error vanishes • Total error will be same for both ML and Bayesian estimation approaches • ML classifiers (where likelihood functions for each class are represented by parameters that are estimated by ML techniques) are simpler • Lead to classifiers nearly as accurate as those based on Bayesian estimation of the parameters 5 Chapter 3 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Dimensionality and Probability of Error • Consider a two class problem, where the two likelihoods are multivariate Gaussian with equal covariance but different means. Also assume that the priors are equal. It can be show that: P (error ) = where: and: −u 2 ∞2 1 ∫e 2π r / 2 du r 2 = ( µ1 − µ2 )t Σ −1 ( µ1 − µ2 ) Mahalanobis distance lim P (error ) = 0 r →∞ 22 2 • If the features are independent, then: Σ = diag (σ1 ,σ 2 ,...,σ d ) , i =d µ − µi 2 2 and r = ∑ i1 σ i =1 i 6 Chapter 3 2 Saurabh Prasad • Hence, for this simple case, theoretically, each additional feature contributes to lowering the P (error) . • Most useful features are those for which difference in means is large relative to their variance. • Each new feature may not add much to the performance, but if r can be increased without limit, the probability of error can “theoretically” be made arbitrarily small, and the performance hence must improve. Pattern Recognition Electrical and Computer Engineering Department Dimensionality and Probability of Error 7 Chapter 3 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Problem of Dimensionality • For many real-life pattern recognition tasks, it is not unusual to have feature spaces that are very high dimensional (10, 100, even 1000…) • We do hope that each feature is atleast useful for some of the discriminations; • While we may doubt that each feature provides independent information (even though that is a convenient assumption to make sometimes), intentionally superflous features have not been included • Two key questions arise: • How does classification accuracy depend upon the dimensionality of the feature space? • Statistical estimation problems • Overfitting problems • How does the computation complexity of the system scale with the dimensionality? 8 Chapter 3 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department High dimensionality and limited training data size • The most useful features are the ones for which the difference between the means is large relative to the standard deviation. • How do we tell if a given collection (set) of features is useful for the problem at hand? • • • • Bhattacharya Distance KL-Divergence Mahalanobis Distance Jeffries Matsushita Distance •These quantify the degree of separability between classes, or the upper bound on the probability of error (Bayes error). •These are hence, in some sense, a “distance” between classes in the feature space (though they may not all satisfy all the properties to be qualified as a “distance”) • Fusion of different types of information, referred to as feature fusion, is a good application for Principal Components Analysis (PCA). 9 Chapter 3 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department High dimensionality and limited training data size • Too many features can lead to a decrease in performance • Increasing the feature vector dimension can significantly increase the memory (e.g., the number of elements in the covariance matrix grows as the square of the dimension of the feature vector) and computational complexity. • Good rule of thumb (The “10-n rule”): 10 independent data samples for every parameter to be estimated. • For practical systems, such as speech or hyperspectral image classification systems, even this simple rule can result in a need for vast amounts of data. Target Detection Accuracy The curse of dimensionality ∞ samples >N samples N samples Bayes error + Model error Bayes error + Model error + Estimation error # Features (Dimensionality) 10 Chapter 3 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department High dimensionality and limited training data size Overfitting and poor generalization 11 Chapter 3 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department High dimensionality and limited training data size • How do we get a handle on this problem of over-dimensionality? • If the likelihoods are parametrized, simplify (reduce) the parameters, for example: • Assume the likelihoods to be Gaussian, if possible • For a Gaussian likelihood, if needed, assume a diagonal (or even an identity) covariance matrix • Assume equal covariance structure across classes (“pooled covariance”) • Regularize ill-conditioned covariance matrices to stabilize inverse and determinant estimates (Will be covered later) • Shrinkage ∑ i (α ) = (1 − α )ni ∑ i +αn ∑ (1 − α )ni + αn − or − ∑(β ) = (1 − β ) ∑ + βI 12 Chapter 3 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department High dimensionality and limited training data size • How do we get a handle on this problem of over-dimensionality • Dimensionality reduction Two key approaches to perform dim. reduction ℜN→ℜM (M<N) (Contd. from last slide)? • Feature “selection”: Identify a subset of features such that the selected subset maximizes useful class-specific information • Transform based reduction techniques (Also known as feature extraction): A transformation (linear or nonlinear) projects the higher dimensional features onto a lower dimensional subspace. Examples: Principal Component (PCA), Fisher’s Linear Discriminant Analysis (LDA), Stepwise LDA etc. In either case, the goal is to identify a lower dimensional subspace (/representation of the data) that preserves (most of) the useful information or structure of the data. 13 Chapter 3 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department Dimensionality Reduction – Signal Representation vs. Classification • Two criteria can be used to find the “optimal” mapping y=f(x) for feature extraction • Signal representation: The goal is to represent the samples as accurately as possible in a lower dimensional subspace (A typical metric that needs to be optimized for such methods is the Mean Squared Error, MSE – For such tasks, the feature extractor is optimal in a MSE sense) • Examples: PCA, Kernel PCA etc. • Classification: The goal is to project the samples in a lower dimensional subspace such that information specific to classification is maximized, while irrelevant or noisy information is discarded (A typical metric that needs to be optimized for such methods is the class-separation in the feature space, for example, the Bhattacharya distance, Fisher’s ratio etc.) • Examples: Fisher’s LDA, Stepwise LDA, K.D.A. etc. 14 Chapter 3 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department In the next couple of lectures… • We will study the following dimensionality reduction techniques: • Feature selection: Feature subset selection methods, such as those based on entropy or Bhattacharya distance, etc. • Linear transformation based projection techniques for dimensionality reduction, including: PCA, LDA and subspace LDA • A combination of feature selection and linear transformation based techniques such as Stepwise LDA • Towards the end of this course, we will also study a nonlinear transformation based technique called Kernel Discriminant Analysis. 15 Chapter 3 Saurabh Prasad Pattern Recognition Electrical and Computer Engineering Department ...
View Full Document

Ask a homework question - tutors are online