Bayes Decision Theory
Discriminant Functions for the Normal Density
CSE 555: Srihari
0
Discriminant Functions for the Normal
Density
We saw that the minimum error-rate
classification can be achieved by the
discriminant function
gi(x) = ln P(x | i) + ln P
Bayes Decision Theory
Sargur Srihari
CSE 555
Introduction to Pattern Recognition
Reverend Thomas Bayes
1702-1761
Bayes set out his theory of probability in Essay towards solving a problem in the doctrine of
chances published in the Philosophical Transacti
Bayes Decision Theory
Minimum-Error-Rate Classification
Classifiers, Discriminant Functions and Decision Surfaces
The Normal Density
CSE 555: Srihari
0
Minimum-Error-Rate Classification
Actions are decisions on classes
If action i is taken and the true s
Maximum-Likelihood and Bayesian
Parameter Estimation
Sufficient Statistics
Common Probability Distributions
Problems of Dimensionality
Sufficient Statistics
Any function s of the samples is a statistic
A sufficient statistic is a (possibly vector valued)
Maximum-Likelihood and Bayesian
Parameter Estimation
Sufficient Statistics
Common Probability Distributions
Problems of Dimensionality
Problems of Dimensionality
Problems involving 50 or 100 features
(binary valued)
Classification accuracy depends upon th
Discriminant Analysis
1. Fisher Linear Discriminant
2. Multiple Discriminant Analysis
CSE 555: Srihari
0
Motivation
Projection that best separates the data in a leastsquares sense
PCA finds components that are useful for representing
data
However no rea
Hidden Markov Models
and
Sequential Data
Sequential Data
Often arise through measurement of time
series
Snowfall measurements on successive days in Buffalo
Rainfall measurements in Chirrapunji
Daily values of currency exchange rate
Acoustic features
Maximum-Likelihood and Bayesian
Parameter Estimation
Expectation Maximization (EM)
CSE 555: Srihari
Estimating Missing Feature Value
Estimating missing variable with known
parameters
In the absence of x1, most
likely class is 2
Choose that value of x1
whi
Component Analysis and Discriminants
Reducing dimensionality when:
1. Classes are disregarded
Principal Component Analysis (PCA)
2. Classes are considered
Discriminant Analysis
x2
x1
Component Analysis vs Discriminants
Two classical approaches for finding
Bayesian Belief Networks
Compound Bayesian Decision
Theory
Srihari: CSE 555
0
Bayesian Belief Networks
In certain situations statistical properties are
not directly expressed by a parameter vector
but by causal relationships among variables
Srihari: CSE
Useful Probability Distributions
Standard Normal Distribution
Binomial
Multinomial
Hypergeometric
Poisson
Beta Binomial
Students t
Beta
Gamma
Dirichlet
Multivariate Normal and Correlation
Standard Normal Distribution
X ~ N ( , 2 )
Standardization
Z = ( X
Introduction to Pattern
Recognition
Sargur N. Srihari
[email protected]
Dept. of Computer Science & Engineering
State University of New York at Buffalo
CSE 555: Sargur Srihari
1
What is a Pattern?
A pattern is the opposite of chaos; it is an
entit
CSE555: Introduction to Pattern Recognition
Spring, 2007
Mid-Term Exam (with solutions)
(100 points, Closed book/notes)
The last page contains some formulas that might be useful.
1. Part(i) (10 pts)
Suppose a bank classies customers as either good or bad
Maximum-Likelihood and Bayesian
Parameter Estimation (part 2)
Bayesian Estimation
Bayesian Parameter Estimation: Gaussian Case
Bayesian Parameter Estimation: General Estimation
Bayesian Estimation
The parameter is a random variable
Computation of posterio
Maximum-Likelihood & Bayesian
Parameter Estimation
Srihari: CSE 555
0
Maximum Likelihood Versus Bayesian
Parameter Estimation
Optimal classifier can be designed knowing P(i) and p(x |
i)
Obtain them from training samples assuming known
forms of pdfs, e.g.
Bayes Decision Theory
Bayes Error Rate and Error Bounds
Receiver Operating Characteristics
Discrete Features
Missing Features
CSE 555: Srihari
0
Example of Bayes Decision Boundary
Two Gaussian distributions each with four data points
x2
10
3
1 =
6
3
2
Nearest Neighbor Classification
The Nearest-Neighbor Rule
Error Bounds
k-Nearest Neighbor Rule
Computational Considerations
Example of Nearest Neighbor Rule
Two class problem:
yellow triangles and blue squares.
Circle represents the unknown sample x and
Artificial Neural Networks
Sargur Srihari
CSE 555: Srihari
Role of ANNs in Pattern Recognition
Generalization of linear discriminant
functions which have limited capability
SVMs need propoer choice of kernel
funcations
Three- and Four-layer nets overco
Notion of Distance
Metric Distance
Binary Vector Distances
Tangent Distance
Distance Measures
Many pattern recognition/data mining techniques are
based on similarity measures between objects
e.g., nearest-neighbor classification,
cluster analysis,
mul
NONMETRIC METHODS
Recognition with Strings
Pattern Recognition
CSE 555/655
Attribute Lists
Nominal Data
No natural notion of similarity or ordering
E.g., a fruit =(red,shiny,sweet,small)
Distance between vectors cannot be
measured
Nearest-neighbor us
String Matching
Whether a candidate x is a factor of text
Definitions
Length[x] = |x|
|text| > |x|
Alphabet A
Binary, A =cfw_0,1
Decimal, A =cfw_0,1,2.,9
English Letters, A =cfw_a,b,c.,z
DNA bases, A =cfw_A,G,C,T
Nave String Matching
Nave String Matchi
Combining Classifiers
Sargur Srihari
[email protected]
Synononyms for the Topic
Mixture of Experts
Ensemble Classifiers
Modular Classifiers
Pooled Classifiers
Combining Models
Typical Scenario
Application in Biometrics: Several Modalities
Two Appr
Linear Discriminant Functions:
Gradient Descent and Perceptron
Convergence
The Two-Category Linearly Separable Case (5.4)
Minimizing the Perceptron Criterion Function (5.5)
Role of Linear Discriminant Functions
A Discriminative Approach
as opposed to
Estimating Misclassification
Probability
Estimating and Comparing Classifiers
Cross Validation: Divide training set into m disjoint sets of equal size
Classifier is trained m times
Estimated performance is mean on m errors
Confidence Intervals
If true bu
Bias and Variance
Bias of ML Estimate of Variance
For a Gaussian distribution, maximum likelihood
estimates for mean and variance are
ML
1
=
N
1
=
N
2
ML
N
x
n =1
n
N
(x
n =1
n
ML ) 2
Systematically underestimates the variance
E[ ML ] =
N 1 2
2
E[
Bagging and Boosting:
Resampling for Classifier Design
Sargur Srihari
[email protected]
Bagging
Arcing adaptive re-weighting and combining
refers to reusing or selecting data to improve
classification
Includes both bagging and boosting
Simplest
Non-Parametric Techniques
Generative and Discriminative Methods
Density Estimation
CSE 555: Srihari
0
Parametric versus Non-parametric
1. Parametric
densities are uni-modal
have a single local maximum
practical problems involve multi-modal densities
2.
Linear Discriminant Functions
Linear Discriminant Functions and Decisions Surfaces
Generalized Linear Discriminant Functions
Introduction
Parametric Methods
Underlying pdfs are known
Training samples used to estimate pdf parameters
Linear Discrimina