Unformatted text preview: Prepared by Prof. Hui Jiang
(CSE6328) 120201 CSE6328 3.0
Speech & Language Processing No.5
Pattern Classification (III)
& Pattern Verification
Prof. Hui Jiang
Department of Computer Science and Engineering
York University Model Parameter Estimation
· Maximum Likelihood (ML) Estimation:
– ML method: most popular model estimation
– EM (ExpectedMaximization) algorithm
– Examples:
• Univariate Gaussian distribution
• Multivariate Gaussian distribution
• Multinomial distribution
• Gaussian Mixture model
• Markov chain model: ngram for language modeling
• Hidden Markov Model (HMM)
alternative model estimation method
· Discriminative Training
– Maximum Mutual Information (MMI)
– Minimum Classification Error (MCE)
· Bayesian Model Estimation: Bayesian theory
· MDI (Minimum Discrimination Information) Dept. of CSE, York Univ. 1 Prepared by Prof. Hui Jiang
(CSE6328) 120201 Discriminative Training(I): Maximum
Mutual Information Estimation (1)
· The model is viewed as a noisy data generation channel
class id ω observation feature X.
· Determine model parameters to maximize mutual information
between ω and X. (close relation between ω and X) {λ1 λN }MMI = arg max I (ω , X )
λ1λ N λ1 p (ω , X )
p (ω ) p ( X )
ωX
p( X  ω )
= ∑ ∑ p (ω , X ) log 2
p( X )
ωX
p( X  ω )
= ∑ ∑ p (ω , X ) log 2
ωX
∑ p( X  ω )
I (ω , X ) = ∑ ∑ p (ω , X ) log 2 λ2 ω X
λN ω = ∑ ∑ p (ω , X ) log 2 noisy data generation channel ω X p ( X  λω ) ∑ p( X  λ
ω ω ) Discriminative Training(I): Maximum
Mutual Information Estimation (2)
· Difficulty: joint distribution p(ω,X) is unknown.
· Solution: collect a representative training set (X1, ω1), (X2, ω2), …,
(XT, ωT) to approximate the joint distribution. {λ1 λN }MMI = arg max I (ω , X )
λ1λ N = arg max ∑ ∑ p (ω , X ) log 2
λ1λ N ω X p ( X  λω )
∑ p( X  λω )
ω ≈ arg max
λ1λ N T ∑
t =1 log 2 p ( X t  λωt ) ∑ p( X
ω t  λωt ) · Optimization:
– Iterative gradientascent method
– Growthtransformation method Dept. of CSE, York Univ. 2 Prepared by Prof. Hui Jiang
(CSE6328) 120201 Discriminative Training(II): Minimum
Classification Error Estimation (1)
· In a Nclass pattern classification problem, given a set of training
data, D={ (X1, ω1), (X2, ω2), …, (XT, ωT)}, estimate model parameters
for all class to minimize total classification errors in D.
– MCE: minimize empirical classification errors
· Objective function total classification errors in D
– For each training data, (Xt, ωt), define misclassification
measure: d ( X t , ωt ) = − p (ωt ) p ( X t  λωt ) + max p (ωt ' ) p ( X t  λωt ' )
ωt ' ≠ωt or d ( X t , ωt ) = − ln[ p (ωt ) p ( X t  λωt )] + max ln[ p (ωt ' ) p ( X t  λωt ' )]
ωt ' ≠ωt if d(Xt, ωt)>0, incorrect classification for Xt 1 error
if d(Xt, ωt)<=0, correct classification for Xt 0 error Discriminative Training(II): Minimum
Classification Error Estimation (2)
· Approximate d(Xt, ωt) by a differentiable function:
⎡1
⎤
d ( X t , ωt ) ≈ − p (ωt ) p ( X t  λωt ) + ln ⎢
∑ exp[η ⋅ p(ωt ' ) p( X t  λωt ' )]⎥
⎣ N − 1 ωt ' ≠ωt
⎦ 1 /η or
⎡1
⎤
d ( X t , ωt ) ≈ − ln[ p (ωt ) p ( X t  λωt )] + ln ⎢
∑ exp[η ⋅ ln( p(ωt ' ) p( X t  λωt ' ))]⎥
⎣ N − 1 ωt ' ≠ωt
⎦ 1 /η where η>1. Dept. of CSE, York Univ. 3 Prepared by Prof. Hui Jiang
(CSE6328) 120201 Discriminative Training(II): Minimum
Classification Error Estimation (3)
· Error count for one data, (Xt, ωt), is
H(d(Xt, ωt)), where H(.) is step function.
· Total errors in training set:
T Q(Λ ) = ∑ H (d ( X t , ωt )) d t =1 · Step function is not differentiable, approximated by a sigmoid
function smoothed total errors in training set.
T Q(Λ ) ≈ Q' (Λ ) = ∑ l (d ( X t , ωt ))
t =1 where l (d ) = 1
1 + e − a d a>0 is a parameter to control its shape. Discriminative Training(II): Minimum
Classification Error Estimation (3)
· MCE estimation of model parameters for all classes: {λ1 λN }MCE = arg min Q' (λ1 λN )
λ1λ N · Optimization: no simple solution is available
– Iterative gradient descent method.
– GPD (generalized probabilistic descent) method. λ(i n +1) = λ(i n ) − ε ⋅ Dept. of CSE, York Univ. ∂
Q' (λ1 λN ) λ =λ( n )
i
i
∂λi 4 Prepared by Prof. Hui Jiang
(CSE6328) 120201 The MCE/GPD Method
· Find initial model parameters, e.g., ML estimates
· Calculate gradient of the objective function
· Calculate the value of the gradient based on the
current model parameters
· Update model parameters λ(i n +1) = λ(i n ) − ε ⋅ ∂
Q' (λ1 λN ) λ =λ( n )
i
i
∂λi · Iterate until convergence How to calculate gradient?
T
∂
∂
Q' (λ1 λN ) = ∑
l [d ( X t , ωt )]
∂λi
∂λi
t =1
T =∑
t =1
T ∂l (d ) ∂d ( X t , ωt )
⋅
∂d
∂λi = ∑ a ⋅ l (d ) ⋅ [1 − l (d )] ⋅
t =1 ∂d ( X t , ωt )
∂λi · The key issue in MCE/GPD is how to set a proper
step size experimentally. Dept. of CSE, York Univ. 5 Prepared by Prof. Hui Jiang
(CSE6328) 120201 Overtraining (Overfitting)
· Low classification error rate in training set does not always
lead to a low error rate in a new test set due to overtraining. Classification Error (in %) Objective function Measuring Performance of MCE · When to converge: monitor three quantities in the MCE/GPD
– The objective function
– Error rate in training set
– Error rate in test set Dept. of CSE, York Univ. 6 Prepared by Prof. Hui Jiang
(CSE6328) 120201 Bayesian Theory
· Bayesian methods view model parameters as random variables
having some known prior distribution. (Prior specification)
– Specify prior distribution of model parameters θ as p(θ).
· Training data D allow us to convert the prior distribution into a
posteriori distribution. (Bayesian learning) p (θ  D) = p (θ ) ⋅ p ( D  θ )
∝ p (θ ) ⋅ p ( D  θ )
p( D) · We infer or decide everything solely based on the posteriori
distribution. (Bayesian inference)
– Model estimation: the MAP (maximum a posteriori) estimation
– Pattern Classification: Bayesian classification
– Sequential (online, incremental) learning
– Others: prediction, model selection, etc. Bayesian Learning
Posteriori p (θ  D)
Likelihood P( D  θ ) Prior p (θ )
θ MAP Dept. of CSE, York Univ. θ ML θ 7 Prepared by Prof. Hui Jiang
(CSE6328) 120201 The MAP estimation of
model parameters
· Do a point estimate about θ based on the posteriori distribution θ MAP = arg max p (θ  D) = arg max p (θ ) ⋅ p ( D  θ )
θ θ · Then θMAP is treated as estimate of model parameters (just like ML
estimate). Sometimes need the EM algorithm to derive it.
· MAP estimation optimally combine prior knowledge with new
information provided by data.
· MAP estimation is used in speech recognition to adapt speech
models to a particular speaker to cope with various accents
– From a generic speakerindependent speech model prior
– Collect a small set of data from a particular speaker
– The MAP estimate give a speakeradaptive model which suit
better to this particular speaker. Bayesian Classification
· Assume we have N classes, ωi (i=1,2,…,N), each class has a classconditional pdf p(Xωi,θi) with parameters θi.
·
·
·
· The prior knowledge about θi is included in a prior p(θi).
For each class ωi, we have a training data set Di.
Problem: classify an unknown data Y into one of the classes.
The Bayesian classification is done as: ωY = arg max p (Y  Di ) = arg max ∫ p (Y  ωi , θ i ) ⋅ p (θ i  Di ) dθ i
i i where p (θ i  Di ) = Dept. of CSE, York Univ. p (θ i ) ⋅ p ( Di  ωi , θ i )
∝ p (θ i ) ⋅ p ( Di  ωi , θ i )
p ( Di ) 8 Prepared by Prof. Hui Jiang
(CSE6328) 120201 Recursive Bayes Learning
(Sequential Bayesian Learning)
· Bayesian theory provides a framework for online learning (a.k.a.
incremental learning, adaptive learning).
· When we observe training data one by one, we can dynamically
adjust the model to learn incrementally from data.
· Assume we observe training data set D={X1,X2,…,Xn} one by one,
X2
p (θ ) ⎯X 1 → p (θ  X 1 ) ⎯⎯→ p (θ  X 1 , X 2 ) p (θ  D ( n ) )
⎯ Learning Rule: Knowledge about
Model at this stage posteriori ∝ prior × likelihood Knowledge about
Model at this stage Knowledge about
Model at this stage Knowledge about
Model at this stage How to specify priors
· Noninformative priors
– In case we don’t have enough prior knowledge, just
use a flat prior at the beginning.
· Conjugate priors: for computation convenience
– For some models, if their probability functions are a
reproducing density, we can choose the prior as a
special form (called conjugate prior), so that after
Bayesian leaning the posterior will have the exact
same function form as the prior except the all
parameters are updated.
– Not every model has conjugate prior. Dept. of CSE, York Univ. 9 Prepared by Prof. Hui Jiang
(CSE6328) 120201 Conjugate Prior
· For a univariate Gaussian model with only unknown mean: 1 p ( x  ωi ) = N ( x  µ , σ 2 ) = 2πσ 2 exp[− ( x − µ )2
2σ 2 · If we choose the prior as a Gaussian distribution (Gaussian’s
conjugate prior is Gaussian)
2
p( µ ) = N ( µ  µ0 , σ 0 ) = 1
2
2πσ 0 exp[− (µ − µ0 ) 2
2
2σ 0 · After observing a new data x1, the posterior will still be Gaussian: p ( µ  x1 ) = N ( µ  µ1 , σ 12 ) =
where µ1 =
σ 12 = 1
2πσ 12 exp[− ( µ − µ1 ) 2
2σ 12 2
σ0
σ2
x1 + 2
µ0
2
σ0 +σ 2
σ0 +σ 2
2
σ 0σ 2
2
σ0 +σ 2 The sequential MAP Estimate
of Gaussian
· For univariate Gaussian with unknown mean, the
MAP estimate of its mean after observing x1:
µ1 = 2
σ0
σ2
x1 + 2
µ0
2
σ0 +σ 2
σ0 +σ 2 · After observing next data x2:
µ2 = σ 12
σ2
x2 + 2
µ1
σ 12 + σ 2
σ1 + σ 2 Dept. of CSE, York Univ. 10 Prepared by Prof. Hui Jiang
(CSE6328) 120201 Pattern Verification
· For an unknown pattern/object P, we can observe/measure some
features X of the pattern P.
· Based on the features X, we need to answer a binary question
(Yes/No) regarding P.
· Example of pattern verification: speaker id verification
– A user claims its id as abc;
– System prompts and records some voice X from the user.
– Based on the voice X, system makes a decision whether the
user is abc or not. (voiceprints for security)
· Pattern verification can be viewed as a 2class classification
problem; but better not to do so.
· A proper view is to cast it as a statistical hypothesis testing
problem. Statistical Hypothesis Testing(I)
· In statistics, we normally need test a hypothesis based on some
observation data. The problem is formulated as a test between two
complementary hypotheses:
– H0: null hypothesis
– H1: alternative hypothesis
· Example: Given
is a random sample from a Gaussian
distribution
, where variance
is known. We need to
verify whether its mean is a given value or not. Thus we do
hypothesis testing between:
–
against
· In Hypothesis testing, we have two types of errors:
– Type I: false rejection error; falsely reject H0 when H0 is true.
– Type II: false alarm error; falsely accept H0 when H1 is true. Dept. of CSE, York Univ. 11 Prepared by Prof. Hui Jiang
(CSE6328) 120201 Statistical Hypothesis Testing(II)
· In essence, a hypothesis test will partition the observation space into
two disjoined parts, C and U. When an observation X lies in the
region C, we reject H0; when X in U, we accept H0. C is called critical
region (or rejection region).
· So type I error probability (also called significant level) of a test: α = Pr( E1 ) = Pr( X ∈ C  H 0 )
· Type II error probability of a test: β = Pr( E2 ) = Pr( X ∈ U  H1 ) = 1 − Pr( X ∈ C  H1 ) = 1 − γ
where γ = Pr( X ∈ C  H1 ) is defined as the power of the test. · At the significant level α, the most powerful test is defined as the one
which maximizes the power γ (in turn minimizes Type II error β). Statistical Hypothesis Testing(III)
· A hypothesis can be simple or composite:
– Simple hypothesis: completely specifies the
distribution, e.g. H 0 : θ = θ0
– Composite hypothesis: involves a region or
interval, e.g. H1 : θ ≠ θ 0 Dept. of CSE, York Univ. or H1 : θ > θ 0 12 Prepared by Prof. Hui Jiang
(CSE6328) 120201 Statistical Hypothesis Testing(IV)
· Neyman Pearson Theorem:
– For a simple H0 and simple H1, if the distributions under both H0
and H1 are known, i.e., f0(Xθ0) and f1(Xθ1). Given any i.i.d.
observation data D={X1,…,XT}, for any significance level α, the
most powerful test is formulated as:
T If LR = ∏f 0 ( X t  θ0 ) t =1
T ∏ f (X
1 t > τ , accept H0; otherwise reject H0.  θ1 ) t =1 The threshold τ is adjusted to make the significance of the test to be α.
If the both pdf’s have the same form, the only difference is parameters,
The ratio is also called likelihood ratio (LR). Statistical Hypothesis Testing(V)
· The Neyman Pearson Theorem provides a method of constructing
the most powerful tests for simple hypotheses when the
distribution of the observation is known.
· How about if the hypothesis is composite
· Likelihood Ratio Test (LRT): assume the distributions are known
except some parameters,
If T = max f H 0 ( X  θ )
θ ∈H 0 max f H1 ( X  θ ) > τ , accept H0; otherwise reject H0. θ ∈H1 ∪ H 0 – LRT is not always uniformly most powerful but has some
desirable properties.
– Distribution of T is complicated, p(T); only computable
asymptotically.
– Widely used for many practical applications. Dept. of CSE, York Univ. 13 Prepared by Prof. Hui Jiang
(CSE6328) 120201 Pattern Verification as
Statistical Hypothesis Testing
· Based on the question to be answered, design two complementary
hypotheses,
– The null hypothesis H0: corresponds to YES of the answer.
– The alternative hypothesis H1: corresponds to NO.
· The feature distribution under either H0 or H1 is unknown.
· Training: apply the same idea of data modeling:
– Choose proper statistical model for either H0 or H1.
– The model parameters are estimated from some training
samples collected from H0 or H1.
· Decision: use likelihood ratio test (LRT) to make decision
If T = ˆ
f0 ( X  θo )
> τ , answer YES; otherwise NO.
ˆ
f1 ( X  θ1 ) ˆ
ˆ
where f0(.) is the model chosen for H0, f1(.) for H1. θ o and θ1 are
parameters estimated from data. Distributions of LR τ
) 0 H  T( g ) 1H  T( g
Dept. of CSE, York Univ. C U 14 Prepared by Prof. Hui Jiang
(CSE6328) 120201 Pattern Verification
· More generally, T can be any test statistics from observation data.
– LRT is a special case for T.
· Given a test statistic T, we can’t minimize both type I error and
type II error at the same time.
· Improve verification by choosing different test statistics
– Distributions of T: less overlap better separation better
verification accuracy (smaller type I and type II errors)
· The key in designing a pattern verification is to find a test statistics
T and its corresponding parameters so that the overlap between
the two distributions is minimized.
· What does it mean by a better verification accuracy?
– Type I error (false rejection error)
– Type II error (false alarm error) Evaluating Verification (I)
Minimum Total
Error Total Error
Type II Error α +β ∞ β = ∫ g (T  H1 ) dT Equal
Error
(EE) τ α = ∫ g (T  H 0 ) dT
−∞ Type I Error τ Equal Error τ 0
Dept. of CSE, York Univ. Threshold 15 Prepared by Prof. Hui Jiang
(CSE6328) 120201 Evaluating Verification (II): ROC curve
(Receiver Operating characteristic)
100%
A Notsogood
System False
Rejection
Error (Type I) Equal Error
Performance A Better System
0% False Alarm Error (Type II) 100% Speaker Verification (SV)
What is your account number?
53020312302390
What is your
secret passphrase ? Customer
Voice Model Open Sesame Speaker
Verification
Server Dept. of CSE, York Univ. Call Center
System 16 Prepared by Prof. Hui Jiang
(CSE6328) 120201 Example(I): Speaker Verification(1)
· Speaker verification: verify user ID based on the voice. The user
first claims a user ID, the system records some voice sample from
the user and try to answer YES/NO to the question “Is the person
the claimed user or not?”.
· Speaker verification: if a person claims to be the user A,
– Observation: a segment of voice feature vectors X
– H0: X is from the claimed user A.
– H1: X is NOT from the claimed user A.
· Data modeling: commonly use GMM for both H0 and H1.
– Mixture number depends on the amount of available data,
usually from 16 to 256.
– For simplicity or estimation reliability, each Gaussian mixand
is assumed to be diagonal.
– For each known user a registered in the system, we must
estimate two GMM’s Λ a and Λ a for its H0 and H1. Example(I): Speaker Verification(2)
· Model estimation:
– For Λ a in H0: collect some training samples from the
known user and train it based on ML criterion.
(how to do ML estimation for GMM?)
– How about Λ a in H1 ?
• Antispeaker model: Train it based on training data
collected for all other known users (except a). (ML
estimation)
• Training it based on training data from some “cohort”
speakers who are confusing with the current speaker a.
(how to choose cohort speaker?)
• For simplicity, use the same background model Λ for
all known users in the system. Λ is trained based on
all users’ training data. Dept. of CSE, York Univ. 17 Prepared by Prof. Hui Jiang
(CSE6328) 120201 Example(I): Speaker Verification(3)
· Verification Decision:
Speaker
Model
Speaker
Scoring Speaker
Claimed ID
Input
Speech Speaker
Threshold
Decision
Making
Output
Decision – A new user claim id as A, based on the recorded voice feature Y: If T = p (Y  H 0 ) p (Y  Λ A )
=
> τ , accept the user as A; otherwise,
p (Y  H1 ) p (Y  Λ A )
reject the user. The decision threshold τ is determined empirically in practice. Example(II): reject outliers
in pattern classification
· How to reject outliers (belonging to none of known classes) in
pattern classification ?
– In speech recognition, how to detect unknown words, called
outof vocabulary (OOV ) words used by users??
· Solution 1: treat outliers as another class (N+1)class patterns
· Solution 2:
– Stage 1: do Nclass pattern classification, find the best match,
say class k;
– Stage 2: verify the decision made in stage 1.
– Stage 2 is a pattern verification problem:
• H0: the pattern X really comes from class k
• H1: the pattern X does NOT come from class k Λ= Dept. of CSE, York Univ. Pr( X  H 0 )
> ζ accept the decision; otherwise reject
Pr( X  H1 ) 18 ...
View
Full
Document
This note was uploaded on 02/14/2012 for the course CSE 6590 taught by Professor Kotakoski during the Winter '12 term at York University.
 Winter '12
 Kotakoski
 Computer Science

Click to edit the document details