cse6328-w5 - Prepared by Prof. Hui Jiang (CSE6328) 12-02-01...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Prepared by Prof. Hui Jiang (CSE6328) 12-02-01 CSE6328 3.0 Speech & Language Processing No.5 Pattern Classification (III) & Pattern Verification Prof. Hui Jiang Department of Computer Science and Engineering York University Model Parameter Estimation ·  Maximum Likelihood (ML) Estimation: –  ML method: most popular model estimation –  EM (Expected-Maximization) algorithm –  Examples: •  Univariate Gaussian distribution •  Multivariate Gaussian distribution •  Multinomial distribution •  Gaussian Mixture model •  Markov chain model: n-gram for language modeling •  Hidden Markov Model (HMM) alternative model estimation method ·  Discriminative Training –  Maximum Mutual Information (MMI) –  Minimum Classification Error (MCE) ·  Bayesian Model Estimation: Bayesian theory ·  MDI (Minimum Discrimination Information) Dept. of CSE, York Univ. 1 Prepared by Prof. Hui Jiang (CSE6328) 12-02-01 Discriminative Training(I): Maximum Mutual Information Estimation (1) ·  The model is viewed as a noisy data generation channel class id ω observation feature X. ·  Determine model parameters to maximize mutual information between ω and X. (close relation between ω and X) {λ1 λN }MMI = arg max I (ω , X ) λ1λ N λ1 p (ω , X ) p (ω ) p ( X ) ωX p( X | ω ) = ∑ ∑ p (ω , X ) log 2 p( X ) ωX p( X | ω ) = ∑ ∑ p (ω , X ) log 2 ωX ∑ p( X | ω ) I (ω , X ) = ∑ ∑ p (ω , X ) log 2 λ2 ω X λN ω = ∑ ∑ p (ω , X ) log 2 noisy data generation channel ω X p ( X | λω ) ∑ p( X | λ ω ω ) Discriminative Training(I): Maximum Mutual Information Estimation (2) ·  Difficulty: joint distribution p(ω,X) is unknown. ·  Solution: collect a representative training set (X1, ω1), (X2, ω2), …, (XT, ωT) to approximate the joint distribution. {λ1 λN }MMI = arg max I (ω , X ) λ1λ N = arg max ∑ ∑ p (ω , X ) log 2 λ1λ N ω X p ( X | λω ) ∑ p( X | λω ) ω ≈ arg max λ1λ N T ∑ t =1 log 2 p ( X t | λωt ) ∑ p( X ω t | λωt ) ·  Optimization: –  Iterative gradient-ascent method –  Growth-transformation method Dept. of CSE, York Univ. 2 Prepared by Prof. Hui Jiang (CSE6328) 12-02-01 Discriminative Training(II): Minimum Classification Error Estimation (1) ·  In a N-class pattern classification problem, given a set of training data, D={ (X1, ω1), (X2, ω2), …, (XT, ωT)}, estimate model parameters for all class to minimize total classification errors in D. –  MCE: minimize empirical classification errors ·  Objective function total classification errors in D –  For each training data, (Xt, ωt), define misclassification measure: d ( X t , ωt ) = − p (ωt ) p ( X t | λωt ) + max p (ωt ' ) p ( X t | λωt ' ) ωt ' ≠ωt or d ( X t , ωt ) = − ln[ p (ωt ) p ( X t | λωt )] + max ln[ p (ωt ' ) p ( X t | λωt ' )] ωt ' ≠ωt if d(Xt, ωt)>0, incorrect classification for Xt 1 error if d(Xt, ωt)<=0, correct classification for Xt 0 error Discriminative Training(II): Minimum Classification Error Estimation (2) ·  Approximate d(Xt, ωt) by a differentiable function: ⎡1 ⎤ d ( X t , ωt ) ≈ − p (ωt ) p ( X t | λωt ) + ln ⎢ ∑ exp[η ⋅ p(ωt ' ) p( X t | λωt ' )]⎥ ⎣ N − 1 ωt ' ≠ωt ⎦ 1 /η or ⎡1 ⎤ d ( X t , ωt ) ≈ − ln[ p (ωt ) p ( X t | λωt )] + ln ⎢ ∑ exp[η ⋅ ln( p(ωt ' ) p( X t | λωt ' ))]⎥ ⎣ N − 1 ωt ' ≠ωt ⎦ 1 /η where η>1. Dept. of CSE, York Univ. 3 Prepared by Prof. Hui Jiang (CSE6328) 12-02-01 Discriminative Training(II): Minimum Classification Error Estimation (3) ·  Error count for one data, (Xt, ωt), is H(d(Xt, ωt)), where H(.) is step function. ·  Total errors in training set: T Q(Λ ) = ∑ H (d ( X t , ωt )) d t =1 ·  Step function is not differentiable, approximated by a sigmoid function smoothed total errors in training set. T Q(Λ ) ≈ Q' (Λ ) = ∑ l (d ( X t , ωt )) t =1 where l (d ) = 1 1 + e − a d a>0 is a parameter to control its shape. Discriminative Training(II): Minimum Classification Error Estimation (3) ·  MCE estimation of model parameters for all classes: {λ1 λN }MCE = arg min Q' (λ1 λN ) λ1λ N ·  Optimization: no simple solution is available –  Iterative gradient descent method. –  GPD (generalized probabilistic descent) method. λ(i n +1) = λ(i n ) − ε ⋅ Dept. of CSE, York Univ. ∂ Q' (λ1 λN ) |λ =λ( n ) i i ∂λi 4 Prepared by Prof. Hui Jiang (CSE6328) 12-02-01 The MCE/GPD Method ·  Find initial model parameters, e.g., ML estimates ·  Calculate gradient of the objective function ·  Calculate the value of the gradient based on the current model parameters ·  Update model parameters λ(i n +1) = λ(i n ) − ε ⋅ ∂ Q' (λ1 λN ) |λ =λ( n ) i i ∂λi ·  Iterate until convergence How to calculate gradient? T ∂ ∂ Q' (λ1 λN ) = ∑ l [d ( X t , ωt )] ∂λi ∂λi t =1 T =∑ t =1 T ∂l (d ) ∂d ( X t , ωt ) ⋅ ∂d ∂λi = ∑ a ⋅ l (d ) ⋅ [1 − l (d )] ⋅ t =1 ∂d ( X t , ωt ) ∂λi ·  The key issue in MCE/GPD is how to set a proper step size experimentally. Dept. of CSE, York Univ. 5 Prepared by Prof. Hui Jiang (CSE6328) 12-02-01 Overtraining (Overfitting) ·  Low classification error rate in training set does not always lead to a low error rate in a new test set due to overtraining. Classification Error (in %) Objective function Measuring Performance of MCE ·  When to converge: monitor three quantities in the MCE/GPD –  The objective function –  Error rate in training set –  Error rate in test set Dept. of CSE, York Univ. 6 Prepared by Prof. Hui Jiang (CSE6328) 12-02-01 Bayesian Theory ·  Bayesian methods view model parameters as random variables having some known prior distribution. (Prior specification) –  Specify prior distribution of model parameters θ as p(θ). ·  Training data D allow us to convert the prior distribution into a posteriori distribution. (Bayesian learning) p (θ | D) = p (θ ) ⋅ p ( D | θ ) ∝ p (θ ) ⋅ p ( D | θ ) p( D) ·  We infer or decide everything solely based on the posteriori distribution. (Bayesian inference) –  Model estimation: the MAP (maximum a posteriori) estimation –  Pattern Classification: Bayesian classification –  Sequential (on-line, incremental) learning –  Others: prediction, model selection, etc. Bayesian Learning Posteriori p (θ | D) Likelihood P( D | θ ) Prior p (θ ) θ MAP Dept. of CSE, York Univ. θ ML θ 7 Prepared by Prof. Hui Jiang (CSE6328) 12-02-01 The MAP estimation of model parameters ·  Do a point estimate about θ based on the posteriori distribution θ MAP = arg max p (θ | D) = arg max p (θ ) ⋅ p ( D | θ ) θ θ ·  Then θMAP is treated as estimate of model parameters (just like ML estimate). Sometimes need the EM algorithm to derive it. ·  MAP estimation optimally combine prior knowledge with new information provided by data. ·  MAP estimation is used in speech recognition to adapt speech models to a particular speaker to cope with various accents –  From a generic speaker-independent speech model prior –  Collect a small set of data from a particular speaker –  The MAP estimate give a speaker-adaptive model which suit better to this particular speaker. Bayesian Classification ·  Assume we have N classes, ωi (i=1,2,…,N), each class has a classconditional pdf p(X|ωi,θi) with parameters θi. ·  ·  ·  ·  The prior knowledge about θi is included in a prior p(θi). For each class ωi, we have a training data set Di. Problem: classify an unknown data Y into one of the classes. The Bayesian classification is done as: ωY = arg max p (Y | Di ) = arg max ∫ p (Y | ωi , θ i ) ⋅ p (θ i | Di ) dθ i i i where p (θ i | Di ) = Dept. of CSE, York Univ. p (θ i ) ⋅ p ( Di | ωi , θ i ) ∝ p (θ i ) ⋅ p ( Di | ωi , θ i ) p ( Di ) 8 Prepared by Prof. Hui Jiang (CSE6328) 12-02-01 Recursive Bayes Learning (Sequential Bayesian Learning) ·  Bayesian theory provides a framework for on-line learning (a.k.a. incremental learning, adaptive learning). ·  When we observe training data one by one, we can dynamically adjust the model to learn incrementally from data. ·  Assume we observe training data set D={X1,X2,…,Xn} one by one, X2 p (θ ) ⎯X 1 → p (θ | X 1 ) ⎯⎯→ p (θ | X 1 , X 2 ) p (θ | D ( n ) ) ⎯ Learning Rule: Knowledge about Model at this stage posteriori ∝ prior × likelihood Knowledge about Model at this stage Knowledge about Model at this stage Knowledge about Model at this stage How to specify priors ·  Noninformative priors –  In case we don’t have enough prior knowledge, just use a flat prior at the beginning. ·  Conjugate priors: for computation convenience –  For some models, if their probability functions are a reproducing density, we can choose the prior as a special form (called conjugate prior), so that after Bayesian leaning the posterior will have the exact same function form as the prior except the all parameters are updated. –  Not every model has conjugate prior. Dept. of CSE, York Univ. 9 Prepared by Prof. Hui Jiang (CSE6328) 12-02-01 Conjugate Prior ·  For a univariate Gaussian model with only unknown mean: 1 p ( x | ωi ) = N ( x | µ , σ 2 ) = 2πσ 2 exp[− ( x − µ )2 2σ 2 ·  If we choose the prior as a Gaussian distribution (Gaussian’s conjugate prior is Gaussian) 2 p( µ ) = N ( µ | µ0 , σ 0 ) = 1 2 2πσ 0 exp[− (µ − µ0 ) 2 2 2σ 0 ·  After observing a new data x1, the posterior will still be Gaussian: p ( µ | x1 ) = N ( µ | µ1 , σ 12 ) = where µ1 = σ 12 = 1 2πσ 12 exp[− ( µ − µ1 ) 2 2σ 12 2 σ0 σ2 x1 + 2 µ0 2 σ0 +σ 2 σ0 +σ 2 2 σ 0σ 2 2 σ0 +σ 2 The sequential MAP Estimate of Gaussian ·  For univariate Gaussian with unknown mean, the MAP estimate of its mean after observing x1: µ1 = 2 σ0 σ2 x1 + 2 µ0 2 σ0 +σ 2 σ0 +σ 2 ·  After observing next data x2: µ2 = σ 12 σ2 x2 + 2 µ1 σ 12 + σ 2 σ1 + σ 2 Dept. of CSE, York Univ. 10 Prepared by Prof. Hui Jiang (CSE6328) 12-02-01 Pattern Verification ·  For an unknown pattern/object P, we can observe/measure some features X of the pattern P. ·  Based on the features X, we need to answer a binary question (Yes/No) regarding P. ·  Example of pattern verification: speaker id verification –  A user claims its id as abc; –  System prompts and records some voice X from the user. –  Based on the voice X, system makes a decision whether the user is abc or not. (voiceprints for security) ·  Pattern verification can be viewed as a 2-class classification problem; but better not to do so. ·  A proper view is to cast it as a statistical hypothesis testing problem. Statistical Hypothesis Testing(I) ·  In statistics, we normally need test a hypothesis based on some observation data. The problem is formulated as a test between two complementary hypotheses: –  H0: null hypothesis –  H1: alternative hypothesis ·  Example: Given is a random sample from a Gaussian distribution , where variance is known. We need to verify whether its mean is a given value or not. Thus we do hypothesis testing between: –  against ·  In Hypothesis testing, we have two types of errors: –  Type I: false rejection error; falsely reject H0 when H0 is true. –  Type II: false alarm error; falsely accept H0 when H1 is true. Dept. of CSE, York Univ. 11 Prepared by Prof. Hui Jiang (CSE6328) 12-02-01 Statistical Hypothesis Testing(II) ·  In essence, a hypothesis test will partition the observation space into two disjoined parts, C and U. When an observation X lies in the region C, we reject H0; when X in U, we accept H0. C is called critical region (or rejection region). ·  So type I error probability (also called significant level) of a test: α = Pr( E1 ) = Pr( X ∈ C | H 0 ) ·  Type II error probability of a test: β = Pr( E2 ) = Pr( X ∈ U | H1 ) = 1 − Pr( X ∈ C | H1 ) = 1 − γ where γ = Pr( X ∈ C | H1 ) is defined as the power of the test. ·  At the significant level α, the most powerful test is defined as the one which maximizes the power γ (in turn minimizes Type II error β). Statistical Hypothesis Testing(III) ·  A hypothesis can be simple or composite: –  Simple hypothesis: completely specifies the distribution, e.g. H 0 : θ = θ0 –  Composite hypothesis: involves a region or interval, e.g. H1 : θ ≠ θ 0 Dept. of CSE, York Univ. or H1 : θ > θ 0 12 Prepared by Prof. Hui Jiang (CSE6328) 12-02-01 Statistical Hypothesis Testing(IV) ·  Neyman Pearson Theorem: –  For a simple H0 and simple H1, if the distributions under both H0 and H1 are known, i.e., f0(X|θ0) and f1(X|θ1). Given any i.i.d. observation data D={X1,…,XT}, for any significance level α, the most powerful test is formulated as: T If LR = ∏f 0 ( X t | θ0 ) t =1 T ∏ f (X 1 t > τ , accept H0; otherwise reject H0. | θ1 ) t =1 The threshold τ is adjusted to make the significance of the test to be α. If the both pdf’s have the same form, the only difference is parameters, The ratio is also called likelihood ratio (LR). Statistical Hypothesis Testing(V) ·  The Neyman Pearson Theorem provides a method of constructing the most powerful tests for simple hypotheses when the distribution of the observation is known. ·  How about if the hypothesis is composite ·  Likelihood Ratio Test (LRT): assume the distributions are known except some parameters, If T = max f H 0 ( X | θ ) θ ∈H 0 max f H1 ( X | θ ) > τ , accept H0; otherwise reject H0. θ ∈H1 ∪ H 0 –  LRT is not always uniformly most powerful but has some desirable properties. –  Distribution of T is complicated, p(T); only computable asymptotically. –  Widely used for many practical applications. Dept. of CSE, York Univ. 13 Prepared by Prof. Hui Jiang (CSE6328) 12-02-01 Pattern Verification as Statistical Hypothesis Testing ·  Based on the question to be answered, design two complementary hypotheses, –  The null hypothesis H0: corresponds to YES of the answer. –  The alternative hypothesis H1: corresponds to NO. ·  The feature distribution under either H0 or H1 is unknown. ·  Training: apply the same idea of data modeling: –  Choose proper statistical model for either H0 or H1. –  The model parameters are estimated from some training samples collected from H0 or H1. ·  Decision: use likelihood ratio test (LRT) to make decision If T = ˆ f0 ( X | θo ) > τ , answer YES; otherwise NO. ˆ f1 ( X | θ1 ) ˆ ˆ where f0(.) is the model chosen for H0, f1(.) for H1. θ o and θ1 are parameters estimated from data. Distributions of LR τ ) 0 H | T( g ) 1H | T( g Dept. of CSE, York Univ. C U 14 Prepared by Prof. Hui Jiang (CSE6328) 12-02-01 Pattern Verification ·  More generally, T can be any test statistics from observation data. –  LRT is a special case for T. ·  Given a test statistic T, we can’t minimize both type I error and type II error at the same time. ·  Improve verification by choosing different test statistics –  Distributions of T: less overlap better separation better verification accuracy (smaller type I and type II errors) ·  The key in designing a pattern verification is to find a test statistics T and its corresponding parameters so that the overlap between the two distributions is minimized. ·  What does it mean by a better verification accuracy? –  Type I error (false rejection error) –  Type II error (false alarm error) Evaluating Verification (I) Minimum Total Error Total Error Type II Error α +β ∞ β = ∫ g (T | H1 ) dT Equal Error (EE) τ α = ∫ g (T | H 0 ) dT −∞ Type I Error τ Equal Error τ 0 Dept. of CSE, York Univ. Threshold 15 Prepared by Prof. Hui Jiang (CSE6328) 12-02-01 Evaluating Verification (II): ROC curve (Receiver Operating characteristic) 100% A Not-so-good System False Rejection Error (Type I) Equal Error Performance A Better System 0% False Alarm Error (Type II) 100% Speaker Verification (SV) What is your account number? 530-203-1230-2390 What is your secret passphrase ? Customer Voice Model Open Sesame Speaker Verification Server Dept. of CSE, York Univ. Call Center System 16 Prepared by Prof. Hui Jiang (CSE6328) 12-02-01 Example(I): Speaker Verification(1) ·  Speaker verification: verify user ID based on the voice. The user first claims a user ID, the system records some voice sample from the user and try to answer YES/NO to the question “Is the person the claimed user or not?”. ·  Speaker verification: if a person claims to be the user A, –  Observation: a segment of voice feature vectors X –  H0: X is from the claimed user A. –  H1: X is NOT from the claimed user A. ·  Data modeling: commonly use GMM for both H0 and H1. –  Mixture number depends on the amount of available data, usually from 16 to 256. –  For simplicity or estimation reliability, each Gaussian mixand is assumed to be diagonal. –  For each known user a registered in the system, we must estimate two GMM’s Λ a and Λ a for its H0 and H1. Example(I): Speaker Verification(2) ·  Model estimation: –  For Λ a in H0: collect some training samples from the known user and train it based on ML criterion. (how to do ML estimation for GMM?) –  How about Λ a in H1 ? •  Anti-speaker model: Train it based on training data collected for all other known users (except a). (ML estimation) •  Training it based on training data from some “cohort” speakers who are confusing with the current speaker a. (how to choose cohort speaker?) •  For simplicity, use the same background model Λ for all known users in the system. Λ is trained based on all users’ training data. Dept. of CSE, York Univ. 17 Prepared by Prof. Hui Jiang (CSE6328) 12-02-01 Example(I): Speaker Verification(3) ·  Verification Decision: Speaker Model Speaker Scoring Speaker Claimed ID Input Speech Speaker Threshold Decision Making Output Decision –  A new user claim id as A, based on the recorded voice feature Y: If T = p (Y | H 0 ) p (Y | Λ A ) = > τ , accept the user as A; otherwise, p (Y | H1 ) p (Y | Λ A ) reject the user. The decision threshold τ is determined empirically in practice. Example(II): reject outliers in pattern classification ·  How to reject outliers (belonging to none of known classes) in pattern classification ? –  In speech recognition, how to detect unknown words, called out-of vocabulary (OOV ) words used by users?? ·  Solution 1: treat outliers as another class (N+1)-class patterns ·  Solution 2: –  Stage 1: do N-class pattern classification, find the best match, say class k; –  Stage 2: verify the decision made in stage 1. –  Stage 2 is a pattern verification problem: •  H0: the pattern X really comes from class k •  H1: the pattern X does NOT come from class k Λ= Dept. of CSE, York Univ. Pr( X | H 0 ) > ζ accept the decision; otherwise reject Pr( X | H1 ) 18 ...
View Full Document

Ask a homework question - tutors are online