lecture6 - CSE 6740 Lecture 6 How Do I Predict a Discrete...

Info iconThis preview shows pages 1–11. Sign up to view the full content.

View Full Document Right Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: CSE 6740 Lecture 6 How Do I Predict a Discrete Variable? (Classification) Alexander Gray agray@cc.gatech.edu Georgia Institute of Technology CSE 6740 Lecture 6 p. 1/3 9 Today 1. Classification (How can I predict a discrete variable?) 2. Generative classification (What if I reduce classification to density estimation?) 3. Discriminative classification (Can I do classification while avoiding density estimation?) CSE 6740 Lecture 6 p. 2/3 9 Classification How can I predict a discrete variable? CSE 6740 Lecture 6 p. 3/3 9 Classification CSE 6740 Lecture 6 p. 4/3 9 Classification Loss The most common loss function for classification is zero-one loss : L ( Y, hatwide c ( X )) = I ( Y negationslash = hatwide c ( X )) . (1) We can generalize this to specify arbitrary costs for misclassifying one class as another: L ( Y, hatwide c ( X )) = C ab (2) where C is a K K matrix, a = Y , and b = hatwide c ( X ) . CSE 6740 Lecture 6 p. 5/3 9 Classification Loss The test error, or expected loss, called the error rate in this case, is E = E [ L ( Y, hatwide c ( X ))] (3) = E X,Y [ L ( Y, hatwide c ( X ))] (4) = E X E Y | X [ L ( Y, hatwide c ( X ))] (5) or, for a given x , E ( x ) = E Y | X [ L ( Y, hatwide c ( x ))] . (6) CSE 6740 Lecture 6 p. 6/3 9 Discriminant Function Suppose Y = { , 1 } . The value hatwide f ( x ) that minimizes E ( x ) is the regression function, which we now call the discriminant function : g ( x ) = E ( Y | X = x ) (7) = integraldisplay yf ( y | x ) dy (8) = 1 P ( Y = 1 | X = x ) + 0 P ( Y = 0 | X = x ) (9) = P ( Y = 1 | X = x ) (10) = f ( x | Y = 1) P ( Y = 1) f ( x | Y = 1) P ( Y = 1) + f ( x | Y = 0) P ( Y = 0) (11) = 1 f 1 ( x ) 1 f 1 ( x ) + f ( x ) . (12) CSE 6740 Lecture 6 p. 7/3 9 Bayes Classifier Making this yield a binary prediction gives the Bayes classifier , or Bayes rule : c ( x ) = braceleftBigg 1 if g ( x ) > 1 / 2 otherwise (13) = braceleftBigg 1 if P ( Y = 1 | X = x ) > P ( Y = 0 | X = x ) otherwise (14) = braceleftBigg 1 if 1 f 1 ( x ) > f ( x ) otherwise. (15) This is easily generalized to any number of classes K . CSE 6740 Lecture 6 p. 8/3 9 Optimal Classification Keep in mind that this is Bayes only in the sense of conditional distributions, not in the sense of Bayesian inference. The Bayes classifier is optimal, i.e. if c ( x ) is any other classification rule then E [ L ( Y,c ( X ))] > E [ L ( Y,c ( X ))] . CSE 6740 Lecture 6 p. 9/3 9 Generative vs. Discriminative The set { x : P ( Y = 1 | X = x ) = P ( Y = 0 | X = x ) } (16) is called the decision boundary . In a generative classifier, well model the class-conditional densities f k ( x ) explicitly. This means well be doing two separate density estimates ....
View Full Document

Page1 / 39

lecture6 - CSE 6740 Lecture 6 How Do I Predict a Discrete...

This preview shows document pages 1 - 11. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online