This preview shows pages 1–11. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: CSE 6740 Lecture 6 How Do I Predict a Discrete Variable? (Classification) Alexander Gray agray@cc.gatech.edu Georgia Institute of Technology CSE 6740 Lecture 6 p. 1/3 9 Today 1. Classification (How can I predict a discrete variable?) 2. Generative classification (What if I reduce classification to density estimation?) 3. Discriminative classification (Can I do classification while avoiding density estimation?) CSE 6740 Lecture 6 p. 2/3 9 Classification How can I predict a discrete variable? CSE 6740 Lecture 6 p. 3/3 9 Classification CSE 6740 Lecture 6 p. 4/3 9 Classification Loss The most common loss function for classification is zeroone loss : L ( Y, hatwide c ( X )) = I ( Y negationslash = hatwide c ( X )) . (1) We can generalize this to specify arbitrary costs for misclassifying one class as another: L ( Y, hatwide c ( X )) = C ab (2) where C is a K K matrix, a = Y , and b = hatwide c ( X ) . CSE 6740 Lecture 6 p. 5/3 9 Classification Loss The test error, or expected loss, called the error rate in this case, is E = E [ L ( Y, hatwide c ( X ))] (3) = E X,Y [ L ( Y, hatwide c ( X ))] (4) = E X E Y  X [ L ( Y, hatwide c ( X ))] (5) or, for a given x , E ( x ) = E Y  X [ L ( Y, hatwide c ( x ))] . (6) CSE 6740 Lecture 6 p. 6/3 9 Discriminant Function Suppose Y = { , 1 } . The value hatwide f ( x ) that minimizes E ( x ) is the regression function, which we now call the discriminant function : g ( x ) = E ( Y  X = x ) (7) = integraldisplay yf ( y  x ) dy (8) = 1 P ( Y = 1  X = x ) + 0 P ( Y = 0  X = x ) (9) = P ( Y = 1  X = x ) (10) = f ( x  Y = 1) P ( Y = 1) f ( x  Y = 1) P ( Y = 1) + f ( x  Y = 0) P ( Y = 0) (11) = 1 f 1 ( x ) 1 f 1 ( x ) + f ( x ) . (12) CSE 6740 Lecture 6 p. 7/3 9 Bayes Classifier Making this yield a binary prediction gives the Bayes classifier , or Bayes rule : c ( x ) = braceleftBigg 1 if g ( x ) > 1 / 2 otherwise (13) = braceleftBigg 1 if P ( Y = 1  X = x ) > P ( Y = 0  X = x ) otherwise (14) = braceleftBigg 1 if 1 f 1 ( x ) > f ( x ) otherwise. (15) This is easily generalized to any number of classes K . CSE 6740 Lecture 6 p. 8/3 9 Optimal Classification Keep in mind that this is Bayes only in the sense of conditional distributions, not in the sense of Bayesian inference. The Bayes classifier is optimal, i.e. if c ( x ) is any other classification rule then E [ L ( Y,c ( X ))] > E [ L ( Y,c ( X ))] . CSE 6740 Lecture 6 p. 9/3 9 Generative vs. Discriminative The set { x : P ( Y = 1  X = x ) = P ( Y = 0  X = x ) } (16) is called the decision boundary . In a generative classifier, well model the classconditional densities f k ( x ) explicitly. This means well be doing two separate density estimates ....
View
Full
Document
 Fall '08
 Staff

Click to edit the document details