This preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: 4 Linear Methods for Classification 4.1 Introduction In this chapter we revisit the classification problem and focus on linear methods for classification. Since our predictor G ( x ) takes values in a dis crete set G , we can always divide the input space into a collection of regions labeled according to the classification. We saw in Chapter 2 that the bound aries of these regions can be rough or smooth, depending on the prediction function. For an important class of procedures, these decision boundaries are linear; this is what we will mean by linear methods for classification. There are several different ways in which linear decision boundaries can be found. In Chapter 2 we fit linear regression models to the class indicator variables, and classify to the largest fit. Suppose there are K classes, for convenience labeled 1 , 2 , . . . , K , and the fitted linear model for the k th indicator response variable is ˆ f k ( x ) = ˆ β k + ˆ β T k x . The decision boundary between class k and is that set of points for which ˆ f k ( x ) = ˆ f ( x ), that is, the set { x : ( ˆ β k − ˆ β ) + ( ˆ β k − ˆ β ) T x = 0 } , an aﬃne set or hyperplane 1 Since the same is true for any pair of classes, the input space is divided into regions of constant classification, with piecewise hyperplanar decision boundaries. This regression approach is a member of a class of methods that model discriminant functions δ k ( x ) for each class, and then classify x to the class with the largest value for its discriminant function. Methods 1 Strictly speaking, a hyperplane passes through the origin, while an aﬃne set need not. We sometimes ignore the distinction and refer in general to hyperplanes. © Springer Science+Business Media, LLC 2009 T. Hastie et al., The Elements of Statistical Learning, Second Edition, 101 DOI: 10.1007/b94608_4, 102 4. Linear Methods for Classification that model the posterior probabilities Pr( G = k  X = x ) are also in this class. Clearly, if either the δ k ( x ) or Pr( G = k  X = x ) are linear in x , then the decision boundaries will be linear. Actually, all we require is that some monotone transformation of δ k or Pr( G = k  X = x ) be linear for the decision boundaries to be linear. For example, if there are two classes, a popular model for the posterior proba bilities is Pr( G = 1  X = x ) = exp( β + β T x ) 1 + exp( β + β T x ) , Pr( G = 2  X = x ) = 1 1 + exp( β + β T x ) . (4.1) Here the monotone transformation is the logit transformation: log[ p/ (1 − p )], and in fact we see that log Pr( G = 1  X = x ) Pr( G = 2  X = x ) = β + β T x. (4.2) The decision boundary is the set of points for which the logodds are zero, and this is a hyperplane defined by x  β + β T x = 0 . We discuss two very popular but different methods that result in linear logodds or logits: linear discriminant analysis and linear logistic regression. Although they differ in their derivation, the essential difference between them is in the way the...
View
Full Document
 Spring '10
 Haulk
 Linear Regression, Regression Analysis, OO, oo oo oo

Click to edit the document details