This preview shows page 1. Sign up to view the full content.
Unformatted text preview: Linear Methods for Classification Linear Methods for Classification
Jia Li
Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu http://www.stat.psu.edu/jiali Jia Li http://www.stat.psu.edu/jiali Linear Methods for Classification Classification Supervised learning Training data: {(x1 , g1 ), (x2 , g2 ), ..., (xN , gN )} The feature vector X = (X1 , X2 , ..., Xp ), where each variable Xj is quantitative. The response variable G is categorical. G G = {1, 2, ..., K } Form a predictor G (x) to predict G based on X . Email spam: G has only two values, say 1 denoting a useful email and 2 denoting a junk email. X is a 57dimensional vector, each element being the relative frequency of a word or a punctuation mark. G (x) divides the input space (feature vector space) into a collection of regions, each labeled by one class. Jia Li http://www.stat.psu.edu/jiali Linear Methods for Classification Jia Li http://www.stat.psu.edu/jiali Linear Methods for Classification Linear Methods Decision boundaries are linear: linear methods for classification. Two class problem: The decision boundary between the two classes is a hyperplane in the feature vector space. A hyperplane in the p dimensional input space is the set: {x : 0 +
p j=1 j xj = 0} . The two regions separated by a hyperplane: {x : 0 + {x : 0 +
Jia Li http://www.stat.psu.edu/jiali p j=1 p j=1 j xj > 0} and j xj < 0} . Linear Methods for Classification More than two classes: The decision boundary between any pair of class k and l is a hyperplane (shown in previous figure). Different criteria lead to different algorithms. Linear regression of an indicator matrix. Linear discriminant analysis. Logistic regression. Rosenblatt's perceptron learning algorithm. Question: which hyperplanes to use? Linear decision boundaries are not necessarily constrained. Jia Li http://www.stat.psu.edu/jiali Linear Methods for Classification The Bayes Classification Rule Suppose the marginal distribution of G is specified by the pmf pG (g ), g = 1, 2, ..., K . The conditional distribution of X given G = g is fX G (x  G = g ). The training data (xi , gi ), i = 1, 2, ..., N, are independent samples from the joint distribution of X and G , fX ,G (x, g ) = pG (g )fX G (x  G = g ) . ^ ^ Assume the loss of predicting G as G (X ) = G is L(G , G ). Goal of classification: minimize the expected loss EX ,G L(G (X ), G ) = EX (EG X L(G (X ), G )) . Jia Li http://www.stat.psu.edu/jiali Linear Methods for Classification To minimize the left hand side, it suffices to minimize EG X L(G (X ), G ) for each X . Hence the optimal predictor G (x) = arg min EG X =x L(g , G ) .
g Bayes classification rule. Jia Li http://www.stat.psu.edu/jiali Linear Methods for Classification For 01 loss, i.e., L(g , g ) = 1 0 g = g g = g EG X =x L(g , G ) = 1  Pr (G = g  X = x) . The Bayes rule becomes the rule of maximum a posteriori probability: G (x) = arg min EG X =x L(g , G )
g = arg max Pr (G = g  X = x) .
g Many classification algorithms attempt to estimate Pr (G = g  X = x), then apply the Bayes rule.
http://www.stat.psu.edu/jiali Jia Li Linear Methods for Classification Linear Regression of an Indicator Matrix If G has K classes, there will be K class indicators Yk , k = 1, ..., K . g 3 1 2 4 1 y1 0 1 0 0 1 y2 0 0 1 0 0 y3 1 0 0 0 0 y4 0 0 0 1 0 Fit a linear regression model for each Yk , k = 1, 2, ..., K , using X : ^ yk = X(XT X)1 XT yk . Define Y = (y1 , y2 , ..., yk ): ^ Y = X(XT X)1 XT Y .
http://www.stat.psu.edu/jiali Jia Li Linear Methods for Classification Classification Procedure ^ Define B = (XT X)1 XT Y. For a new observation with input x, compute the fitted output ^ ^ f (x) = [(1, x)B]T ^ = [(1, x1 , x2 , ..., xp )B]T ^ f1 (x) f2 (x) ^ = ... ^ fK (x) ^ Identify the largest component of f (x) and classify accordingly: ^ ^ G (x) = arg max fk (x) .
kG
http://www.stat.psu.edu/jiali Jia Li Linear Methods for Classification Rationale The linear regression of Yk on X is a linear approximation to E (Yk  X = x). E (Yk  X = x) = Pr (Yk = 1  X = x) 1 + Pr (Yk = 0  X = x) 0 = Pr (G = k  X = x) = Pr (Yk = 1  X = x) According to the Bayes rule, the optimal classifier: G (x) = arg max Pr (G = k  X = x) .
kG Linear regression of an indicator matrix: Approximate Pr (G = k  X = x) by a linear function of x using linear regression. Apply the Bayes rule to the approximated probability. Jia Li http://www.stat.psu.edu/jiali Linear Methods for Classification Example: Diabetes Data
The diabetes data set is taken from the UCI machine learning database repository at: http://www.ics.uci.edu/~mlearn/MachineLearning.html . The original source of the data is the National Institute of Diabetes and Digestive and Kidney Diseases. There are 768 cases in the data set, of which 268 show signs of diabetes according to World Health Organization criteria. Each case contains 8 quantitative variables, including diastolic blood pressure, triceps skin fold thickness, a body mass index, etc. Two classes: with or without signs of diabetes. ~ ~ ~ Denote the 8 original variables by X1 , X2 , ..., X8 . ~ Remove the mean of Xj and normalize it to unit variance.
http://www.stat.psu.edu/jiali Jia Li Linear Methods for Classification The two principal components X1 and X2 are used in classification: ~ ~ ~ ~ X1 = 0.1284X1 + 0.3931X2 + 0.3600X3 + 0.4398X4 ~ ~ ~ ~ +0.4350X5 + 0.4519X6 + 0.2706X7 + 0.1980X8 ~ ~ ~ ~ X2 = 0.5938X1 + 0.1740X2 + 0.1839X3  0.3320X4 ~ ~ ~ ~ 0.2508X5  0.1010X6  0.1221X7 + 0.6206X8 Jia Li http://www.stat.psu.edu/jiali Linear Methods for Classification The scatter plot follows. Without diabetes: stars (class 1), with diabetes: circles (class 2). Jia Li http://www.stat.psu.edu/jiali Linear Methods for Classification ^ B = (XT X)1 XT Y 0.6510 0.3490 = 0.1256 0.1256 0.0729 0.0729 ^ Y1 = 0.6510  0.1256X1  0.0729X2 ^ Y2 = 0.3490 + 0.1256X1 + 0.0729X2 ^ ^ Note Y1 + Y2 = 1. Jia Li http://www.stat.psu.edu/jiali Linear Methods for Classification Classification rule ^ ^ 1 Y1 Y2 ^ ^ 2 Y1 < Y2 1 0.151  0.1256X1  0.0729X2 0 2 otherwise ^ G (x) = = Within training data classification error rate: 28.52%. Sensitivity (probability of claiming positive when the truth is positive): 44.03%. Specificity (probability of claiming negative when the truth is negative): 86.20%. Jia Li http://www.stat.psu.edu/jiali Linear Methods for Classification Jia Li http://www.stat.psu.edu/jiali Linear Methods for Classification The Phenomenon of Masking When the number of classes K 3, a class may be masked by others, that is, there is no region in the feature space that is labeled as this class. The linear regression model is too rigid. Jia Li http://www.stat.psu.edu/jiali Linear Methods for Classification Jia Li http://www.stat.psu.edu/jiali Linear Methods for Classification Jia Li http://www.stat.psu.edu/jiali ...
View Full
Document
 Fall '09
 JIALI
 Statistics

Click to edit the document details