chap4 - ESL Chapter 4 -- Linear Methods for Classification...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: ESL Chapter 4 -- Linear Methods for Classification Trevor Hastie and Rob Tibshirani Linear Methods for Classification Linear regression linear and quadatric discriminant functions example: gene expression arrays reduced rank LDA logistic regression separating hyperplanes 1 ESL Chapter 4 -- Linear Methods for Classification Trevor Hastie and Rob Tibshirani Linear classifiers Some concepts: T linear regression fk (x) = k0 + k x decision boundary between classes k and : {x : fk (x) = f (x)} linear discriminant analysis, logistic regression P (G = 1|X = x) = 0 + T x log P (G = 2|X = x) explicit approaches: separating hyperplanes discriminant functions: k (x), G(x) k = 1, 2, . . . K = argmin k (x) 2 ESL Chapter 4 -- Linear Methods for Classification Trevor Hastie and Rob Tibshirani Linear regression Indicator response matrix 3 1 g = 4 , . . . 2 0 1 Y = 0 . . . 0 0 1 0 0 0 0 1 0 0 0 1 0 ^ ^ ^ F = Y = X(XT X)-1 XT Y = XB 3 ESL Chapter 4 -- Linear Methods for Classification Trevor Hastie and Rob Tibshirani ^ f1 (x) ^ f2 (x) ^ ^ f (x) = B T x = . . . ^ fK (x) Targets: N p (x) 1 p2 (x) Note: E(Y |X = x) . . . pK (x) where pk (x) = P (G = k|X = x) minB i=1 ||yi - B T xi ||2 , yi , xi are ith rows of Y and X. ^ ^ ^ ^ with f (x) = B T x, G(x) = argmink ||f (x) - tk ||2 , where tk = (0, 0, . . . 0, 1, 0, . . .) (1 in kth position). 4 ESL Chapter 4 -- Linear Methods for Classification Trevor Hastie and Rob Tibshirani Masking problems with linear regression Linear Regression 3 3 3 3 33 3 33 33 3 3 3 3 3 3 33 33 3 3 3 3 33 3 3 33 3 3 3 3 3 3 33 3 33333 33 3 3 3 33 3 33 33 33 3 33 33333 333333 3 3 3 3 33 3 3 3 33 33 3 3 33 3 3 3 3 333333 3 3333 3 33 3 3 3 3 3 3 33333 3 3333 33 3 3 3 3 3 3 33 33 3 3 33 33 3 33 333 3 3 333 3 3 33 3 33 333 3 3 33 33 3 3 3 33 3 3 33 3333 3 3 33333333 3 333 3 3 33 3 3 33 3333333 3 3 3 3 33 3 3 33 3 3 3 3 3 3 3 3 3 3333 3 3 33333 3 3 33 33 3 3 3 3 3 3 3 33 3 3 3 33 3 33 3 3 3 3 33 3 3 3 33 3 3 3 33 3 33 3 3333 3 3333 33 3 3 3 3 33 3 333333 3333 3 33 3 3 3 3 3 333 333333 3 33 3 3 3 3 3 3 3 3 3 3 33 3 3 333 33 33 333 33 3 33 3 3 3 3 33 33 3 3 3 3 3 3 3 3 3 333 3 2 3 3 3 333 3 333 3 33 3 3 3 3 3 3 33 333 33 3 3 33 3 33 33 3 33 3 22 3 3 3 3 33 3 3 33 3 3 3 3 2 3 3 3 3 3 3 3 33 3 2 2 2 3 2 2 23 33 2 3 2 2 2 2 2 222 2 2 222 22222 2 2 3 2 2 2 22 2 22 2 2 2 2 2 22 222 2222 22 2 2 2 2 3 2 2 2 22 2 2 2 222 2 22 2 2 2 22 2 2 2 22222 2 2 2 2 2 22 2 2 2 2 2 22 22 222 2 2 2 2 2 2 22 2 22 2 22 2 2222 222 2 2 2 222 2222 22222 2 2 22 22 2 2 2 2 2 22 2 2 2 2 2222 2 2 2 2 2 2 2 222 222 2222 222 2 22 2 22 2 22 2 2 22 2 2 22 2 22 2 2 2 22 2 2 2222 22 2222222 22222 2 22 2 22 2 222 222 22 2 2 2 2 2 222 2222 2 2 22 222 2 2 22 22 2 2 2 2 22 2222 2222 22222 2 22 2 2 2 2 22 2 22 2 22 2 2 2 2 222 222 2 2 22 2 2 222 2 2 2 22 2 2 2 2 22 2 22 2 2 22 2 22 2 2 2 22 22 2 222 2 22 2 2 2222 222 22 2 2 2 2 2 2 2 22 22 22 2 1 22 2 222 2222 2 22 2 2 2 2 2 2 22 2 2 2 2 2 2 2 1 1 2 1 2 2 2 2 2 2 22 2 2 2 2 1 1 11 2 2 22222 22 2 11 1 22 2 2 2 1 1 11 2 2 2 2 22 2 11 1 1 1 1 1 1 1 1 1 1 111 1 1 111 11 1 1 1 2 2 2 1 11 1 1 11 1 1 1 1 1 11 1 1 1 1 11 1 1 1 1 1 1 11111 11 111 1111111 1 1 1 1 11 11 1 1 11 1 1 1 1 11 1 1 11 1 11111111 1 1 1 111 1 1 1 11 1 11 1 1 1 1 1 1111 111 11 1 1 1 1 1 1 11 1 1 1 1 1 1 1 11 11111 1 1 1 1 1 1 111 1 1 1 1 11 1111 1 111 1 11 1 1 1 1 1111 1 1 1111 1 11 1 1 1 1 1 1 11111111 1 1 1 1 1 1 1 11 1 1 1 1 1 1 111 11 1 1 11111111 1111 11 11 11 1 1 1 1 1 11 1 1 1 11 1 1 111 1 1 1 111 1 11 1 1 1 111 1 1 11 1111 11 11111 1 1 1 1 1 1 1 1 1 11 1 1 1 11 1 1 11 1 1 1 1 11 11 11111 1 1 1 1 11 1 1 1 1111 11 1 1 1 1 1 1 1 11111111 1 11 1 1 1 11 1 1 111 11 1 11 1 11 1 11 1 11 1 1 1 1 1 11 1 111 1 1 1 1 1 1 1 1 1 1 111 1 1 1 111 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 111 1 1 1 1 3 1 Linear Discriminant Analysis 3 3 3 3 33 3 33 33 3 3 3 3 3 3 33 33 3 3 3 3 33 3 3 33 3 3 3 3 3 3 33 3 33333 33 3 3 3 33 3 33 33 33 3 33 33333 333333 3 3 3 3 33 3 3 3 33 33 3 3 33 3 3 3 3 333333 3 3333 3 33 3 3 3 3 3 3 33333 3 3333 33 3 3 3 3 3 3 33 33 3 3 33 33 3 33 333 3 3 333 3 3 33 3 33 333 3 3 33 33 3 3 3 33 3 3 33 3333 3 3 33333333 3 333 3 3 33 3 3 33 3333333 3 3 3 3 33 3 3 33 3 3 3 3 3 3 3 3 3 3333 3 3 33333 3 3 33 33 3 3 3 3 3 3 3 33 3 3 3 33 3 33 3 3 3 3 33 3 3 3 33 3 3 3 33 3 33 3 3333 3 3333 33 3 3 3 3 33 3 333333 3333 3 33 3 3 3 3 3 333 333333 3 33 3 3 3 3 3 3 3 3 3 3 33 3 3 333 33 33 333 33 3 33 3 3 3 3 33 33 3 3 3 3 3 3 3 3 3 333 3 2 3 3 3 333 3 333 3 33 3 3 3 3 3 3 33 333 33 3 3 33 3 33 33 3 33 3 22 3 3 3 3 33 3 3 33 3 3 3 3 2 3 3 3 3 3 3 3 33 3 2 2 2 3 2 2 23 33 2 3 2 2 2 2 2 222 2 2 222 22222 2 2 3 2 2 2 22 2 22 2 2 2 2 2 22 222 2222 22 2 2 2 2 3 2 2 2 22 2 2 2 222 2 22 2 2 2 22 2 2 2 22222 2 2 2 2 2 22 2 2 2 2 2 22 22 222 2 2 2 2 2 2 22 2 22 2 22 2 2222 222 2 2 2 222 2222 22222 2 2 22 22 2 2 2 2 2 22 2 2 2 2 2222 2 2 2 2 2 2 2 222 222 2222 222 2 22 2 22 2 22 2 2 22 2 2 22 2 22 2 2 2 22 2 2 2222 22 2222222 22222 2 22 2 22 2 222 222 22 2 2 2 2 2 222 2222 2 2 22 222 2 2 22 22 2 2 2 2 22 2222 2222 22222 2 22 2 2 2 2 22 2 22 2 22 2 2 2 2 222 222 2 2 22 2 2 222 2 2 2 22 2 2 2 2 22 2 22 2 2 22 2 22 2 2 2 22 22 2 222 2 22 2 2 2222 222 22 2 2 2 2 2 2 2 22 22 22 2 1 22 2 222 2222 2 22 2 2 2 2 2 2 22 2 2 2 2 2 2 2 1 1 2 1 2 2 2 2 2 2 22 2 2 2 2 1 1 11 2 2 22222 2 22 2 11 1 2 2 2 2 1 1 11 2 2 2 2 22 2 11 1 1 1 1 1 1 1 1 1 1 111 1 1 111 11 1 1 1 2 2 2 1 11 1 1 11 1 1 1 1 1 11 1 1 1 1 11 1 1 1 1 1 1 11111 11 111 1111111 1 1 1 1 1 11 11 1 11 1 1 1 1 11 1 1 1 1 11 1 11111111 1 111 1 1 11 1 11 1 1 1 1 1 1111 111 11 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 11 11111 1 1 1 1 1 1 111 1 1 1 1 11 1111 1 111 1 11 1 1 1 1111 1 1 1 1111 1 11 1 1 1 1 1 1 11111111 1 1 1 1 1 1 1 11 1 1 1 1 1 1 111 11 1 1 11111111 1111 11 11 11 1 1 1 1 1 11 1 1 1 11 1 1 111 1 1 1 111 1 11 1 1 1 111 1 1 11 1111 11 11111 1 1 1 1 1 1 1 1 1 11 1 1 1 11 1 1 11 1 1 1 1 11 11 11111 1 1 1 1 11 1 1 1 1111 11 1 1 1 1 1 1 1 11111111 1 11 1 1 1 11 1 11 1 111 1 11 1 11 1 11 1 11 1 1 1 1 1 11 1 111 1 1 1 1 1 1 1 1 1 1 111 1 1 1 111 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 111 1 1 1 1 3 1 3 3 X2 X1 X2 X1 The data come from three classes in R2 and are easily separated by linear decision boundaries. The right plot shows the boundaries found by linear discriminant analysis. The left plot shows the boundaries found by linear regression of the indicator response variables. The middle class is completely masked (never dominates). 5 ESL Chapter 4 -- Linear Methods for Classification Trevor Hastie and Rob Tibshirani Degree = 1; Error = 0.25 1 1 11 11 11 1 1 3 33 3 33 33 3 33 33 33 3 33 33 33 3 33 3 3 3 1 1 1 1 1 1 1 1 Degree = 2; Error = 0.03 1.0 11 1 3 3 1 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 33 3 33 3 1 11 11 11 11 11 1 1 11 1 11 1.0 1 1 1 1 1 11 1 11 1 1 11 1 11 1 11 22 2 22 2 2 2 22 2 2 2 2 2 2 2 2 2 222 2222222 2 22 2 2222222 2 2 2 22 222 22 2 2 11 1 11 1 11 11 3 33 33 3 33 3 33 0.5 2 22 2 2 22 2 2 2 22 2 2 2222 2 2 22 22 22 2 2222222 22 0.0 3 3 33 33 33 3 3 33 3 33 3 33 3 33 33 3 33 33 3 3 33 33 11 333 11 3 3 1 3 22 2 2 222222 2 2 2 2 22222222 2222 2 11 2 11 22 232 2323 2 22 22322222 2 2 2222 2222222 222 2 22 22 222222 321 3 3 11 33 111 3 11 3 3 3 11 33 3 11 1 3 1 11 11 1 1 11 1 11 11 11 11 11 11 11 11 11 1 11 1 1 1 2 1 2 3 0.5 3 0.0 33 333 3 33 2 3 3 3 3333 3 3 33 3 3 2 3 3333333 33 33 3 3 2 2 2 2 2 2 2 2 2 2 22 3 33 11 1 2 33 2 22 11 11 3 2 2 2 11 333 11 333 11 3 2 2 131 11 311 2 1 3 2 33 1 2 33 111 33 2 3 2 111 1 33 2 1 1 1 1 1 333 3 3 1 1 1 11 111111 11121111 1 11111112 1 11 1 11 11 2 2 2 2 2 2 2 2 3 22 33 33 22 2 2 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 The effects of masking on linear regression in R for a three-class problem. The rug plot at the base indicates the positions and class membership of each observation. The three curves in each panel are the fitted regressions to the three-class indicator variables; for example, for the green class, ygreen is 1 for the green observations, and 0 for the orange and blue. The fits are linear and quadratic polynomials. Above each plot is the training error rate. The Bayes error rate is 0.025 for this problem, as is the LDA error rate. 6 ESL Chapter 4 -- Linear Methods for Classification Trevor Hastie and Rob Tibshirani Linear discriminant analysis fk (x) -- density of X in class G = k k -- class prior P r(G = k). Bayes theorem P r(G = k|X = x) = fk (x)k K =1 f (x) leads to LDA, QDA, MDA (mixture DA), Kernel DA, Naive Bayes LDA: fk (x) P r(G = k|x) log P r(G = |x) (2)p/2 || k 1 = log - (k + )T -1 (k - ) + xT -1 (k - ) 2 = 1 e- 2 (x-k ) 1/2 1 T -1 (x-k ) 7 ESL Chapter 4 -- Linear Methods for Classification Trevor Hastie and Rob Tibshirani More on LDA estimate k by centroid in class k, and by pooled within class covariance matrix estimated Bayes rule: classify to class k that maximizes the discriminant function 1 ^ ^ ^ ^k ^ ^ k (x) = xT -1 k - T -1 k + log k 2 for two classes, we classify to class 2 if ^ xT -1 (^2 - 1 ) > ^ 1 N2 ^ (^2 + 1 )T -1 (^2 - 1 ) - log ^ ^ 2 N1 where N1 , N2 are number of observations in each class. 8 ESL Chapter 4 -- Linear Methods for Classification Trevor Hastie and Rob Tibshirani + + + + Although the line joining the centroids defines the direction of greatest centroid spread, the projected data overlap because of the covariance (left panel). The discriminant direction minimizes this overlap for Gaussian data (right panel). 9 ESL Chapter 4 -- Linear Methods for Classification Trevor Hastie and Rob Tibshirani Linear boundaries and their projections 1 2 1 1 2 1 2 3 2 2 1 2 2 2 3 2 2 2 3 3 22 2 3 33 2 33 12 3 2 2 3 2 2 2 22 2 3 3 2 3 2 222 2 1 2 2 3 22 2 1 2 2 2 3 3 2 2 22 2 2 2 2 2 22 2 2 2 2 2 22 22 2 22 22 22 2 22 3 3 3 33 2 2 2 222 2 22 22 2 1 3 33333 3 22 2 2 22 2 3 33 3 3 3 3 3 2 2 22 2 2 2 2 3 3 33 2 22 2 2 2 22 2 2 2 2 3 2 3 2 2 22 2 2 22 2222 22 2 13 3 2 2 22 2 2 2222 2 3 333 3 3 2 22 1 2 2 222 22 1 22 2 2 2 2 2 1 33 33 3 33 212 2 22 1 2 2 1 1 2 1 1 1 1 3 33 3 2 2 2 2 2 1 22 22 2 2 1 1 1 1 1 33 33 3 22 12 2 2 1 1 33 3 3 3 2 3 11 1 2 1 11 1 11 2 1 21 1 1 33 33 1 2 1 2 21 1 2 33 33 11 1 1 1 1 1 1 3 33 2 2 12 1 1 11 22 1 1 1 1 21 3 33 3 1 1 1 1 1 11 1 1133 3 3 3 3333 3 1 3 3 2 111 2 1 2 1 1 3 3 33 1 1 111 1 33 3 1 1 33 33 3 1 1 1 11 1 1 1 1 1 1 1 3 1 3 33 3 1 1 1 1 1 1 1 1 11 1 11 11 11 1 1 11 11 1 111 33 133 3 1 3 333333 3 11 1 3 1 1 1 11 3 1 1 1 1 11 1 1 1 1 1 33 3333 3 3 1333 1 1 1 11 1 1 1 33 3 33 11 1 1 1 1 11 1 1 1 33 1 11 1 11 1 33 3 3 3333 11 3 3 3333 3 3 3 1 1 1 1 1 1 1 3 333 1 1 1 3 3 33 1 1 1 333 3 3 3 3 1 2 1 1 2 1 2 3 2 2 1 2 2 2 3 2 2 2 3 3 22 2 3 33 2 33 12 3 2 2 3 2 2 2 22 2 3 3 2 3 2 222 2 1 2 2 3 22 2 1 2 2 2 3 3 2 2 22 2 2 2 2 2 22 2 2 2 2 2 22 22 2 22 22 22 2 22 3 3 3 33 2 2 2 222 2 22 22 2 1 3 33333 3 22 2 2 22 2 3 33 3 3 3 3 3 2 2 22 2 2 2 2 3 3 33 2 22 2 2 2 22 2 2 2 2 3 2 3 2 2 22 2 2 22 2222 22 2 13 3 2 2 22 2 2 2222 2 3 333 3 3 2 22 1 2 2 222 22 1 22 2 2 2 2 2 1 33 33 3 33 212 2 22 1 2 2 1 1 2 1 1 1 1 3 33 3 2 2 2 2 2 1 22 22 2 2 1 1 1 1 1 33 33 3 22 12 2 2 1 1 33 3 3 3 2 3 11 1 2 1 11 1 11 2 1 21 1 1 33 33 1 2 1 2 21 1 2 33 33 11 1 1 1 1 1 1 3 33 2 2 12 1 1 11 22 1 1 1 1 21 3 33 3 1 1 1 1 1 11 1 1133 3 3 3 3333 3 1 3 3 2 111 2 1 2 1 1 3 3 33 1 1 111 1 33 3 1 1 33 33 3 1 1 1 11 1 1 1 1 1 1 1 3 1 3 33 3 1 1 1 1 1 1 1 1 11 1 11 11 11 1 1 11 11 1 111 33 133 3 1 3 333333 3 11 1 3 1 1 1 11 3 1 1 1 1 11 1 1 1 1 1 33 3333 3 3 1333 1 1 1 11 1 1 1 33 3 33 11 1 1 1 1 11 1 1 1 33 1 11 1 11 1 33 3 3 3333 11 3 3 3333 3 3 3 1 1 1 1 1 1 1 3 333 1 1 1 3 3 33 1 1 1 333 3 3 3 3 2 2 The left plot shows some data from three classes, with linear decision boundaries found by linear discriminant analysis. The right plot shows quadratic decision boundaries. These were obtained by finding linear 2 2 boundaries in the five-dimensional space X1 , X2 , X1 X2 , X1 , X2 . Linear inequalities in this space are quadratic inequalities in the original space. 10 ESL Chapter 4 -- Linear Methods for Classification Trevor Hastie and Rob Tibshirani Quadratic discriminant analysis 1 1 k (x) = - log |k | - (x - k )T -1 (x - k ) + log k k 2 2 1 2 1 1 2 1 3 22 1 2 22 2 3 2 2 2 3 3 22 2 3 33 2 33 12 3 2 2 3 2 2 2 22 2 3 3 2 3 2 222 2 1 2 2 3 22 2 1 2 2 2 3 3 2 2 22 2 2 2 2 2 22 2 2 2 2 2 22 22 2 22 22 22 2 22 3 3 3 33 2 2 2 222 2 22 22 2 1 3 33333 3 22 2 2 22 2 3 33 3 3 3 3 3 2 2 22 2 2 2 2 3 3 33 2 22 2 2 2 22 2 2 2 2 3 2 3 2 2 22 2 2 22 2 222 22 2 13 3 2 2 2 22 2 2 2 2 2 2 2 3 333 3 3 2 1 2 22 1 2 22222 2 2 22 2 1 33 33 3 33 22 2 22 12 1 2 2 1 1 2 1 3 33 3 2 2 22 22 1 1 1 2 1 2 1 22 1 1 1 33 33 3 22 1 2 2 2 1 22 1 1 1 33 3 3 3 2 11 1 11 2 11 1 1 1 21 1 1 3 33 3 3 3 2 1 2 21 1 2 3 1 1 1 1 2 2 1 1 1 11 1 11 1 1 11 1 1 1 33 3 333 12 22 21 31333 3 11 1 1 3 33 1 11 2 33 3 2 111 1 1 2 1 1 33333 33 1 1 1 11 1 33 3 1 33 33 3 1 1 1 11 1 1 1 1 1 1 1 3 1 3 33 3 1 1 1 1 1 1 1 1 11 1 11 11 11 1 1 11 11 1 111 33 133 3 1 33 33 3 33 3 11 1 1 1 1 11 33 1 1 1 1 11 1 1 1 1 1 33 3333 3 3 1333 1 1 1 1 1 1 1 33 3 33 11 1 1 1 1 11 1 1 1 1 33 1 11 1 11 1 33 3 3 3333 11 3 3 3333 3 33 3 1 1 1 1 1 1 1 3 33 1 1 1 3 3 33 1 1 1 333 3 3 3 3 1 2 1 1 2 1 3 22 1 2 22 2 3 2 2 2 3 3 22 2 3 33 2 33 12 3 2 2 3 2 2 2 22 2 3 3 2 3 2 222 2 1 2 2 3 22 2 1 2 2 2 3 3 2 2 22 2 2 2 2 2 22 2 2 2 2 2 22 22 2 22 22 22 2 22 3 3 3 33 2 2 2 222 2 22 22 2 1 3 33333 3 22 2 2 22 2 3 33 3 3 3 3 3 2 2 22 2 2 2 2 3 3 33 2 22 2 2 2 22 2 2 2 2 3 2 3 2 2 22 2 2 22 2 222 22 2 13 3 2 2 2 22 2 2 2 2 2 2 2 3 333 3 3 2 1 2 22 1 2 22222 2 2 22 2 1 33 33 3 33 22 2 22 12 1 2 2 1 1 2 1 3 33 3 2 2 22 22 1 1 1 2 1 2 1 22 1 1 1 33 33 3 22 1 2 2 2 1 22 1 1 1 33 3 3 3 2 11 1 11 2 11 1 1 1 21 1 1 3 33 3 3 3 2 1 2 21 1 2 3 1 1 1 1 2 2 1 1 1 11 1 11 1 1 11 1 1 1 33 3 333 12 22 21 31333 3 11 1 1 3 33 1 11 2 33 3 2 111 1 1 2 1 1 33333 33 1 1 1 11 1 33 3 1 33 33 3 1 1 1 11 1 1 1 1 1 1 1 3 1 3 33 3 1 1 1 1 1 1 1 1 11 1 11 11 11 1 1 11 11 1 111 33 133 3 1 33 33 3 33 3 11 1 1 1 1 11 33 1 1 1 1 11 1 1 1 1 1 33 3333 3 3 1333 1 1 1 1 1 1 1 33 3 33 11 1 1 1 1 11 1 1 1 1 33 1 11 1 11 1 33 3 3 3333 11 3 3 3333 3 33 3 1 1 1 1 1 1 1 3 33 1 1 1 3 3 33 1 1 1 333 3 3 3 3 2 2 Two methods for fitting quadratic boundaries. [Left] Quadratic decision boundaries, obtained using LDA in the five-dimensional "quadratic" space. [Right] Quadratic decision boundaries found by QDA. The differences are small, as is usually the case. 11 ESL Chapter 4 -- Linear Methods for Classification Trevor Hastie and Rob Tibshirani + + + 3 3 3 3 1 3 3 2 2 13 2 1 1 3 33 2 3 3 32 2 1 3 3 2 3 33 3 2 2 1 2 33 1 2 1 2 22 2 3 2 3 11 1 1 1 3 2 1 31 1 3 1 11 2 22 1 1 22 1 1 2 2 1 2 1 1 1 1 2 1 2 2 2 2 1 33 3 3 The left panel shows three Gaussian distributions, with the same covariance and different means. Included are the contours of constant density enclosing 95% of the probability in each case. The Bayes decision boundaries between each pair of classes are shown (broken straight lines), and the Bayes decision boundaries separating all three classes are the thicker solid lines (a subset of the former). On the right we see a sample of 30 drawn from each Gaussian distribution, and the fitted LDA decision boundaries. 12 ESL Chapter 4 -- Linear Methods for Classification Trevor Hastie and Rob Tibshirani Regularized discriminant analysis ^ ^ ^ Regularized QDA k () = k + (1 - ) ^ ^ Regularized LDA () = + (1 - )^ 2 I ^ Together (, ) ^ ^ ^ could use () = + (1 - )diag() in "Nearest Shrunken Centroid" work we use p k (x) = j=1 (xj - )2 ^jk 1 - log k s2 2 j where is a shrunken centroid. Details later. ^jk 13 ESL Chapter 4 -- Linear Methods for Classification Trevor Hastie and Rob Tibshirani Regularized Discriminant Analysis on the Vowel Data 0.5 Misclassification Rate 0.4 0.3 0.2 Test Data Train Data 0.1 0.0 1.0 0.0 0.2 0.4 0.6 0.8 Test and training errors for the vowel data, using regularized discriminant analysis with a series of values of [0, 1]. The optimum for the test data occurs around = 0.9, close to quadratic discriminant analysis. 14 ESL Chapter 4 -- Linear Methods for Classification Trevor Hastie and Rob Tibshirani 15 ESL Chapter 4 -- Linear Methods for Classification Trevor Hastie and Rob Tibshirani Classification in high dimensions important for gene expression microarray problems and other genomics problems ^ Starting point: diagonal LDA which uses diag() nearest centroid classification on standardized features is equivalent to diagonal LDA nearest shrunken centroids regularizes further, by discarding noisy features 16 ESL Chapter 4 -- Linear Methods for Classification Trevor Hastie and Rob Tibshirani Classification of microarray samples Example: small round blue cell tumors; Khan et al, Nature Medicine, 2001 Tumors classified as BL (Burkitt lymphoma), EWS (Ewing), NB (neuroblastoma) and RMS (rhabdomyosarcoma). There are 63 training samples and 25 test samples, although five of the latter were not SRBCTs. 2308 genes Khan et al report zero training and test errors, using a complex neural network model. Decided that 96 genes were "important". Too complicated! 17 ESL Chapter 4 -- Linear Methods for Classification Trevor Hastie and Rob Tibshirani BL EWS NB RMS Khan data 18 ESL Chapter 4 -- Linear Methods for Classification Trevor Hastie and Rob Tibshirani Neural network approach 19 ESL Chapter 4 -- Linear Methods for Classification Trevor Hastie and Rob Tibshirani Class centroids BL EWS NB RMS Gene 0 -1.0 -0.5 500 1000 1500 2000 0.0 0.5 1.0 -1.0 -0.5 0.0 0.5 1.0 -1.0 -0.5 0.0 0.5 1.0 -1.0 -0.5 0.0 0.5 1.0 Centroids: Average Expression Centered at Overall Centroid 20 ESL Chapter 4 -- Linear Methods for Classification Trevor Hastie and Rob Tibshirani Shrunken centroids Idea: shrink each class centroid towards the overall centroid. First normalize by the within-class standard deviation for each gene. Let xij be the expression for samples i = 1, 2, . . . n and genes j = 1, 2, . . . p. We have classes 1, 2, . . . K, and let Ck be indices of the nk samples in class k. The jth component of the centroid for class k is xjk = iCk xij /nk , the mean expression value in class k for gene j; the jth component of the overall centroid is xj = n xij /n. i=1 21 ESL Chapter 4 -- Linear Methods for Classification Trevor Hastie and Rob Tibshirani Let djk = (jk - xj )/sj , x where sj is the pooled within class standard deviation for gene j: s2 = j 1 n-K (xij - xjk )2 . k iCk (1) (2) Shrink each djk towards zero, giving d and new shrunken centroids jk or prototypes x = xj + sj d jk jk (3) 22 ESL Chapter 4 -- Linear Methods for Classification Trevor Hastie and Rob Tibshirani (0,0) The shrinkage is soft-thresholding: each djk is reduced by an amount in absolute value, and is set to zero if its absolute value is less than zero. Algebraically, this is expressed as d = sign(djk )(|djk | - )+ jk (4) where + means positive part (t+ = t if t > 0, and zero otherwise). Choose by cross-validation. 23 ESL Chapter 4 -- Linear Methods for Classification Trevor Hastie and Rob Tibshirani Advantages Simple, includes nearest centroid classifier as a special case. Thresholding denoises large effects, and sets small ones to zerothereby selecting genes with more than two classes, method can select different genes, and different numbers of genes for each class. 24 ESL Chapter 4 -- Linear Methods for Classification Trevor Hastie and Rob Tibshirani Class probabilities For a test sample x = (x , x , . . . x ), we define the discriminant 1 2 p score for class k p k (x ) = j=1 jk (x - x )2 j - 2 log k 2 sj (5) The classification rule is then C(x ) = if (x ) = min k (x ) k (6) estimates of the class probabilities, by analogy to Gaussian linear discriminant analysis, are pk (x ) = ^ e- 2 k (x 1 ) 1 K e- 2 (x ) =1 (7) 25 ESL Chapter 4 -- Linear Methods for Classification Trevor Hastie and Rob Tibshirani Results on Khan data At optimal point, there are 43 active genes Size 2308 2188 1668 1020 598 339 206 133 81 52 34 22 15 10 8 5 1 tr cv te te Error 0.4 te cv tr te 0.2 te tr te cv cv te te te te te te cv tr tr tr cv tr cv tr cv te te te te te cv te tr 0.8 te te 0.6 cv 0.0 0 2 4 6 Amount of Shrinkage Delta 26 ESL Chapter 4 -- Linear Methods for Classification Trevor Hastie and Rob Tibshirani Training Data 1.0 0.8 Probability 0.6 0.4 0.2 0.0 BL EWS 0 10 20 30 Sample 40 50 60 NB RMS Test Data 1.0 0.8 Probability 0.6 0.4 0.2 0.0 O O BL EWS NB O RMS O O 5 10 Sample 15 20 25 27 ESL Chapter 4 -- Linear Methods for Classification Trevor Hastie and Rob Tibshirani BL EWS NB RMS 813841 859359 207274 296448 898219 784224 796258 244618 789253 298062 461425 1409509 42558 769716 25725 44563 325182 812105 41591 810057 52076 866702 814260 43733 357031 1435862 770394 377461 1473131 295985 241412 80109 183337 233721 897788 563673 504791 212542 365826 204545 308163 21652 486110 tissue plasminogen activator quinone oxidoreductase homolog insulin-like growth factor 2 insulin-like growth factor 2 (somatomedin A) homolog of mouse mesoderm specific transcript fibroblast growth factor receptor 4 sarcoglycan alpha (dystrophin-associated glycoprotein) EST presenilin 2 (Alzheimer disease 4) troponin T2, cardiac muscle isoforms myosin MYL4 troponin T1, slow skeletal muscle isoforms L-arginine:glycine amidinotransferase neurofibromin 2 (mutated in neurofibromatosis type 2) farnesyl-diphosphate farnesyltransferase 1 growth associated protein 43 (GAP43) N-cadherin (neuronal) ALL1-fused gene from chromosome 1q meningioma 1 (disrupted in balanced translocation) cold shock domain protein A neuroblastoma protein (NOE1) Fas-associated protein tyrosine phosphatase 1 follicular lymphoma variant translocation protein 1 glycogenin 2 tumor necrosis factor alpha-induced protein 6 MIC2 surface antigen (CD99) IgG Fc fragment receptor transporter, alpha chain caveolin 1 (caveolae protein) transducin-like enhancer of split 2 EST E74-like factor 1 (ets domain transcription factor) major histocompatibility complex, class II, DQ alpha 1 major histocompatibility complex, class II, DM alpha insulin-like growth factor binding protein 2 receptor type protein tyrosine phosphatase F antiquitin 1 glutathione S-transferase A4 cDNA DKFZp586J2118 growth arrest-specific protein 1 EST EST alpha 1 catenin (cadherin-associated protein) profilin 2 The genes that matter 28 ESL Chapter 4 -- Linear Methods for Classification Trevor Hastie and Rob Tibshirani Heatmap of selected genes BL EWS NB RMS 813841 859359 207274 296448 898219 784224 796258 244618 789253 298062 461425 1409509 42558 769716 25725 44563 325182 812105 41591 810057 52076 866702 814260 43733 357031 1435862 770394 377461 1473131 295985 241412 80109 183337 233721 897788 563673 504791 212542 365826 204545 308163 21652 486110 29 ESL Chapter 4 -- Linear Methods for Classification Trevor Hastie and Rob Tibshirani Reduced rank LDA ^ let = U DU T (eigendecomposition) ^ let x = D -1/2 U T x = -1/2 x ^ ^ = -1/2 k ^k ^ LDA: k (x) = (1/2)||x - ||2 - log k (closest centroid in ^k sphered space, with a correction for class size) hence if p > K - 1, can project data onto K - 1 dim space spanned by and lose nothing ^k 30 ESL Chapter 4 -- Linear Methods for Classification Trevor Hastie and Rob Tibshirani Can project onto even lower dimensions, using the principal components of : ^k ^ ^ compute M = K p matrix of centroids, , M = M -1/2 Reduce M by principal components; i.e. compute B = covariance matrix of M (with mass k on each row), B = V DB V T T z = v x is the th discriminant (or canonical) variable, with ^ v = -1/2 v 31 ESL Chapter 4 -- Linear Methods for Classification Trevor Hastie and Rob Tibshirani Linear Discriminant Analysis 4 oo o oo o o oo o oo o oo o o oo oo o o o o o o oo o o o o o o oo o o ooo ooo o oo o o o o o o oo o oo o o o o oo oo oo oo o o o ooo o o o oo o o o o oo o o o o ooo o o o o ooo ooo o o o o o o o o o o o o o o o o oo oo o oo o oo oo o o o o o o o o o o o o o o oo o o oo oo o oo o o o o o o o o o o o o o oo o oo o o o o o o oo o o o o o o oo o o o o oo o o o o ooo ooo o o o o oo oo o o o o o o o ooo o o oo o o o o oo o oo o o o o o ooo o o oo oo o o o o o o o oo oo o o o o o o oo oo o oo o o oo o o oo o o o oo oo o o oo o oo o o o o o o o o oo o o oo o oo o o o o o oo o o o o o o oo o oo o o oo o o o o o o o o o oo o o o o o o o oo o o o o o o o o o o o o o o o o oo o o o o o o o o oo o o o o o o oo o o o o ooo o o o ooo o o o o o oo o o o o o o o o o o o o o oo o o o o oo o o o o oo o o o oo o o oo o o o oo o o o oo o o o o o o o o o o o oo o o o oo o oo o o o o o oo o o o o o o Coordinate 2 for Training Data 2 0 -4 -2 o -6 o o o o o 0 Coordinate 1 for Training Data 2 4 -4 -2 A two-dimensional plot of the vowel training data. There are eleven classes with X R10 , and this is the best view in terms of a LDA model. The heavy circles are the projected mean vectors for each class. The class overlap is considerable. 32 ESL Chapter 4 -- Linear Methods for Classification Trevor Hastie and Rob Tibshirani Projections onto pairs of discriminant variates Linear Discriminant Analysis o o o o o o o o o oo o o o oo o o o o o oo o o o o o oo o o oo oo o o oo o o o ooo o o o o oo o o o o oo o o o o oo o o o o o o o o o oo o ooo o o o o o oo o o o o o o oo o ooo o o o o o ooooo o ooo o o o o o o o ooooo o o o o o o o oo oooo o o o o o o ooo o oo o o oo oo o o o o ooo o o o ooo o o o o oo o o oo o o oo o oo oo o o oo ooo oo o o oo oo o o oo o o o oo o ooo o o o o o oooo o o o o o o o o oo o o o o oo o o oooo oo o o oo o oo o o o o o o oo o o o oo o o o oooo ooo ooo o oo oo o o oo o o o o ooo o o oo o o o o o o oo o o o o oo o o o o oo o o o o o oo oo ooo o oo o o o o ooo o o o oo o o o oo o o o oo o o o o oo oo o o oo o o o oo ooo o oo o o oo o o o o oo o oo o o oo o oo o o o oo o o o o o o o o o oo o o o o o o oo o o o o o o o o o oo o o o oo oo o o o oo o o o o o o o o o o o oo o o oo o o o o o o o o o oo o o o o oo o o o o o o o o o o o o oo o o o o o o o oo o o o o o o o oo ooo o o o o o o oo o oo o o o oo o o o o oo o oooo o o o o o o o o oo o oo o o o o oo o o o oo o o o o o o oo ooo oo o ooo o oo o o o o oo o o o o o o oo o o o o ooo o ooo ooo o oo o oo o o o o o o oooo oo oo o o oo o o o oo o o oo ooo o oo ooooo o o o oo oo o o o o o o o o oo o oo o o oo o o o o o o o ooo o o ooooo o o o o o oo o o o o oo oo o o o ooo o o oo o ooo o o o o oo o o ooo oo o o o o o oo o o oo ooo o o o oooo oo o o o oo oo o o o o o o ooo o o o o o oo o o o o o oooo o o o o o o o o o o o o oo oo o o o o o oo o oo o o o o o oo ooo o o o oo ooo o oo oo o o o oo o o o o oo o o o oo o o oo o o o o o oo o o o o oo o oo o o o o o o o o o oo o o o o o o o o o o o o o o o o oo o oo o oo o o o oo o o o o o o o o o oo o o o o o o o o o o o o o o ooo o o o oo o o o o o 2 o o o 2 o 0 0 o o -2 0 2 Coordinate 3 -2 Coordinate 3 -2 o o o o o -4 -2 0 Coordinate 1 o o o o oo o o o o 2 4 -6 -4 4 Coordinate 2 o o o o o o o o o oo o o o o o o o o oo o o o o o o o o oo o o o o o o o oo o o o oo o o o oo oo o o o o o oo o o oo o o o o o o o o oo o o o o o o o o oo o oo oo o o o o o o o oo o ooo o o ooo o oo ooo o o o o ooo o o o o o o o o ooo o o o o o o o o o oo oo o o o o o oo o o o o o oo ooo oo o o oo o o oo o oo o o oo o o o o o o o o oo o o ooooo o o o oo o o o o o o o oo o oo o o oo oo o o o o o o ooo o o o o o oo o o o o o o o o o o o o o ooo o o o o o o o o o o o o o o oo o oo o o o o oo o oo o oo o o oo o o o oo o oo ooo oo oo oo o o o o o o oo o o o oo o o o o ooo oo ooo o o o o o oooo o o o o oo o o o o oo o o o ooo o o o o o o o o ooo oo oo o o oo o o o o o o oo oo oo ooo o o o o o o o o o o o oo o oo o o o o o o o o o o o ooo o o o oo o o o o o o ooo ooo ooo o o oo oo oo o oo o o oooo o o oo o o o o o o oo oo o o oo o o o o o o o o o o o o o o 3 o o oo oo o o o o o oo o o o o o o o oo oo o oo o o o oo o o o o oo o oo o oo oo o o oo o o o o o o o o o o o o o o oo o o o o o oo oooo oo o oo o oo o o o oo o o o o o oo o o o o oo ooo o o o o oo oo o o o o o o o o oo o o o o oo o o oo o oo oo o o o o o o o o o o o oo o ooo o oo oo o o o o oo o o oo oo o oo o ooo o oo oo o o o o o o o o ooo o o oo o o o o oo oo o o o ooo oo oo o o oo o o oo o o o o o o oo o o o o oo o o oo o o o oooo o o o o o o oo o o o oo ooo o oo o o o o o oooo o o o ooooo ooo o oo o o ooo o ooo o oo ooo o oo ooo oo o o oo o o o o o o o o oo oo o oo o ooo o o o ooo oo o o oo o oo ooo o o o o oo o o o o o o o oo o oo o o o o oo oo o o o o o oo o oo o o o o o o oo o o o o oo o o o o o o oo oo o o oo ooo oo oo o o ooo o oo o oo o o o o oo o o o o o ooo o o o o o o o o o o o o o o o o o o o o o oo oo 2 2 o o 1 Coordinate 10 1 Coordinate 7 o -1 o o o o o 0 -2 -2 -1 0 o o o o oo o -3 -4 -2 0 Coordinate 1 2 4 -2 -1 0 1 2 3 Coordinate 9 33 ESL Chapter 4 -- Linear Methods for Classification Trevor Hastie and Rob Tibshirani + + + + Although the line joining the centroids defines the direction of greatest centroid spread, the projected data overlap because of the covariance (left panel). The discriminant direction minimizes this overlap for Gaussian data (right panel). 34 ESL Chapter 4 -- Linear Methods for Classification Trevor Hastie and Rob Tibshirani Fisher's formulation of discriminant analysis Find z = aT x such that the between-class variance is maximized ^ relative to within-class variance W = : aT Ba max T aIRp a W a or aIR max aT Ba subject to aT W a = 1 p gives v1 = a; then find next direction orthogonal to first aIR max aT Ba subject to aT W a = 1, aT W v1 = 0 p gives v2 = a etc This equivalent to the PCA of standardized centroids of earlier slide. 35 ESL Chapter 4 -- Linear Methods for Classification Trevor Hastie and Rob Tibshirani LDA and Dimension Reduction on the Vowel Data 0.7 Misclassification Rate 0.6 0.5 0.4 Test Data Train Data 0.3 2 4 6 Dimension 8 10 Training and test error rates for the vowel data, as a function of the dimension of the discriminant subspace. In this case the best error rate is for dimension 2. 36 ESL Chapter 4 -- Linear Methods for Classification Trevor Hastie and Rob Tibshirani Performance on Vowel Data Train Linear Regression LDA Reduced Rank LDA QDA Logistic Regression 0.48 0.32 0.36 0.01 0.22 Test 0.67 0.56 0.50 0.53 0.51 37 ESL Chapter 4 -- Linear Methods for Classification Trevor Hastie and Rob Tibshirani Classification in Reduced Subspace oo o oo o o oo o oo o oo o o oo oo o o o o o o oo oo o oo oo o oo o oo oo o o o o oo o oo o o oo o o o o oo o oo oo o o o oo oo ooo o o o ooo o oo o o o o o ooo o o o o o o o o ooo o o o o o o o o o o o oooo o o o o o o oo o oo o o oo o oo o o o o oo ooo o o o o o o o o o o o oo oo o o o o o o o o o oo o oo o o o o o o oo o o o oo o o o o ooo o o oo o o o o oo ooo ooo o o o ooo o o o o o o o oo o o ooo o o o oo oo o o o o ooo o o o oo o o o o ooo o o o o o o o oo o oo o o o o oo oo o o o oo o o o o o o o oo oo o o o o o oo o o o o o oo o o oo oo oooo o oo o o o o oo o o oo o oo o oo o o o o oo o o o o oo o o o o o o o o o o oo o o o oo o o o o o o oo o o o o o oo o o o o o o o o o o o o o ooo o o oo o o ooo o o o o o oo ooo o o o o o oo o o o o oo o o o o oo o o o o oo o o o o oo o oo o o oo oo o o o o oo o o o o o oo o o o o o o o oo o oo o o oo o oo o o o o o o Canonical Coordinate 2 o o o o o o Canonical Coordinate 1 38 ESL Chapter 4 -- Linear Methods for Classification Trevor Hastie and Rob Tibshirani Linear Logistic Regression Two-class case. Y = 0/1 codes the classes. Model p(x) = P r(Y = 1|x). p(x) = T x, 1 - p(x) n logitp(x) log p(x) = e 1 + e T x T x Log-Likelihood = i=1 {yi log pi + (1 - yi ) log(1 - pi )} IRLS algorithm 1. Initialize . 2. Form linearized responses zi = T xi + (yi - pi )/{pi (1 - pi )} 3. Form weights wi = pi (1 - pi ) 4. Update by weighted LS of zi on xi with weights wi . Steps 2-4 are repeated until convergence. 39 ESL Chapter 4 -- Linear Methods for Classification Trevor Hastie and Rob Tibshirani Properties of logistic regression solutions ^ Satisfy score equations XT (y - p) = 0. If W is a diagonal matrix with weights wi = pi (1 - pi ), then the ^ ^ ^ asymptotic covariance matrix of is ^ cov() = (XT WX)-1 If the two classes are linearly separable, then the solution is undefined! [MLE tries to achieve probabilities that are 0 and 1, and ^ for this some of must go to ]. Inference proceeds in a manner very similar to that for linear regression. 40 ESL Chapter 4 -- Linear Methods for Classification Trevor Hastie and Rob Tibshirani IRLS Newton algorithm () 2 () T A Newton step is thus new = = = old + (XT WX)-1 XT (y - p) (XT WX)-1 XT W X old + W-1 (y - p) (XT WX)-1 XT Wz. = = XT (y - p) -XT WX. In the second and third line we have re-expressed the Newton-Raphson step as a weighted least squares step, with the response z = X old + W-1 (y - p). 41 ESL Chapter 4 -- Linear Methods for Classification Trevor Hastie and Rob Tibshirani 0 10 20 30 0.0 0.4 0.8 o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o oo o oo ooo oo o o oo ooooo oo oo o o o oo o o oooooooo o o oo oooooooo oooo o ooo o ooooo o o o oo o ooo ooooooooo o oooooo oooooo ooooo ooooooo ooooooo o ooo ooooo oooooooooo o o ooo ooo ooo oo o oooooo o o oooooooo o o ooooooo o o o oooo o o ooooooooo oooo o o ooooooooo o o oooo o o oooooo o oooo o oooo 0 50 100 220 20 20 40 60 40 60 15 25 35 45 2 6 10 14 100 160 sbp o o o o o o o o o o o o o oo o oo o o o o oo ooo o o oo o o o o ooo o o oooo o o ooo o o o o ooo o oo o o o o oooo ooo oo o o o oo o oo oo o oooo o oo ooooo oo o o ooooo oooo oooooo oo o o o ooo oo o o o oooo oo oo o ooooo o o o ooo ooooooo oo o o ooooo ooooooooo o o oo o o ooooo o oo o oo o oooooooo o o o oo ooo o oooooooooo oo o ooo oo ooooooooo oo oo o ooooo o o ooooooooo ooo o ooo o ooooooo ooooooooo o ooooo oo ooo o oooooo o o o o o ooooooooo o o ooo ooooo oooooo o oo o o oo o o o oo ooo o oooo ooo o o o o oooo o o o o o o o oo oo ooo o oooo oooooo o o ooooooo oo o ooooo o o oooo oooooooo o o o o ooo oooooo o o o ooooooo o ooooooooo o o o o o o o oooooooo o o oo oo o o o o o oooo o o ooooo oo o o o o o oo o o o oo o o oo o oooo ooo oo oo o oooooo o o ooo o oo o o oooo oo oo ooooo ooooo oooooo oo o ooooooo o ooo ooo ooo o o ooo o ooooooo o o o ooooooooo o o o oo oooooo oo ooooo ooooooo o oooo o ooooo o o o ooo o ooooooo o o ooo oo oooooooo oo oo o oo oo oo o ooooooo o o o ooo o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o oo oo o o o o o o o o o o o o o oo o o o oo o o oo ooo o o oo o ooo oo oo o oooo o oooo o o ooo oo ooo ooo oo o oooo oooooo o o o oo ooo ooo o oooo o o ooooooooo oo o oooo oo o o o o o oooooo ooo o o o oo o oooooooooo oo o oo oo o oo oooooo ooo o o ooo ooo o o oo o o oo oo o oo o o oo o o oo oooooooooooo oo ooo oooooooooo o o ooo oo o oooooooooooooo oo o o oo ooo o o oo oooooo o oooooo o o ooo ooooooooo oo ooooooooo oo o o oooooooooooooooooo oooo oo oooooooooo oooooo ooo o oo oo oo oo o ooooo oo o oo o o o ooooooo o oo o ooo oooooo ooo o o o oo ooooo ooooooooooo o oo oo o o oooo ooo o oo oo oooo o o oo oo o oo oo o o o o oooo ooo o o ooo ooo o oooo ooo o oo o oooo o oo o o oo o o o o o o o o o o o o o oo o o o o o o o o o o oo o oo o oo oo o o oo oooo ooo o o o oo o o o oo oo o o o o o ooooo o o o oo o o oo oooo o o oo ooooooooo ooooo oo o o oo o o oo oo ooo oo oo oooo o o o oo ooooo oo oooooo oo o o ooooooooooooo o o ooo oo o oo o ooooooooo oo o o oooo oo o oo o oo o oo oo o oooooo oo oo o o oo o oooooo o o ooo o oooooo oo o o o oo o oooo o ooo oooooooooo ooooo o o ooo ooo oooooo o o oooo oo o oo ooo oooo oooo oooo ooo oo oo oo o ooooooooooooooooo ooo ooooooo oooo o o ooooooo ooooooooooo o oo oo o oo o o o ooo o ooooooooooooooooo ooooo o oo o o o o o o o oo o o o o o o oo o o o o oo oo oo o o oo oo oo oo oo o o o o oo o o o o o oooooooo o o oo oooooo o o oo oo o o o ooo oooooooooo ooooo oo oo oo o oo oo o o o o ooo ooooooooo oo o o o o o ooo o oo o o o o oo ooo oooo o o oo o o oooooooooo o oooo oooooo oo o o o oo o ooooo o oo o o o o oooooooooooooo o ooo ooooooooooo oooo oo o oooooooo o o oo o o o o oooo oooo oo o o o oo oo o o o oo o oo o o oo o o ooooooo o o o oo o o ooo oooo o oooooooo oo o o oo o oo o oo o o o oo o ooo o oooooooooooooooo o o o o oooooooo oo o o ooooooooooooooooooo o oo ooooo o o o o o o o o ooo o o o o ooooooo o oo o o o o o o o oo o o o oooo o oo ooo ooo o o o o o ooo o oo o o o oooo o oooo oooo o oo ooo o o ooooooooo oo oo o oooooooooooooooooo oooooo ooo o oo oooo ooooooooo ooo ooo o o o ooooooooooo o o 30 o o o oo o o o oo oo o oo ooo o o o o o oooo o o oo o oooo ooo oooooooooo o ooooo o o ooo o ooo o oo o ooooooo o ooooooooooo oo o o oooooooooo o o ooooooooooo oooo o oo o o ooooooooooooo o o ooooo oo o oo oooooooo ooo ooooooooooooooo oo oo o oooo oooo ooooooo ooooooo o oo ooo o 0 10 tobacco o o o o o oo oo o o o o oo o o o oo oo o oo ooo o o ooo oo o oo o oo ooooo oo oooo o o oooo ooooo o oooooo o o o oo o o ooooooo oooo o o ooo oooo o o oo oooooo oo ooo o oooo ooo ooooo o o oooo o o oooo ooo oooooooooo o oo oooooo ooo o oooooo ooo o ooo o o ooo o o oo ooo oo oo oooooooo oo o ooooooooo o o o o ooo ooo o o o ooo o ooooooooo o oo o o o ooo ooooooooo o o oooooooooooooooo oo o oo o o oooooo o oo o oo o ooooo oo o oo o ooooooooooo oo ooooooooo ooooo ooo oo o oooooo o o oooooo o o ooooooo o oo o o o oooooooo o o o ooo oo oooo ooooooo o o oo o ooo o ooooo oo oooo o oo oo o oooooooooooooooooo ooooooooo o o oooo ooo oo o o oooo oo oo o oo oo o ooooooooo ooo o o ooooo ooooooo oo oo ooo o o oo 0.8 ldl o ooooo oo ooooo o o oooooooo oo oooooooooo oo oo o o o o o oo o o o o o o oo oo o o oo o oo o oo o ooo o o oooo o o o oooo o o o o oo o oo o o o o oo oooooooo o o o o o o ooooo o o o oo o o o oo o oo o oooo o oooooo o oooooooooooo o o oo o o o ooooooo o oooooo ooo ooo o o o ooo o o o o ooooooooooo oooo oo ooooooooo o o ooooooooooooooo o o o oooooooooo oooooo o ooooo o o o oo o o o o o o oo oo o o o ooooo o oo oo o o ooo oo oooooo o o o o o o o oooooooo o ooooo o ooooo o o o o oooooo o o oooo ooooooo o o o o o oooooooooo o o o o oooooooo o oo ooooooo ooooo oooo o o oo oooooooo oo oooo o o o o o oooo o oo o oo oo o o oooooo oo o o ooooo o oooooo o o o o o ooooooooo ooo o o o o oo ooooooo oo o o o o ooooo ooo o o oo oooooo o o oooooo o ooooo o ooooooooooooo oooooooooo o oooo oo oo oo oo oo o 20 0.0 0.4 famhist oo oo oo o o ooo ooooo oo o o oo oo ooooooooo oo ooo oo oooooooooo oo o oo ooooooo oo o oooooooooooo oooo oooooooooo oo oo o ooooo o ooooooo o ooo o o o o o o o o o o o o o oo o oo o o oo ooo ooo o o o o o o o oo o ooooo o o o oo o oo o o o o ooo ooo ooooo oooooo o o o o ooooo o oo oooooo o oo oooooo o o o oo ooo oooooo o o oo o oooooo o oo o oooo o o oooooooooooooo oo oooooooooo oo oooooo o oo oooo oo ooooooooo o oo ooooooo oooooooo oooooooo o ooooo o ooooooooooo o ooooo o ooo ooooo o oo ooo oo o o ooo ooooo o o o ooooooooooo o o oooooooooo o o oo o ooooooo o oo oo ooooo ooooooo o o ooo oooooo o o oooooo oo o oooooo o o ooo o ooo o oooooo o oooo o o ooo oo ooo o o o oo o o ooooo oo o oo oo oo o o oo o o oo o o o ooo o oo o o o o o o o o o o o o o o o oo o oo o o o o o o ooooo o o o oooooooo o oo o o ooo oo oo o o ooooo ooo o o oo oo ooooo ooooo o oo o o o ooooo oo oo o o ooooooooo ooooo o o o oooooo o o o oo oo o oo oo o o o o ooo o o oo ooooooooo oo o oooo oooo o oooooo ooo o ooo oooooooooooo o oooo ooooo oooo o ooooo oooo o o ooooo o oooo o ooo ooo oooooooooo o ooooo ooo ooooooooooooo oo oo oooooooooooo oooooooooooo o oo oooooooo o ooo oooo o o oo oo ooooo ooooo o ooo o oo o o oo ooo oooooo o o oooooooooooo oo o oo o o o o oo ooo oooooo o o ooo oo ooo o o ooooooooo o o o o oooooo o o oo oooo o o oo oooo oooo oo o o o oo ooooo o o ooo o oooo o ooooooooo o ooo o oo oooooooo o ooo o o oooooooooo o oo ooooooooo oooo oooo o ooooooooo o ooooo o o ooo oo o oooooo o o ooooooooo o o o o ooo o oo oo oooooo o o oo oooooo ooooooo ooooooo o oooooooooooo o o ooooooooo o oo o oooooo ooooooo o o ooo o oooo ooo o o oo oo ooo o o o o o o oo oooooo o ooooo o o oooooo oo ooo o ooooooo o o oooooo o o o o ooo oo o ooo ooo o o oooo ooo ooooo o ooo o oo oo o o ooo oo o o ooo oo ooooo ooo o ooooooo o o oooo o oo o ooo o oooooo o oo oo ooooo oo oo o oo o o oooooo o oo oooo o o oooo o o o o o ooooo o ooooooo oo o 100 160 220 oo oooooooooo o o o o ooooooo o ooooo oooo ooooooooooo oo oooooooo o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o oo oooooooooooo oooo ooo ooooooooo o ooo oo oooo oo o ooo o o ooo o o o o o o oo o oo oo ooo oo ooooooo oooooooooooooooooo oooooooooooooooooo oo ooooooo ooooooo o o o o o o o o oo o oo o o o ooo o o oo oo o oo oo o o o o o oo oo o oo oo oo ooo o ooooo ooooo oooooo o o o oo o oo o o oo oo o o o o o oo ooooo o ooo o ooooo oo oooooo oo ooooooo oo o oo oo ooooo o o oo ooo ooooooooo o oooooo o ooooooooooooo oo ooo o o o o o oooooo oooooooooooooo oo o o oo oooo o oo o oooooooo ooo oo o oooooooooooooo oo oooooooooooooo ooo o o o oo o o o ooo ooo o oooooo o o o ooo oo ooooooooooooooo ooo o o o oo oooo ooooooooo o o oooooo oo o o o ooooooooooooooooooo o o o o o o o ooooooooo o ooooo oooooo o oo o ooo o oooooooooooo o ooo oooo oo o o o ooo o o oooooo oo o ooo o oooo o oo oo o oo oo oo o oo o o o o o o o o o o o o o o ooo o o o o oo o oo oooo o o o o oooo oo oo ooo o oo o o oo o o o oo oooooo oo oo o ooo oo oo oo o o o o oooo ooo oooooo oo ooo o o o o ooooooooooooooo o o oo o o o o ooooo oo ooooooo o oo oo o oooo oo oooooooooooooo o oo ooooo oooo oo ooooo oooooooooooooooooo oo ooooooooooooooo oo oooo oooooooooo oo oo o o o o oooo ooo oo oo ooooo o o o oo o ooo o o ooooooooooooo oooo oo oo ooo o o oo o o o o oo o oo o o ooooooo oo o oo o ooooo o o ooooooooo o o ooo o oo o ooooooooo o o o ooooooooo o ooooo oo ooo oo o o oo o ooo o oo oo oo oooooooooo o o o oooooo o o oo o oooooo oo ooooooooo o o o o oooooooo oooooo ooooooo oo oo o o oooooo ooo ooooo ooo oo oo o o oooooo o o oo o o oo o o o o o o o o o o oo o oo o o oooo o o ooooo o o oo o ooooo o o o ooo ooooooo o oo ooooo o oo o o ooo o oooooo o ooooooo oo oo o o oooooo o o oo ooo o oooo oo oo oo o oo oooooo oo ooooooooo o oo o oooooooooooooo o oooooooo o o ooo oooooooooo o o oooo oo oo o oo o ooo oo o o o oooooooo oo o ooooooooo o o oooooooo o o oooo oo o oooooo o o ooooooo oo o o oo oooooooooo oo o oooo oo o oooooo o o oo ooo o ooooooooo o ooo oo ooooo o oooooo o o oooooo o oooooo oo oo oooo oo oooo o ooooooo oo o o oo o oo oo o o o o ooo o o o ooooo o o oo oo ooooooo o oo o oo ooooo oooo o oo oo oo o o o ooo o o oo o o oo ooo o ooo ooooo oooo o ooo 2 6 10 14 obesity o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o o oo o o o o o o o o oo o o oo o o o o o ooo o o o o o o o ooooooo o o ooo o o o oooo o o ooo o o ooooooo o o o o o oo o o o ooooo ooo o o oo oooooooo oo o ooooooo oooooooooo oo oo ooooooo o o o o ooo oooo oo o o o oooooo o o oooooo o ooooooooo o o o o o oooo o o ooooooo o oo o o oo o o o o o ooooooooooooo o oooooo o o o o oo oo oo oo o ooooooo o o oooooo o o o o o o o ooooooooo ooooooo o o o o o oooooo o o ooo o o oooooo oo o o o ooooooo o o o o o oooo o o o oo o o oooo o o ooo o o o o o ooooo oo o o oooooo o o o o o o o oooo o o o o oo o oo o o o o oooo o ooooo o o o ooo o o ooo o ooo ooo o oooooo o o o o oo o o ooooooo o o o o o o o o oo o oo o oooooo oo o oooooo o o oooooo ooooooo oo o o oooooooo o oo o oooo o o ooo o o oooooo o oo o ooo o o ooooo oo o o ooo oo o ooo 15 25 35 45 100 0 50 alcohol oo o o o o oo ooo o oo oo o o ooooooo o oooooooo o o oo o o oooo oooooooo o oo o ooooo o o oo o oooooooo o o o oo oo o o oo oooooo oo oo o o ooo oo o o o ooooo oo o oooooo o o oo o oooo o o o ooo o o o oooo o o o o oo o ooo oooo o o o oo oo o oo o ooo o o ooo o o oooo oo o o o ooo o o oo oooo o oo o oo o ooo oo o o o o ooo oo o o oo o ooo o o o age A scatterplot matrix of the South African heart disease data. Each plot shows a pair of risk factors, and the cases (160) and controls (302) are color coded (red is a case). The variable famhist (family history of heart disease) is binary (yes or no). 42 ESL Chapter 4 -- Linear Methods for Classification Trevor Hastie and Rob Tibshirani Results from a logistic regression fit to the South African heart disease data. Coefficient (Intercept) sbp tobacco ldl famhist obesity alcohol age -4.130 0.006 0.080 0.185 0.939 -0.035 0.001 0.043 Standard Error 0.964 0.006 0.026 0.057 0.225 0.029 0.004 0.010 Z Score -4.285 1.023 3.034 3.219 4.178 -1.187 0.136 4.184 43 ESL Chapter 4 -- Linear Methods for Classification Trevor Hastie and Rob Tibshirani Model building ^ Deviance: dev(y, p) = -2() ^ H0 : First q components of are non-zero H1 : is unrestricted under H0 , dev(y, p0 ) - dev(y, p1 ) 2 asymptotically (as ^ ^ p-q N ) "Chi-square statistic"--quadratic approximation to deviance: n n wi (zi - i=1 ^ xT ) i = i=1 (yi - pi )2 ^ pi (1 - pi ) ^ ^ ^ N (, (XT WX)-1 ) asymptotically, if the model is correct. 44 ESL Chapter 4 -- Linear Methods for Classification Trevor Hastie and Rob Tibshirani Case-control sampling and logistic regression In South African data, there are 160 cases, 302 controls -- = 0.35 ~ are cases. Yet the prevalence of MI in this region is = 0.05. With case-control samples, we can estimate the regression parameters j accurately; the constant term 0 is incorrect. We can correct the estimated intercept by a simple transformation ^ ^ 0 = 0 + log ~ - log 1- 1- ~ Often cases are rare and we take them all; up to five times that number of controls is sufficient. See next slide 45 ESL Chapter 4 -- Linear Methods for Classification Trevor Hastie and Rob Tibshirani 0.08 0.09 Simulation Theoretical 2 4 6 8 Control/Case Ratio 10 12 14 Sampling more controls than cases reduces the variance of the parameter estimates. But after a ratio of about 5 to 1 the variance reduction flattens out. Coefficient Variance 0.04 0.05 0.06 0.07 46 ESL Chapter 4 -- Linear Methods for Classification Trevor Hastie and Rob Tibshirani Risk estimates and classification ^ We can estimate the risk for a new observation x0 via (x0 ) = xT ^ 0 ^ ^ ^ and Pr(Y = 1|X = x0 ) = e(x0 ) /(1 + e(x0 ) ). To obtain a 95% confidence interval for Pr(Y = 1|X = x0 ), we first ^ obtain one for (x0 ) (using the estimated covariance of ). We then apply the sigmoid transformation to the lower and upper values. ^ To classify a new observation, we threshold Pr(Y = 1|X = x0 ) at 0.5. Other thresholds change the sensitivity and specificity, and are used to construct ROC curves. 47 ESL Chapter 4 -- Linear Methods for Classification Trevor Hastie and Rob Tibshirani Multiple Logistic Regression T Model is defined in terms of J - 1 logits j (X) = j X: log P (G = 1|X) P (G = J|X) P (G = 2|X) log P (G = J|X) = 1 (X) = 2 (X) . . . log P (G = J - 1|X) P (G = J|X) = J-1 (X) ej (X) P (G = j|X) = 1+ J-1 (X) =1 e Fit by least squares or multinomial maximum likelihood. 48 ESL Chapter 4 -- Linear Methods for Classification Trevor Hastie and Rob Tibshirani Logistic Regression with pN Typically linear models are sufficient -- logit(pi ) = T xi Models have to be regularized ridge penalty -- similar to SVM N PLL = i=1 {yi log pi + (1 - yi ) log(1 - pi )} - ||||2 lasso penalty -- selects variables N p PLL = i=1 {yi log pi + (1 - yi ) log(1 - pi )} - j=1 |j | IRLS algorithm for ridge, and LARS-like algorithm for Lasso 49 ESL Chapter 4 -- Linear Methods for Classification Trevor Hastie and Rob Tibshirani Glmnet software in R Glmnet fits the GLM family of models by penalized maximum likelihood. This includes (multiple) logistic regression. Glmnet computes the entire "regularization path" for the "elastic net" penalty family: 1 max l() - (1 - )||||2 + ||||1 2 2 The regularization path follows a complete grid of values for , with fixed. spans ridge to lasso For multiple logistic regression, the model is symmetric, with j (x) log P (G = j|x) = xT j and K P (G = j|x) = ej (x) / =1 e (x) . 50 ESL Chapter 4 -- Linear Methods for Classification Trevor Hastie and Rob Tibshirani Logistic regression or LDA? LDA: Pr(G = j|X = x) log Pr(G = K|X = x) j 1 = log - (j + K )T -1 (j - K ) K 2 +xT -1 (j - K ) T = j0 + j x. This linearity is a consequence of the Gaussian assumption for the class densities, as well as the assumption of a common covariance matrix. Logistic model: log Pr(G = j|X = x) T = j0 + j x. Pr(G = K|X = x) They use the same form for the logits 51 ESL Chapter 4 -- Linear Methods for Classification Trevor Hastie and Rob Tibshirani Discriminative vs generative (informative) learning: logistic regression uses the conditional distribution of Y given x to estimate parameters, while LDA uses the full joint distribution (assuming normality). Pr(X, G = j) = Pr(X)Pr(G = j|X), If normality holds, LDA is up to 30% more efficient; o/w logistic regression can be more robust. But the methods are similar in practice. The additional efficiency is obtained from using observations far from the decision boundary to help estimate (dubious!) 52 ESL Chapter 4 -- Linear Methods for Classification Trevor Hastie and Rob Tibshirani Naive Bayes Models Suppose we estimate the class densities f1 (X) and f2 (X) for the features in class 1 and 2 respectively. Bayes Formula tells us how to convert these to class posterior probabilities: Pr(Y = 1|X) = f1 (X)1 , f1 (X)1 + f2 (X)2 where 1 = Pr(Y = 1) and 2 = 1 - 1 . Since X is often high dimensional, the following independence model is convenient: p fj (X) m=1 fjm (Xm ) Works for more than two classes as well. 53 ESL Chapter 4 -- Linear Methods for Classification Trevor Hastie and Rob Tibshirani Each of the component densities fjm are estimated separately within each class: Discrete components via histograms quantitative components via Gaussians or smooth density estimates. The PAM model has this structure, and in addition assumes the gaussian densities have the same variance in each class shrinks the class centroids towards the overall mean in each class More general models have less bias but are typically hard to estimate in high dimensions, so the independence assumption may not hurt too much. 54 ESL Chapter 4 -- Linear Methods for Classification Trevor Hastie and Rob Tibshirani Naive Bayes vs Quadratic Discriminant Analysis 1 2 2 2 2 1 1 1 3 2 2 2 2 2 1 1 1 2 1 2 1 2 2 3 2 2 2 3 3 22 2 3 33 2 33 2 12 3 2 2 3 3 2 3 2 2 2 2 3 2 2 2 2 1 22 2 22 3 22 2 1 2 2 22 3 3 2 2 22 2 2 2 2 2 2 2 3 33 22 2 22 2 22 2 2 2 2 2 3 33 22 2 22 22 22 222 2 1 3 3 333 3 2 2 2 2 2 2 22 2 2 2 2 222 222 2222 22 2 3 33 3 3 3 3 2 2 2 2 3 3 2 22 2 2 22 33 3 3 3 2 13 3 2 2 2 2 22 2 1 22 222 22 22 22 22 333 3 3 3 2 1 2 222 22 2 2 222 2 2 2 2 1 2 3 2 3333 3 22 2 2 1 2 2 1 1 2 21 2 3 1 3 33 3 1 2 2 2 2 3 2 1 22 22 22 1 1 1 1 2 1 1 1 3 33 3 1 22 1 2 2 2 2 1 1 2 1 1 1 1 3 3 3 1 3 333 3 3 3 2 2 11 22 11 1 1 1 1 1 33 1 2 1 222 3 3 11 1 1 12 1 1 1 33 33 2 1 11 1 21 22 1 31 3 11 1 1 11 1 13 33 3 3 3 1 1 1 1 3 2 33 3 1 2 11 1 1 1 11 1 1 33 333333 1 2 1 11 1 1 1 33 333 3 1 11 1 1 1 1 1 33 1 1 11 1 11 33 3 3 1 3 1 1 11 1 1 1 1 11 1 1 1 11 1 1 1 1 1 1 1 1 111 33 1 3 3 3 33 3 3 1 1 3 1 3 1 1 1 33 3 33 11 1 1 1 1 1 1 33 1 1 13 33 3 3 1 1 3 3 11 1 11 1 333 3333 11 1 1 1 1 1 1 1 1 1 1 3 33 1 1 1 1 1 11 1 1 3 33 3 1 1 3 3 33 3 3 333 333 3 1 1 1 1 1 1 3 1 33 1 1 1 3 3 3 1 1 1 3 3 33 3 3 3 3 2 2 2 3 1 2 2 3 2 2 2 3 3 22 2 3 33 2 33 2 12 3 2 2 3 3 2 3 2 2 2 2 3 2 2 2 2 1 22 2 22 3 22 2 1 2 2 22 3 3 2 2 22 2 2 2 2 2 2 2 3 33 22 2 22 2 22 2 2 2 2 2 3 33 22 2 22 22 22 222 2 1 3 3 333 3 2 2 2 2 2 2 22 2 2 2 2 222 222 2222 22 2 3 33 3 3 3 3 2 2 2 2 3 3 2 22 2 2 22 33 3 3 3 2 13 3 2 2 2 2 22 2 1 22 222 22 22 22 22 333 3 3 3 2 1 2 222 22 2 2 222 2 2 2 2 1 2 3 2 3333 3 22 2 2 1 2 2 1 1 2 21 2 3 1 3 33 3 1 2 2 2 2 3 2 1 22 22 22 1 1 1 1 2 1 1 1 3 33 3 1 22 1 2 2 2 2 1 1 2 1 1 1 1 3 3 3 1 3 333 3 3 3 2 2 11 22 11 1 1 1 1 1 33 1 2 1 222 3 3 11 1 1 12 1 1 1 33 33 2 1 11 1 21 22 1 31 3 11 1 1 11 1 13 33 3 3 3 1 1 1 1 3 2 33 3 1 2 11 1 1 1 11 1 1 33 333333 1 2 1 11 1 1 1 33 333 3 1 11 1 1 1 1 1 33 1 1 11 1 11 33 3 3 1 3 1 1 11 1 1 1 1 11 1 1 1 11 1 1 1 1 1 1 1 1 111 33 1 3 3 3 33 3 3 1 1 3 1 3 1 1 1 33 3 33 11 1 1 1 1 1 1 33 1 1 13 33 3 3 1 1 3 3 11 1 11 1 333 3333 11 1 1 1 1 1 1 1 1 1 1 3 33 1 1 1 1 1 11 1 1 3 33 3 1 1 3 3 33 3 3 333 333 3 1 1 1 1 1 1 3 1 33 1 1 1 3 3 3 1 1 1 3 3 33 3 3 3 3 Two methods for fitting quadratic boundaries. [Left] Quadratic decision boundaries, obtained Naive Bayes. [Right] Quadratic decision boundaries found by QDA. 55 ESL Chapter 4 -- Linear Methods for Classification Trevor Hastie and Rob Tibshirani Naive Bayes and GAMs Note that log f1 (X)1 =+ gm (Xm ), f2 (X)2 m=1 p a generalized additive logistic regression model. GAMs are fit by binomial maximum likelihood. Naive Bayes models are fit using the full likelihood. GAMs are discussed in chapters 5 and 6. 56 ESL Chapter 4 -- Linear Methods for Classification Trevor Hastie and Rob Tibshirani Separating hyperplanes {x : 0 + 1 x1 + 2 x2 = 0} A toy example with two classes separable by a hyperplane. The orange line is the least squares solution, which misclassifies one of the training points. Also shown are two blue separating hyperplanes found by the "perceptron learning algorithm" with different random starts. 57 ESL Chapter 4 -- Linear Methods for Classification Trevor Hastie and Rob Tibshirani Rosenblatt's Perceptron Learning Algorithm If a response yi = 1 is misclassified, then xT + 0 < 0, and the opposite i for a misclassified response with yi = -1. The goal is to minimize D(, 0 ) = - iM yi (xT + 0 ), i over |||| = 1, where M indexes the set of misclassified points. D(, 0 ) D(, 0 ) 0 = - iM yi xi , yi . iM = - Stochastic gradient descent converges if data are separable (Ex 4.6): yx + i i . yi 0 0 58 ESL Chapter 4 -- Linear Methods for Classification Trevor Hastie and Rob Tibshirani Geometry Consider the line L: f (x) = 0 + T x = 0. For any x1 and x2 on the line, T (x1 - x2 ) = 0. Hence = /|||| is the normal to the affine set f (x) = 0. The signed distance of any point x to L is given by T (x - x0 ) = = 1 ( T x + 0 ) |||| 1 f (x). (x)|| ||f x0 x 0 + T x = 0 59 ESL Chapter 4 -- Linear Methods for Classification Trevor Hastie and Rob Tibshirani Optimal Separating Hyperplanes Problem ,0 ,||||=1 max C subject to yi (xT + 0 ) C, i = 1, . . . , N. i Convex optimization shows that the solution has the form N ^ = i=1 i yi xi ^ i > 0 if xi is on boundary, else 0. Such boundary points are called ^ support points (3 in toy example) 60 ESL Chapter 4 -- Linear Methods for Classification Trevor Hastie and Rob Tibshirani The same toy example. The shaded region delineates the maximum margin separating the two classes. There are three support points indicated, which lie on the boundary of the margin, and the optimal separating hyperplane (blue line) bisects the slab. Included in the figure is the boundary found using logistic regression (red line), which is very close to the optimal separating hyperplane (see Chapter 12 of ESL). 61 ...
View Full Document

Ask a homework question - tutors are online