60 esl chapter 4 linear methods for classication

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: , with ηj (x) ∼ log P (G = j |x) = xT βj and K P (G = j |x) = eηj (x) / eη (x) . =1 56 ESL Chapter 4 — Linear Methods for Classification Trevor Hastie and Rob Tibshirani Logistic regression or LDA? • LDA: Pr(G = j |X = x) log Pr(G = K |X = x) = πj 1 log − (µj + µK )T Σ−1 (µj − µK ) πK 2 +xT Σ−1 (µj − µK ) = T αj 0 + αj x. This linearity is a consequence of the Gaussian assumption for the class densities, as well as the assumption of a common covariance matrix. • Logistic model: log Pr(G = j |X = x) T = βj 0 + βj x. Pr(G = K |X = x) They use the same form for the logits 57 ESL Chapter 4 — Linear Methods for Classification Trevor Hastie and Rob Tibshirani • Discriminative vs generative (informative) learning: logistic regression uses the conditional distribution of Y given x to estimate parameters, while LDA uses the full joint distribution (assuming normality). Pr(X, G = j ) = Pr(X )Pr(G = j |X ), • If normality holds, LDA is up to 30% more efficient (Efron 1975); o/w logistic regression can be more robust. But the methods are similar in practice. • The additional efficiency is obtained from using observations far from the decision boundary to help estimate Σ (dubious!) 58 ESL Chapter 4 — Linear Methods for Classification Trevor Hastie and Rob Tibshirani Naive Bayes Models Suppose we estimate the class densities f1 (X ) and f2 (X ) for the features in class 1 and 2 respectively. Bayes Formula tells us how to convert these to class posterior probabilities: f1 (X )π1 , Pr(Y = 1|X ) = f1 (X )π1 + f2 (X )π2 where π1 = Pr(Y = 1) and π2 = 1 − π1 . Since X is often high dimensional, the following within class independence model is convenient: p fj (X ) ≈ fjm (Xm ) m=1 Works for more than two classes as well. 59 ESL Chapter 4 — Linear Methods for Classification Trevor Hastie and Rob Tibshirani • Each of the component densities fjm are estimated separately within each class: – Discrete components via histograms – quantitative components via Gaussians or smooth density estimates. • The nearest shrunken centroids model has this structure, and in addition – assu...
View Full Document

This document was uploaded on 03/10/2014 for the course STATS 315A at Stanford.

Ask a homework question - tutors are online