This preview shows pages 1–6. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: Methods for large P, small N problems I Regression has (at least) three major purposes: 1. Estimate coefficients in a prespecified model 2. Discover an appropriate model 3. Predict values for new observations I Regression includes classification because classification can be viewed as a logistic regression I Up to now (500, 511), have worked with data with small to moderate number of observations, small number of potentially important variables. I What if many potential variables? I e.g. 10,000 variables and 100 observations, p >> n I Linear regression with all 10,000 variables will not work ( X X singular) I Forward stepwise or all subsets regression will not work well when predictors are highly correlated c 2011 Dept. Statistics (Iowa State University) Stat 511 section 29 1 / 26 I Often prediction is the most important goal, but need 1) and 2) to do prediction I Practical methods sometimes called data mining. I Huge area of current research: Statistics, Computer Science I Large number of potential methods I We look very quickly at four methods 1. Principal Components Regression 2. Partial Least Squares Regression 3. Ridge Regression 4. Lasso I Support Vector Machines, Neural Networks, and elastic nets are other popular (or upcoming) techniques that we will not talk about. I Hastie, Tibshirani, and Friedman, 2009, The Elements of Statistical Learning is my primary source for this material. Also discusses SVM’s, NN’s, and EN’s. c 2011 Dept. Statistics (Iowa State University) Stat 511 section 29 2 / 26 Examples I Arcene data: I Use abundance of proteins in blood to predict whether an individual has cancer (ovarian or prostate). I Training data: 10,000 variables, 100 subjects: 44 with cancer, 56 without I Validation data: to assess performance of model on new data (outofsample prediction) 100 subjects I Corn grain data: I Use the nearinfrared absortion spectrum to predict % oil in corn grain I Data from ISU grain quality lab, provided by Dr. Charlie Hurbugh I Training Data 1: 20 samples from 2008 I Training Data 2: 337 samples from 2005  2008 I Validation Data: 126 samples from 2009, 2010 I 38 variables, all highly correlated: pairwise 0.79  0.9999 c 2011 Dept. Statistics (Iowa State University) Stat 511 section 29 3 / 26 Principle Components Regression I Useful for: I Reducing # X variables I Eliminating collinearity I Example: I Corn Grain data, training set 2 I 337 samples, 38 X variables I X variables highly correlated: large off diagonal elements of X X . I Remember the spectral (aka eigen) decomposition (Koehler notes, p 87): A = UDU , where U U = I I Center all X variables: X c i = X i μ i I Apply decomposition to X c X c = UDU I Compute new X variables: X * = XU I This is a rigid rotation around (0,0) c 2011 Dept. Statistics (Iowa State University) Stat 511 section 29 4 / 26 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●...
View
Full
Document
This note was uploaded on 02/11/2012 for the course STAT 511 taught by Professor Staff during the Spring '08 term at Iowa State.
 Spring '08
 Staff

Click to edit the document details