{[ promptMessage ]}

Bookmark it

{[ promptMessage ]}

29highP

# 29highP - Methods for large P small N problems Regression...

This preview shows pages 1–6. Sign up to view the full content.

Methods for large P, small N problems I Regression has (at least) three major purposes: 1. Estimate coefficients in a pre-specified model 2. Discover an appropriate model 3. Predict values for new observations I Regression includes classification because classification can be viewed as a logistic regression I Up to now (500, 511), have worked with data with small to moderate number of observations, small number of potentially important variables. I What if many potential variables? I e.g. 10,000 variables and 100 observations, p >> n I Linear regression with all 10,000 variables will not work ( X 0 X singular) I Forward stepwise or all subsets regression will not work well when predictors are highly correlated c 2011 Dept. Statistics (Iowa State University) Stat 511 section 29 1 / 26

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
I Often prediction is the most important goal, but need 1) and 2) to do prediction I Practical methods sometimes called data mining. I Huge area of current research: Statistics, Computer Science I Large number of potential methods I We look very quickly at four methods 1. Principal Components Regression 2. Partial Least Squares Regression 3. Ridge Regression 4. Lasso I Support Vector Machines, Neural Networks, and elastic nets are other popular (or upcoming) techniques that we will not talk about. I Hastie, Tibshirani, and Friedman, 2009, The Elements of Statistical Learning is my primary source for this material. Also discusses SVM’s, NN’s, and EN’s. c 2011 Dept. Statistics (Iowa State University) Stat 511 section 29 2 / 26
Examples I Arcene data: I Use abundance of proteins in blood to predict whether an individual has cancer (ovarian or prostate). I Training data: 10,000 variables, 100 subjects: 44 with cancer, 56 without I Validation data: to assess performance of model on new data (out-of-sample prediction) 100 subjects I Corn grain data: I Use the near-infrared absortion spectrum to predict % oil in corn grain I Data from ISU grain quality lab, provided by Dr. Charlie Hurbugh I Training Data 1: 20 samples from 2008 I Training Data 2: 337 samples from 2005 - 2008 I Validation Data: 126 samples from 2009, 2010 I 38 variables, all highly correlated: pairwise 0.79 - 0.9999 c 2011 Dept. Statistics (Iowa State University) Stat 511 section 29 3 / 26

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
Principle Components Regression I Useful for: I Reducing # X variables I Eliminating collinearity I Example: I Corn Grain data, training set 2 I 337 samples, 38 X variables I X variables highly correlated: large off diagonal elements of X 0 X . I Remember the spectral (aka eigen) decomposition (Koehler notes, p 87): A = UDU 0 , where U 0 U = I I Center all X variables: X c i = X i - μ i I Apply decomposition to X c 0 X c = UDU 0 I Compute new X variables: X * = XU I This is a rigid rotation around (0,0) c 2011 Dept. Statistics (Iowa State University) Stat 511 section 29 4 / 26

This preview has intentionally blurred sections. Sign up to view the full version.

View Full Document
This is the end of the preview. Sign up to access the rest of the document.

{[ snackBarMessage ]}