{[ promptMessage ]}

Bookmark it

{[ promptMessage ]}

29highP - Methods for large P small N problems Regression...

Info iconThis preview shows pages 1–6. Sign up to view the full content.

View Full Document Right Arrow Icon
Methods for large P, small N problems I Regression has (at least) three major purposes: 1. Estimate coefficients in a pre-specified model 2. Discover an appropriate model 3. Predict values for new observations I Regression includes classification because classification can be viewed as a logistic regression I Up to now (500, 511), have worked with data with small to moderate number of observations, small number of potentially important variables. I What if many potential variables? I e.g. 10,000 variables and 100 observations, p >> n I Linear regression with all 10,000 variables will not work ( X 0 X singular) I Forward stepwise or all subsets regression will not work well when predictors are highly correlated c 2011 Dept. Statistics (Iowa State University) Stat 511 section 29 1 / 26
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
I Often prediction is the most important goal, but need 1) and 2) to do prediction I Practical methods sometimes called data mining. I Huge area of current research: Statistics, Computer Science I Large number of potential methods I We look very quickly at four methods 1. Principal Components Regression 2. Partial Least Squares Regression 3. Ridge Regression 4. Lasso I Support Vector Machines, Neural Networks, and elastic nets are other popular (or upcoming) techniques that we will not talk about. I Hastie, Tibshirani, and Friedman, 2009, The Elements of Statistical Learning is my primary source for this material. Also discusses SVM’s, NN’s, and EN’s. c 2011 Dept. Statistics (Iowa State University) Stat 511 section 29 2 / 26
Background image of page 2
Examples I Arcene data: I Use abundance of proteins in blood to predict whether an individual has cancer (ovarian or prostate). I Training data: 10,000 variables, 100 subjects: 44 with cancer, 56 without I Validation data: to assess performance of model on new data (out-of-sample prediction) 100 subjects I Corn grain data: I Use the near-infrared absortion spectrum to predict % oil in corn grain I Data from ISU grain quality lab, provided by Dr. Charlie Hurbugh I Training Data 1: 20 samples from 2008 I Training Data 2: 337 samples from 2005 - 2008 I Validation Data: 126 samples from 2009, 2010 I 38 variables, all highly correlated: pairwise 0.79 - 0.9999 c 2011 Dept. Statistics (Iowa State University) Stat 511 section 29 3 / 26
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Principle Components Regression I Useful for: I Reducing # X variables I Eliminating collinearity I Example: I Corn Grain data, training set 2 I 337 samples, 38 X variables I X variables highly correlated: large off diagonal elements of X 0 X . I Remember the spectral (aka eigen) decomposition (Koehler notes, p 87): A = UDU 0 , where U 0 U = I I Center all X variables: X c i = X i - μ i I Apply decomposition to X c 0 X c = UDU 0 I Compute new X variables: X * = XU I This is a rigid rotation around (0,0) c 2011 Dept. Statistics (Iowa State University) Stat 511 section 29 4 / 26
Background image of page 4
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Image of page 6
This is the end of the preview. Sign up to access the rest of the document.

{[ snackBarMessage ]}