lec25_slides.pdf - Shrinkage May 1 2019 Large p problems More and more statistical datasets have p large sometimes much larger than n I What variations

# lec25_slides.pdf - Shrinkage May 1 2019 Large p problems...

• Notes
• 48

This preview shows page 1 - 11 out of 48 pages.

ShrinkageMay 1, 2019
LargepproblemsMore and more statistical datasets haveplarge, sometimesmuch larger thann.IWhat variations in genome are associated with disease?
p>nIIf we regress our response on all the variables, what is theproblem? There is not a unique to the ols problem ifp>n.
p>nIIf we regress our response on all the variables, what is theproblem? There is not a unique to the ols problem ifp>n.IEven ifp<nwe will get a lot of variability in our estimatesofˆβif bothpandnare large. Model selection can help –i.e. reduce the number of variables – but the methodswe’ve discussed so far are not going to be very compellingin the face of hundreds of variables, as 2pcan be verylarge.
Multiple TestingIWith a large number of tests, one for each variable, we arelikely to get a great number of false positives. If we have ahundred variables, we get, on average, 5 significantvariables, even if there is absolutely no relationshipbetween the variables and the response, and that numbergrows with the number of variables.
Example: Marginal TestingIWhen there are a large number of variables, some peoplefocus on individual relationships rather than a jointanalysis. This is very common in genomics, for example.In this way, they run a separate regression on eachvariable and test each variable separately.
Example: Marginal TestingIWhen there are a large number of variables, some peoplefocus on individual relationships rather than a jointanalysis. This is very common in genomics, for example.In this way, they run a separate regression on eachvariable and test each variable separately.IThis can be done even ifp>>n. In this setting the jointrelationship is less important than finding single variablesof interest (e.g. genes). In this case, the number ofvariables could be in the thousands, while the number ofobservations in the tens. Often in this case we control forthe false positive rate – the proportion of discoveries thatare FP, rather than the absolute number. Otherwise wewould never find anything.
Shrinkage MethodsIAnother approach to model selection/improvement in theface of large number of variables are “shrinkage” methodsthat adjust theˆβby making some of closer to zero.
Shrinkage MethodsIAnother approach to model selection/improvement in theface of large number of variables are “shrinkage” methodsthat adjust theˆβby making some of closer to zero.IFirst, let’s get all the variables on the same scale bysubtracting off the mean and dividing by the standarddeviation (per variable). Similarly, let’s center the response(which means we can omit interceptβ0).
Shrinkage MethodsIAnother approach to model selection/improvement in theface of large number of variables are “shrinkage” methodsthat adjust theˆβby making some of closer to zero.