18
HighDimensional Problems:
p
N
18.1
When
p
is Much Bigger than
N
In this chapter we discuss prediction problems in which the number of
features
p
is much larger than the number of observations
N
, often written
p
N
. Such problems have become of increasing importance, especially in
genomics and other areas of computational biology. We will see that high
variance and overfitting are a major concern in this setting. As a result,
simple, highly regularized approaches often become the methods of choice.
The first part of the chapter focuses on prediction in both the classification
and regression settings, while the second part discusses the more basic
problem of feature selection and assessment.
To get us started, Figure 18.1 summarizes a small simulation study that
demonstrates the “less fitting is better” principle that applies when
p
N
.
For each of
N
= 100 samples, we generated
p
standard Gaussian features
X
with pairwise correlation 0
.
2. The outcome
Y
was generated according
to a linear model
Y
=
p
j
=1
X
j
β
j
+
σε
(18.1)
where
ε
was generated from a standard Gaussian distribution. For each
dataset, the set of coeﬃcients
β
j
were also generated from a standard Gaus
sian distribution. We investigated three cases:
p
= 20
,
100
,
and 1000. The
standard deviation
σ
was chosen in each case so that the signaltonoise
ratio Var[E(
Y

X
)])
/
Var(
ε
) equaled 2. As a result, the number of significant
© Springer Science+Business Media, LLC 2009
T. Hastie et al.,
The Elements of Statistical Learning, Second Edition,
649
DOI: 10.1007/b94608_18,
This preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
650
18.
HighDimensional Problems:
p
N
1.0
1.5
2.0
2.5
3.0
20
9
2
20 features
Test Error
1.0
1.5
2.0
2.5
3.0
99
35
7
100 features
1.0
1.5
2.0
2.5
3.0
99
87
43
1000 features
Effective Degrees of Freedom
FIGURE 18.1.
Testerror results for simulation experiments. Shown are box
plots of the relative test errors over
100
simulations, for three different values
of
p
, the number of features. The relative error is the test error divided by the
Bayes error,
σ
2
. From left to right, results are shown for ridge regression with
three different values of the regularization parameter
λ
:
0
.
001
,
100
and
1000
. The
(average) effective degrees of freedom in the fit is indicated below each plot.
univariate regression coeﬃcients
1
was 9, 33 and 331, respectively, averaged
over the 100 simulation runs. The
p
= 1000 case is designed to mimic the
kind of data that we might see in a highdimensional genomic or proteomic
dataset, for example.
We fit a ridge regression to the data, with three different values for the
regularization parameter
λ
: 0
.
001, 100, and 1000. When
λ
= 0
.
001, this
is nearly the same as least squares regression, with a little regularization
just to ensure that the problem is nonsingular when
p > N
. Figure 18.1
shows boxplots of the relative test error achieved by the different estimators
in each scenario. The corresponding average degrees of freedom used in
each ridgeregression fit is indicated (computed using formula (3.50) on
page 68
2
). The degrees of freedom is a more interpretable parameter than
λ
. We see that ridge regression with
λ
= 0
.
001 (20 df) wins when
p
= 20;
λ
= 100 (35 df) wins when
p
= 100, and
λ
This is the end of the preview.
Sign up
to
access the rest of the document.
 Spring '10
 Haulk
 Linear Regression, Regression Analysis, T. HASTIE, highdimensional problems

Click to edit the document details