This preview shows pages 1–3. Sign up to view the full content.
CS229 Lecture notes
Andrew Ng
Part VI
Regularization and model
selection
Suppose we are trying select among several diferent models For a learning
problem.
±or instance, we might be using a polynomial regression model
h
θ
(
x
) =
g
(
θ
0
+
θ
1
x
+
θ
2
x
2
+
···
+
θ
k
x
k
), and wish to decide iF
k
should be
0, 1, .
. . , or 10. How can we automatically select a model that represents
a good tradeof between the twin evils oF bias and variance
1
? Alternatively,
suppose we want to automatically choose the bandwidth parameter
τ
For
locally weighted regression, or the parameter
C
For our
±
1
regularized SVM.
How can we do that?
±or the sake oF concreteness, in these notes we assume we have some
²nite set oF models
M
=
{
M
1
, . . . , M
d
}
that we’re trying to select among.
±or instance, in our ²rst example above, the model
M
i
would be an
i
th
order polynomial regression model. (The generalization to in²nite
M
is not
hard.
2
) Alternatively, iF we are trying to decide between using an SVM, a
neural network or logistic regression, then
M
may contain these models.
1
Given that we said in the previous set of notes that bias and variance are two very
diFerent beasts, some readers may be wondering if we should be calling them “twin” evils
here.
Perhaps it’d be better to think of them as nonidentical twins.
The phrase “the
fraternal twin evils of bias and variance” doesn’t have the same ring to it, though.
2
If we are trying to choose from an in±nite set of models, say corresponding to the
possible values of the bandwidth
τ
∈
R
+
, we may discretize
τ
and consider only a ±nite
number of possible values for it.
More generally, most of the algorithms described here
can all be viewed as performing optimization search in the space of models, and we can
perform this search over in±nite model classes as well.
1
This preview has intentionally blurred sections. Sign up to view the full version.
View Full Document2
1
Cross validation
Lets suppose we are, as usual, given a training set
S
. Given what we know
about empirical risk minimization, here’s what might initially seem like a
algorithm, resulting from using empirical risk minimization for model selec
tion:
1. Train each model
M
i
on
S
, to get some hypothesis
h
i
.
2. Pick the hypotheses with the smallest training error.
This algorithm does
not
work. Consider choosing the order of a poly
nomial.
The higher the order of the polynomial, the better it will Ft the
training set
S
, and thus the lower the training error. Hence, this method will
always select a highvariance, highdegree polynomial model, which we saw
previously is often poor choice.
Here’s an algorithm that works better. In
holdout cross validation
(also called
simple cross validation
), we do the following:
1. Randomly split
S
into
S
train
(say, 70% of the data) and
S
cv
(the remain
ing 30%). Here,
S
cv
is called the holdout cross validation set.
2. Train each model
This is the end of the preview. Sign up
to
access the rest of the document.
 Fall '09

Click to edit the document details