3
Linear Methods for Regression
3.1
Introduction
A linear regression model assumes that the regression function E(
Y

X
) is
linear in the inputs
X
1
, . . . , X
p
. Linear models were largely developed in
the precomputer age of statistics, but even in today’s computer era there
are still good reasons to study and use them. They are simple and often
provide an adequate and interpretable description of how the inputs affect
the output. For prediction purposes they can sometimes outperform fancier
nonlinear models, especially in situations with small numbers of training
cases, low signaltonoise ratio or sparse data. Finally, linear methods can be
applied to transformations of the inputs and this considerably expands their
scope. These generalizations are sometimes called basisfunction methods,
and are discussed in Chapter 5.
In this chapter we describe linear methods for regression, while in the
next chapter we discuss linear methods for classification. On some topics we
go into considerable detail, as it is our firm belief that an understanding
of linear methods is essential for understanding nonlinear ones. In fact,
many nonlinear techniques are direct generalizations of the linear methods
discussed here.
© Springer Science+Business Media, LLC 2009
T. Hastie et al.,
The Elements of Statistical Learning, Second Edition,
43
DOI: 10.1007/b94608_3,
This preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
44
3.
Linear Methods for Regression
3.2
Linear Regression Models and Least Squares
As introduced in Chapter 2, we have an input vector
X
T
= (
X
1
, X
2
, . . . , X
p
),
and want to predict a realvalued output
Y
. The linear regression model
has the form
f
(
X
) =
β
0
+
p
j
=1
X
j
β
j
.
(3.1)
The linear model either assumes that the regression function E(
Y

X
) is
linear, or that the linear model is a reasonable approximation. Here the
β
j
’s are unknown parameters or coeﬃcients, and the variables
X
j
can come
from different sources:
•
quantitative inputs;
•
transformations of quantitative inputs, such as log, squareroot or
square;
•
basis expansions, such as
X
2
=
X
2
1
,
X
3
=
X
3
1
, leading to a polynomial
representation;
•
numeric or “dummy” coding of the levels of qualitative inputs. For
example, if
G
is a fivelevel factor input, we might create
X
j
, j
=
1
, . . . ,
5
,
such that
X
j
=
I
(
G
=
j
). Together this group of
X
j
repre
sents the effect of
G
by a set of leveldependent constants, since in
∑
5
j
=1
X
j
β
j
, one of the
X
j
s is one, and the others are zero.
•
interactions between variables, for example,
X
3
=
X
1
·
X
2
.
No matter the source of the
X
j
, the model is linear in the parameters.
Typically we have a set of training data (
x
1
, y
1
)
. . .
(
x
N
, y
N
) from which
to estimate the parameters
β
. Each
x
i
= (
x
i
1
, x
i
2
, . . . , x
ip
)
T
is a vector
of feature measurements for the
i
th case. The most popular estimation
0
1
p
T
to minimize the residual sum of squares
RSS(
β
)
=
N
i
=1
(
y
i
−
f
(
x
i
))
2
=
N
i
=1
y
i
−
β
0
−
p
j
=1
x
ij
β
j
2
.
(3.2)
From a statistical point of view, this criterion is reasonable if the training
observations (
x
i
, y
i
) represent independent random draws from their popu
lation. Even if the
x
i
This is the end of the preview.
Sign up
to
access the rest of the document.
 Spring '10
 Haulk
 Statistics, Linear Regression, Regression Analysis, linear methods

Click to edit the document details