44
3. Linear Methods for Regression
3.2
Linear Regression Models and Least Squares
As introduced in Chapter 2, we have an input vector
X
T
=(
X
1
,X
2
,...,X
p
),
and want to predict a realvalued output
Y
. The linear regression model
has the form
f
(
X
)=
β
0
+
p
±
j
=1
X
j
β
j
.
(3.1)
The linear model either assumes that the regression function E(
Y

X
)i
s
linear, or that the linear model is a reasonable approximation. Here the
β
j
’s are unknown parameters or coeﬃcients, and the variables
X
j
can come
from diﬀerent sources:
•
quantitative inputs;
•
transformations of quantitative inputs, such as log, squareroot or
square;
•
basis expansions, such as
X
2
=
X
2
1
,
X
3
=
X
3
1
, leading to a polynomial
representation;
•
numeric or “dummy” coding of the levels of qualitative inputs. For
example, if
G
is a ﬁvelevel factor input, we might create
X
j
,j
=
1
,...,
5
,
such that
X
j
=
I
(
G
=
j
). Together this group of
X
j
repre
sents the eﬀect of
G
by a set of leveldependent constants, since in
∑
5
j
=1
X
j
β
j
, one of the
X
j
s is one, and the others are zero.
•
interactions between variables, for example,
X
3
=
X
1
·
X
2
.
No matter the source of the
X
j
, the model is linear in the parameters.
Typically we have a set of training data (
x
1
,y
1
)
...
(
x
N
N
)fromwh
ich
to estimate the parameters
β
.Ea
ch
x
i
x
i
1
,x
i
2
,...,x
ip
)
T
is a vector
of feature measurements for the
i
th case. The most popular estimation
0
1
p
T
to minimize the residual sum of squares
RSS(
β
N
±
i
=1
(
y
i
−
f
(
x
i
))
2
=
N
±
i
=1
²
y
i
−
β
0
−
p
±
j
=1
x
ij
β
j
³
2
.
(3.2)
From a statistical point of view, this criterion is reasonable if the training
observations (
x
i
i
) represent independent random draws from their popu
lation. Even if the
x
i