The assumptions on the errors in this model can also be written in vector
form. We write
∼
N
(
0
,σ
2
I
)
, a multivariate normal distribution with mean
vector
E
( )
=
0
and covariance matrix
V
( )
=
σ
2
I
. Similarly, we write
y
∼
N
(
X
β
,σ
2
I
)
, a multivariate normal distribution with mean vector
E
(
y
)
=
X
β
and covariance matrix
V
(
y
)
=
σ
2
I
.
4.2
ESTIMATION OF THE MODEL
We now consider the estimation of the unknown parameters: the
(
p
+
1
)
re
gression parameters
β
, and the variance of the errors
σ
2
. Since
y
i
∼
N
(µ
i
,σ
2
)
with
µ
i
=
β
0
+
β
1
x
i
1
+ ··· +
β
p
x
ip
are independent, it is straightforward to write
down the joint probability density
p
(
y
1
,...,
y
n

β
,σ
2
)
. Treating this, for given
data
y
, as a function of the parameters leads to the likelihood function
L
(
β
,σ
2

y
1
,...,
y
n
)
=
(
1
/
√
2
πσ)
n
exp
−
n
i
=
1
(
y
i
−
µ
i
)
2
/
2
σ
2
(4.8)
Maximizing the likelihood function
L
with respect to
β
is equivalent to minimiz
ing
S
(
β
)
=
∑
n
i
=
1
(
y
i
−
µ
i
)
2
with respect to
β
. This is because the exponent in
Eq. (4.8) is the only term containing
β
. The sum of squares
S
(
β
)
can be written
in vector notation,
S
(
β
)
=
(
y
−
µ
) (
y
−
µ
)
=
(
y
−
X
β
) (
y
−
X
β
),
since
µ
=
X
β
(4.9)
Abraham
Abraham
˙
C04
November 8, 2004
1:29
4.2
Estimation of the Model
91
The minimization of
S
(
β
)
with respect to
β
is known as
least squares estimation
,
and for normal errors it is equivalent to maximum likelihood estimation. We
determinetheleastsquaresestimatesbyobtainingthe
fi
rstderivativesof
S
(
β
)
with
respect to the parameters
β
0
,β
1
,...,β
p
, and by setting these
(
p
+
1
)
derivatives
equal to zero.
The appendix shows that this leads to the
(
p
+
1
)
equations
X X
ˆ
β
=
X
y
(4.10)
These equations are referred to as the
normal equations
. The matrix
X
is assumed
to have full column rank
p
+
1. Hence, the
(
p
+
1
)
×
(
p
+
1
)
matrix
X X
is
nonsingular and the solution of Eq. (4.10) is given by
ˆ
β
=
(
X X
)
−
1
X
y
(4.11)
The estimate
ˆ
β
in Eq. (4.11) minimizes
S
(
β
)
, and is known as the
least squares
estimate
(LSE) of
β
.
4.2.1 A GEOMETRIC INTERPRETATION OF LEAST SQUARES
The model in Eq. (4.7) can be written as
y
=
β
0
1
+
β
1
x
1
+ ··· +
β
p
x
p
+
=
µ
+
(4.12)
where the
(
n
×
1
)
vectors
y
and
are as de
fi
ned before, and the
(
n
×
1
)
vec
tors
1
=
(
1
,
1
,...,
1
)
and
x
j
=
(
x
1
j
,
x
2
j
,...,
x
nj
)
, for
j
=
1
,
2
,...,
p
, represent
the columns of the matrix
X
. Thus,
X
=
(
1
,
x
1
,...,
x
p
)
and
µ
=
X
β
=
β
0
1
+
β
1
x
1
+ ··· +
β
p
x
p
.
The representation in Eq. (4.12) shows that the deterministic component
µ
is
a linear combination of the vectors
1
,
x
1
,...,
x
p
. Let
L
(
1
,
x
1
,...,
x
p
)
be the set
of all linear combinations of these vectors. If we assume that these vectors are not
linearly dependent,
L
(
X
)
=
L
(
1
,
x
1
,...,
x
p
)
is a subspace of
R
n
of dimension
p
+
1. Note that the assumption that
1
,
x
1
,...,
x
p
are not linearly dependent is
the same as saying that
X
has rank
p
+
1.
We want to explain these concepts slowly because they are essential for under
standing the geometric interpretation that follows. First, note that the dimension
of the regressor vectors
1
,
x
1
,...,
x
p
is
n
, the number of cases. When we display
the
(
p
+
1
)
regressor vectors, we do that in
n
dimensional Euclidean space
R
n
.