If we add an extra component, equal to 1, to each data vector
X
j
so that
now
x
j
= [1
, x
j
1
, . . . , x
jd
]
T
, for
j
= 1
, . . . , N
, then we can write (5.64) as
f
(
x
) =
a
0
+
a
T
x
(5.65)
and the dimension
d
is increased by one. Then we are seeking for a vector
a
∈
R
d
that minimizes
E
(
a
) =
N
X
j
=1
[
f
j

a
T
x
j
]
2
.
(5.66)
Putting the data
x
j
as rows of an
N
×
d
(
N
≥
d
) matrix
X
and
f
j
as the
components of a (column) vector
f
, i.e.
X
=
x
1
x
2
.
.
.
x
N
and
f
=
f
1
f
2
.
.
.
f
N
(5.67)
We have textbook solutions for you!
The document you are viewing contains questions related to this textbook.
The document you are viewing contains questions related to this textbook.
Expert Verified
96
CHAPTER 5.LEAST SQUARES APPROXIMATION
we can write (5.66) as
E
(
a
) = (
f

X
a
)
T
(
f

X
a
) =
k
f

X
a
k
2
.
(5.68)
The normal equations are given by the condition
∇
a
E
(
a
) = 0. Since
∇
a
E
(
a
) =

2
X
T
f
+ 2
X
T
X
a
, we get the linear system of equations
X
T
X
a
=
X
T
f.X. Clearly,W⊆
(5.69)
Every solution of the least square problem is necessarily a solution of the
normal equations. We will prove that the converse is also true and that the
solutions have a geometric characterization.
Let
W
be the linear space spanned by the columns of
R
N
. Then, the least square problem is equivalent to minimizing
k
f

w
k
2
among all vectors
w
in
W
. There is always at least one solution, which can
be obtained by projecting
f
onto
W
, as Fig. 5.2 illustrates. First, note that if
a
∈
R
d
is a solution of the normal equations (5.69) then the residual
f

X
a
is orthogonal to
W
because
X
T
(
f

X
a
) =
X
T
f

X
T
X
a
=
0
(5.70)
and a vector
r
∈
R
N
is orthogonal to
W
if it is orthogonal to each column
of
X
, i.e.
X
T
r
= 0.
Let
a
*
be a solution of the normal equations, let
r
=
f

X
a
*
, and for arbitrary
a
∈
R
d
, let
s
=
X
a

X
a
*
. Then, we have
k
f

X
a
k
2
=
k
f

X
a
*

(
X
a

X
a
*
)
k
2
=
k
r

s
k
2
.
(5.71)
But
r
and
s
are orthogonal. Therefore,
k
r

s
k
2
=
k
r
k
2
+
k
s
k
2
≥ k
r
k
2
(5.72)
and so we have proved that
k
f

X
a
k
2
≥ k
f

X
a
*
k
2
(5.73)
for arbitrary
a
∈
R
d
, i.e.
a
*
minimizes
k
f

X
a
k
2
.
If the columns of
X
are linearly independent, i.e. if for every
a
6
=
0
we
have that
X
a
6
=
0
, then the
d
×
d
matrix
X
T
X
is positive definite and hence
nonsingular. Therefore, in this case, there is a unique solution to the least
squares problem min
a
k
f

X
a
k
2
given by
a
*
= (
X
T
X
)

1
X
T
f
.
(5.74)
5.5.
HIGHDIMENSIONAL DATA FITTING
97
f
X
a
f

X
a
W
Figure 5.2: Geometric interpretation of the solution
X
a
of the Least Squares
problem as the orthogonal projection of
f
on the approximating linear sub
space
W
.
The
d
×
N
matrix
X
†
= (
X
T
X
)

1
X
T
(5.75)
is called the
pseudoinverse
of the
N
×
d
matrix
X
.
Note that if
X
were
square and nonsingular
X
†
would coincide with the inverse,
X

1
.
As we have done in the other least squares problems we seen so far, rather
than working with the normal equations, whose matrix
X
T
X
may be very
sensitive to perturbations in the data, we use an orthogonal basis for the ap
proximating subspace (
W
in this case) to find a solution. While in principle
this can be done by applying the GramSchmidt process to the columns of
X
, this is a numerically unstable procedure; when two columns are nearly
linearly dependent, errors introduced by the finite precision representation