There are actually two equivalent ways to think about PCA. The first
is statistical: we are trying to find a transform that is carefully tuned
to the (secondorder) statistics of the data. The second perspective,
which is what we will adopt in this course, is more geometrical: given
a set of vectors, we are trying to find a subspace of a certain dimension
that comes closest to containing this set.
Specifically, suppose that we have data points
x
1
, . . . ,
x
N
∈
R
D
, and
want to find the
K
dimensional affine space (subspace plus offset)
that comes closest to containing them. Here is a picture
4
Example
From Chapter 14 of Hastie, Tibshirani, and Friedman
16
4
From Ch. 14 of Tibshirani and Hastie’s
Elements of Statistical Learning
.
76
Georgia Tech ECE 6250 Fall 2019; Notes by J. Romberg and M. Davenport. Last updated 23:01, November 5, 2019
Subscribe to view the full document.
Our goal is to find an offset
μ
∈
R
D
and a matrix
Q
with orthonormal
columns such that
x
n
≈
μ
+
Qθ
n
for all
n
= 1
, . . . , N,
for some
θ
n
∈
R
K
. We cast this as the following optimization prob
lem. Given
x
1
, . . . ,
x
N
, solve
minimize
μ
,
Q
,
{
θ
n
}
N
X
n
=1
k
x
n

μ

Qθ
n
k
2
2
subject to
Q
T
Q
=
I
.
Note that if we fix
μ
and define
e
x
n
=
x
n

μ
, then we can recast
the optimization with respect to
Q
and the
θ
n
as
minimize
Q
,
{
θ
n
}
N
X
n
=1
k
e
x
n

Qθ
n
k
2
2
subject to
Q
T
Q
=
I
.
If
f
X
and
Θ
denote the matrices whose columns are given by
e
x
1
, . . . ,
e
x
n
and
θ
1
, . . . ,
θ
n
respectively, then we can also write this as
minimize
Q
:
D
×
K
Θ
:
K
×
N
k
f
X

Q
Θ
k
2
F
subject to
Q
T
Q
=
I
.
This is exactly the optimization problem that we looked at previ
ously in our Subspace Approximation Lemma! Thus the solution is
given by computing the SVD of
f
X
=
U
Σ
V
and then taking as our
solution
b
Q
=
U
K
,
b
Θ
=
U
T
K
f
X
,
where
U
K
=
u
1
u
2
· · ·
u
K
contains the first
K
columns of
U
.
77
Georgia Tech ECE 6250 Fall 2019; Notes by J. Romberg and M. Davenport. Last updated 23:01, November 5, 2019
Finally, let us return to the question of how to set
μ
. For any given
μ
, the solution for
Q
and
Θ
is given by the Subspace Approximation
Lemma. This results in setting
θ
n
=
Q
T
(
x
n

μ
)
.
Plugging this in for
θ
n
in our objective function, we have that
x
n

μ

Qθ
n
=
x
n

μ

QQ
T
(
x
n

μ
)
= (
I

QQ
T
)(
x
n

μ
)
.
Hence, the problem of selecting
μ
reduces to the optimization prob
lem
minimize
μ
N
X
n
=1
k
(
I

QQ
T
)(
x
n

μ
)
k
2
2
The vector
μ
is unconstrained; we can solve for the optimal
μ
by
taking a gradient and setting it equal to zero. To make this easier,
note that
k
(
I

QQ
T
)(
x
n

μ
)
k
2
2
= (
x
n

μ
)
T
(
I

QQ
T
)(
x
n

μ
)
by simply expanding out the norm squared as an inner product and
then using the fact that
I

QQ
T
is a projector, i.e., it is symmetric
and (
I

QQ
T
)
2
=
I

QQ
T
. Thus, by taking a gradient and setting
it equal to zero we have
0
=

2
N
X
n
=1
(
I

QQ
T
)(
x
n

μ
)
=

2(
I

QQ
T
)
N
X
n
=1
x
n
!
Subscribe to view the full document.
 Fall '08
 Staff