
N
μ
!
.
78
Georgia Tech ECE 6250 Fall 2019; Notes by J. Romberg and M. Davenport. Last updated 23:01, November 5, 2019
Subscribe to view the full document.
We can satisfy this condition by taking the offset
μ
to be the sample
mean (average of all the observed vectors):
b
μ
=
1
N
N
X
n
=1
x
n
.
Note that this choice is not unique – any choice of
μ
that results in
∑
(
x
n

μ
) living in the nullspace of
I

QQ
T
would also suffice –
but
μ
is the easy and obvious choice, and also what is usually done
in practice, because it makes computing the solution to the PCA
problem straightforward.
Computing the PCA solution
Specifically, in practice you would typically proceed by first comput
ing the mean
b
μ
of your data as described above. Given
b
μ
, you can
then form the matrix
f
X
whose columns are given by
e
x
n
=
x
n

b
μ
.
Alternatively, if you know a priori that your columns of zero mean (or
should have zero mean) based on the underlying process generating
the data, then you can skip this step, setting
f
X
=
X
.
In either case, once you have formed
f
X
, you simply compute the
SVD of
f
X
=
U
Σ
V
and then set
b
Q
=
U
K
,
b
θ
n
=
U
T
K
e
x
n
,
where
U
K
=
u
1
u
2
· · ·
u
K
contains the first
K
columns of
U
.
We can think of
b
θ
n
as a representation of
x
n
is a
K
dimensional
79
Georgia Tech ECE 6250 Fall 2019; Notes by J. Romberg and M. Davenport. Last updated 23:01, November 5, 2019
subspace, with
b
Q
giving us a basis for that subspace (which is useful
for projecting vectors
x
∈
R
N
into the subspace).
Note that if you look up a discussion of PCA in most textbooks or
online, you will typically see a slightly different presentation. Specif
ically, most texts describe an approach to the problem that involves
forming the matrix
S
=
N
X
n
=1
(
x
n

b
μ
)(
x
n

b
μ
)
T
,
taking and eigenvalue decomposition
S
=
V
Λ
V
T
, and then taking
Q
=
v
1
v
2
· · ·
v
K
,
where
v
1
, . . . ,
v
K
are the eigenvectors of
S
corresponding to the
K
largest eigenvalues.
This approach is
completely equivalent
to our approach above.
5
The reason that PCA is typically presented in this was is that
S
can be interpreted as a scaled version of an empirical estimate of
the covariance matrix for the underlying distribution generating the
data. While this provides a nice connection with the other (statisti
cal) interpretation of PCA, I personally find the SVD approach more
intuitive. In PCA, we are simply trying to find a lowrank approx
imation to our dataset, which is directly and optimally handled by
computing a truncated SVD.
5
Recall the relationship between the SVD of
f
X
and the eigendecomposition
of
f
X
f
X
T
.
80
Georgia Tech ECE 6250 Fall 2019; Notes by J. Romberg and M. Davenport. Last updated 23:01, November 5, 2019
Subscribe to view the full document.
Technical Details: Subspace Approx. Lemma
We prove the subspace approximation lemma from above.
First,
with
Q
fixed, we can break the optimization over
Θ
into a series of
leastsquares problems. Let
a
1
, . . . ,
a
N
be the columns of
A
, and
θ
1
, . . . ,
θ
N
be the columns of
Θ
. Then
minimize
Θ
k
A

Q
Θ
k
2
F
is exactly the same as
minimize
θ
1
,...,
θ
N
N
X
n
=1
k
a
n

Qθ
n
k
2
2
.
 Fall '08
 Staff