Fall 2006 ORIE474: Section 7 notes
Nikolai Blizniouk
The goal today is to discuss how to do principal components (PC) regression, how
to create interactions in
SAS EM
and
SAS Analyst
, and how to use
SAS Analyst
to
extract diagnostic information not provided by the
Regression
node of
SAS EM
.
Setup: we’ll use the
BASEBALL
data set. Load it as before. In the
Input Data
Source
node, set
SALARY
to be your response (
target
).
Some notation:
I
p
is the
p
×
p
identity matrix and 1
n
is the length
n
column vector
of ones. For an arbitrary matrix
A
,
A
j
will denote the
j
th column of
A
, and
A
ij
will
denote its
ij
th entry. Also,
Y
denotes the vector of responses and
is the vector of
errors.
PC regression
Review of SVD and spectral decomposition
Let
Z
be an
n
×
p
matrix (assume
n
≥
p
). The
singular value decomposition (SVD)
of
Z
is given by the equation
Z
=
USV
T
, where
U
is of size
n
×
p
,
S
and
V
are of
size
p
×
p
. Furthermore,
S
is diagonal
1
with
S
ii
≥
S
jj
≥
0 if
i < j
,
V V
T
=
V
T
V
=
I
p
and
U
T
U
=
I
p
. Notice that
Z
T
Z
=
V SU
T
USV
T
=
V
(
SS
)
V
T
, which implies that
λ
i
=
S
2
ii
is the
i
th largest eigenvalue of
Z
T
Z
and
V
i
is the corresponding eigenvector.
How does this relate to PCA?
Recall that in PCA using the sample correlation matrix, we were looking for eigen
values and eigenvectors of the matrix
Z
T
Z
, where
Z
ji
= (
X
ji

ave
(
X
i
))
/std
(
X
i
).
2
After that, one would do a change of variables
z
→
V
T
z
, thereby decomposing the
variation in the original variables into orthogonal directions. In PC regression, the
idea is similar: instead of working with the model (1), we work with the equivalent
model (3)
Y
=
β
0
·
1
n
+
Zβ
+
,
(1)
=
β
0
·
1
n
+ (
US
)(
V
T
β
) +
(2)
=
β
0
·
1
n
+
Pγ
+
,
where
P
=
US, γ
=
V
T
β.
(3)
In the model (3), one can set
γ
q
+1
, . . . , γ
p
to zero, which is equivalent to regressing
Y
on the first
q
principal components of
Z
, which are first
q
columns of
P
(plus an
intercept, of course). The interpretation is the same: in presence of multicollinearity,
variation of
P
q
+1
, . . . , P
p
is small, and thus these components can be ignored. (The
This preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
This is the end of the preview.
Sign up
to
access the rest of the document.
 Spring '07
 APANASOVICH
 Regression Analysis, SAS analyst, Data Source node, Input Data Source

Click to edit the document details