Fall 2006 ORIE474: Section 6 notes
Nikolai Blizniouk
The goal of these notes is to provide some guidance for the use of
Regression
node in
SAS
. The setup assumes that you have drawn a diagram similar to that from
Section 5.
Doing regression with categorical variables using
SAS
Suppose we have a population divided in
J
disjoint strata/categories and we make
a measurement on a subject that belongs to one of these strata. Consider a simple
regression model of the form
Y
i
=
β
0
+
J
s
j
=1
β
1
,j
·
I
(
i
th subject belongs to
j
th category) +
ǫ
i
,
where
j
= 1
, . . . , J
and
ǫ
i
is a zero-mean noise term.
1
The coeFcient
β
0
is treated
as the overall population mean (aka “grand mean”) and
β
1
,j
’s measure the amount
by which the mean of the
j
th stratum deviates from the grand mean. Given that we
have the data
Y
1
, . . . , Y
n
, the goal is to estimate the parameters (
β
0
,
β
1
,
1
, ...,
β
1
,J
)
of the model. It is known from which subpopulation each
Y
i
comes from.
Notice that the above model does not determine the parameters uniquely. In
statistics, a formal statement is “parameterization is not identi±able”. This means
that no matter how much data we have, it will not be possible to say for certain which
parameter values actually were used to generate the data. Why? Because
Y
i
=
β
0
+
β
1
,j
+
ǫ
i
= (
β
0
−
α
) + (
β
1
,j
+
α
) +
ǫ
i
,
for every value of
α
, so even if you knew the expectation (i.e., true mean)
μ
i
of each
of
Y
i
’s, you would still be unable to determine the
β
’s uniquely. ²or example, suppose
we have two subpopulations of ORIE graduates, one of which consists of individuals
whose highest degree is Bachelor and the other of those with Masters. Let
Y
i
denote
the random variable for the current salary of the
i
th individual. Then
β
1
,j
’s will
capture deviations of the subpopulation means (
μ
j
=
β
0
+
β
1
,j
) from some “base
level” mean
β
0
. Obviously, unless
β
0
is ±xed, one would be unable to determine
β
1
,j
’s
even if
Y
i
’s did not include the error term
ǫ
i
.
It is thus desirable to avoid this kind of situation by putting constraints on pa-