Simple regression (Aug 23, 2010)
Review
Basic equations for simple regression
The purpose of regression is to describe the relationship between a response variable
y
and a
predictor variable
X,
and to evaluate the strength of the relationship if it exists.
In its simplest form, the
“linear model” of the relationship between a continuous variable
y
and a variable
X
is assumed to be
linear (straight line), and the equation expressing the linear relationship between
y
and
X
is written as:
)
,
0
(
~
2
1
0
σ
ε
ε
β
β
NID
where
X
y
i
i
i
i
+
+
=
.
In this expression, β
0
is called the “intercept” (the value of
y
when
X
= 0) and β
1
is the “slope” (the
change in
y
associated with each unit change in
X
). The Greek letters signify that they are “population
parameters,” or characteristics of the population, rather than “sample statistics” used to estimate
population parameters.
The equation says that when you consider all the people or communities or
organizations in your population, one attribute (e.g., political conservatism of individuals) of the
population,
y
, is linearly (straight line) related to a second attribute (e.g., age) of the population,
X
(
)
1
0
i
i
X
y
β
β
+
=
, but not perfectly, and the departure of individual observations from the value
predicted by the equation (
ε
i
) is our understanding of the term “statistical error” or “residual.” Taken
together, the two “parameters” of the model provide a complete description of the straight line
relationship.
One characteristic of population parameters is that they are fixed quantities of the population; that
is, if you have N people living in a community at noon on September 1, 2010, then
β
1
is an expression
of the relationship between the variables y and X for all of the people living in the community at that
time. Departures of the observations from the prediction line are referred to as statistical error (ε
i
) and
the error terms are assumed to have particular characteristics; they are said to be “normally and
independently distributed with mean zero and variance
2
σ
,hence we say
).
,
0
(
~
2
σ
ε
NID
i
The error
term for each observation ,
i
ε
, and population variance ,
σ
2
, is also population parameters.
Usually, we don’t know the value of the population means (µ
y
and µ
X
), variances (σ
y
and σ
X
),
correlations (ρ
y
and ρ
X
), intercepts or slopes, so we estimate their value by taking a sample from the
population. These estimates are our “sample statistics.” If the sample is random, so that neither the
researchers nor the subjects of the study have any influence over who is selected, then the sample
statistics are “unbiased” estimates of the population parameters.
In the discussion that follows, we will
focus on estimating sample statistics and we will assume the data are from a random sample of size
n
(n < N).