MN1025 – Business Statistics
35
Lecture 8—Friday 29/2/2008
LINEAR REGRESSION
Reference: Lind
et al.
, Chapter 13.
8.1
Regression: introduction
In the last lecture we introduced the concept of the
best-fit line, which is an approximation to the data.
The closeness of this approximation is measured by
the correlation coefficient
r
. In this lecture we will
see how the best-fit line can be used for prediction.
Example: Suppose the College wishes to save money
and asks:
can we predict exam results well from
weekly work? If the answer is yes, we dispense with
exams. So to test this we need a sample of students
already examined and see if for each student, their
average weekly mark predicts their exam result. To
estimate the predictive power, one uses
Linear Re-
gression
.
8.2
Back to sales and scores
Back to Example 7.7 (sales and scores).
Here are
the data again:
Data Display
Row scores sales
1
4
5
2
7
12
3
3
4
4
6
8
5
10
11
We
wish
to
analyse
how
good
an
approxima-
tion
to
the
data
the
best-fit
line
is.
We
use
STAT
→
REGRESSION
→
REGRESSION.
We
are
asked to choose a RESPONSE column and a PRE-
DICTOR column. In this case the only reasonable
choice is “scores” as predictor (or cause) and “sales”
as response (or effect). We get the Regression Anal-
ysis table shown below.
Regression Analysis: sales versus scores
The regression equation is
sales = 1.20 + 1.13 scores
Predictor
Coef SE Coef
T
P
Constant
1.200
2.313 0.52 0.640
scores
1.1333
0.3569 3.18 0.050
S = 1.955
R-Sq = 77.1%
R-Sq(adj) = 69.4%
Analysis of Variance
Source
DF
SS
MS
F
P
Regression
1 38.53 38.53 10.08 0.050
Residual Error
3 11.47
3.82
Total
4 50.00
The regression equation
sales = 1.20 + 1.13 scores
in the printout is
the equation of the best-fit line. We can plot this on
the scatter plot or get Minitab to plot it for us: we
use STAT
→
REGRESSION
→
FITTED LINE PLOT
and enter again “sales” as response, “scores” as pre-
dictor.
In all these examples we assume that the underlying
populations have an approximately normal distribu-
tion, and that a relation of the form
sales =
m
×
scores +
c
+ random error
is reasonable. In general, there could be more than
one predictor. For instance, we could think that staff
experience was a relevant factor and get a relation
of the form
sales =
m
1
×
scores +
m
2
×
experience+
c
+random error
.
Here we have two predictors, “scores” and “experi-
ence”. Generally, by a suitable choice of additional
predictors we can reduce the random error. In this
course, we will always use a single predictor only.
8.3
Testing if the slope is nonzero
For the
population
of scores and sales there is an
underlying (population) regression line:
sales =
m
population
×
scores +
c
population
.
In this equation,
m
population
is the slope of the (pop-
ulation) regression line, and
c
population
is its inter-
cept. The sample slope of
m
= 1
.
13 is our estimate
for
m
population
, and the sample intercept of
c
= 1
.
20
is our estimate for
c
population
.
This
preview
has intentionally blurred sections.
Sign up to view the full version.

This is the end of the preview.
Sign up
to
access the rest of the document.
- Spring '08
- SCHACK
- Regression Analysis, Errors and residuals in statistics, best-fit line, Minitab Regression Analysis
-
Click to edit the document details