Regression Analysis
Author: John M. Cimbala, Penn State University
Latest revision: 12 September 2007
Introduction
•
Consider a set of
n
measurements of some variable
y
as a function of another variable
x
.
•
Typically,
y
is some measured
output
as a function of some known
input
,
x
. Recall that the
linear
correlation coefficient
is used to determine
if
there is a trend.
•
If there
is
a trend,
regression analysis
is useful.
Regression analysis is used to find an equation for
y
as a
function of
x
that provides the
best fit
to the data
.
Linear regression analysis
•
Linear regression analysis
is also called
linear least-squares fit analysis
.
•
The goal of linear regression analysis is to find the “best fit” straight line through a set of
y
vs.
x
data.
•
The technique for deriving equations for this
best-fit
or
least-squares fit
line is as follows:
o
An equation for a straight line that attempts to fit the data pairs is chosen as
Ya
xb
=+
.
o
In the above equation,
a
is the
slope
(
a
=
dy
/
dx
– most of us are more familiar with the symbol
m
rather
than
a
for the slope of a line), and
b
is the
y-intercept
– the
y
location where the line crosses the
y
axis (in
other words, the value of
Y
at
x
= 0).
o
An upper case
Y
is used for the fitted line to distinguish the fitted data from the
actual
data values,
y
.
o
In linear regression analysis,
coefficients a and b are optimized for the best possible fit to the data
.
o
The optimization process itself is actually very straightforward:
o
For each data pair (
x
i
,
y
i
),
error e
i
is defined as
the difference between the predicted or fitted value and
the actual value
:
e
i
= error at data pair
i
, or
iii
i
i
eYya
xby
=
−= +−
.
e
i
is also called the
residual
.
Note
:
Here, what we call the
actual
value does not necessarily mean the “correct” value, but rather the value of
the actual measured data point.
o
We define
E
as the
sum of the squared errors
of the fit – a global measure of the error associated with
all
n
data points. The equation for
E
is
()
2
2
11
in
ii
i
ea
x
b
y
==
+
−
∑∑
E
.
o
It is now assumed that
the best fit is the one for which E is the smallest
.
o
In other words,
coefficients a and b that minimize E need to be found
. These coefficients will be the
ones that create the best-fit straight line
Y
=
ax
+
b
.
o
How can
a
and
b
be found such that
E
is minimized? Well, as any good engineer or mathematician
knows, to find a minimum (or maximum) of a quantity, that quantity is
differentiated
, and
the derivative
is set to zero
.
o
Here,
two
partial
derivatives are required, since
E
is a function of two variables,
a
and
b
. Therefore, we
set
0
E
a
∂
=
∂
and
0
E
b
∂
=
∂
.
o
After some algebra, which can be verified, the following equations result for coefficients
a
and
b
:
1
2
2
i
i
i
nx
y
x
y
a
x
=
=
⎛⎞
−
⎜⎟
⎝⎠
=
−
∑
2
1
1
2
2
i
i
i
i
i
x
yx
x
y
b
x
=
=
=
=
⎛
⎞
⎛
⎞
⎛
⎞
−
⎜
⎟
⎜
⎟
⎜
⎟
⎝
⎠
⎝
⎠
⎝
⎠
=
−
∑
∑
and
.