This preview shows page 1. Sign up to view the full content.
Unformatted text preview: Relationships
Regression PSLS chapter 4 © 2009 W.H. Freeman and Company Objectives (PSLS chapter 4)
Regression Regression lines The leastsquares regression line Using technology Facts about leastsquares regression Residuals Influential observations Cautions about correlation and regression Association does not imply causation Correlation tells us about
strength (scatter) and direction
of the linear relationship
between two quantitative
variables. In addition, we would like to have a numerical description of how both
variables vary together. And we would like to make predictions based on
the observed association.
But which line best
describes our data? The leastsquares regression line
The leastsquares regression line is the unique line such that the sum
of the total vertical (y) distances is zero and sum of the squared vertical
(y) distances between the data points and the line is the smallest
possible. Distances between the points and
line are squared so all are positive
values. This is done so that
distances can be properly added
(Pythagoras). Facts about leastsquares regression
1. The distinction between explanatory and response variables is
essential in regression. 2. There is a close connection between correlation and the slope of the
leastsquares line. 3. The leastsquares regression line always passes through the point 4. The correlation r describes the strength of a linear relationship.
The square of the correlation, r2, is the fraction of the variation in
the values of y that is explained by the leastsquares regression
of y on x. ( x, y ) Properties
The leastsquares regression line can be shown to have this equation: s
s
y
ˆ= y r )+ y x, o y= + x
y ( −x
r
r ˆ ab
s
s
x
x
ˆ
y is the predicted y value (y
hat)
b is the slope
a is the yintercept "a" is in units of y
"b" is in units of y / units of x How to:
First we calculate the slope of the line, b,
from statistics we already know: b=r r is the correlation
sy is the standard deviation of the response variable y
sx is the the standard deviation of the explanatory variable x sy
sx Once we know b, the slope, we can calculate a, the yintercept: ayb
=x
− where x and y are the sample
means of the x and y variables This means that we don’t have to calculate a lot of squared distances to find the leastsquares regression line for a data set. We can instead rely on the equation. But typically, we use a 2var stats calculator or a stats software. BEWARE !!!
Not all calculators and software use the same convention: ˆ= +x
y ab
Some use instead: ˆ= x b
y a+ Make sure you know what YOUR
calculator gives you for a and b before
you answer homework or exam
questions. Software output Intercept
Slope
R2 r
R2 Intercept
Slope The equation completely describes the regression line.
To plot the regression line, you only need to plug two x values into the
equation, get y, and draw the line that goes through those two points.
Hint: The regression line always passes through the mean of x and y. The points you use for
drawing the regression
line are derived from the
equation.
They are NOT points from
your sample data (except
by pure coincidence). The distinction between explanatory and response variables is crucial in
regression. If you exchange y for x in calculating the regression line, you
will get the wrong line.
Regression examines the distance of all points from the line in the y
direction only. Data from a study of the effect of
fidgeting on weight gain:
These two lines are the two regression
lines calculated either correctly
(x = nonexercise activity, y = fat gain,
solid line) or incorrectly (x = fat gain,
y = nonexercise activity, dotted line). Correlation and regression The correlation is a measure In regression we examine of spread (scatter) in both the the variation in the response x and y directions in the linear variable (y) given change in relationship. the explanatory variable (x). Coefficient of determination, r2
r2, the coefficient of determination, is the square of the correlation
coefficient. r2 represents the fraction of the
variance in y (vertical scatter from
the regression line) that can be
explained by changes in x. sy
b= r
sx r = −1
r2 = 1 Changes in x
explain 100% of
the variations in y. r = 0.87
r2 = 0.76 y can be entirely
predicted for any
given value of x. r=0
r2 = 0 Changes in x
explain 0% of the
variations in y.
The value(s) y
takes is (are)
entirely
independent of
what value x
takes. Here the change in x only
explains 76% of the change in
y. The rest of the change in y
(the vertical scatter, shown as
red arrows) must be explained
by something other than x. r =0.7
r2 =0.49 There is quite some variation in BAC for the
same number of beers drunk. A person’s
blood volume is a factor that was overlooked. We changed number of beers to number
of beers/weight of a person in pounds. r =0.9
r2 =0.81 In the first plot, number of beers only explains
49% of the variation in BAC.
But number of beers/weight explains 81% of
the variation in BAC. Additional factors contribute to variations in
BAC among individuals (like maybe some
genetic ability to process alcohol). Grade performance
If class attendance explains 16% of the variation in grades, what is
the correlation between percent of classes attended and grade? 1. We need to make an assumption: Attendance and grades are
positively correlated. We also assume that the association between
attendance and grades is linear. So r will be positive too.
2. r2 = 0.16, so r = +√0.16 = + 0.4 A weak correlation. Residuals
The distances from each point to the leastsquares regression line are
called residuals. The sum of all the residuals is by definition 0.
The pattern made by the residuals over x is actually very informative.
Points above the
line have a positive
residual (under
estimation).
Points below the line have a
negative residual (over
estimation). ^
Predicted y
Observed y ˆ
dist. ( y − y ) = residual Residual plots
Residuals are the distances between yobserved and ypredicted. We
plot them in a residual plot.
If residuals are scattered randomly around 0, chances are your data
fit a linear model, were normally distributed, and you didn’t have outliers. scatterplot The xaxis in a residual plot is
the same as on the scatterplot. The line on both plots is the
regression line. residual plot Only the yaxis is different. Residuals are randomly scattered—good! A curved pattern—means the relationship
you are looking at is not linear. A change in variability across plot is a
warning sign. You need to find out why it
is and remember that predictions made in
areas of larger variability will not be as
good. Outliers and influential points
Outlier: An observation that lies outside the overall pattern of
observations.
“Influential individual”: An observation that markedly changes the
regression if removed. This is often an outlier on the xaxis.
Child 19 = outlier
in y direction Child 19 is an outlier
of the relationship. Child 18 = outlier in x direction Child 18 is only an
outlier in the x
direction and thus
might be an
influential point. Outlier in
ydirection All data
Without child 18
Without child 19 Are these
points
influential? Influential Correlation/regression using averages
Many regression or correlation studies use average data.
While this is appropriate, you should know that correlations based on
averages are usually quite higher than when made on the raw data. The correlation is a measure of spread
(scatter) in a linear relationship. Using
averages greatly reduces the scatter.
Therefore, r and r2 are typically much
stronger when averages are used. Boys Each dot represents an average. The
variation among boys per age class is
not shown. Boys These histograms illustrate that each
mean represents a distribution of
boys of a particular age. Should parents be worried if their son does not match the point for his age?
If the raw values were used in the correlation instead of the mean, there would be
a lot of spread in the ydirection ,and thus, the correlation would be smaller. That’s why typically growth
charts show a range of values
(here from 5th to 95th
percentiles).
This is a more comprehensive
way of displaying the same
information. Making predictions: Interpolation
The equation of the leastsquares regression allows you to predict y for
any x within the range studied. This is called interpolating. ˆ0 4 0 8
y . 4+0
=1
0x. 0
0 Nobody in the study drank 6.5
beers, but by finding the value
of ˆ
y from the regression line for x = 6.5, we would expect a BAC
of 0.094 mg/ml. ˆ
y = 0.0144 * 6.5 + 0.0008
ˆ
y = 0.936 + 0.0008 = 0.0944 mg / ml 100 There is a positive linear
80 number of powerboats
registered and the number
of manatee deaths. Manatee deaths relationship between the y = 0.1301x  43.7
R2 = 0.9061 60
40
20 The leastsquares
regression line is: ˆ
y = 0.13 x − 43.7 0
400 600 800 1000 Powerboats (x1000) If Florida were to limit the number of powerboat registrations to 500,000,
what could we expect for the number of manatee deaths in a year? Thousands
powerboats
447
460
481
498
513
512
526
559
585
614
645
675
711
719
681
679
678
696
713
732
755
809
830
880
944
962
978
983
1010
1024 ˆ
ˆ
y = 0.13(500) − 43.7 ⇒ y = 65 − 43.7 = 21.3 Roughly 21 manatee deaths. Manatee
deaths
13
21
24
16
24
20
15
34
33
33
39
43
50
47
55
38
35
49
42
60
54
66
82
78
81
95
73
69
79
92 Caution with regression Do not use a regression on inappropriate data. Pattern in the residuals Presence of large outliers Clumped data falsely appearing linear Use residual plots for help. Recognize when the correlation/regression is performed on averages. A relationship, however strong, does not itself imply causation. Beware of lurking variables. Avoid extrapolating (going beyond interpolation). Lurking variables
A lurking variable is a variable not included in the study design that
does or may have an effect on the variables studied.
Lurking variables can falsely suggest a relationship.
What is the lurking variable here? Some more obvious than others. Strong positive association between the number firefighters at a fire site and
the amount of damage a fire does. Negative association between moderate
amounts of winedrinking and death rates
from heart disease in developed nations. Extrapolation Extrapolation is the use of a
regression line for predictions
outside the range of x values
used to obtain the line. This can be misleading, as
seen here. The yintercept
Sometimes the yintercept is not biologically possible. Here we have
negative BAC for zero beer drunk, which makes no sense… But this negative yintercept
helps describe mathematically
the regression line.
Here, we didn’t collect data for
“zero beer,” there is a lot of
scatter overall and the line is
just an estimate. yintercept shows
negative blood alcohol ALWAYS PLOT YOUR DATA!
The correlations all give r ≈ 0.816, and the regression lines are all approximately ˆ
y = 3 + 0.5x. For all four sets, we would predict y = 8 when x = 10.
ˆ ALWAYS PLOT YOUR DATA!
However, making the scatterplots shows us that the correlation/ regression
analysis is not appropriate for all data sets. Moderate linear
association;
regression OK. Obvious nonlinear
relationship;
regression
inappropriate. One point deviates from
the (highly linear)
pattern of the other
points; requires further
examination. Just one very influential
point and a series of
other points all with the
same x value; a
redesign is due here… Association and causation
Association, however strong, does NOT imply causation.
Only careful experimentation can show causation. reading index Strong positive linear relationship
Children reading skills w ith shoe size
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 1 2 3 4 Shoe size Not all examples are so obvious… 5 6 7 Vocabulary: lurking vs. confounding
LURKING VARIABLE A lurking variable is a variable that is not among the explanatory or
response variables in a study and yet may influence the
interpretation of relationships among those variables. CONFOUNDING Two variables are confounded when their effects on a response
variable cannot be distinguished from each other. The confounded
variables may be either explanatory variables or lurking variables. But you often see them used interchangeably… Association and causation
Lung cancer is clearly associated with smoking.
How do we know that both variables are not being
affected by an unobserved third (lurking) variable?
For instance, what if a genetic mutation caused people to both get lung cancer
and become addicted to smoking, but smoking itself didn’t CAUSE lung cancer? We can evaluate an association using the following criteria:
1) The association is strong.
2) The association is consistent.
3) Higher doses are associated with stronger responses.
4) The alleged cause precedes the effect.
5) The alleged cause is plausible. ...
View
Full
Document
This note was uploaded on 10/07/2011 for the course BSTT 400 taught by Professor Sallyfreels during the Fall '11 term at Ill. Chicago.
 Fall '11
 SallyFreels

Click to edit the document details