This preview shows pages 1–5. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: Week 3 — Session one (Chapter 9, Regression Wisdom)
Professor Esfandiari Regression Assumptions
Linearity: Checked by the scatterplot of Y vs. X
Equality of error variance: Checked by the scatterplot of e Vs. X
Independence: Making sure that data are independent (each subject has an equal
chance of being selected and the choice of one participant does not depend on the
other) Checking for the assumption of linear relationship between the outcome (Y) and the
predictor (X) The best way to check for linearity of relationship between two quantitative variables is not
looking at the scatterplot of Y Vs. X, rather it is examining the scatterplot of residuals (e = Y
— y"; that we discussed in chapter 8), vs. X or predictor or independent variable. If the relationship between Y and X is linear, there should be no pattern in the plot of
residqeilsfvs: EmpIying that the standard deviation of residuals (Se = line at the mean of errors, which is equal to zero, there is no pattern in the data. The data
are equally scattered above and below the line. is similar for the different values of X. Notice that when we draw a Scatterplot of standardized reading scores in 02 and 03 11:1C1):v1:v O
0 C O O O Q
n n I V
50130 000 V u 09 O c ’3
0 O C a) 00 49.063 2 cutu standardized reading scores 2003 it”: 7013 253,631} 433.00 930.0% $3.00 100.013 standardized reading scores for 2002 The researchers equip emperor penguins with devices that record their : ». I
rates during dives. Here’s a scatterplot of the Dive Heart Rate (beats per mi: 
and the Duration (minutes) of dives by these hightech penguins: G
120 :4: FtGURE 9.1
0 ' t .
g 66529 The scatterp/or of Drve Heart Rate ‘
t 90 9° 9&0 3 beats per minute {23pm} vs. Durat :‘
g ﬁle a 'g M! 6 (minutes) shows a strong, rough
'5 o ‘   '  g 60 2 ° 0 . linear, negative assoczazron.
5 c '4 _ ., * ~——r———r—————+—————r— Duration (mins) 'lhe scatterplot looks fairly linear with a moderately strong negative associa:
(R2 = 71.5%). The linear regression equation /\ .
DzveHeartRate = 96.9 — 5.47 Duratzon per dive minute, starting from a value of 96.9 beats per minute.
The scatterplot of the residuals against Duration holds a surprise. The Linea: " at .,,
1 says" that for longer dives, the average Dive Heart Rate is lower by about 5.47 7:5: ' ~‘ Assumption says we should not see a pattern, but instead there’s a bend, starts. high on the left, dropping down in the middle of the plot, and rising again at 2"» right. Graphs of residuals often reveal patterns such as this that were easy to .~ in the original scatterplot. 40 J FIGURE 9.2 20 l Plotting the residuals against Dure: :‘
Lg : reveals a bend. It was also in the cg
:5 ins] scatterplot, but here it’s easier
3 0 to see.
m , —20 ——t————l—————l——t—————
4 8 12 16 Duration (mins) Outliers: Outliers stand away from the rest of data. They can have high residual or
high leverage. High residual could be any value with a Z value larger than Z = +3 or Z
= +4 or smaller than Z 3 or Z = —4. To decided how inﬂuential an outlier is, one can
analyze the data with and without outlier and see how R"2 and the model changes. Lurking variables and their role in observational studies Observational studies do not lead to causal conclusions because 0 Control groups are not possible 0 Random assignment of subjects to groups is not possible
Example: studies have shown that among elderly there is a relationship between
having a pet and depression so that those elderly who have pets have a lower level
of depression. The only way that such a conclusion could be drawn would be to take
a group of elderly, randomly assign them to own or not own a pet and measure the
level of their depression prior to and after owning a pet and such an experiment is
unethical and not doable. Thus, we cannot conclude that having pets lead to less
depression among the elderly. There are many other potential factors (called
lurking variables or confounding factors) that underlie owning a pet Maybe elderly
who own a pet are more loving, more giving, have more energy, have more zest for
life, etc. Thus, in observational studies, no matter how large the R"2, we cannot
conclude that one variable is the cause of the other. There may be a common cause underlying both the predictor and the outcome Suppose that a researcher finds that there is a positive relationship between career
aspiration and extra hours that employees choose to spend on the job. He cannot
conclude that career aspiration is the cause underlying the extra hours of work that
the employees choose to spend at work. Maybe there is a third factor such as
passion toward the job that underlies both career aspiration and extra hours
of work that employees choose spend at work. "'—’ " A . L 4. ‘ The United Nations Development
Programme (UNDP) uses the Human Development 1:: (HDI) in an attempt to summarize in one number the
progress in health, education, and economics ofa cou: 7'
The number of cell phone subscribers per 1000 people 1.
positively associated with economic progress in a COLE?
Can the number of cell phone subscribers be used to
predict the HDI? Here is a scatterplot of HDI against c—z .,
phone subscribers: 0.9
0.8
0.7
0.6
0.5
0.4
0.3 HDl 0 200 400 600 800 1000 1200
Cell Phone Subscribers lu Explain why ﬁtting a linear model to these data might
be misleading.
If you ﬁt a linear model to the data, what do you think a scatterplot of residuals versus predicted HDI will
look like? Here’s a scatterplot of the production
‘ ‘ " ~ets (in millions of dollars) vs. the running time . “.45 :. minutes) for major release movies in 2005. Dramas are
:1: tied in red and all other genres are plotted in black. A
: rate least squares regression line has been fitted to each
: ""1. For the following questions, just examine the plot. 0 : Drama
 Other Hmlqnl (IIIIIIIHH'.1II l‘ll‘l‘ll‘A) Run Time (minutes) are the units for the slopes of these lines? t war are dramas and other movies similar
eci to this relationship?
v: are dramas differen '
respect to :eia.. ..:: . 6 /. ‘ 25. .' ' ' ' For women, pregnancy lasts about“ 12. Each of the following I 9 months. In other species of animals, the length of tune
gcaﬂ‘ﬂ'f’k’ts 5110ng a duSter 0f Pants and one "Sta? % from conception to birth varies. Is there any, evidence that
point For eaCh, answer these (11169501155 ' the gestation period is related to the annnal's lifespan? 1) I“ What way is the Pomt unusual? Does it have 11:: The ﬁrst scatterplot shows GestationPenod (in days) vs.
leverage, a large I‘eSiduaL OI bOth? Life Expectancy (in years) for 18 spec1es of mammals.
2) D0 You think that Pomt is an inﬂuemial P051“? _ The highlighted point at the far nght represents humans.
3) If that point were removed from the data, WOUJC‘ 625 2
the correlation become stronger or weaker?
Explain. V 500
4) If that point were removed from the data, would _ a
the slope of the regression line increase or decree; g 375 Explain. '5 C. “g 250 <5
125 f {.3
a) ——————i———+—+———t—
20 4O 60 80
Life Expectancy (yr)
M
a) For these data, 1' = 0.54, not a very strong relationship.
Do you thmk’ the association would be stronger or I
13) weaker if humans were removed? Explain. b) Is there reasonable justiﬁcation for removing humans
from the data set? Explain. c) Here are the scatterplot and regression analysis for the
17 nonhuman species. Comment on the strength of the association.
625 _. g a .
* 9&9 500 e? 1?
~ 9 . a
_ Egg 3;— 375 a
— 968 1% o y
ﬁl—‘l—H—ﬁ 55; 250 ‘a' 125 _ ii ‘ —+——t——+——i—— 3‘31: e _ g Q .3 7.5 15.0 22.5 30.0
._ , I f ' l Life Expectancy (yr) Dependent variable is: Gestation
_. R—squared = 72.2% Variable Coefﬁcient
Constant —39.51 72
LifExp 1 5.4980 d) Interpret the slope of the line.
e) Some species of monkeys have a life expectancy of about 20 years. Estimate the expected gestation period
of one of these monkeys. ...
View
Full
Document
This note was uploaded on 12/03/2011 for the course STATISTICS 10 taught by Professor Gould during the Fall '11 term at UCLA.
 Fall '11
 Gould

Click to edit the document details