10_W3_S1_regresion_wisdom-chapter_9_

10_W3_S1_regresion_w - Week 3 — Session one(Chapter 9 Regression Wisdom Professor Esfandiari Regression Assumptions Linearity Checked by the

Info iconThis preview shows pages 1–5. Sign up to view the full content.

View Full Document Right Arrow Icon
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Background image of page 2
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full DocumentRight Arrow Icon
Background image of page 4
Background image of page 5
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Week 3 — Session one (Chapter 9, Regression Wisdom) Professor Esfandiari Regression Assumptions Linearity: Checked by the scatterplot of Y vs. X Equality of error variance: Checked by the scatterplot of e Vs. X Independence: Making sure that data are independent (each subject has an equal chance of being selected and the choice of one participant does not depend on the other) Checking for the assumption of linear relationship between the outcome (Y) and the predictor (X) The best way to check for linearity of relationship between two quantitative variables is not looking at the scatterplot of Y Vs. X, rather it is examining the scatterplot of residuals (e = Y — y"; that we discussed in chapter 8), vs. X or predictor or independent variable. If the relationship between Y and X is linear, there should be no pattern in the plot of residqeilsfvs: EmpIying that the standard deviation of residuals (Se = line at the mean of errors, which is equal to zero, there is no pattern in the data. The data are equally scattered above and below the line. is similar for the different values of X. Notice that when we draw a Scatterplot of standardized reading scores in 02 and 03 11:1C1):v1:v O 0 C O O O Q n n I V 50130 000 V u 09 O c ’3 0 O C a) 00 49.063 2 cut-u standardized reading scores 2003 it”: 7013 253,631} 433.00 930.0% $3.00 100.013 standardized reading scores for 2002 The researchers equip emperor penguins with devices that record their : ». I rates during dives. Here’s a scatterplot of the Dive Heart Rate (beats per mi: - and the Duration (minutes) of dives by these high-tech penguins: G 120 :4: FtGURE 9.1 0 ' t . g 66529 The scatterp/or of Drve Heart Rate ‘ t 90 9° 9&0 3 beats per minute {23pm} vs. Durat :‘ g file a 'g M! 6 (minutes) shows a strong, rough '5 o ‘ - - ' -- g 60 2 ° 0 . linear, negative assoczazron. 5 c '4 _ ., * ~——r———r—————+—-—-———r— Duration (mins) 'lhe scatterplot looks fairly linear with a moderately strong negative associa: (R2 = 71.5%). The linear regression equation /\ . DzveHeartRate = 96.9 — 5.47 Duratzon per dive minute, starting from a value of 96.9 beats per minute. The scatterplot of the residuals against Duration holds a surprise. The Linea: " at .,-, 1 says" that for longer dives, the average Dive Heart Rate is lower by about 5.47 7:5: ' ~‘ Assumption says we should not see a pattern, but instead there’s a bend, starts. high on the left, dropping down in the middle of the plot, and rising again at 2"» right. Graphs of residuals often reveal patterns such as this that were easy to .~ in the original scatterplot. 40 J- FIGURE 9.2 20 l Plotting the residuals against Dure: :‘ Lg : reveals a bend. It was also in the cg :5 ins] scatterplot, but here it’s easier 3 0 to see. m , —20 —-—t——-—-—l—————-l——t———-—— 4 8 12 16 Duration (mins) Outliers: Outliers stand away from the rest of data. They can have high residual or high leverage. High residual could be any value with a Z value larger than Z = +3 or Z = +4 or smaller than Z -3 or Z = —4. To decided how influential an outlier is, one can analyze the data with and without outlier and see how R"2 and the model changes. Lurking variables and their role in observational studies Observational studies do not lead to causal conclusions because 0 Control groups are not possible 0 Random assignment of subjects to groups is not possible Example: studies have shown that among elderly there is a relationship between having a pet and depression so that those elderly who have pets have a lower level of depression. The only way that such a conclusion could be drawn would be to take a group of elderly, randomly assign them to own or not own a pet and measure the level of their depression prior to and after owning a pet and such an experiment is unethical and not doable. Thus, we cannot conclude that having pets lead to less depression among the elderly. There are many other potential factors (called lurking variables or confounding factors) that underlie owning a pet Maybe elderly who own a pet are more loving, more giving, have more energy, have more zest for life, etc. Thus, in observational studies, no matter how large the R"2, we cannot conclude that one variable is the cause of the other. There may be a common cause underlying both the predictor and the outcome Suppose that a researcher finds that there is a positive relationship between career aspiration and extra hours that employees choose to spend on the job. He cannot conclude that career aspiration is the cause underlying the extra hours of work that the employees choose to spend at work. Maybe there is a third factor such as passion toward the job that underlies both career aspiration and extra hours of work that employees choose spend at work. "'—’ " A -. L 4. ‘ The United Nations Development Programme (UNDP) uses the Human Development 1:: (HDI) in an attempt to summarize in one number the progress in health, education, and economics ofa cou: 7' The number of cell phone subscribers per 1000 people 1.- positively associated with economic progress in a COLE? Can the number of cell phone subscribers be used to predict the HDI? Here is a scatterplot of HDI against c—z ., phone subscribers: 0.9 0.8 0.7 0.6 0.5 0.4 0.3 HDl 0 200 400 600 800 1000 1200 Cell Phone Subscribers lu Explain why fitting a linear model to these data might be misleading. If you fit a linear model to the data, what do you think a scatterplot of residuals versus predicted HDI will look like? Here’s a scatterplot of the production ‘ ‘ " ~ets (in millions of dollars) vs. the running time . “.45 :. minutes) for major release movies in 2005. Dramas are :1: tied in red and all other genres are plotted in black. A : rate least squares regression line has been fitted to each :- ""1. For the following questions, just examine the plot. 0 : Drama - Other Hmlqnl (IIIIIIIHH'.1II l‘ll‘l‘ll‘A) Run Time (minutes) are the units for the slopes of these lines? t war are dramas and other movies similar eci to this relationship? v: are dramas differen ' respect to :eia.-. .-.:: . 6 /. ‘ 25. .' ' ' ' For women, pregnancy lasts about“ 12. Each of the following I 9 months. In other species of animals, the length of tune gcafl‘fl'f’k’ts 5110ng a duSter 0f Pants and one "Sta? % from conception to birth varies. Is there any, evidence that point For eaCh, answer these (11169501155 ' the gestation period is related to the annnal's lifespan? 1) I“ What way is the Pomt unusual? Does it have 11:: The first scatterplot shows Gestation-Penod (in days) vs. leverage, a large I‘eSidua-L OI bOth? Life Expectancy (in years) for 18 spec1es of mammals. 2) D0 You think that Pomt is an influemial P051“? _ The highlighted point at the far nght represents humans. 3) If that point were removed from the data, WOUJC‘ 625 2 the correlation become stronger or weaker? Explain. V 500 4) If that point were removed from the data, would _ a the slope of the regression line increase or decree; g 375 Explain. '5 C. “g 250 <5 125 f {.3 a) ——————i———+—+—-——t— 20 4O 60 80 Life Expectancy (yr) M a) For these data, 1' = 0.54, not a very strong relationship. Do you thmk’ the association would be stronger or I 13) weaker if humans were removed? Explain. b) Is there reasonable justification for removing humans from the data set? Explain. c) Here are the scatterplot and regression analysis for the 17 nonhuman species. Comment on the strength of the association. 625 _. g a . * 9&9 500 e? 1? ~ 9 . a _ Egg 3;— 375 a — 968 1% o y fil—‘l—H—fi 55; 250 ‘a' 125 _ ii ‘ -—+——t——-+——i—— 3‘31: e _ g Q .3 7.5 15.0 22.5 30.0 ._ , I f ' l Life Expectancy (yr) Dependent variable is: Gestation _. R—squared = 72.2% Variable Coefficient Constant —39.51 72 LifExp 1 5.4980 d) Interpret the slope of the line. e) Some species of monkeys have a life expectancy of about 20 years. Estimate the expected gestation period of one of these monkeys. ...
View Full Document

This note was uploaded on 12/03/2011 for the course STATISTICS 10 taught by Professor Gould during the Fall '11 term at UCLA.

Page1 / 5

10_W3_S1_regresion_w - Week 3 — Session one(Chapter 9 Regression Wisdom Professor Esfandiari Regression Assumptions Linearity Checked by the

This preview shows document pages 1 - 5. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online