This preview shows page 1. Sign up to view the full content.
Unformatted text preview: M316 Chapter 5 Dr. Berg Regression The simplest relation between two quantitative variables is a linear (straight line) relationship. These are common and easy to understand. Regression Lines Definition A regression line is a straight line that describes how a response variable y changes as an explanatory variable x changes. We often use a regression line to predict the value of y for a given value of x. Example (5.1) Does fidgeting keep you slim? State: Perhaps fidgeting and other "nonexercise activity" (NEA) would help explain why some people don't gain weight even when they overeat. Researchers overfed 16 healthy young adults for 8 weeks. They measured fat gain (in kilograms) and, as an explanatory variable, change in energy use (in calories). Here are the data. nea fat nea fat -94 4.2 392 3.8 -57 3 473 1.7 -29 3.7 486 1.6 135 2.7 535 2.2 143 3.2 571 1 151 3.6 580 0.4 245 2.4 620 2.3 355 1.3 690 1.1 Do people with larger increased in NEA tend to gain less fat? Formulate: Make a scatterplot, measure correlation and draw a regression line. 1 M316 Chapter 5 Dr. Berg Solve: The plot shows a moderately strong negative linear association with no outliers. The correlation is r = -0.7786 . The line is a regression line for predicting fat gain from change in NEA. Conclude: People with larger increases in nonexercise activity do gain less fat. Review of Straight Lines Suppose that y is a response variable and x is an explanatory variable. A straight line relating y to x has an equation of the form y = a + bx In this equation, b is the slope (rate of change of y with respect to x), and a is the intercept (value of y when x = 0 . Example Any straight line describing the NEA data has the form fat gain = a + (b NEA change) For the data in Example 5.1 the regression line has the equation fat gain = 3.505 - (0.00344 NEA change) The role of the two numbers a and b in this equation is: 1) The slope b = -0.00344 tells us that fat gained goes down by 0.00344 kilogram for each added calorie of NEA. 2) The intercept a = 3.505 means that zero added calories of NEA results in a fat increase of 3.505 kilograms. The LeastSquares Regression Line Because the regression line is used to make predictions, we want the line that is as accurate as possible. The errors are the (vertical) distances between the actual value and the predicted value for each individual. Example One subject had a decrease of 57 calories in NEA. The line predicts a fat gain of y = 3.505 - 0.00344(-57) = 3.7 kilograms. The actual fat gain was 3.0 kilograms. The prediction error is error=observed responsepredicted response =3.03.7=0.7 kilogram. The usual method for making these errors as small as possible is the "leastsquares" method. Definition The leastsquares regression line is the line that makes the sum of the squares of the errors as small as possible. 2 M316 Chapter 5 Dr. Berg Procedure If x and y are the means, and sx and sy are the standard deviations, and r is the correlation, then the least squares regression line is given by ^ y = a + bx where the slope is s b=r y sx and the intercept is a = y - bx . Exercise Two points uniquely determine a line. Let's find the line that best fits three noncolinear points (1, 3), (2, 2), and (3, 4). Using Technology Leastsquares regression is one of the most common statistical procedures. Here are some outputs. 3 M316 Chapter 5 Dr. Berg Facts About LeastSquares Regression 1) The distinction between the explanatory and response variables is essential. The errors being minimized are in the y direction. Reversing the roles of x and y has a radical effect on the line. s 2) Because b = r y , a change of one standard deviation in x results in a change sx of r standard deviations in y. 4 M316 Chapter 5 Dr. Berg 3) The leastsquares line always passes through the point (x, y ) on the graph of y against x. 4) The correlation r describes the strength of the straightline relationship. The square of the correlation, r 2 is the fraction of the variation in the values of y that is explained by the leastsquares regression of y on x. Algebraically: ^ variation in y as x pulls it along the line r2 = . total variation in observed values of y Example (5.4) Using r 2 . For the NEA data r = -0.7786 and r 2 = 0.6062 , so about 61% of the variation in fat gained is accounted for by the linear relationship with the change in NEA. The other 39% is due to other things. Residuals Large deviations from the regression line show up in residuals. Definition A residual is the difference between an observed value of the response variable and the value predicted by the regression line. Thus: residual = observation prediction ^ = y - y . Example (5.5) I Feel Your Pain Empathy means being able to understand what others feel. In an experiment, the subjects are 16 couples in their midtwenties who were married or had been dating for at least two years. The male partner of the couple is zapped with electricity while the female partner watches and has the area of her brain that responds to pain monitored. The woman also completed a test measuring her empathy. Will women who are higher in empathy respond more strongly when their partner has a painful experience? Here are the data. Subject 1 2 3 4 5 6 7 8 Empathy Score 38 53 41 55 56 61 62 48 Brain Activity 0.12 0.392 0.005 0.369 0.016 0.415 0,107 0.506 Subject 9 10 11 12 13 14 15 16 Empathy Score 43 47 56 65 19 61 32 105 Brain Activity 0.153 0.745 0.255 0.574 0.21 0.722 0.358 0.779 The resulting scatterplot and regression line are shown below. The empathy score is the explanatory variable and brain activity is the response variable. The residual for subject 1 is detailed. 5 M316 Chapter 5 Dr. Berg Here is a plot of the residuals. 6 M316 Influential Observations Chapter 5 Dr. Berg Notice that in the previous example, subject 16 is an outlier in the x direction. Because of its extreme position on the empathy scale, this point has a strong influence on the correlation. Dropping subject 16 reduces the correlation from r = 0.515 to r = 0.331. We say that subject 16 is influential for calculating the correlation. Definition An observation is influential for a statistical calculation if removing it would markedly change the result of the calculation. Points that are outliers in either the x or y direction are often influential for the leastsquares regression line. Example (5.6) Is observation 16 also influential for the regression line? As seen in figure 5.7, it is close to the regression line, so removing it makes little difference. 7 M316 Chapter 5 Dr. Berg Cautions About Correlation and Regression 1) Correlation and regression lines describe only linear relationships. 2) Correlation and leastsquares regression lines are not resistant. 3) Beware of extrapolation, the use of a regression line to make predictions far outside the range of values of the explanatory variable. 4) Beware of the lurking variable, a variable that is not among the explanatory or response variables in a study and yet may influence the interpretation of the relationships among those variables. Example One expert on heart disease noted that deaths from heart attacks recorded from year to year have a very high correlation with the number of TV antennas on houses. Did the number of TV antennas have a strong influence on heart disease? Association Does Not Imply Causation A strong association between two variables does not imply that one causes the other. Example (5.8) A study one found that people with two cars live longer than people who own only one car. Owning three cars is even better. Could we lengthen our lives by buying more cars? There is a lurking variable here that would tend to explain both longer life and having more cars: affluence. 8 ...
View Full Document
- Fall '08