This preview shows page 1. Sign up to view the full content.
Unformatted text preview: M316 Chapter 24 Dr. Berg Inference for Regression When a scatterplot shows a linear relationship between a quantitative explanatory variable x and a quantitative response variable y, we can use the least‐ squares line fitted to the data to predict y for a given value of x. When the data are a sample from a larger population, we need statistical inference to answer these questions: • Is there really a linear relationship between x and y in the population, or might the pattern we see in the scatterplot plausibly arise just by chance? • How large is the slope (rate of change) that relates y to x in the population, including a margin of error for our estimate of the slope? • If we use the least‐squares line to predict y for a given value of x, how accurate is our prediction (again, with a margin of error)? Example (24.1) Crying and IQ STATE: Infants who cry easily may be more easily stimulated than others. This may be a sign of higher IQ. Child development researchers explored the relationship between the crying of infants four to ten day sold and their later IQ test scores. A snap of a rubber band on the sole of the foot caused the infants to cry. The researchers recorded the crying and measured the intensity by the number of peaks in the most active 20 seconds. They later measured the children’s IQ at age three years using the Stanford‐Binet IQ test. Here is the data. crycount 10 20 17 12 12 16 19 12 9 23 iq 87 90 94 94 97 100 103 103 103 103 crycount 13 14 16 27 18 10 18 15 18 23 iq 104 106 106 108 109 109 109 112 112 113 crycount 15 21 16 9 12 12 19 16 20 iq 114 114 118 119 119 120 120 124 132 crycount 15 22 31 16 17 30 22 33 13 iq 133 135 135 136 141 155 157 159 162 FORMULATE: Make a scatterplot. If the relationship appears linear, use correlation and regression to describe it. Finally, ask whether there is a statistically significant relationship between crying and IQ. SOLVE: (first steps) Chapters 4 and 5 introduced the data analysis that must come before inference. The first steps we take are a review of this data analysis. The scatterplot that follows indicates a possible moderate linear relationship. 1 M316 Chapter 24 Dr. Berg € The correlation is r=0.455 and the least‐squares regression line is ˆ y = a + bx = 91.27 + 1.493 x . CONCLUDE: Children who cry more vigorously do tend t have higher IQs. Because r 2 = 0.207 , only about 21 % of the variation in IQ scores is explained by crying intensity. Prediction of IQ will not be very accurate, but it is still an interesting € relationship. Conditions for Regression Inference Our regression line is useful only if the relationship is actually linear. The values of a and b are computed from the data in the experiment or study, and are therefore statistics used to estimate the actual unknown parameters. Conditions for Regression Inference We have n observations on an explanatory variable x and a response variable y. Our goal is to study or predict the behavior of y for given values of x. We work with these assumptions. • For any fixed value of x, the response y varies according to a Normal distribution. Repeated responses y are independent of each other. • The mean response µy has a straight‐line relationship with x given by a population regression line µy = α + βx where α and β are unknown parameters. € € 2 M316 • Chapter 24 Dr. Berg The standard deviation of y (call it σ) is the same for all values of x. The value of σ is unknown. There are thus three population parameters that we must estimate from the data: α, β, and σ. € These conditions say that in the population there is an “on the average” straight‐line relationship between y and x. The population regression line µy = α + βx says that the mean response µy moves along a straight line as the explanatory variable x changes. The values of y that we observe vary about the means according to a Normal distribution. If we hold x fixed and take many observations, the distribution of y will be Normal. The standard deviation σ € determines the spread of y for a fixed x. Estimating the Parameters The first step in inference is to estimate the unknown parameters α, β, and σ. Estimating the Parameters When the conditions for regression are met and we calculate the least‐ ˆ squares line y = a + bx , the slope b of the regression line is an unbiased estimator of the population slope β, and the intercept a of the least‐squares line is an unbiased estimator of the population intercept α. € 3 M316 Chapter 24 Dr. Berg € Example (24.2) Crying and IQ The data in our example satisfy the condition of scatter about an invisible population regression line reasonably well. The least‐squares line is ˆ y = 91.27 + 1.493 x . The slope is particularly important. The slope is the rate of change and says how much the response variable changes with a change of one unit in the explanatory variable. In this case, one more peak in crying intensity gives an increase of about 1.5 IQ points. The intercept a=91.27 has little meaning in this problem since we would never expect x to be zero. The remaining parameter is the standard deviation σ, which describes the variability of the response y about the regression line. The residuals tell us how much y varies about the regression line. Regression Standard Error The regression standard error is 1 1 ˆ s= ∑ residual2 = n − 2 ∑ ( y − y )2 n −2 Use s to estimate the standard deviation σ of responses about the mean given by the population regression line. € Notice that s2 is an average of square deviations. The degrees of freedom are n–2 since, if we know n–2 of the residuals, the other two are determined. We usually use software to do the calculation. Exercise (24.1) Coffee and Deforestation Coffee is a leading export from several developing countries. When coffee prices are high, farmers often clear forest to plant more coffee trees. Here are five years of data on prices paid to coffee growers in Indonesia and the percent of forest area lost in a national park that lies in a coffee producing region: Price (cents per pound) 29 40 54 55 72 Forest Lost (percent) 0.49 1.59 1.69 1.82 3.10 a) Examine the data. Make a scatterplot with coffee price as the explanatory variable. What are the correlation r and the equation of the least squares regression line? Do you think that coffee price will allow a good prediction of forest lost? b) Explain in words what the slope β of the population regression line would tell us if we knew it. Based on the data, what are the estimates of β and the intercept α of the population regression line? c) Calculate by hand the residuals for the five data points. Check that their sum is 0 (up to round off error). Use the residuals to estimate the standard deviation σ of percents of forest lost about the means given by the population regression line. You have now estimated all three parameters. 4 M316 Using Technology Chapter 24 Dr. Berg Basic “two variable statistics” calculators will find the slope and intercept of the least squares regression line. Inference about regression requires in addition the regression standard error s. What follows is the output from a graphing calculator, two statistics programs, and a spreadsheet program using the crycount and IQ data from our first example. 5 M316 Chapter 24 Dr. Berg Testing the Hypothesis of No Linear Relationship Example 24.1 asked, “Do children with higher crying counts tend to have higher IQ?” Data analysis supports this conjecture, but is the positive correlation statistically significant? That is, is it too strong to often occur just by chance? To answer this question, test hypotheses about the slope β of the population regression line: H 0 : β = 0 versus H a : β > 0 . Having slope equal to zero says that there is no relationship between the explanatory and response variables. The test statistic is just the standardized version of the least‐squares slope b and is a t statistic. € € Significance Test for Regression Slope To test the hypothesis H 0 : β = 0 , compute the t statistic b t= SE b where the standard error of the least squares slope b is € s . SE b = ∑ ( x − x )2 € The sum is over all observations on the explanatory variable x. In terms of a random variable T having the t(n–2) distribution, the P‐value for a test of H0 against H a : β > 0 is P (T ≥ t ) € H a : β < 0 is P (T ≤ t ) H a : β ≠ 0 is 2 P (T ≥ t ) . € € Example (24.4) € Crying and IQ: Is It Significant? € The hypothesis H 0 : β = 0 says that crying has no straight‐line relationship € € with IQ. We conjecture that there is a positive relationship, so we use the one‐sided alternative H a : β > 0 . Software gives t=3.07 and P=0.002 so the evidence is strong. € € 6 M316 Chapter 24 Dr. Berg Exercise (24.4) Coffee and Deforestation Exercise 24.1 presents data on coffee process and loss of forest in Indonesia. In that exercise, you estimated the parameters using only a two‐variable statistics calculator. Software tells us that the least squares slope is b=0.0543 with standard error SEb=0.0097. a) What is the t statistic for testing H 0 : β = 0 ? b) How many degrees of freedom does t have? Use table C to approximate the P‐ value of t against the one‐sided alternative H a : β > 0 . What do you conclude? € Testing Lack of Correlation € Testing the null hypothesis H 0 : β = 0 is exactly the same as testing that there is no correlation since the correlation is zero exactly when the slope is zero. Because correlation also makes sense when there is no explanatory‐response distinction, it is handy to be able to test correlation without doing regression. Table F gives critical € values of the sample correlation r under the null hypothesis that the correlation is zero. See Example 24.5 in the textbook to see how this works. Confidence Intervals for the Regression Slope The slope β of the population regression line is usually the most important parameter in a regression problem. We often want to estimate β. The slope b of the least squares regression line is an unbiased estimator of β. A confidence interval shows how accurate b is likely to be. Confidence Interval for Regression Slope A level C confidence interval for the slope β of the population regression line is b ± t * SE b Here t* is the critical value for the t(n–2) density curve with area C between –t* and t*. € Example (24.6) Crying and IQ Software gives b=1.4929 and SEb=0.4870. There are 38 data points so we have 38‐2=36 degrees of freedom. On Table C we use df=30. The 95% confidence interval for the population slope β is b ± t * SE b = 1.4929 ± (2.02809)(0.4870) = 1.4929 ± 0.9877 or 0.505 to 2.481. Exercise (24.9) Coffee and Deforestation €Give a 95% confidence interval for β when b=0.0543 and SE =0.0097 with 5 b data points. 7 M316 Inference About Prediction Chapter 24 Dr. Berg One of the most common reasons to fit a line to data is to predict the response to a particular value of the explanatory variable. We want a margin of error that describes how accurate the predictions is likely to be. Example (24.7) Beer and Blood Alcohol STATE: The EESEE story “Blood and Alcohol Content” describes a study in which 16 student volunteers at the Ohio State University drank a randomly assigned a number of cans of beer. Thirty minutes later, a police officer measured their blood alcohol content (BAC) in grams of alcohol per deciliter of blood. Here are the data. Student 1 2 3 4 5 6 7 8 Beers 5 2 9 8 3 7 3 5 BAC 0.10 0.03 0.19 0.12 0.04 0.095 0.07 0.06 Student 9 10 11 12 13 14 15 16 Beers 3 5 4 6 5 7 1 4 BAC 0.02 0.05 0.07 0.10 0.085 0.09 0.01 0.05 The students were equally divided between men and women and differed in weight and usual drinking habits. Because of this variation, many students don’t believe that number of drinks predicts blood alcohol well. Steve thinks he can drive legally 30 minutes after he finishes drinking 5 beers. The legal limit for driving is BAC 0.08 in all states. We want to predict Steve’s blood alcohol content, using no information except that he drinks 5 beers. FORMULATE: Regress BAC on number of beers. Use the regression line to predict Steve’s BAC. Give a margin of error that allows us to have 95% confidence in our prediction. 8 M316 Chapter 24 Dr. Berg SOLVE: The scatterplot and regression output show that the number of beers predicts BAC quite well. In fact r2=0.80, so that number of beers explains 80% of the observed variation in BAC. To predict Steve’s BAC after 5 beers, use the equation of the regression line: ˆ y = −0.0127 + 0.0180 x = −0.0127 + 0.0180(5) = 0.077 . That’s dangerously close to the legal limit 0.80. What about 95% confidence? The “predicted values” part of the output shows two 95% intervals. Which should we use? € To decide which interval to use, you must answer this question: do you want to predict the mean BAC for all students who drink 5 beers, or do you want to predict the BAC of one individual who drinks 5 beers? The actual prediction is the same, but the margin of error is different for the two kinds of predictions. Individual students who drink 5 beers don’t all have the same BAC, so we need a larger margin of error to predict an individual’s BAC with 95% confidence than to predict the mean for all students who drink 5 beers. Write the given value of the explanatory variable x as x*. In our example x*=5. To emphasize the distinction between the two types of predictions, we use different terms for the two intervals. • To estimate the mean response, we use a confidence interval. It is an ordinary confidence interval for mean response when x=x*, which is µy = α + βx . This is a parameter, a fixed number whose value we don’t know. • To estimate an individual response y, we use a prediction interval. A prediction interval estimates a single random response y rather than a € parameter like µy . The response y is not a fixed number 9 € M316 Chapter 24 Dr. Berg Example (24.8) Beer and Blood Alcohol CONCLUSION: Steve is one individual, so we use the prediction interval. The confidence interval is labeled C.I. and the prediction interval is labeled P.I. We are 95% confident that Steve’s BAC after 5 beers will lie between 0.032 and 0.122. The upper end of the range will get him arrested if he drives. The 95% confidence interval for the mean BAC of all students who drink 5 beers is much narrower, 0.066 to 0.088. The meaning of a prediction interval is very much like the meaning of a confidence interval. A 95% prediction interval, like a 95% confidence interval, is right 95% of the time in repeated use. Confidence and Prediction Intervals for Regression Response A level C confidence interval for the mean response µ y when x takes the value x* is ˆ y ± t * SE µ . ˆ The standard error of SE µ is ˆ € SE µ = s ˆ 1 ( x * − x )2 + . n ∑ ( x − x )2 € A level C prediction interval for a single observation y when x takes the value x* is ˆ y ± t * SE yˆ . € The standard error of SE yˆ is 1 ( x * − x )2 SE yˆ = s 1 + + . € n ∑ ( x − x )2 € In both intervals, t* is the critical value for the t(n–2) density curve with area C between –t* and t*. € The extra 1 under the square root sign in the standard error for prediction makes the interval wider. Both standard errors are multiples of the regression standard error s. The degrees of freedom are again n–2, the degrees of freedom of s. Exercise (24.12) Coffee and Deforestation: Prediction Regarding exercise 24.1, if the world coffee price next year is 60 cents per pound, what percent of the national park forest do you predict will be cleared? The next figure is part of the output of CrunchIt! for prediction when x*=60. 10 M316 Chapter 24 Dr. Berg a) Which interval in the output is the proper 95% interval for predicting next year’s loss of forest? b) CrunchIt! gives only one of the standard errors used in prediction. It is SE µ , ˆ the standard error for estimating the mean response. Use this fact along with the CrunchIt! output to give a 90% confidence interval for the mean percent of forest lost in years when the price is 60 cents per pound. € 11 ...
View Full Document
This note was uploaded on 09/02/2010 for the course BIO 325 taught by Professor Saxena during the Spring '08 term at University of Texas at Austin.
- Spring '08