This preview shows page 1. Sign up to view the full content.
Unformatted text preview: Stat 250 Gunderson Lecture Notes Chapters 5 and 14: Regression Analysis The invalid assumption that correlation implies cause is probably among the two or three most serious and common errors of human reasoning. ‐‐Stephen Jay Gould, The Mismeasure of Man Describing and assessing the significance of relationships between variables is very important in research. We will first learn how to do this in the case when the two variables are quantitative. Quantitative variables have numerical values that can be ordered according to those values. We will study the material from Chapters 5 and 14 together. We will merge the two chapters together into one overall discussion of these ideas. Main idea We wish to study the relationship between two quantitative variables. Generally one variable is the ____ RESPONSE ______ variable, denoted by y. This variable measures the outcome of the study and is also called the ______ DEPENDENT _______ variable. (thought to depend on x) The other variable is the ____ EXPLANATORY ____ variable, denoted by x. It is the variable that is thought to explain the changes we see in the response variable. The explanatory variable is also called the ____ INDEPENDENT __ variable. The first step in examining the relationship is to use a graph ‐ a scatterplot ‐ to display the relationship. We will look for an overall pattern and see if there are any departures from this overall pattern. If a linear relationship appears to be reasonable from the scatterplot, we will take the next step of finding a model (an equation of a line) to summarize the relationship. The resulting equation may be used for predicting the response for various values of the explanatory variable. If certain assumptions hold, we can assess the significance of the linear relationship and make some confidence intervals for our estimations and predictions. Let's begin with an example that we will carry throughout our discussions. 183 Graphing the Relationship: Exam 2 versus Final Scores How well does the exam 2 score for a Stats 350 student predict their final exam score? Below are the scores for a random sample of n = 6 students from a previous term. Exam 2 Score Final Exam Score 33 53 65 80 44 78 64 93 60 88 40 58 Response (dependent) variable y = FINAL EXAM SCORE . Explanatory (independent) variable x = ___ EXAM 2 SCORE . Step 1: Examine the data graphically with a scatterplot. Add the points to the scatterplot below: y= x= Interpret the scatterplot in terms of ... overall form (is the average pattern look like a straight line or is it curved?) direction of association (positive or negative) strength of association (how much do the points vary around the average pattern?) any deviations from the overall form? None here! 184 Describing a Linear Relationship with a Regression Line Regression analysis is the area of statistics used to examine the relationship between a quantitative response variable and one or more explanatory variables. A key element is the estimation of an equation that describes how, on average, the response variable is related to the explanatory variables. A regression equation can also be used to make predictions. The simplest kind of relationship between two variables is a straight line, the analysis in this case is called linear regression. Regression Line for Exam 2 vs Final Remember the equation of a line? y = mx + b ˆ
In statistics we denote the regression line for a sample as: y b0 b1 x where: ˆ
y yhat = the predicted y or estimated y value b0 yintercept= estimated y when x=0 (not always meaningful) b1 slope = how much of an increase or decrease we expect
to see in y when x increases by 1 unit. Goal: To find a line that is “close” to the data points ‐ find the “best fitting” line. How? Observed
What do we mean by best? One measure of how good a line Predicted
fits is to look at the “observed errors” in prediction. Observed errors = A possible line Observed error if we used this
line to predict = y  yhat ˆ
______ y y _______ are called ____ residuals _____ So we want to choose the line for which the sum of squares of the observed errors (the sum of squared residuals) is the least. The line that does this is called: _____ Least Squares Regression Line _________ 185 The equations for the estimated slope and intercept are given by: b1 x x y y x x y S x x x x S
2 XY 2 r XX sy
sx b0 y b1 x Predict a
single y
at given x Estimate the
average y for all x ˆˆ
The least squares regression line (estimated regression function) is: y y ( x) b0 b1 x More on this distinction later when talk about prediction intervals vs.CIs for a mean. To find this estimated regression line for our exam data by hand, it is easier if we set up a calculation table. By filling in this table and computing the column totals, we will have all of the main summaries needed to perform a complete linear regression analysis. Note that here we have n = 6 observations. The first five rows have been completed for you. In general, use SPSS or a calculator to help with the graphing and numerical computations! y
x
y y xx x x y x x 2 y y 2 exam 2 final 33 53 33–51 = ‐18 (‐18)2 = 324 (‐18)(53)= ‐ 954 53–75 = ‐22 (‐22)2 = 484
65 80 65–51= 14 (14)2 = 196 (14)(80)= 1120 80–75 = 5 (5)2 = 25 44 78 44–51= ‐7 (‐7)2 = 49 (‐7)(78)= ‐546 78–75 = 3 (3)2 = 9 64 93 64–51= 13 (13)2 = 169 (13)(93)= 1209 93–75 = 18 (18)2 = 324 60 88 58 60–51 = 9 (9)2 = 81 (9)(88)= 792 88–75 = 13 (13)2 = 169 4051 = 11 (11)2
= 121 (11)(58) =638 5875
=17 (17)2
= 289 450 0 940 983 0 1300 40 306 x 306
450 51 y 75 6
6 Slope Estimate: b1 x x y 983 1.0457 x x 2 940 y‐intercept Estimate: b0 y b1 x 75 (1.0457 )(51) 21.67 ˆ
Estimated Regression Line: y b0 b1 x 21.67 1.046( x ) Predict the final exam score for a student who scored 60 on exam 2. ˆ
y 21.67 1.046(60) 84.43 points 186 Note: The 5th student had an exam 2 score of 60 and the observed final exam score was 88 points. Find the residual for the 5th observation. ˆ
Notation for a residual e5 y 5 y 5 88– 84.43 = 3.57 The residuals … You found the residual for one observation. You could compute the residual for each observation. The following table shows each residual. x exam 2 final predicted values residuals Squared residuals y
ˆ
ˆ
y 21.67 1.046( x) e y y ˆ
( e ) 2 y y 2 33 53 56.19 ‐3.19 10.18 65 80 89.66 ‐9.66 93.31 44 78 67.69 10.31 106.29 64 93 88.61 4.39 19.27 60 88 84.43 3.57 12.74 40 58 63.51 ‐5.51 30.36 ‐‐ ‐‐ ‐‐ 0 272.1 SSE = sum of squared errors (or residuals) 272.1 187 Measuring Strength and Direction of a Linear Relationship with Correlation The correlation coefficient r is a measure of strength of the linear relationship between y and x. Properties about the Correlation Coefficient r 1. r ranges from ... –1 to +1 (and it is unitless) 2. Sign of r indicates ... direction of the association 3. Magnitude of r indicates ... strength (r = 0.8 and r = +0.8 indicate equally strong linear associations) A “strong” r is discipline specific r = 0.8 might be an important (or strong) correlation in engineering r = 0.6 might be a strong correlation in psychology or medical research 4. r ONLY measures the strength of the LINEAR relationship. Some pictures: y r = +0.7 y r = 0.4 y x x The formula for the correlation: (but we will get it from computer output or from r2) Exam Scores Example: r = ___0.889____ Interpretation: A fairly strong positive linear association between exam 2 scores and final exam scores. 188 r0 x The square of the correlation r 2 The squared correlation coefficient r 2 always has a value between __0 and 1 __ and is sometimes presented as a percent. It can be shown that the square of the correlation is related to the sums of squares that arise in regression. The responses (the final exam scores) in data set are not all the same ‐ they do vary. We would measure the total variation in these responses as SSTO y y 2 (this was the last column total in our calculation table that we said we would use later). Total
variation
in the y’s Variation not
accounted for Part of the reason why the final exam scores vary is because there is a linear relationship between final exam scores and exam 2 scores, and the study included students with different exam 2 scores. When we found the least squares regression line, there was still some small variation remaining of the responses from the line. This amount of variation that is not accounted for by the linear relationship is called the SSE. The amount of variation that is accounted for by the linear relationship is called the sum of squares due to the model (or regression), denoted by SSM (or sometimes as SSR). So we have: SSTO = _____ SSM + SSE _________ It can be shown that SSTO SSE
SSM r = SSTO
SSTO
2 = the proportion of total variability in the responses that can be explained by the linear relationship with the explanatory variable x . Note: As we will see, the value of r 2 and these sums of squares are summarized in an ANOVA table that is standard output from computer packages when doing regression. 189 Measuring Strength and Direction for Exam 2 vs Final From our first calculation table (page 186) we have: y y 2 SSTO = ____ = 1300___ From our residual calculation table (page 187) we have: SSE = _____272.1_________ So the squared correlation coefficient for our exam scores regression is: 1300 272 .1 1027 .9 0.791 SSTO SSE
r2 = 1300
1300
SSTO Interpretation: We have that 79.1 % of the variation in final exam scores can be accounted for by its linear relationship with exam 2 scores The correlation coefficient is r = . r 2 0.791 0.889 Be sure you read Sections 5.4 and 5.5 (pages 171 – 177) for good examples and discussion on the following topics: Nonlinear relationships Detecting Outliers and their influence on regression results. Affect means, standard deviations so will affect b0 and b1.
Dangers of Extrapolation (predicting outside the range of your data) Dangers of combining groups inappropriately (Simpson’s Paradox) y=
reading
level circles = grade levels
Grade/Age = confounding
variable x = # TV hours Correlation does not prove causation 190 SPSS Regression Analysis for Exam 2 vs Final Let’s look at the SPSS output for our exam data. We will see that much of the computations are done for us. b
Variables Entered/Removed Model
1 Variables
Entered
exam 2
scores a
(out of 75) Variables
Removed Method
. Enter a. All requested variables entered.
b. Dependent Variable: final exam scores (out of 100) Model Summaryb
Model
1 R
.889a Adjusted
R Square
.738 R Square
.791 Std. Error of
the Estimate
8.24671 a. Predictors: (Constant), exam 2 scores (out of 75)
b. Dependent Variable: final exam scores (out of 100)
ANOVAb
Model
1 Regression
Residual
Total Sum of
Squares
1027.967
272.033
1300.000 df
1
4
5 Mean Square
1027.967
68.008 F
15.115 Sig.
.018a a. Predictors: (Constant), exam 2 scores (out of 75)
b. Dependent Variable: final exam scores (out of 100)
Coefficientsa Model
1 (Constant)
exam 2 scores (out of 75) Unstandardized
Coefficients
B
Std. Error
21.667
14.125
1.046
.269 Standardized
Coefficients
Beta a. Dependent Variable: final exam scores (out of 100) Predicted y = yhat = 21.667 + 1.046(x) 191 .889 t
1.534
3.888 Sig.
.200
.018 Inference in Linear Regression Analysis The material covered so far is presented in Chapter 5 and focuses on using the data for a sample to graph and describe the relationship. The slope and intercept values we have computed are statistics, they are estimates of the underlying true relationship for the larger population. Chapter 10 focuses on making inferences about the relationship for the larger population. Here is a nice summary to help us distinguish between the regression line for the sample and the regression line for the population. Regression Line for the Sample Regression Line for the Population Aside: E(Y) = Y(x) = mean response at a given x; sometimes called the regression
function. It can take on many forms, we will consider the simple linear regression
function: 0 + 1x 192 To do formal inference, we think of our b0 and b1 as estimates of the unknown parameters 0 and 1 . Below we have the somewhat statistical way of expressing the underlying model that produces our data: Linear Model: the response y = [0 + 1(x)] + = [Population relationship] + Randomness This statistical model for simple linear regression assumes that for each value of x the observed values of the response (the population of y values) is normally distributed, varying around some true mean (that may depend on x in a linearway) and a standard deviation that does not depend on x. This true mean is sometimes expressed as E(Y) = 0 + 1(x). And the components and assumptions regarding this statistical model are shown visually below. The represents the true error term. These would be the deviations of a particular value of the response y from the true regression line. As these are the deviations from the mean, then these error terms should have a normal distribution with mean 0 and constant standard deviation . Now, we cannot observe these ’s. However we will be able to use the estimated (observable) errors, namely the residuals, to come up with an estimate of the standard deviation and to check the conditions about the true errors. 193 So what have we done, and where are we going? 1. Estimate the regression line based on some data. DONE! 2. Measure the strength of the linear relationship with the correlation. DONE! 3. Use the estimated equation for predictions. DONE! 4. Assess if the linear relationship is statistically significant. 5. Provide interval estimates (confidence intervals) for our predictions. 6. Understand and check the assumptions of our model. We have already discussed the descriptive goals of 1, 2, and 3. For the inferential goals of 4 and 5, we will need an estimate of the unknown standard deviation in regression Estimating the Standard Deviation for Regression The standard deviation for regression can be thought of as measuring the average size of the residuals. A relatively small standard deviation from the regression line indicates that individual data points generally fall close to the line, so predictions based on the line will be close to the actual values. It seems reasonable that our estimate of this average size of the residuals be based on the residuals using the sum of squared residuals and dividing by appropriate degrees of freedom. Our estimate of is given by: s= sum of squared residuals n2 SSE MSE where SSE n2 e 2
i ˆ2 y y Note: Why n – 2? In estimating the mean response we had to estimate 2
quantities, the yintercept and the slope; so we lose 2 df. Estimating the Standard Deviation: Exam 2 vs Final Below are the portions of the SPSS regression output that we could use to obtain the estimate of for our regression analysis. Model Summaryb
Model
1 R
.889a R Square
.791 Adjusted
R Square
.738 Std. Error of
the Estimate
8.24671 a. Predictors: (Constant), exam 2 scores (out of 75)
b. Dependent Variable: final exam scores (out of 100) ANOVAb
Model
1 Regression
Residual
Total Sum of
Squares
1027.967
272.033
1300.000 df
1
4
5 Mean Square
1027.967
68.008 a. Predictors: (Constant), exam 2 scores (out of 75)
b. Dependent Variable: final exam scores (out of 100) 194 F
15.115 Sig.
.018a Significant Linear Relationship? Consider the following hypotheses: H 0 : 1 0 versus H a : 1 0 What happens if the null hypothesis is true? If 1=0 then E(Y) = 0 => a constant no matter what the value of x is.
i.e. knowing x does not help to predict the response. So these
hypotheses are testing if there is a significant nonzero linear
relationship between y and x.
There are a number of ways to test this hypothesis. One way is through a t‐test statistic (think about why it is a t and not a z test). sample statistic  null value The general form for a t test statistic is: t standard error of the sample statistic We have our sample estimate for 1 , it is b1 . And we have the null value of 0. So we need the standard error for b1 . We could “derive” it, using the idea of sampling distributions (think about the population of all possible b1 values if we were to repeat this procedure over and over many times). Here is the result: t‐test for the population slope 1 b 0
To test H 0 : 1 0 we would use t 1 s.e.(b1 )
s where SE (b1 ) x x 2 and the degrees of freedom for the t‐distribution are n – 2. This t‐statistic could be modified to test a variety of hypotheses about the population slope (different null values and various directions of extreme). Try It! Significant Relationship between Exam 2 Scores and Final Scores? Is there a significant (non‐zero) linear relationship between the exam 2 scores and the final exam scores? (i.e., is exam 2 score a useful linear predictor for final exam score?) That is, test H 0 : 1 0 versus H a : 1 0 using a 5% level of significance. 1. SE (b1 ) 2. t s x x 2 8.24671 0.269 940 b1 0 1.046 0 3.89
s.e.(b1 )
0.269 3. Using table A.3 with df = 6 – 2 = 4, we have pvalue < 2(0.020) = 0.04
We can reject H0 and conclude the exam 2 score is a significant linear predictor of
final exam score. 195 Think about it: Based on the results of the previous t‐test conducted at the 5% significance level, do you think a 95% confidence interval for the true slope 1 would contain the value of 0? Confidence Interval for the population slope 1 b1 t * SE b1 where df = n‐2 for the t * value Compute the interval and check your answer. Could you interpret the 95% confidence level here? 1.046 ± (2.78)(0.269) 1.046 ± 0.748 ( 0.298, 1.794 )
(t* = 2.78 from df = 4 and 95% confidence) If this experiment were repeated many times, we’d expect 95% of the
resulting confidence intervals to contain the population slope 1. Inference about the Population Slope using SPSS Below are the portions of the SPSS regression output that we could use to perform the t‐test and obtain the confidence interval for the population slope 1 . Coefficientsa Model
1 (Constant)
exam 2 scores (out of 75) Unstandardized
Coefficients
B
Std. Error
21.667
14.125
1.046
.269 Standardized
Coefficients
Beta
.889 t
1.534
3.888 Sig.
.200
.018 a. Dependent Variable: final exam scores (out of 100) Note: There is a third way to test H 0 : 1 0 versus H a : 1 0 . It involves another F‐test from an ANOVA for regression. ANOVAb
Model
1 Regression
Residual
Total Sum of
Squares
1027.967
272.033
1300.000 df
1
4
5 Mean Square
1027.967
68.008 F
15.115 Sig.
.018a a. Predictors: (Constant), exam 2 scores (out of 75)
b. Dependent Variable: final exam scores (out of 100) * The ttest is more flexible than the F test; F only twosided with null=0 196 Predicting for Individuals versus Estimating the Mean Consider the relationship between exam 2 and final exam scores … Least squares regression line (or estimated regression function): ˆ
y 21.67 + 1.046(x) also E Y = 21.67 + 1.046(x) We also have: s 8.24671 How would you predict the final exam score for Barb who scored 60 points on exam 2? ˆ
y 21.67 + 1.046(60) = 84.43 points. How would you estimate the mean final exam score for all students who scored 60 points on exam 2? E Y = 21.67 + 1.046(60) = 84.43 points. So our estimate for predicting a future observation and for estimating the mean response are found using the same least squares regression equation. What about their standard errors? (We would need the standard errors to be able to produce an interval estimate.) Idea: Consider a population of individuals and a population of means: n Population of individuals Population of means What is the standard deviation for a population of individuals? What is the standard deviation for a population of means? n Which standard deviation is larger? So a prediction interval for an individual response will be (wider or narrower) than a confidence interval for a mean response. 197 Here are the (somewhat messy) formulas: Confidence interval for a mean response: ˆ* y t s.e.(fit) s.e.(fit ) s where (x x)
1 n x i x 2 Sum that comes up a lot! 2 df = n – 2 the x that you are doing
Prediction interval for an individual response: the predicting/estimating at ˆ
y t * s.e.(pred) 2 = – 2 s.e.(pred) s 2 s.e.(fit ) the sdf = n xtra variability
e 2 where that makes PI wider Try It! Exam 2 vs Final Construct a 95% confidence interval for the mean final exam score for all students who scored x = 60 points on exam 2. Recall: n = 6, x 51 , x x 2 S XX 940 , y 21.67 +1.046(x), and s = 8.24761. ˆ ˆ
y 21.67 + 1.046(60) = 84.43 cm. t* = 2.78 (with df = 4)
s.e.(fit ) s 1 n (x x) 2 x i x 2 8.24761 1 (60 51) 2 4.147
6
940 ˆ
y t *s.e.(fit) 84.43 (2.78)(4.147) 84.43 11.53 => ( 72.9, 95.96) Construct a 95% prediction interval for the final exam score for an individual student who scored x = 60 points on exam 2. s.e.(pred) s 2 s.e.(fit ) 2 (8.24761) 2 4.147 9.23
2 ˆ
y t *s.e.(pred) 84.43 2.78(9.23) 84.43 25.66
=> ( 58.77, 110.09)
IT IS WIDER (and even includes
values above the max possible of 100)! Show prediction interval and
confidence interval
bands on the scatterplot 198 Checking Assumptions in Regression Let’s recall the statistical way of expressing the underlying model that produces our data: Linear Model: the response y = [0 + 1(x)] + = [Population relationship] + Randomness where the ‘s, the true error terms should be normally distributed with mean 0 and constant standard deviation , and this randomness is independent from one case to another. Thus there are four essential technical assumptions required for inference in linear regression: (1) Relationship is in fact linear. (2) Errors should be normally distributed. (3) Errors should have constant variance. (4) Errors should not display obvious ‘patterns’. Now, we cannot observe these ’s. However we will be able to use the estimated (observable) errors, namely the residuals, to come up with an estimate of the standard deviation and to check the conditions about the true errors. So how can we check these assumptions with our data and estimated model? (1) Relationship is in fact linear. examine the scatterplot of y versus x (2) Errors should be normally distributed. Histogram or qq plot of residuals (3) Errors should have constant variance. Residual plot (plot residuals against x);
(4) Errors should not display obvious ‘patterns’. if random scatter with no pattern in
horizontal band => ok If we saw … Let's turn to one last full regression problem that will include checking of the assumptions. 199 Relationship between height and foot length for College Men The heights (in inches) and foot lengths (in centimeters) of 32 college men were used to develop a model for the relationship between height and foot length. The scatterplot and SPSS regression output are provided. Comment on scatterplot here! Descriptive Statistics
foot
height Mean
27.8
71.7 Std. Deviation
1.5497
3.0579 N
32
32 Model Summary
Model
1 R
.758 R Square
.574 Adjusted
R Square
.560 Std. Error of
the Estimate
1.0280 ANOVA
Model
1 Regression
Residual
Total Sum of
Squares
42.74
31.70
74.45 df
1
30
31 Mean Square
42.74
1.06 F
40.45 Sig.
.000001 Coefficients Model
1 (Constant)
height Unstandardized
Coefficients
B
Std. Error
.25
4.33
.38
.06 Standardized
Coefficients
Beta
.758 Also note that: SXX = x x 2 = 289.87 200 t
.06
6.36 Sig.
.954
.000001 a. How much would you expect foot length to increase for each 1‐inch increase in height? Include the units. This is asking about the slope: 0.38 centimeters. b. What is the correlation between height and foot length? r = 0.758 (would you be able to interpret the value of r2? c. Give the equation of the least squares regression line for predicting foot length from height. predicted y = yhat = 0.25 + 0.38(x) d. Suppose Max is 70 inches tall and has a foot length of 28.5 centimeters. Based on the least squares regression line, what is the value of the predication error (residual) for Max? Show all work. predicted y = yhat = 0.25 + 0.38(70) = 26.85
observed y – predicted y = 28.5 – 26.85 = 1.65 e. Use a 1% significance level to assess if there is a significant positive linear relationship between height and foot length. State the hypotheses to be tested, the observed value of the test statistic, the corresponding p‐value, and your decision. Hypotheses: H0:_____1 = 0 _____ Ha:_____1 > 0 _______ p‐value: _0.000001/2= 0.0000005 _ Decision: (circle) Test Statistic Value: ____6.36 _______ Fail to reject H0 Reject H0 Conclusion: Thus it appears there is a significant positive linear relationship
between height and foot lengths for the population of college men represented by
the sample. 201 f. Calculate a 95% confidence interval for the average foot length for all college men who are 70 inches tall. (Just clearly plug in all numerical values.) 1
ˆ yt s
n
* (x x) 2 x i x 2 26.85 (2.04)1.028 1 70 71.7 2 32
289.87 26.85 ± 0.425 (26.425, 27.275) g. Consider the residual plot shown at the right. Does this plot support the conclusion that the linear regression model is appropriate? Yes No Explain: The plot shows a random
scatter in a horizontal band
around 0 with no pattern. Note: on exam, students who said ‘NO, because the variation appears to
change with x’ were marked as ok too. 202 Regression
Standard Error of the Sample Slope Linear Regression Model Population Version: Y x E (Y ) 0 1 x Mean: Individual: y i 0 1 x i i where i is N (0, ) Sample Version: ˆ
Mean: y b0 b1 x Individual: yi b0 b1 xi ei s.e.(b1 ) s S XX s x x 2 Confidence Interval for 1 b1 t *s.e.(b1 ) df = n – 2 t‐Test for 1 To test H 0 : 1 0 t x x y y x x y x x x x 2 df = n – 2 MSREG MSE df = 1, n – 2 Confidence Interval for the Mean Response ˆ
y t * s.e.(fit) df = n – 2 Parameter Estimators S XY S XX or F b1 b1 0 s.e.(b1 ) 2 b0 y b1 x Residuals where s.e.(fit ) s 1 (x x) 2 n
S XX Prediction Interval for an Individual Response ˆ
y t *s.e.(pred) df = n – 2 ˆ
e y y = observed y – predicted y where s.e.(pred) s 2 s.e.(fit ) 2 Standard Error of the Sample Intercept Correlation and its square S XY r r2 S XX S YY SSTO SSE SSREG SSTO
SSTO where SSTO S YY SSE Confidence Interval for 0 y y 2 b0 t *s.e.(b0 ) df = n – 2 t‐Test for 0 To test H 0 : 0 0 SSE where n2 ˆ y y e
2 2 t 1 x2 n S XX Estimate of s MSE s.e.(b0 ) s 203 b0 0 s.e.(b0 ) df = n – 2 Additional Notes A place to … jot down questions you may have and ask during office hours, take a few extra notes, write out an extra practice problem or summary completed in lecture, create your own short summary about this chapter. 204 ...
View Full
Document
 Winter '10
 Gunderson
 Statistics, Correlation

Click to edit the document details