This preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: Stat 104: Quantitative Methods for Economists Class 33: Regression Diagnostics 1 The Model and Data Y X = + + β β ε 1 Given X ’s X X X m 1 2 , , K We assume the following model holds: 2 ~ (0, ) i N independent ε σ Y X i i i = + + β β ε 1 2 3 2 1 50 40 30 20 price Given data size We ( hope, assume ) we see a linear pattern and a level of variation about the line. Our model is designed to capture these two features of the data. 3 In practice we need to check our assumption that the model captures the important features of the data. Is the model a good way to describe the data?? This is called model checking, and is done using the residuals from the regression. 4 Why do we have to check our model? s All estimates, intervals, and hypothesis tests have been developed assuming that the model is correct. s If the model is incorrect, then the formulas and methods we use are at risk of being incorrect. 5 To drive this point home, let’s look at the “famous” Anscomb data sets… 6 Data Set 1 y1 = 3.00 + 0.500 x1 Predictor Coef SE Coef T P Constant 3.000 1.125 2.67 0.026 x1 0.5001 0.1179 4.24 0.002 s = 1.237 Rsq = 66.7% Rsq(adj) = 62.9% Analysis of Variance SOURCE DF SS MS F p egression 7 10 7 10 7 9 02 14 9 4 11 10 9 8 7 6 5 4 x 1 y1 Regression 1 27.510 27.510 17.99 0.002 Error 9 13.763 1.529 Total 10 41.273 7 Data Set 2 y2 = 3.00 + 0.500 x2 Predictor Coef SE Coef T P Constant 3.001 1.125 2.67 0.026 x2 0.5000 0.1180 4.24 0.002 s = 1.237 Rsq = 66.6% Rsq(adj) = 62.9% Analysis of Variance SOURCE DF SS MS F p Regression 1 27.500 27.500 17.97 0.002 14 9 4 9 8 7 6 5 4 3 x 2 y 2 Error 9 13.776 1.531 Total 10 41.276 8 Data Set 3 y3 = 3.00 + 0.500 x3 Predictor Coef SE Coef T P Constant 3.002 1.124 2.67 0.026 x3 0.4997 0.1179 4.24 0.002 s = 1.236 Rsq = 66.6% Rsq(adj) = 62.9% Analysis of Variance SOURCE DF SS MS F p Regression 1 27.470 27.470 17.97 0.002 Error 9 13.756 1.528 Total 10 41.226 14 9 4 13 12 11 10 9 8 7 6 5 x 3 y 3 9 Data Set 4 y4 = 3.00 + 0.500 x4 Predictor Coef SE Coef T P Constant 3.002 1.124 2.67 0.026 x4 0.4999 0.1178 4.24 0.002 s = 1.236 Rsq = 66.7% Rsq(adj) = 63.0% Analysis of Variance SOURCE DF SS MS F p egression 7 90 7 90 8 02 Regression 1 27.490 27.490 18.00 0.002 Error 9 13.742 1.527 Total 10 41.232 2 0 15 10 13 12 11 10 9 8 7 6 5 x 4 y 4 10 Anscomb Conclusion ? s No data is bad; it just might not meet your assumptions. The data could simply be naughty. your assumptions aren’t met, the computer s If your assumptions aren’t met, the computer output might appear perfectly reasonable, but in reality be uninterpretable....
View
Full
Document
This note was uploaded on 03/27/2012 for the course STATS 104 taught by Professor Michaelparzen during the Fall '11 term at Harvard.
 Fall '11
 MichaelParzen

Click to edit the document details