This preview shows pages 1–8. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: Statistical Data Mining ORIE 474 Fall 2007 Tatiyana Apanasovich 10/10/07 Assumptions of Multiple Linear Regression Model 1. Linearity: 2. Constant variance: The standard deviation of Y for the subpopulation of units with is the same for all subpopulations. 3. Normality: The distribution of Y for the subpopulation of units with is normally distributed for all subpopulations. 4. The observations are independent. 1 1 1 (  , , ) K K K E Y X X X X = + + + K L 1 1 , , K K X x X x = = K 1 1 , , K K X x X x = = K Assumptions for linear regression and their importance to inferences Inference Assumptions that are important Point prediction, point estimation Linearity, independence Confidence interval for slope, hypothesis test for slope, confidence interval for mean response Linearity, constant variance, independence, normality (aymptotic normality) Prediction interval Linearity, constant variance, independence, normality Polynomials and Transformations in Multiple Example: Fast Food Locations. An analyst working for a fast food chain is asked to construct a multiple regression model to identify new locations that are likely to be profitable. The analyst has for a sample of 25 locations the annual gross revenue of the restaurant (y), the mean annual household income and the mean age of children in the area. Data in fastfoodchain.jmp Fast Food Chain Data Response Revenue Summary of Fit RSquare 0.325221 RSquare Adj 0.263877 Root Mean Square Error 111.6051 Mean of Response 1085.56 Observations (or Sum Wgts) 25 Analysis of Variance Source DF Sum of Squares Mean Square F Ratio Model 2 132070.87 66035.4 5.3016 Error 22 274025.29 12455.7 Prob > F C. Total 24 406096.16 0.0132 Parameter Estimates Term Estimate Std Error t Ratio Prob>t Intercept 667.80548 132.3305 5.05 <.0001 Income 11.429981 4.677122 2.44 0.0230 Age 16.819467 7.999592 2.10 0.0472 300200100 100 200 Revenue Residual 800 900 1000 1100 1200 1300 Revenue Predicted Checking Linearity Plot residuals versus each of the explanatory variables. Each of these plots should look like random scatter, with no pattern in the mean of the residuals. If residual plots show a problem, then we could try to transform the xvariable and/or the yvariable. Residual Plot: Use Fit Y by X with Y being Residuals. Fit Line will draw horizontal Line.300200100 100 200 Residual Revenue 15 20 25 30 35 Income300200100 100 200 Residual Revenue 2.5 5.0 7.5 10.0 12.5 15.0 Age Residual by Predicted Plot Fit Model displays the Residual by Predicted Plot automatically in its output. The plot is a plot of the residuals versus the predicted Ys, We can think of the predicted Ys as summarizing all the information in the Xs. As usual we would like this plot to show random scatter....
View
Full
Document
This note was uploaded on 12/23/2009 for the course ORIE 474 at Cornell University (Engineering School).
 '07
 APANASOVICH

Click to edit the document details