multiple regression 3 ppt

# multiple regression 3 ppt - Variable Selection Strategies...

This preview shows page 1. Sign up to view the full content.

This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Variable Selection Strategies Section 15.4 and elsewhere Multiple Regression 3: Variable Selection 1 Variable Selection Up to now, we have used all variables available. Here we discuss how you might choose a "best" subset of variables. Multiple Regression 3: Variable Selection 2 Approaches Using some kind of selection algorithm Using an "all possible" or "many" regressions approach. Multiple Regression 3: Variable Selection 3 Using a selection algorithm Backwards elimination: put all Xs in the model to start. Remove one at a time by using T-tests. Forward selection: Use correlation to find best X. Fit model and get residuals. Use residual correlation to find next X. Multiple Regression 3: Variable Selection 4 The Backward Elimination Heuristic 1. 2. 3. 4. Regress Y on all k potential X variables Use t-tests to determine which X has the least amount of significance If this X does not meet some minimum level of significance, remove it from the model Regress Y on the set of k-1 X variables Repeat 2-4 until all remaining Xs meet minimum Multiple Regression 3: Variable Selection 5 Idea behind backward elimination If tstat not significant, we can remove an X and simplify the model while still maintaining the model's high R2. A typical stopping rule: Continue until all Xs meet some target "significance level to stay" (often .10 or .15). Multiple Regression 3: Variable Selection 6 Backwards elimination with newsprint data Variables In R2 AdjR2 S ___________ ___________ ___________ ___________ ___________ ___ ______ ___ ______ ___ ______ ___ ______ ___ ______ Worst X Action _________ _________ _________ _________ _________ ____ ________ ____ ________ ____ ________ ____ ________ ____ ________ Multiple Regression 3: Variable Selection 7 First iteration Intercept NFamily Retail\$ TwoPaper PctAdult MedianED PctWhCol Coefficients Standard Error 4153.9527 6620.8318 0.0853 0.0664 0.0398 0.0095 1731.7032 898.0573 14.2979 73.3783 -490.7901 580.2036 -74.2575 82.9173 t Stat 0.6274 1.2858 4.2031 1.9283 0.1949 -0.8459 -0.8956 P-value 0.5322 0.2022 0.0001 0.0573 0.8460 0.4001 0.3731 Order of importance 1. Retail\$ 2. TwoPaper 3. NFamily PctWhCol 2. MedianEd 3. PctAdult 1. 8 Multiple Regression 3: Variable Selection Eliminating PctAdult We now have to run a 5-variable model without PctAdult, which is column E. For the Xs, select the first part, columns B, C and D. While holding the Ctrl key down, now select columns F and G. This should copy over just the 5 Xs you want. Multiple Regression 3: Variable Selection 9 Second Iteration Intercept NFamily Retail\$ TwoPaper MedianED PctWhCol Coefficients Standard Error 5129.5976 4306.2815 0.0858 0.0659 0.0397 0.0094 1717.2731 889.7330 -523.8256 551.6155 -67.0559 73.7879 t Stat 1.1912 1.3016 4.2248 1.9301 -0.9496 -0.9088 P-value 0.2370 0.1967 0.0001 0.0571 0.3451 0.3661 It is close, but PctWhCol is now the least significant variable. We will now run a 4-variable model. Multiple Regression 3: Variable Selection 10 3 and 4 iterations rd th Next eliminate __________. Finally, try ____________. Multiple Regression 3: Variable Selection 11 Forward Selection This approach builds "up" The first iteration always uses the X which is most highly correlated with Y After that it gets too tedious to do by hand. Multiple Regression 3: Variable Selection 12 Stepwise Regression Stepwise regression is a popular procedure that is in many of the major statistics packages like Minitab, SAS or SPSS. It combines the forward and backwards approaches. It fits a sequence of models, usually in a forward manner. If a variable that is "in" loses significance, it can be eliminated. Multiple Regression 3: Variable Selection 13 PHStat Stepwise Regression Will automatically run the stepwise procedure. Can also be used to run forward selection or backwards elimination. You specify the stopping rule by choosing the p-value or t-value you want. Very fast with lots of output to sort through. Multiple Regression 3: Variable Selection 14 Running the procedure Multiple Regression 3: Variable Selection 15 Stepwise Analysis Table of Results for General Stepwise Retail\$ entered. Intercept Retail\$ Stepwise results (lots hidden) Coefficients Standard Error t Stat -3870.024982 506.6223749 -7.638874977 0.052177526 0.002225546 23.44481537 P-value 2.82691E-11 4.0645E-39 Lower 95% Upper 95% -4877.15691 -2862.893053 0.047753286 0.056601766 MedianED entered. Coefficients Standard Error t Stat Intercept 8877.923093 4001.435205 2.218684706 Retail\$ 0.052807597 0.00212328 24.87075971 MedianED -1134.567428 353.5424888 -3.209140241 P-value 0.029173189 8.98204E-41 0.001878686 Lower 95% Upper 95% 921.9980469 16833.84814 0.048585947 0.057029248 -1837.504598 -431.6302572 TwoPaper entered. Coefficients Standard Error t Stat Intercept 8219.415922 3966.554805 2.072180097 Retail\$ 0.051657483 0.00219096 23.5775599 MedianED -1085.503563 350.0302129 -3.101171051 TwoPaper 1605.273664 891.4137403 1.800817725 P-value 0.041315319 8.53668E-39 0.002624243 0.075320593 Lower 95% Upper 95% 331.486843 16107.345 0.047300519 0.056014446 -1781.577015 -389.4301116 -167.4002831 3377.947611 No other variables could be entered into the model. Stepwise ends. Multiple Regression 3: Variable Selection 16 "Many Regressions" Approach Even with the stepwise adjustment, we will never look at all possible combinations of Xs so we may miss something. This is even more critical if there are missing values. Computer packages (SAS, Minitab) have efficient ways to search over all or many models Will give table showing which Xs are included, plus fit statistics Multiple Regression 3: Variable Selection 17 Best Subsets in PHStat Will fit every combo of Xs 6 one-variable models 15 two-variable models Etc. There are 26 = 64 possibilities Multiple Regression 3: Variable Selection 18 Model X1 X1X2 X1X2X3 X1X2X3X4 X1X2X3X4X5 X1X2X3X4X5X6 X1X2X3X4X6 X1X2X3X5 X1X2X3X5X6 X1X2X3X6 X1X2X4 X1X2X4X5 X1X2X4X5X6 X1X2X4X6 X1X2X5 X1X2X5X6 X1X2X6 X1X3 Part of output Cp k+1 21.04458 2 9.911608 3 7.790583 4 9.592669 5 5.80203 6 7 7 5.715535 6 3.854134 5 5.037967 6 3.929167 5 11.65079 4 7.130998 5 8.718252 6 8.044994 5 5.206545 4 6.719545 5 6.188332 4 17.96049 3 R Square Adj. R Square Std. Error 0.855325 0.853642966 3398.539 0.873413 0.870434367 3197.646 0.879089 0.874770389 3143.684 0.879361 0.873547316 3158.999 0.887337 0.880466782 3071.353 0.888441 0.880177498 3075.067 0.887456 0.880593173 3069.729 0.887265 0.881831722 3053.767 0.888389 0.881583269 3056.976 0.887161 0.8817234 3055.166 0.873772 0.869263962 3212.056 0.882752 0.877101094 3114.293 0.88332 0.876205452 3125.62 0.881493 0.87578161 3130.966 0.882648 0.878456411 3097.073 0.883318 0.877695086 3106.758 0.881295 0.877055934 3114.865 0.862327 0.859088053 3334.72 19 Multiple Regression 3: Variable Selection Comparing models Same # of X's? Can use R2 or Se # of X's? Adjusted R2 or Se Different Adj Rsq = 1 (1 R2)(n-1)/(n-k-1) Higher "penalty" with more Xs Multiple Regression 3: Variable Selection 20 Why not regular R ? 2 It can be shown that R2 will never decrease if another X is added "Max R2" implies using every possible X Multiple Regression 3: Variable Selection 21 Another idea The Cp statistic (page 590) is a favorite of some people. You want to choose a model where Cp is smaller than k+1 where k = # Xs. Plot the Cp statistic against the number of terms in the model (nterms = k+1). Multiple Regression 3: Variable Selection 22 Cp Plot 30 25 Cp statistic 20 15 10 5 0 0 1 2 3 4 5 Number of terms Use 4 Xs?? 6 7 8 Multiple Regression 3: Variable Selection 23 ...
View Full Document

{[ snackBarMessage ]}

Ask a homework question - tutors are online