EXST7015 Fall2011 Lect14

EXST7015 Fall2011 Lect14 - Statistical Techniques II Page...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Statistical Techniques II Page 58 To request the procedure ask for model options “selection=rsquare”. I also included the options “start=3 stop=6 best=8”; This instructs SAS to start with 3 variable models, go up to 6 variables and show me the best 8 models for each number of variables. As requested, the RSQUARE selection option first produces the best 8 3-factor models (plus intercept). Number 3 3 3 3 3 3 3 3 R-square 0.49340010 0.48523075 0.47356336 0.47347050 0.46955655 0.46300398 0.46191250 0.45384242 Variables in Model LTOFSTAY CULRATIO SERVICES LTOFSTAY CULRATIO NURSES LTOFSTAY CULRATIO NOBEDS LTOFSTAY CULRATIO CENSUS LTOFSTAY CULRATIO XRAY CULRATIO XRAY SERVICES CULRATIO XRAY CENSUS CULRATIO XRAY NURSES And then the best 4-factor, 5-factor, etc. Number in 4 4 4 4 4 4 4 4 R-square 0.51613081 0.51023237 0.50002851 0.49971593 0.49556642 0.49556459 0.49348607 0.49341314 Variables in Model LTOFSTAY CULRATIO XRAY SERVICES LTOFSTAY CULRATIO XRAY NURSES LTOFSTAY CULRATIO XRAY CENSUS LTOFSTAY CULRATIO XRAY NOBEDS LTOFSTAY CULRATIO NURSES SERVICES LTOFSTAY AGE CULRATIO SERVICES LTOFSTAY CULRATIO NOBEDS SERVICES LTOFSTAY CULRATIO CENSUS SERVICES The best model we found was a 4-factor model. Here we can check for alternative 4-factor models. Note that frequently very little is lost by replacing one or two variables with different variables, often less than a few percentage points on the R2 value. The other variables may be more interpretable, more reliably measured, cheaper and easier to measure, or have some other advantage. Other Regression Topics As mentioned earlier, the intercept for our last problem was not very meaningful (when all Xi equal zero we have no beds, no nurses, a length of stay of zero days, etc.) This is not an uncommon problem. In evaluating the abundance of marine organism with salinity, temperature and depth, for example, a salinity of zero is not a marine environment, a temperature of zero is not liquid and a depth of zero is not wet, so the intercept is meaningless. So, if you want to plot your data on one of the Xi values, what can you do. If you just extract the intercept and slope of interest, you are essentially setting all other Xi equal to zero. This can lead to unreasonable values of Yhat even if you do not show the intercept. ˆ Yi b0 b1 X1i b2 X 2i b3 X 3i b4 X 4i ˆ Yi b0 b1 X1i b2 (0) b3 (0) b4 (0) ˆ Yi b0 b1 X1i James P. Geaghan - Copyright 2011 Statistical Techniques II Page 59 Infection Risk Plot of observed and Predicted Infection Risk 9 8 7 6 5 4 3 2 1 0 5 7 9 11 13 15 17 19 Length of Stay In order to do a plot of Yj and Yhatj on a single Xij value, it is best to set the other Xij values to their mean value. ˆ Yi b0 b1 X1i b2 X 2i b3 X 3i b4 X 4i ˆ Yi b0 b1 X1i b2 X 2 b3 X 3 b4 X 4 ˆ Yi b0 b2 X 2 b3 X 3 b4 X 4 b1 X1i b0 b1 X1i Since all bj X j are “constant”, the part in brackets is a new “intercept”, b'0. For the final 4-factor model, If I wanted to plot our observed and predicted SENIC values on Length of Stay (with a meaningful range of values) I would get the following results. Variable Parameter Estimate INTERCEP 0.04644573 0.01205242 0.02046537 SLR 2.53702 0.18841053 CULRATIO XRAY SERVICES Constants –0.06358059 –0.06358059 LTOFSTAY Means 0.18841 15.79 81.63 43.16 Sum 0.733513715 0.983818779 0.88327088 2.537022785 Notice the change in intercept, it is no longer negative, suggesting that even for a very short stay in the hospital (near zero time) there is still a positive risk of infection. This seem more reasonable. Infection Risk Now lets look at the plot of the adjusted model for observed and predicted infection risk 9 8 7 6 5 4 3 2 1 0 5 7 9 11 13 15 17 19 Length of Stay James P. Geaghan - Copyright 2011 Statistical Techniques II Page 60 The predicted values line up nicely, as we would expect for a simple linear regression, and could be connected to fit a line. Also, though the origin is not shown, the intercept of 2.5 would be reasonable reasonable. Compare this to the graph that used the full model slope and intercept to get the predicted values with all other Xi values essentially set to zero. Obviously the unadjusted line does not fit well, and it's negative intercept is too low. There appears to be a great deal of scatter, but remember we are looking at the Y variable on only one X variable. There are 3 other significant independent variables doing their share to explain the variation. Cause & Effect I must reiterate, you cannot prove cause and effect with correlation or regression. Cause and effect are “proved” with a controlled experiment. However, once proved relationships can be quantified with regression, and a good correlation may prove to be a useful predictive tool even where there is no cause and effect. Linear combinations Regression is a linear combination. It is linear because the terms are additive. There are some properties of linear combinations that are useful not only for regression, but for other applications as well. Take the linear combination Ai aX i bYi cZi . The variance is given by, Var Ai a 2 * Var X i b 2 * Var Yi c 2 * Var Z i 2 * Covariances 2 2 2 Var Ai a 2 X i b 2 Yi c 2 Zi 2 ab X i ,Yi ac X i , Zi bc Yi , Zi Unless the variables are independent, in which case the covariances may be assumed to be zero. For our variance calculation purposes in Multiple regression we need not consider the covariance among observations because they are independent, We need not consider the covariance among Yhati and ei because they are independent We DO NOT consider the parameter estimates of a multiple regression independent, and we use the covariance estimates from the analysis. Other applications, In a two-sample t-test, and later on in Analysis of variance, if you want to test an hypothesis between two or more independent estimates like, H 0 : 1 0 .5 2 or H 0 : 1 0 .5 2 0 We note that since these are independent, the variance for this t-test will be Variance Var ( 1 ) 0.52 Var ( 2 ) Var ( 1 ) 0.25Var ( 2 ) Linear combinations also are used in sampling. If random sampling is done on a heterogeneous population, the heterogeneity will cause a large variance. If the population is broken into smaller, more homogeneous, units the variance of each of the units will be smaller. The overall variance is then calculated by summing the individual variances (multiplied by the square of the coefficients). Since the units are sampled independently no covariance is needed. For an example, with calculations, see “Linear combinations” under the EXST7005 notes. James P. Geaghan - Copyright 2011 Statistical Techniques II Page 61 Multiple Regression Summary Although the observation diagnostics are similar between SLR and MLR, there a number of new diagnostics for variables. There is also a new problem (multicollinearity) that needs to be addressed. Don't forget, or underestimate, this problem. The assumptions for MLR are basically the same as for SLR. Most diagnostics on assumptions and model adequacy are similar (normality, curvature, etc.). We have partial residual plots (which could have been done for SLR) as a new diagnostic tool. Extra SS are important to understanding the various types of SS, and the General Linear Test. You should now be able to interpret the parameter estimates provided by SLR or MLR, and use most of the diagnostics produced by SAS to determine variable “importance”, evaluate observations and determine of the model is adequate and if the assumptions are met! Curvilinear Regression As the name implies, these are regressions that fit curves. However, the regressions we will discuss are also linear models, so most of the techniques and SAS procedures we have discussed will still be relevant. We will discuss two basic types of curvilinear model. Models that are not linear, but that can be “linearized” by transformation are called intrinsically linear because after transformation they are linear, often SLR. These have already been discussed. The other category are polynomial regressions. These are an extraordinarily flexible family of curves that will fit almost anything. Unfortunately, they rarely have a good, interpretation of the parameter estimates. Polynomial Regression Polynomial regressions are multiple regressions that use power terms of the Xi variable to fit curves. As long as the value of the power is known, the model is linear. Only a single Xi is needed (though more can be used). The assumptions are the same as for any other multiple regression. Polynomial regressions are of the form Yi b0 b1 X i b2 X i2 b3 X i3 ... bk X ik ei The simplest in this family of models is the “linear”, which is just a simple linear regression. Polynomials proceed, Quadratic Yi b0 b1 X i b2 X i2 ei Cubic Yi b0 b1 X i b2 X i2 b3 X i3 ei Quartic Yi b0 b1 X i b2 X i2 b3 X i3 b4 X i4 ei Quintic, etc. James P. Geaghan - Copyright 2011 Statistical Techniques II Page 62 The quadratic fits a simple parabolic curve. Either concave or convex, depending on the sign on the regression coefficient. Y Y X X The cubic fits parabolic curves with an inflection. The inflection does not always occur within the range of the data. Y Y Inflection X X The quartic polynomial adds another inflection, and another peak or valley (maximum or minimum point). These are not usually symmetric. Y Y X X The same pattern continues for larger models. Y X What good are polynomials? They will fit anything. In fact, if no two X values are repeated, then a large enough polynomial will go through every observation. A SLR exactly fits 2 points A quadratic polynomial will exactly fit 3 points A cubic will pass through each of 4 points For n points, a polynomial with n–1 terms will pass through every point. James P. Geaghan - Copyright 2011 Statistical Techniques II Page 63 Sounds like a good thing? Only if you want to fit random scatter! How would you interpret the graph below? Y X About polynomial regressions Polynomial regressions are fitted successively starting with the linear term (a first order polynomial). These are tested in order, so Sequential SS are appropriate. When the highest order term is determined, then all lower order terms are also included. For example, we fit a fifth order polynomial, and only the CUBIC term is significant, then we would OMIT THE HIGHER ORDER NON-SIGNIFICANT TERMS, BUT RETAIN THOSE TERMS OF SMALLER ORDER THAN THE CUBIC. This does not mean that Yi b0 b3 X i3 ei is not a potentially useful model, only that this is not a “polynomial” model. If there are “s” different values of Xi, then s–1 polynomial terms (plus the intercept) will pass through every point (or the mean of every point if there are more than one observation per Xi value. It is often recommended that not more than 1/3 of the total number of points (different Xi values) be tied up in polynomial terms. For example, if we are fitting a polynomial to the 12 months of the year, don't use more than 4 polynomial terms (quartic). All of the assumptions for regression apply to polynomials. Polynomials are WORTHLESS outside the range of observed data!!! Do NOT try to extend predictions beyond the range of data. Polynomials generally do not have “ biologically interpretable” regression coefficients. Since the successive variables are all powers of Xi they are correlated, multicollinearity could be an issue, but for two facts. Using sequential SS gives exactly the needed tests, collinearity is not an issue. Regression coefficients may be affected and variances inflated, but we are unlikely to be interested in the regression coefficients for polynomials anyway. Recall that transformations of Xi will not influence variance. This is true for polynomials. Y Yi i Xi Xi James P. Geaghan - Copyright 2011 Statistical Techniques II Page 64 Polynomial Regression Example (10 K Race Results – Vermont) – Appenxix 9 There are separate race results for 527 Women & 963 Men. We will hypothesize that fastest runners will be neither the oldest nor the youngest. This can be fitted with a polynomial. See the output in Appendix 9 Examine the scatter plots, done separately for the two sexes. Examine the regression models, also done separately for the two sexes. High resolution graphics were prepared in SAS and processed in Freelance. Graphics for the two model were done separately. Time to run marathon (min) 325 300 275 250 225 200 Marathon race sex=F 175 150 Polynomial Regression Example 125 10 20 30 40 50 60 70 Age (years) Time to run marathon (min) 325 300 Polynomial Regression Example 275 250 225 200 175 Marathon race 150 sex=M 125 10 20 30 40 50 60 70 Age (years) Test of separate parameters for the two genders Remember the General Linear Hypotheisis test? Once again we have a full model (3 parameters fitted to each gender = 6 parameters fitted) versus a single fit to both genders combined (ony 3 parameters). The full model is the 6 parameter fit and the reduced model is the 3 paremater fit. The sums of squares from the separate fits to gender can be added to give the following result. James P. Geaghan - Copyright 2011 Statistical Techniques II Page 65 We fit the Reduced model and the Full model (as two separate models). Full model results: dfError = 524 + 960 = 1484, SSE = 304949 + 799234 = 1104183 Reduced model results: dfError=1487, SSE=1206036.845 Then we set up the table below to test the difference. Source Reduced model Full model Difference Full model d.f. 1487 1484 3 1484 SSE MSE 1206036.85 1104183.01 101853.84 33951.2791 1104183.01 744.0586 F 45.6298 P>F 3.320764E‐28 In this case we would decide that there was clearly a difference between genders. We don’t know which one or more of the 3 parameters is different (different curvature or different intercepts) but some difference exists. Later, with analysis of covariance, we could determing which parameters differ. It actually turns out to be the intercept, the curvatures are the same for both genders. So, given the curvature, there is an intermediate age that runs the 10 K race fastest, and younger and older individuals take longer. What is that age? The fitted model for females is Time = 270.94 – 1.7668Age + 0.02906Age2 If we take the first derivative and set this equal to zero, and solve for Age we get: Age at minimum time = 1.7668 / 2(0.02906) = 30.4 Using the equation to solve for the average time at age = 30.4 we get a mean of 244 minutes for women, the best average time for any age. The fitted model for males is Time = 265.60 – 2.3003Age + 0.03392Age2 Men had a minimum at 33.9 and had a mean time of 226.6 minutes at that age. Do the results seem worthwhile? Are they meaningful? Are they Interpretable? Do they have value? Note that the R2 for females was only 0.026822 and for males only 0.038138. However, linear and quadratic terms were significant indicating that there is a significant fit to the means. Polynomial Regression Summary Polynomial regressions are treated like any other multiple regression, except that we use Type I SS for testing hypotheses. Note that the FULLY ADJUSTED regression coefficients are still used to fit the model. The ability to determine a minimum or maximum point is a useful application of polynomials (optimum performance @ age, optimum yield @ fertilizer level, etc). We have some new capabilities as far as what we can do with regression. Test for a curvilinear relationship between the Y and X. Test if the curvature is Quadratic? Cubic? Quartic? ... We can now obtain a curvilinear predictive equation for Y on X. James P. Geaghan - Copyright 2011 ...
View Full Document

Ask a homework question - tutors are online