EXST7015 Fall2011 Lect06

EXST7015 Fall2011 Lect06 - Statistical Techniques II Page...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Statistical Techniques II Page 16 SAS example of a Simple Linear Regression (Appendix 2) Our analyses will be done in SAS. Other, simpler options, such as EXCEL, work well for simple linear regression, but only SAS will cover all of the analyses and all of the options that we want to discuss this semester. If you are not familiar with SAS, see information available on my EXST7005 page, and talk to the TA about getting up to speed. The numerical example used is from your textbook. It is a data set taken from 47 trees. Each tree was measured for diameter, height, weight of harvestable wood and other values. Our objective will be to predict the weight of harvestable wood using just the diameter. The diameter variable is measured about 4 feet off the ground and is called Diameter at Breast Height (DBH). SAS Programming The SAS program. I will presume you are familiar with the SAS data step. I will discuss it briefly only for this first example. SAS Statements – all SAS statements end in a semicolon; Comments – comments are statements that start with an asterisk. They do nothing in the program, they are included only for the purpose of documenting the program. My Simple Linear Regression (SLR) example starts with the comments, *********************************************; *** Data from Freund & Wilson (1993) ***; *** TABLE 8.24 : ESTIMATING TREE WEIGHTS ***; *********************************************; This is for documentation purposes only. It does not affect the program. Options – options can be specified to modify output appearance. The options I prefer are, options ps=61 ls=78 nocenter nodate nonumber nolabel FORMCHAR="|----|+|---+=|-/\<>*"; This option creates a page size (ps) of 61 lines (use 54 for the lab) a line size of 78 character columns, and suppresses the centering of output and printing of the date and page numbers. The DATA step. All our programs will include a DATA section. In this section the data to be analyzed is entered into the SAS system and, if necessary, modified for analysis. data one; A second statement informs SAS that the data is included in the program (CARDS) and that if there are missing values the system should NOT to the next line to get the data (MISSOVER). infile cards missover; The TITLE statement ends in a semicolon as usual, and the text to be used a the title is enclosed in single quotes. TITLE1 'Estimating tree weights from other morphometric variables'; The input statement. Along with the DATA statement, this is an important statement. It names the variables to be used, tells SAS what type of variables they are (numeric or alphanumeric) and gets the data into the SAS data set. input ObsNo Dbh Height Age Grav Weight ObsID $; Note that only one variable in the list is followed by a $. This will cause SAS to assume that all variables are numeric except the variable called OBSID. James P. Geaghan - Copyright 2011 Statistical Techniques II Page 17 The variable OBSID is one I created by adding to each observation a different letter. The first line got an “a”, the second a “b”, etc. The 26th observation got a “z” and the 27th an “A”, etc. This was done to have a way of distinguishing each observation. The LABEL statement provides a way of identifying each variable. It is optional, but if present will be used by SAS in a number of places to identify the variables. label ObsNo = 'Original observation number' Dbh = 'Diameter at breast height (inches)' etc. ... ; I have deactivated the labels by making them a comment statement. If data must be modified, it is done in the data step after the INPUT statement. I have two statements that create logarithms. These are not used in the first analysis, but will be used later in the semester. lweight = log(weight); ldbh = log(DBH); These statements create two new variables (LWEIGHT and LDBH) that are the natural logs of the original variables. Two last statements before the data. The CARDS statement tells SAS that the data step is done and data follows. The RUN statement tells SAS to process all information that it has so far and output any messages about the analysis to the LOG. cards; run; Note that two statements can occur on the same line. The SAS DATA step is now complete. The data will be entered into the SAS system and processing will continue. The rest of the statements in this program are procedures (PROCs) and associated statements. I will briefly discuss some of these statements. For most of the semester we will concentrate on the PROCs that actually do statistics, such as REG, GLM, LOGISTIC, ANOVA, and MIXED. The first PROC is, proc print data=one; TITLE2 'Raw data print'; run; This PROC causes the data to be printed with the second title line added as “Raw data print”. See the Data list from PROC PRINT, Notice that this is a TITLE2, so any previous title1 is kept. Also notice I usually follow PROCs with a RUN statement. This causes the procedure to be executed and any comments regarding the statement are placed in the LOG prior to the next PROC. The next PROC is a PLOT. options ls=111 ps=61; proc plot data=one; plot weight*Dbh=obsid; TITLE1 'Scatter plot'; run; options ps=512 ls=132; It is surrounded by option statements. Although I usually like a large page size of (512), I don't want the plot to cover 512 lines, so I put the page size to 61 for the plot, and then reset it to 512 for subsequent output. The plot is for weight on DBH. Notice the “=ObsID” at the end of the plot statement. This will cause SAS to plot a single character (the ObsID I created) as a symbol representing each observation in the plot. I do this to be able to distinguish between the observations in the plot. James P. Geaghan - Copyright 2011 Statistical Techniques II Page 18 See the output from PROC PLOT, The means statement is often used to examine variables and determine the number of observations of each variable, its minimum and maximum. proc means data=one n mean max min var std stderr; TITLE1 'Raw data means'; var Dbh Height Age Grav Weight; run; This has limited utility for regression analysis. See the output from PROC MEANS titled “Raw data means” You might use it to look for outliers, or to get the range of values for a plot. The SAS UNIVARIATE procedure is very useful in regression analysis. However, the application to the RAW variables is not very useful. proc univariate data=one normal plot; TITLE1 'Raw data Univariate analysis'; var Weight Dbh; run; See the output from PROC Univariate As far a regression is concerned, the preceding material is ancillary, used to prepare or enhance our analysis. The important information for regression will be provided by PROC REG or PROC GLM. 47 proc reg data=one LINEPRINTER; ID ObsID DBH; 48 TITLE2 'Simple linear regression'; 49 model Weight = Dbh / clb alpha=0.01; *** p xpx i influence CLI CLM; 50 Slope:Test DBH = 180; 51 Joint:TEST intercept = 0, DBH = 180; 52 run; options ls=78 ps=45; 53 plot residual.*predicted.=obsid / VREF=0; run; 54 OUTPUT OUT=NEXT1 P=Predicted R=Resid cookd=cooksd dffits=dffits 55 STUDENT=student rstudent=rstudent lclm=lclm uclm=uclm lcl=lcl ucl=ucl; 56 run; 57 options ps=61 ls=95; PROC REG Output See the output from PROC Univariate The ANOVA table The ANOVA table has one of the key tests of hypothesis and an estimate of the Mean Square Error (MSE). Supplemental information This section is ancillary information. Not necessary, but informative in some cases. A popular statistics from this section is the R-Square (R2). Parameter estimates and tests This section is the most important in terms of interpreting the regression, I provides estimates of the regression coefficients (intercept and slope) with their standard errors and a test of each James P. Geaghan - Copyright 2011 Statistical Techniques II Page 19 regression coefficient against zero. The confidence intervals are available on request (option CLB). The default confidence interval is 95% but other values of alpha can be specified. The parameter estimates are: Intercept = b0 = –729.396300 and Slope = b1 = 178.563714 The linear equation is: Yi = b0 + b1 Xi = –729.4 + 178.6* Xi for any value of X Interpretation : The weight starts at –729 when the diameter is zero and increases by 179 pounds for each additional inch in diameter. For a t-test of either parameter against an hypothesized value or a confidence interval on either parameter we would use the standard errors provided by SAS. Sb0 = 55.69366336 and Sb1 = 8.57640103 A 99% confidence interval is calculated by SAS because the option CLB was requested on the model statement and a value of alpha = 0.01 was specified. P(155.5 ≤ 1 ≤ 201.6) = 0.99. A 95% confidence interval is calculated as, Parameter ± tvalue*standard error. The t-value has n – 2 = 45 d.f. and can be found in a t-table. For a two tailed interval and a value of a = 0.05, the t-value is 2.014. For the slope the estimate is 178.6 and the standard error is 8.576. The confidence interval is then: 178.6 ± 2.014*8.576 The preferred expression is: P(161.328 ≤ 1 ≤ 195.872) = 0.95 SAS automatically provides a t-test of each parameter against an hypothesized value of zero, the most common test. t values and P values are Intercept : calculated t = –13.097, P value < 0.0001 Slope : calculated t = 20.820, P value < 0.0001 Interpretation: The slope and intercept differ from zero. Therefore, the line does not pass through the origin, and the line is not a “flat” line; the calculated regression line is an improvement over the original flat line fitted by the correction factor. Other values may be of interest besides zero. These can be tested by hand, or with at “TEST” statement in SAS. I added an additional, optional, test. I decided to test two specific hypotheses about the regression coefficients. 50 51 Slope:Test DBH = 180; Joint:TEST intercept = 0, DBH = 180; SAS provides a mechanism to do this. The statement “TEST DBH = 200;” is added to the program after the model statement. The test outputs the test result (in this program the output follows the list of observation diagnostics). This tests the hypothesis, H 0 : DBH =200 , and you can see that it is rejected here. SAS used an F test to test this (more flexible), we would probably use a t-test (computationally and conceptually easier). The second hypothesis tested is a joint test of the two hypotheses together; H 0 : 0 =0 and H 0 : D B H = 2 0 0 . James P. Geaghan - Copyright 2011 Statistical Techniques II Page 20 This is a two degree of freedom test. Other useful information Other useful output from PROC REG includes observation diagnostics, residual plots and the ability to output residuals for testing. AS YOU KNOW, WE TEST NORMALITY OF THE RESIDUALS, NOT THE RAW DATA!!! Observation diagnostics There are a few diagnostics calculated from individual observations that are of interest. First the residuals are of interest only for their sign. Long strings of residuals with the same sign can indicate either curvature or a lack of independence. Since we don't know what constitutes an overly large residual, these are not very useful for detecting outliers. Another value of interest is the standardized residuals, in SAS the values “STUDENT” and “RSTUDENT”. These are standardized residuals, and should have a mean of zero and a variance of one. They should follow a t distribution, so that for our example with 45 observations we expect that 99% would be between ±2.690. The HAT diag values. Hat diag is a relative measure of how far an X value is from the center of the X space. A high value indicates an unusual value of X. This is not necessarily bad, but unusual values should be examined for correctness. The hat diag values will sum to “p”, where p is the number of parameters estimated in the model (2 for SLR). The mean of the hat diag values will be p/n, and any values more than twice this value are considered “large”. Again, this is not necessarily a problem. Influence diagnostics examine how the regression would change if an observation were removed from the analysis. If an observation is removed an the regression does not change, the observation is not influential. If the regression changes a lot, the observation is very influential. DFFITS measures the change in terms of the “fit”, as judged by the predicted (Yhat) value. If a point is removed and Yhat changes a lot, the point is influential. for small to medium size databases, DFFITS should not exceed 1, while for large databases it should not exceed 2*sqrt(p/n) DFBETAS measures the change in terms of the “fit”, as judged by changes in b0 and b1. If a point is removed and b0 or b1 change a lot, the point is influential. for small to medium size databases, DFBETAS should not exceed 1, while for large databases it should not exceed 2/sqrt(n) (see Appendix 1). Observation diagnostics Look for RSTUDENT values over 2.7 Look for Hat diag values over 0.08 Look for DFFITS & DFBETAS over 1. Large Hat diag values on both ends of the regression James P. Geaghan - Copyright 2011 Statistical Techniques II Page 21 Large DFFITS and DFBETAS for observation f & q. Large RSTUDENT for observation f Residual plots (produced by either PROC REG or PROC PLOT) Residual plots are a useful tool for detecting various problems Outliers Curvature Non-homogeneous variance and more Univariate tests & graphics (with options “normal” and “plot”) 66 67 proc univariate data=next1 normal plot; var Resid; TITLE3 'Residual analysis'; run; Note that residuals sum to zero Examine tests of normality Check stem and leaf and box plots for symmetry and outliers Examine “Normal Probability Plot” to study departure from normality The SAS UNIVARIATE procedure is very useful in regression analysis. However, the application to the RAW variables is not very useful. We will be interested in using this PROC to evaluate normality. We will be ESPECIALLY interested in the tests, Shapiro-Wilk W 0.710878 Pr < W <0.0001 Shapiro-Wilk W 0.89407 Pr < W 0.0005 We will also be interested in other tools to evaluate normality (STEM & LEAF, BOX PLOT, NORMAL PROBABILITY PLOT), but NOT FOR THE RAW DATA for either variable (X or Y). Note that these tests of normality are not particularly useful. We will actually test residuals for reasons explained later. We will be assuming normality and testing for normality, but not on the original variables. Later we will be testing the Deviations or Residuals!!! These are the appropriate tests of normality, not the tests of the original variables!!! High quality graphics are available in SAS with procedures such as GPLOT and GCHART. PROC GPLOT was used to produce the following plot. Among other things GPLOT will automagically produce regression lines (linear, quadratic or cubic) with confidence intervals ( = 0.05 or 0.01) for the regression line or the individual points. 110 111 112 114 115 116 117 118 119 120 121 122 200; 123 124 GOPTIONS DEVICE=CGMflwa GSFMODE=REPLACE GSFNAME=OUT1 NOPROMPT noROTATE ftext='TimesRoman' ftitle='TimesRoman' htext=1 htitle=1 ctitle=black ctext=black; FILENAME OUT1'C:\SAS\SLR-Trees1.CGM'; PROC GPLOT DATA=ONE; TITLE1 font='TimesRoman' H=1 'Simple Linear Regression Example'; TITLE2 font='TimesRoman' H=1 'Wood harvest from trees'; PLOT weight*Dbh=1 weight*Dbh=2 / overlay HAXIS=AXIS1 VAXIS=AXIS2; AXIS1 LABEL=(font='TimesRoman' H=1 'Diameter at breast height (inches)') WIDTH=1 MINOR=(N=1) VALUE=(font='TimesRoman' H=1) color=black ORDER=3 TO 13 BY 1; AXIS2 LABEL=(ANGLE=90 H=1 'Weight of wood harvested (lbs)') WIDTH=1 VALUE=(font='TimesRoman' H=1) MINOR=(N=5) color=black ORDER=0 TO 1800 BY SYMBOL1 color=red V=None I=RLcli99 SYMBOL2 color=blue V=dot I=None L=1 MODE=INCLUDE; L=1 MODE=INCLUDE; RUN; James P. Geaghan - Copyright 2011 Statistical Techniques II Page 22 Weight of wood harvested (lbs) 1800 Simple Linear Regression Example Wood harvest from trees 1600 1400 1200 1000 800 600 400 200 0 3 4 5 6 7 8 9 10 11 12 13 Diameter at breast height (inches) Summary of simple linear regression For this relationship a significant correlation exists between the diameter of the tree and the weight of the wood harvested from the tree. In fact, we get 178.6 pounds of wood for each additional inch of diameter P(161.328 ≤ 1 ≤ 195.872) = 0.95 The equation to predict wood harvest from diameter is Yi = –729.4 + 178.6*Xi We might expect that a tree with a diameter of zero to have a weight of zero, but our model says that the weight for such a tree would actually be –729.4. The first question is whether this is “statistically significant” in differing from the hypothesized value of zero. It is (P<0.0001). This is impossible, so either there is something about tree growth we don't understand, or we do not have the right model. So we try to evaluate our model. Are the observations correct and reasonable? Examine the RSTUDENT values. We note a potential problem with Obs “f” Examine the residual plot. This plot appears to show that the line is actually curved and possibly has non-homogeneous variance!!! The Hat diag values indicated that the values on the higher end of the regression were possibly “unusual”. This is not uncommon for simple linear regression, which is kind of one dimensional for X. This statistics will be more useful for multiple regression. The influence diagnostics indicated that a number of observations were “influential”. If the observations are correct, and there are no outliers, there is no problem. Also if an observation IS an outlier, but it is not influential, we don't have much of a problem. Problems occur when an observation is BOTH and outlier and influential. Like observation “f”!!! James P. Geaghan - Copyright 2011 Statistical Techniques II Page 23 Examine the PROC UNIVARIATE output for tests and graphics of normality and for outliers. The Shapiro-Wilk test indicates the residuals do not depart from normality. The graphics do not show a great departure from normality, but there is a possible outlier (observation “f”). The normal probability plot shows only one departure, and it appears to be the outlier on the upper end (f again). So this regression appears to fit “well”. Everything is significant and the R2 is pretty high, but there are a lot of problems. The basic problem is that we do not have the right model. The model should really have some curvature (we will cover this later). Then, observations that are outliers on the ends might fit right on the line. Curvilinear Regression or Intrinsically Linear Regression As the name implies, these are regressions that fit curves. However, the regressions we will discuss are also linear models, so most of the techniques and SAS procedures we have discussed will still be relevant. We will discuss two basic types of curvilinear model. The first are models that are not linear, but that can be “linearized” by transformation. These models are referred to as “intrinsically linear”, because after transformation they are linear. Later we will cover polynomial regressions. These are an extraordinarily flexible family of curves that will fit almost anything. Unfortunately, they rarely have a good, interpretation of the parameter estimates. Intrinsically linear models These are models that contain some transformed variable, logarithms, inverses, square roots, sines, etc. We will concentrate on logarithms, since these models are some of the most useful. What is the effect of taking a logarithm of a dependent or independent variable? For example, instead of Yi b0 b1 X i ei , fit log Yi b0 b1 X i ei If we fit log Yi b0 b1 X i ei , then the original model, before we took logarithms, must have been Yi b0 e b1 X i ei where “e” is the base of the natural logarithm (2.718281828). This model is called the “Exponential Growth Exponential growth and decay model” if b1 is positive, or the exponential decay model if it is not. It is used in the 35 biological sciences to fit exponential growth 30 25 (+1) or mortality (–1). This function 20 produces a curve that increases or decreases 15 proportionally. Exponential model Yi b0 e b1 X i ei 10 5 0 0 10 20 30 James P. Geaghan - Copyright 2011 Statistical Techniques II Other examples of curvilinear models. Power model: Yi b0 X ib1 ei . Taking log Yi b0 X ib1 ei gives log Yi log b0 b1 log X i log ei . Page 24 30 25 20 15 This is a simple linear regression with a 10 dependent variable of log(Yi) and an 5 independent variable of log(Xi). This model is used to fit many things, including morphometric 0 data, The following graphic was fitted for the 0 values of b0 and b1 = 29, –1, 19,0, 4, 0.5, 1,1, 0.03, 2. Models following a hyperbolic shape, with an 25 asymptote can be fitted with inverses. A model 20 with an inverse (e.g. 1 X i ) will fit a “hyperbola”, 15 with its asymptote. Yi bo b1 1 ei X i where b0 fits the asymptote 10 b1=negative b1=0 0<b1<1 b1=1 5 10 15 b1>1 20 25 30 Hyperbolic curves Yhat = 10 + 10(1/Xi) Yhat = 10 - 10(1/Xi) 5 0 0 10 20 30 These are a few of many possible curvilinear regressions. Models including power terms, exponents, logarithms, inverses, roots, and trigonometric functions fit may be curvilinear. 40 A note on logarithms. The model described above requires natural logs. In SAS the function “LOG” produces natural logs (LOG10 gives log base 10). In EXCEL and on many calculators the natural log function is “LN”. However, not all are curves can be fitted by linear models with transformations. Some are nonlinear, and require nonlinear curve fitting techniques. For example, Yi b0 X ib1 ei is curvilinear Yi b0 X ib1 ei is nonlinear Yi b0 b1 X i b2 X i2 ei is linear (polynomial) Yi b0 b1 X i b2 X ib3 ei is nonlinear Note that the power model, Yi b0 X ib1 ei , has an error multiplied by Xi. This is interesting because when the error is multiplied by the independent variable, the variability about the regression line of the raw data should appear to increase as Xi increases. The log transformation (of Yi) should remove this nonhomogeneous variance. This should not be true after log transformation of Xi. James P. Geaghan - Copyright 2011 Statistical Techniques II Page 25 Curvilinear Residual Patterns Transformations of Yi, like log transformations, will affect homogeneity of variance. The raw data should actually appear nonhomogeneous. Yi Yi Xi Xi Transformations of Xi will fit curves but will not affect the homogeneity of variance. Y Yi i Xi Xi Polynomials assume homogeneous variance and will not adjust variance. Yi Xi Intrinsically Linear (Curvilinear) regression example Remember our SLR example about the amount of wood harvested from trees, predicted on the basis of DBH (diameter at breast height)? Recall that it looked a little curved, and maybe even had nonhomogeneous variance? Let's take another look at that model. Typically, morphometric relationships (between parts of an organism) are best fitted with models with both Log(Y) and Log(X). Fish length - scale length Fish total length - fish fork length Crab width - crab length Fish length - fish weight, etc. Statistics quote: There are three kinds of lies: lies, damned lies, and statistics. Benjamin Disraeli (1804 - 1881) James P. Geaghan - Copyright 2011 Statistical Techniques II Regression Diagnostic Criteria Appendix 1 Supplement Page 143 Criteria for the interpretation of selected regression statistics from the SAS output Reference was primarily Neter, J., Kutner, M. H., Nachtsheim, C. J., and Wasserman, W., Applied Linear Statistical Models, 4th Edition, Richard D. Irwin, Inc., Burr Ridge, Illinois, 1996. General regression diagnostics n 1 2 Adjusted R2 : Radj 1 n p b g FG SSError IJ FG n 1 IJ c1 R h b g H SSTotal K H n p K 2 This is intended to be an adjustment to R2 for additional variables in the model Unlike the usual R2, this value can decrease as more variables are entered in the model if the variables do not account for sufficient additional variation (equal to the MSE). Standardized regression coefficient bj': bj' = bj (Sxj / Sy) Unlike the usual regression coefficient, the magnitude of the standardized coefficient provides a meaningful comparison among the regression coefficients. Larger standardized regression coefficients have more impact on the calculation of the predicted value and are more “important”. Partial correlations Squared semi-partial correlation TYPE I = SCORR1 = SeqSSXj / SSTotal Squared partial correlation TYPE I = PCORR1 = SeqSSXj / (SeqSSXj + SSError) Squared semi-partial correlation TYPE II = SCORR2 = PartialSSXj / SSTotal Squared partial correlation TYPE II = PCORR2 = PartialSSXj / (PartialSSXj + SSError) Note that for regression, TYPE II SS and TYPE III SS are the same. Residual Diagnostics The hat matrix main diagonal elements, hii (Hat Diag , H values in SAS) , called “leverage values”, they are used to detect unusual observations in the X space. . This can also identify substantial extrapolation of new values. As a general rule, hii values greater than 0.5 are “large” while those between 0.2 and 0.5 are moderately large, also look for a leverage value which is noticeably larger than the next largest. The hii values sum to p mean,hii = p/n (note that this is < 1) A value may be an “outlier” if it is more than twice the valuehii (i.e.hii > 2p/n). Studentized residuals (“Student Residual” in SAS). Also called Internally Studentized Residual. There are two versions: Simpler calculation = ei / root(MSE) More common application = ei / root(MSE * (1-hii)) [SAS produces these] We already assume these are normally distributed, so these values would approximately follow a t distribution, where for large samples about 65% are between -1 and +1 about 95% are between -2 and +2 about 99% are between -2.6 and +2.6 Deleted Studentized residuals (“RStudent” in SAS). Also called externally studentized residual. There are also two versions as with the studentized residuals above Deleted Studentized = ei(i) / root(MSE(i)) Deleted Internally Studentized = ei(i) / root(MSE(i) *1-hii) [SAS produces these values] As with the studentized residuals above these values would approximately follow a t distribution James P. Geaghan - Copyright 2011 Statistical Techniques II Regression Diagnostic Criteria Appendix 1 Supplement Page 144 Influence Diagnostics DFFITS; an influence statistic, it measures the difference in fits as judged by the change in predicted value when the point is omitted This is a standardized value and can be interpreted as the number of standard deviation units for small to medium size databases, DFFITS should not exceed 1, while for large databases it should not exceed 2*sqrt(p/n) DBETAS; an influence statistic, it measures the difference in fits as judged by the change in the values of the regression coefficients note that this is also a standardized value for small to medium size databases, DFBETAS should not exceed 1, while for large databases it should not exceed 2/sqrt(n) Cook's D : influence statistic (D is for distance) The boundary of a simultaneous regional confidence region for all regression coefficients this does not follow an F distribution, but it is useful to compare it to the percentiles of the F distribution [F1-; p, n-p] where a change of < 10th or 20th percentile shows little effect, while the 50th percentile is considered large Multicollinearity Diagnostics VIF is related to the severity of multicollinearity a standardized estimate of regression coefficients would be expected to have a value of 1 if the regressors are uncorrelated If the mean of this value is much greater than 2, serious problems are indicated. No single VIF should exceed 10 Tolerance is the inverse of VIF, where Tolerancek = 1-Rk2 The Condition number (a multivariate evaluation) Eigen values are extracted from the regressors, These are variances of linear combinations of the regressors, and go from larger to smaller. If one or more are zero (at the end) then the matrix is not full rank. These sum to p, and if the Xk are independent, each would equal 1 The condition number is the square root of the ratio of the largest (always the first) to each of the others. If this value exceeds 30 then multicollinearity may be a problem. Model Evaluation and Validation R2p, AdjR2p and MSEp can be used to graphically compare and evaluate models. The subscript p refers to the number of parameters in the model Mallow's Cp criterion Use of this statistic presumes no bias in the full model MSE, so the full model should be carefully chosen to have little or no multicollinearity Cp criterion = (SSEp / TrueMSE) -(n - 2p) The Cp statistics will be approximately equal to p if there is no bias in the regression model PRESSp criterion (PRESS = Prediction SS) This criterion is based on deleted residuals. There are n deleted residuals in each regression, and PRESSp is the SS of deleted residuals This value should approximately equal the MSE if predictions are good, it will get larger as predictions are poorer They may be plotted, and the smaller PRESS statistic models represent better predictive models. This statistics can also be used for model validation James P. Geaghan - Copyright 2011 Statistical Techniques II Simple Linear Regression: Appendix 2 Annotated SAS example Page 145 The SAS program. I will presume you are familiar with the SAS data step. I will discuss it briefly only for this first example. SAS Statements – all SAS statements end in a semicolon; Comments – comments are statements that start with an asterisk. They do nothing in the program, they are included only for the purpose of documenting the program. Options can be specified to modify output appearance. The option statement below creates a page size (ps) of 256 lines (use 54 for the lab) and a line size of 80 character columns, and suppresses the centering of output and printing of the date and page numbers. The DATA step. All our programs will include a DATA section. In this section the data to be analyzed is entered into the SAS system and, if necessary, modified for analysis. A second statement informs SAS that the data is included in the program (CARDS) and that if there are missing values the system should NOT to the next line to get the data (MISSOVER). The next statement in my program is a TITLE statement. Up to 9 titles can be active (TITLE1 through TITLE9) and once set are printed at the top of each page. Setting a new title, say TITLE3, would not affect lower numbered titles (TITLE1 and TITLE2) but would delete all higher numbered titles (TITLE4 ...). The TITLE statement ends in a semicolon as usual, and the text to be used a the title is enclosed in single quotes. The input statement. Along with the DATA statement, this is an important statement. It names the variables to be used, tells SAS what type of variables they are (numeric or alphanumeric) and gets the data into the SAS data set. Note that only one variable in the list is followed by a $. This will cause SAS to assume that all variables are numeric except the variable called OBSID. The variable OBSID is one I created by adding to each observation a different letter. The first line got an “a”, the second a “b”, etc. The 26th observation got a “z” and the 27th an “A”, etc. This was done to have a way of distinguishing each observation. The LABEL statement provides a way of identifying each variable. It is optional, but if present will be used by SAS in a number of places to identify the variables. label ObsNo = 'Original observation number' Dbh = 'Diameter at breast height (inches)' etc. ... ; I have deactivated the labels by making them a comment statement. If data must be modified, it is done in the data step after the INPUT statement. I have two statements that create logarithms. These are not used in the first analysis, but will be used later in the semester. lweight = log(weight); ldbh = log(DBH); These statements create two new variables (LWEIGHT and LDBH) that are the natural logs of the original variables. Two last statements before the data. The CARDS statement tells SAS that the data step is done and data follows. The RUN statement tells SAS to process all information that it has so far and output any messages about the analysis to the LOG. cards; run; Note that two statements can occur on the same line. The SAS DATA step is now complete. The data will be entered into the SAS system and processing will continue. The rest of the statements in this program are procedures (PROCs) and associated statements. James P. Geaghan - Copyright 2011 Statistical Techniques II Simple Linear Regression: Appendix 2 Annotated SAS example Page 146 dm'log;clear;output;clear'; options ps=512 ls=120 nocenter nodate nonumber FORMCHAR="|----|+|---+=|-/\<>*"; TITLE1 'Appendix02: Estimating tree harvest weights'; ODS HTML style=minimal body='C:\SAS\Appendix02 Slr-Trees.HTML' ; ODS rtf style=minimal body='C:\SAS\Appendix02 Slr-Trees.RTF' ; filename input1 'C:\SAS\Appendix02 Slr-Trees.csv'; FILENAME OUT1'C:\SAS\Appendix02 Slr-Trees01.CGM'; FILENAME OUT2'C:\SAS\Appendix02 Slr-Trees02.CGM'; ***********************************************; *** Data from Freund & Wilson (1993) ***; *** TABLE 8.24 : ESTIMATING TREE WEIGHTS ***; ***********************************************; data one; infile input1 missover DSD dlm="," firstobs=2; input ObsNo Dbh Height Age Grav Weight; *********** label ObsNo = 'Original observation number' Dbh = 'Diameter at breast height (inches)' Height = 'Height of the tree (feet)' Age = 'Age of the tree (years)' Grav = 'Specific gravity of the wood' Weight = 'Harvest weight of the tree (lbs)' ObsId = 'Identification letter added to dataset'; lweight = log(weight); ldbh = log(DBH); observation + 1; if observation ge 27 then ObsID = byte(observation+64-26); * upper case *; if observation le 26 then ObsID = byte(observation+96); * lower case *; keep Dbh Height Age Grav Weight ldbh lweight obsid; datalines; run; proc print data=one; TITLE2 'Raw data print'; run; options ls=95 ps=61; proc plot data=one; plot weight*Dbh=obsid; TITLE2 'Scatter plot'; run; options ps=256 ls=85; proc means data=one n mean max min std stderr; TITLE2 'Raw data means'; var Dbh Height Age Grav Weight; run; proc univariate data=one normal plot; TITLE2 'Raw data Univariate analysis'; var Weight Dbh; run; proc reg data=one LINEPRINTER; ID ObsID DBH; TITLE2 'Simple linear regression'; model Weight = Dbh / clb alpha=0.01; *** p xpx i influence CLI CLM; Slope:Test DBH = 180; Joint:TEST intercept = 0, DBH = 180; run; options ls=78 ps=45; plot residual.*predicted.=obsid / VREF=0; run; OUTPUT OUT=NEXT1 P=Predicted R=Resid cookd=cooksd dffits=dffits STUDENT=student rstudent=rstudent lclm=lclm uclm=uclm lcl=lcl ucl=ucl; run; options ps=61 ls=95; proc print data=next1; TITLE3 'Listing of observation diagnostics'; var ObsId DBH Weight Predicted Resid student rstudent; run; proc print data=next1; TITLE3 'Listing of observation diagnostics'; var ObsId cooksd dffits lclm uclm lcl ucl; run; options ps=512 ls=85; proc univariate data=next1 normal plot; var Resid; TITLE3 'Residual analysis'; run; options ls=95 ps=61; proc plot data=one; plot weight*Dbh=obsid; TITLE2 'Scatter plot'; run; options ps=512 ls=85; ods html close; ods rtf close; run; quit; James P. Geaghan - Copyright 2011 Statistical Techniques II Simple Linear Regression: Appendix 2 Annotated SAS example Page 147 1 dm'log;clear;output;clear'; 2 options ps=512 ls=120 nocenter nodate nonumber FORMCHAR="|----|+|---+=|-/\<>*"; 3 TITLE1 'Appendix02: Estimating tree harvest weights'; 4 5 ODS HTML style=minimal body='C:\SAS\Appendix02 Slr-Trees.HTML' ; NOTE: Writing HTML Body file: C:\SAS\Appendix02 Slr-Trees.HTML 6 ODS rtf style=minimal body='C:\SAS\Appendix02 Slr-Trees.RTF' ; NOTE: Writing RTF Body file: C:\SAS\Appendix02 Slr-Trees.RTF 7 filename input1 'C:\SAS\Appendix02 Slr-Trees.csv'; 8 FILENAME OUT1'C:\SAS\Appendix02 Slr-Trees.CGM'; 9 FILENAME OUT2'C:\SAS\Appendix02 Slr-Trees.CGM'; 10 11 ***********************************************; 12 *** Data from Freund & Wilson (1993) ***; 13 *** TABLE 8.24 : ESTIMATING TREE WEIGHTS ***; 14 ***********************************************; 15 16 data one; infile input1 missover DSD dlm="," firstobs=2; 17 input ObsNo Dbh Height Age Grav Weight; 18 *********** label ObsNo = 'Original observation number' 19 Dbh = 'Diameter at breast height (inches)' 20 Height = 'Height of the tree (feet)' 21 Age = 'Age of the tree (years)' 22 Grav = 'Specific gravity of the wood' 23 Weight = 'Harvest weight of the tree (lbs)' 24 ObsId = 'Identification letter added to dataset'; 25 lweight = log(weight); 26 ldbh = log(DBH); 27 observation + 1; 28 if observation ge 27 then ObsID = byte(observation+64-26); * upper case *; 29 if observation le 26 then ObsID = byte(observation+96); * lower case *; 30 keep Dbh Height Age Grav Weight ldbh lweight obsid; 31 datalines; NOTE: The infile INPUT1 is: Filename=C:\SAS\Appendix02 Slr-Trees.csv, RECFM=V,LRECL=256, File Size (bytes)=1125, Last Modified=18Jan2009:19:36:54, Create Time=20Dec2009:11:35:59 NOTE: 47 records were read from the infile INPUT1. The minimum record length was 19. The maximum record length was 24. NOTE: The data set WORK.ONE has 47 observations and 8 variables. NOTE: DATA statement used (Total process time): real time 0.03 seconds cpu time 0.04 seconds 31 ! run; 33 proc print data=one; TITLE2 'Raw data print'; run; NOTE: There were 47 observations read from the data set WORK.ONE. NOTE: The PROCEDURE PRINT printed page 1. NOTE: PROCEDURE PRINT used (Total process time): real time 0.26 seconds cpu time 0.07 seconds EXST7015: Estimating tree weights from other morphometric variables Raw data print Obs 1 2 3 4 5 6 7 8 9 . . . 46 47 Obs No 1 2 3 4 5 6 7 8 9 Dbh 5.7 8.1 8.3 7.0 6.2 11.4 11.6 4.5 3.5 46 47 5.2 3.7 Height 34 68 70 54 37 79 70 37 32 47 33 Age 10 17 17 17 12 27 26 12 15 Grav 0.409 0.501 0.445 0.442 0.353 0.429 0.497 0.380 0.420 13 13 0.432 0.389 Weight 174 745 814 408 226 1675 1491 121 58 194 66 Obs ID a b c d e f g h i lweight 5.15906 6.61338 6.70196 6.01127 5.42053 7.42357 7.30720 4.79579 4.06044 ldbh 1.74047 2.09186 2.11626 1.94591 1.82455 2.43361 2.45101 1.50408 1.25276 T U 5.26786 4.18965 1.64866 1.30833 James P. Geaghan - Copyright 2011 Statistical Techniques II Simple Linear Regression: Appendix 2 Annotated SAS example Page 148 35 options ls=95 ps=61; proc plot data=one; plot weight*Dbh=obsid; 36 TITLE2 'Scatter plot'; run; 37 options ps=256 ls=85; 38 NOTE: There were 47 observations read from the data set WORK.ONE. NOTE: The PROCEDURE PLOT printed page 2. NOTE: PROCEDURE PLOT used (Total process time): real time 0.14 seconds cpu time 0.00 seconds EXST7015: Estimating tree weights from other morphometric variables Scatter plot Plot of Weight*Dbh. Symbol is value of ObsID. Weight | | 1800 + | | | f q | 1600 + | | | g | 1400 + | | | | 1200 + | | | | 1000 + | | | | 800 + c z | b t | N | | 600 + S | | y s | D | 400 + d | l u | o | B COj | F k Je 200 + A L m | a | w h | I Kn H r | i 0 + | --+---------+---------+---------+---------+---------+---------+---------+---------+---------+-------–-+3 4 5 6 7 8 9 10 11 12 13 Dbh NOTE: 11 obs hidden. 39 proc means data=one n mean max min std stderr; 40 TITLE2 'Raw data means'; 41 var Dbh Height Age Grav Weight; run; NOTE: There were 47 observations read from the data set WORK.ONE. NOTE: The PROCEDURE MEANS printed page 3. NOTE: PROCEDURE MEANS used (Total process time): real time 0.28 seconds cpu time 0.04 seconds EXST7015: Estimating tree weights from other morphometric variables Raw data means The MEANS Procedure Variable N Mean Maximum Minimum Variance Std Dev Std Error -------------------------------------------------------------------------------------------------------------Dbh 47 6.1531915 12.1000000 3.5000000 4.4016744 2.0980168 0.3060272 Height 47 49.5957447 79.0000000 27.0000000 167.6808511 12.9491641 1.8888297 Age 47 16.9574468 27.0000000 10.0000000 26.9111933 5.1876000 0.7566892 Grav 47 0.4452979 0.5080000 0.3530000 0.0014853 0.0385402 0.0056217 Weight 47 369.3404255 1692.00 58.0000000 154916.75 393.5946534 57.4116808 -------------------------------------------------------------------------------------------------------------- James P. Geaghan - Copyright 2011 Statistical Techniques II Simple Linear Regression: Appendix 2 Annotated SAS example Page 149 43 proc univariate data=one normal plot; 44 TITLE2 'Raw data Univariate analysis'; 45 var Weight Dbh; run; NOTE: The PROCEDURE UNIVARIATE printed pages 4-5. NOTE: PROCEDURE UNIVARIATE used (Total process time): real time 0.31 seconds cpu time 0.09 seconds Appendix02: Estimating tree harvest weights Raw data Univariate analysis The UNIVARIATE Procedure Variable: Weight Moments N Mean Std Deviation Skewness Uncorrected SS Coeff Variation 47 369.340426 393.594653 2.20870748 13537551 106.566903 Sum Weights Sum Observations Variance Kurtosis Corrected SS Std Error Mean Basic Statistical Measures Location Variability Mean 369.3404 Std Deviation Median 224.0000 Variance Mode 84.0000 Range Interquartile Range 47 17359 154916.751 4.83581557 7126170.55 57.4116808 393.59465 154917 1634 341.00000 Note: The mode displayed is the smallest of 3 modes with a count of 2. Tests for Location: Mu0=0 Test -StatisticStudent's t t 6.433193 Sign M 23.5 Signed Rank S 564 Tests for Normality Test Shapiro-Wilk Kolmogorov-Smirnov Cramer-von Mises Anderson-Darling -----p Value-----Pr > |t| <.0001 Pr >= |M| <.0001 Pr >= |S| <.0001 --Statistic--W 0.710878 D 0.24806 W-Sq 0.77793 A-Sq 4.435579 -----p Value-----Pr < W <0.0001 Pr > D <0.0100 Pr > W-Sq <0.0050 Pr > A-Sq <0.0050 Quantiles (Definition 5) Quantile Estimate 100% Max 1692 99% 1692 95% 1491 90% 814 75% Q3 462 50% Median 224 25% Q1 121 10% 74 5% 66 1% 58 0% Min 58 Extreme Observations ----Lowest---Value Obs 58 9 60 16 66 47 70 35 74 18 ----Highest--Value Obs 814 3 815 26 1491 7 1675 6 1692 17 James P. Geaghan - Copyright 2011 Statistical Techniques II Simple Linear Regression: Stem 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Leaf 89 Appendix 2 Annotated SAS example Page 150 Boxplot 2 1 9 * * 12 2 147 3 1 1 24 2 16 2 00144 5 001112233488 12 00222799 8 667778889 9 ----+----+----+----+ Multiply Stem.Leaf by 10**+2 | | | | +-----+ | + | *-----* +-----+ | Normal Probability Plot 1650+ * * | | * | ++ | ++ | +++ | ++ | +++ 850+ ++** | +*** | +++* | ++ * | +++ ** | ++ *** | +******** | ***** 50+ * * * *******+ +----+----+----+----+----+----+----+----+----+----+ -2 -1 0 +1 +2 Appendix02: Estimating tree harvest weights Raw data Univariate analysis The UNIVARIATE Procedure Variable: Dbh Moments N Mean Std Deviation Skewness Uncorrected SS Coeff Variation 47 6.15319149 2.09801677 1.17285986 1981.98 34.0963998 Sum Weights Sum Observations Variance Kurtosis Corrected SS Std Error Mean Basic Statistical Measures Location Variability Mean 6.153191 Std Deviation Median 5.700000 Variance Mode 4.000000 Range Interquartile Range 47 289.2 4.40167438 1.18369068 202.477021 0.3060272 2.09802 4.40167 8.60000 2.90000 Note: The mode displayed is the smallest of 2 modes with a count of 4. James P. Geaghan - Copyright 2011 Statistical Techniques II Simple Linear Regression: Appendix 2 Tests for Location: Mu0=0 Test -StatisticStudent's t t 20.10668 Sign M 23.5 Signed Rank S 564 Tests for Normality Test Shapiro-Wilk Kolmogorov-Smirnov Cramer-von Mises Anderson-Darling Annotated SAS example Page 151 -----p Value-----Pr > |t| <.0001 Pr >= |M| <.0001 Pr >= |S| <.0001 --Statistic--W 0.89407 D 0.171951 W-Sq 0.214712 A-Sq 1.387777 -----p Value-----Pr < W 0.0005 Pr > D <0.0100 Pr > W-Sq <0.0050 Pr > A-Sq <0.0050 Quantiles (Definition 5) Quantile 100% Max 99% 95% 90% 75% Q3 50% Median 25% Q1 10% 5% 1% 0% Min Estimate 12.1 12.1 11.4 8.8 7.4 5.7 4.5 4.0 3.7 3.5 3.5 Extreme Observations ----Lowest---- ----Highest--- Value 3.5 3.7 3.7 3.9 4.0 Value 8.8 9.3 11.4 11.6 12.1 Stem 12 11 11 10 10 9 9 8 8 7 7 6 6 5 5 4 4 3 Obs 9 47 35 37 44 Leaf 1 6 4 Obs 26 20 6 7 17 Boxplot 3 68 013 78 04 57 0011122 5666677 0224 555 0000233 5779 ----+----+----+----+ 1 1 1 1 2 3 2 2 2 7 7 4 3 7 4 0 | | | | | | | | | +-----+ | | | + | *-----* | | +-----+ | | James P. Geaghan - Copyright 2011 Statistical Techniques II Simple Linear Regression: Appendix 2 Annotated SAS example Page 152 Normal Probability Plot 12.25+ * | * | * ++ | +++ | ++ | +++ | *+ | ** | *** | +* | +** | +++** | +**** | ***** | *** | ** | ****** 3.75+ * * * * ++ +----+----+----+----+----+----+----+----+----+----+ -2 -1 0 +1 +2 47 proc reg data=one LINEPRINTER; ID ObsID DBH; 48 TITLE2 'Simple linear regression'; 49 model Weight = Dbh / clb alpha=0.01; *** p xpx i influence CLI CLM; 50 Slope:Test DBH = 180; 51 Joint:TEST intercept = 0, DBH = 180; run; options ls=78 ps=45; 53 plot residual.*predicted.=obsid / VREF=0; run; 54 OUTPUT OUT=NEXT1 P=Predicted R=Resid cookd=cooksd dffits=dffits 55 STUDENT=student rstudent=rstudent lclm=lclm uclm=uclm lcl=lcl ucl=ucl; 56 run; 57 options ps=61 ls=95; NOTE: The data set WORK.NEXT1 has 47 observations and 18 variables. NOTE: The PROCEDURE REG printed pages 6-9. NOTE: PROCEDURE REG used (Total process time): real time 0.59 seconds cpu time 0.28 seconds Appendix02: Estimating tree harvest weights Simple linear regression The REG Procedure Model: MODEL1 Dependent Variable: Weight Number of Observations Read Number of Observations Used Analysis of Variance Source Model Error Corrected Total Root MSE Dependent Mean Coeff Var 47 47 Sum of Squares 6455980 670191 7126171 DF 1 45 46 122.03740 369.34043 33.04198 Parameter Estimates Parameter Variable DF Estimate Intercept 1 -729.39630 Dbh 1 178.56371 R-Square Adj R-Sq Standard Error 55.69366 8.57640 t Value -13.10 20.82 Mean Square 6455980 14893 F Value 433.49 Pr > F <.0001 0.9060 0.9039 Pr > |t| <.0001 <.0001 99% Confidence Limits -879.18914 -579.60346 155.49675 201.63067 James P. Geaghan - Copyright 2011 Statistical Techniques II Simple Linear Regression: Appendix 2 Test Slope Results for Dependent Variable Weight Mean Source DF Square F Value Numerator 1 417.69334 0.03 Denominator 45 14893 Test Joint Results for Dependent Variable Weight Mean Source DF Square F Value Numerator 2 12807462 859.96 Denominator 45 14893 R e s i d u a l Annotated SAS example Page 153 Pr > F 0.8678 Pr > F <.0001 -+------+------+------+------+------+------+------+------+------+-RESIDUAL | | 400 + + | f | | | | | | | | q | | | 200 + + | i | | ?K g | | Q | | ? G | | ? ? B N c | | A b | 0 + r ? l + | F z | | ?k CO | | j u S | | a P d y | | ? o D | | t | -200 + Residual plots are a useful+tool | | for detecting various problems | | Outliers | s | | | Curvature | | Non-homogeneous variance | | and more -400 + + | | -+------+------+------+------+------+------+------+------+------+--200 0 200 400 600 800 1000 1200 1400 1600 Predicted Value of Weight PRED 58 proc print data=next1; 59 TITLE3 'Listing of observation diagnostics'; 60 var ObsId DBH Weight Predicted Resid student rstudent; run; NOTE: There were 47 observations read from the data set WORK.NEXT1. NOTE: The PROCEDURE PRINT printed page 10. NOTE: PROCEDURE PRINT used (Total process time): real time 0.13 seconds cpu time 0.03 seconds James P. Geaghan - Copyright 2011 Statistical Techniques II Simple Linear Regression: Appendix 2 Annotated SAS example Page 154 Appendix02: Estimating tree harvest weights Simple linear regression Listing of observation diagnostics Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 Obs ID a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F G H I J K L M N O P Q R S T U Dbh 5.7 8.1 8.3 7.0 6.2 11.4 11.6 4.5 3.5 6.2 5.7 6.0 5.6 4.0 6.7 4.0 12.1 4.5 8.6 9.3 6.5 5.6 4.3 4.5 7.7 8.8 5.0 5.4 6.0 7.4 5.6 5.5 4.3 4.2 3.7 6.1 3.9 5.2 5.6 7.8 6.1 6.1 4.0 4.0 8.0 5.2 3.7 Weight 174 745 814 408 226 1675 1491 121 58 278 220 342 209 84 313 60 1692 74 515 766 345 210 100 122 539 815 194 280 296 462 200 229 125 84 70 224 99 200 214 712 297 238 89 76 614 194 66 Predicted 288.42 716.97 752.68 520.55 377.70 1306.23 1341.94 74.14 -104.42 377.70 288.42 341.99 270.56 -15.14 466.98 -15.14 1431.22 74.14 806.25 931.25 431.27 270.56 38.43 74.14 645.54 841.96 163.42 234.85 341.99 591.98 270.56 252.70 38.43 20.57 -68.71 359.84 -33.00 199.14 270.56 663.40 359.84 359.84 -15.14 -15.14 699.11 199.14 -68.71 Resid -114.417 28.030 61.317 -112.550 -151.699 368.770 149.057 46.860 162.423 -99.699 -68.417 0.014 -61.560 99.141 -153.981 75.141 260.775 -0.140 -291.252 -165.246 -86.268 -60.560 61.572 47.860 -106.544 -26.964 30.578 45.152 -45.986 -129.975 -70.560 -23.704 86.572 63.429 138.711 -135.842 131.998 0.865 -56.560 48.599 -62.842 -121.842 104.141 91.141 -85.113 -5.135 134.711 student -0.94818 0.23442 0.51389 -0.93392 -1.25650 3.29162 1.33889 0.39083 1.36987 -0.82579 -0.56698 0.00012 -0.51029 0.83095 -1.27635 0.62979 2.38302 -0.00117 -2.44967 -1.40424 -0.71476 -0.50200 0.51447 0.39917 -0.88786 -0.22740 0.25412 0.37452 -0.38092 -1.08081 -0.58489 -0.19655 0.72336 0.53050 1.16676 -1.12516 1.10759 0.00718 -0.46884 0.40532 -0.52051 -1.00920 0.87285 0.76389 -0.71112 -0.04263 1.13312 rstudent -0.94710 0.23194 0.50965 -0.93256 -1.26484 3.73546 1.35112 0.38712 1.38372 -0.82282 -0.56266 0.00011 -0.50605 0.82804 -1.28558 0.62552 2.52082 -0.00116 -2.60199 -1.42001 -0.71082 -0.49778 0.51022 0.39541 -0.88573 -0.22498 0.25146 0.37092 -0.37727 -1.08288 -0.58057 -0.19444 0.71947 0.52622 1.17159 -1.12858 1.11046 0.00710 -0.46474 0.40153 -0.51625 -1.00941 0.87050 0.76031 -0.70716 -0.04215 1.13679 61 proc print data=next1; 62 TITLE3 'Listing of observation diagnostics'; 63 var ObsId cooksd dffits lclm uclm lcl ucl; run; NOTE: There were 47 observations read from the data set WORK.NEXT1. NOTE: The PROCEDURE PRINT printed page 11. NOTE: PROCEDURE PRINT used (Total process time): real time 0.12 seconds cpu time 0.03 seconds 64 options ps=512 ls=85; James P. Geaghan - Copyright 2011 Statistical Techniques II Simple Linear Regression: Appendix 2 Annotated SAS example Page 155 Appendix02: Estimating tree harvest weights Simple linear regression Listing of observation diagnostics Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 Obs ID a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F G H I J K L M N O P Q R S T U cooksd 0.01025 0.00114 0.00608 0.01110 0.01717 1.01075 0.18073 0.00275 0.05571 0.00742 0.00366 0.00000 0.00304 0.01596 0.01896 0.00917 0.69191 0.00000 0.16073 0.07442 0.00571 0.00294 0.00526 0.00287 0.01349 0.00153 0.00092 0.00173 0.00159 0.01742 0.00399 0.00046 0.01040 0.00588 0.03658 0.01377 0.02981 0.00000 0.00256 0.00295 0.00295 0.01108 0.01761 0.01348 0.01002 0.00002 0.03450 dffits -0.14301 0.04734 0.10939 -0.14877 -0.18654 1.61350 0.60670 0.07348 0.33716 -0.12135 -0.08496 0.00002 -0.07728 0.17801 -0.19616 0.13447 1.24438 -0.00022 -0.60223 -0.39013 -0.10629 -0.07602 0.10174 0.07505 -0.16386 -0.05473 0.04256 0.05826 -0.05578 -0.18699 -0.08866 -0.03009 0.14346 0.10758 0.27160 -0.16646 0.24481 0.00115 -0.07097 0.07610 -0.07614 -0.14888 0.18714 0.16345 -0.14078 -0.00686 0.26353 lclm 239.41 651.33 683.80 468.84 329.81 1176.08 1207.49 12.93 -182.13 329.81 239.41 293.98 221.01 -84.13 417.47 -84.13 1285.93 12.93 732.24 844.29 382.73 221.01 -25.76 12.93 585.83 764.38 108.65 183.92 293.98 536.12 221.01 202.51 -25.76 -45.17 -142.83 311.95 -103.66 146.45 221.01 602.28 311.95 311.95 -84.13 -84.13 635.03 146.45 -142.83 uclm 337.42 782.61 821.56 572.26 425.59 1436.38 1476.40 135.35 -26.72 425.59 337.42 389.99 320.11 53.84 516.49 53.84 1576.51 135.35 880.26 1018.20 479.81 320.11 102.61 135.35 705.25 919.55 218.19 285.78 389.99 647.83 320.11 302.90 102.61 86.31 5.41 407.74 37.67 251.82 320.11 724.52 407.74 407.74 53.84 53.84 763.20 251.82 5.41 lcl -43.45 382.24 417.30 188.27 45.99 953.14 987.24 -259.75 -441.73 45.99 -43.45 10.26 -61.39 -350.54 135.04 -350.54 1072.28 -259.75 469.78 591.69 99.47 -61.39 -296.02 -259.75 311.93 504.69 -169.35 -97.31 10.26 259.03 -61.39 -79.34 -296.02 -314.18 -405.21 28.14 -368.75 -133.30 -61.39 329.53 28.14 28.14 -350.54 -350.54 364.69 -133.30 -405.21 ucl 620.28 1051.70 1088.06 852.83 709.40 1659.32 1696.64 408.03 232.88 709.40 620.28 673.71 602.51 320.26 798.92 320.26 1790.17 408.03 1142.72 1270.80 763.07 602.51 372.87 408.03 979.16 1179.24 496.19 567.01 673.71 924.92 602.51 584.75 372.87 355.32 267.79 691.55 302.75 531.57 602.51 997.27 691.55 691.55 320.26 320.26 1033.54 531.57 267.79 66 proc univariate data=next1 normal plot; var Resid; 67 TITLE3 'Residual analysis'; run; NOTE: The PROCEDURE UNIVARIATE printed page 12. NOTE: PROCEDURE UNIVARIATE used (Total process time): real time 0.14 seconds cpu time 0.04 seconds 68 James P. Geaghan - Copyright 2011 Statistical Techniques II Simple Linear Regression: Appendix 2 Annotated SAS example Page 156 EXST7015: Estimating tree weights from other morphometric variables Simple linear regression Residual analysis The UNIVARIATE Procedure Variable: E (Residual) N Mean Std Deviation Skewness Uncorrected SS Coeff Variation Moments 47 Sum Weights 0 Sum Observations 120.703619 Variance 0.47869472 Kurtosis 670190.732 Corrected SS . Std Error Mean Basic Statistical Measures Location Variability Mean 0.00000 Std Deviation Median -0.14041 Variance Mode . Range Interquartile Range 47 0 14569.3637 1.04153074 670190.732 17.6064324 120.70362 14569 660.02160 161.40929 Tests for Location: Mu0=0 Test -Statistic-----p Value-----Student's t t 0 Pr > |t| 1.0000 Sign M -0.5 Pr >= |M| 1.0000 Signed Rank S -25 Pr >= |S| 0.7946 Test Shapiro-Wilk Kolmogorov-Smirnov Cramer-von Mises Anderson-Darling Tests for Normality --Statistic--W 0.973389 D 0.084574 W-Sq 0.044081 A-Sq 0.354877 -----p Value-----Pr < W 0.3544 Pr > D >0.1500 Pr > W-Sq >0.2500 Pr > A-Sq >0.2500 Quantiles (Definition 5) Quantile Estimate 100% Max 368.769960 99% 368.769960 95% 162.423301 90% 138.710558 75% Q3 75.141444 50% Median -0.140413 25% Q1 -86.267841 10% -135.842356 5% -153.980584 1% -291.251641 0% Min -291.251641 Extreme Observations ------Lowest---------Highest----Value Obs Value Obs -291.252 19 138.711 35 -165.246 20 149.057 7 -153.981 15 162.423 9 -151.699 5 260.775 17 -135.842 36 368.770 6 James P. Geaghan - Copyright 2011 Statistical Techniques II Simple Linear Regression: Stem 3 3 2 2 1 1 0 0 -0 -0 -1 -1 -2 -2 Appendix 2 Leaf 7 # 1 Boxplot 0 6 1 Annotated SAS example Page 157 | | | | +-----+ | + | *-----* +-----+ | | | | 56 00334 5555666899 0033 3210 997766665 4321110 755 2 5 10 4 4 9 7 3 9 1 ----+----+----+----+ Multiply Stem.Leaf by 10**+2 Normal Probability Plot 375+ * | + | * ++++ | ++++ | +++* | +***** | ***** | +**** | ++*** | ****** | ****** | * *+*+ | ++++ -275+ ++*+ +----+----+----+----+----+----+----+----+----+----+ -2 -1 0 +1 +2 6 font fixed courier new ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0----+----1----+----2----+---- 7 font fixed courier new ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0----+----1- 8 font fixed courier new ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+-- 9 font fixed courier new ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+- 10 font fixed courier new ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+-- 11 font fixed courier new ----+----1----+----2----+----3----+----4----+----5----+----6----+---- 12 font fixed courier new ----+----1----+----2----+----3----+----4----+----5----+----6---- James P. Geaghan - Copyright 2011 Statistical Techniques II Simple Linear Regression: Appendix 2 Annotated SAS example Page 158 110 GOPTIONS DEVICE=CGMflwa GSFMODE=REPLACE GSFNAME=OUT NOPROMPT noROTATE 111 ftext='TimesRoman' ftitle='TimesRoman' htext=1 htitle=1 ctitle=black ctext=black; 112 113 GOPTIONS GSFNAME=OUT1; 114 FILENAME OUT1'C:\SAS\SLR-Trees1.CGM'; NOTE: There were 47 observations read from the data set WORK.ONE. NOTE: The PROCEDURE PLOT printed page 14. NOTE: PROCEDURE PLOT used: real time 0.09 seconds cpu time 0.04 seconds 115 PROC GPLOT DATA=ONE; 116 TITLE1 font='TimesRoman' H=1 'Simple Linear Regression Example'; 117 TITLE2 font='TimesRoman' H=1 'Wood harvest from trees'; 118 PLOT weight*Dbh=1 weight*Dbh=2 / overlay HAXIS=AXIS1 VAXIS=AXIS2; 119 AXIS1 LABEL=(font='TimesRoman' H=1 'Diameter at breast height (inches)') WIDTH=1 MINOR=(N=1) 120 VALUE=(font='TimesRoman' H=1) color=black ORDER=3 TO 13 BY 1; 121 AXIS2 LABEL=(ANGLE=90 font='TimesRoman' H=1 'Weight of wood harvested (lbs)') WIDTH=1 122 VALUE=(font='TimesRoman' H=1) MINOR=(N=5) color=black ORDER=0 TO 1800 BY 200; 123 SYMBOL1 color=red V=None I=RLcli99 L=1 MODE=INCLUDE; 124 SYMBOL2 color=blue V=dot I=None L=1 MODE=INCLUDE; RUN; NOTE: Regression equation : Weight = -729.3963 + 178.5637*Dbh. NOTE: Foreground color BLACK same as background. Part of your graph may not be visible. NOTE: 52 RECORDS WRITTEN TO C:\SAS\SLR-Trees1.CGM 125 **** V = "dot" would place a dot for each point; 126 **** I = for regression: R requests fitted regression line, L, Q or C requests Linear, 127 Quadraatic or cubic, CLM or CLI requests corresponding confidence interval and 128 95 specifies alpha level for CI (any value from 50 to 99); 129 **** I = for categories: requests STD (std dev) 1 (1 width, 2 or 3) M (of mean=std err) 130 J (join means of bars) t (add top & bottom hash) p (use pooled variance); 131 **** Other options for categories: omit M=std dev, use B to get bar for min/max; 132 RUN: NOTE: There were 47 observations read from the data set WORK.ONE. NOTE: PROCEDURE GPLOT used: real time 0.22 seconds cpu time 0.10 seconds Weight of wood harvested (lbs) 1800 Simple Linear Regression Example Wood harvest from trees 1600 1400 1200 1000 800 600 400 200 0 3 4 5 6 7 8 9 10 11 12 Diameter at breast height (inches) James P. Geaghan - Copyright 2011 13 ...
View Full Document

This note was uploaded on 12/29/2011 for the course EXST 7015 taught by Professor Wang,j during the Fall '08 term at LSU.

Ask a homework question - tutors are online