EXST7015 Fall2011 Lect04

EXST7015 Fall2011 Lect04 - Statistical Techniques II Page...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Statistical Techniques II Page 13 Numerical Example of a Simple Linear Regression (data from Freund, Mohr & Wilson (2010) : Estimating tree weights (Table 8.26)) Observation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 Sum Mean n Dbh 5.7 8.1 8.3 7.0 6.2 11.4 11.6 4.5 3.5 6.2 5.7 6.0 5.6 4.0 6.7 4.0 12.1 4.5 8.6 9.3 6.5 5.6 4.3 4.5 7.7 8.8 5.0 5.4 6.0 7.4 5.6 5.5 4.3 4.2 3.7 6.1 3.9 5.2 5.6 7.8 6.1 6.1 4.0 4.0 8.0 5.2 3.7 289.2 6.15 47 Weight 174 745 814 408 226 1675 1491 121 58 278 220 342 209 84 313 60 1692 74 515 766 345 210 100 122 539 815 194 280 296 462 200 229 125 84 70 224 99 200 214 712 297 238 89 76 614 194 66 17359 369.34 47 Dbh*Dbh 32.49 65.61 68.89 49.00 38.44 129.96 134.56 20.25 12.25 38.44 32.49 36.00 31.36 16.00 44.89 16.00 146.41 20.25 73.96 86.49 42.25 31.36 18.49 20.25 59.29 77.44 25.00 29.16 36.00 54.76 31.36 30.25 18.49 17.64 13.69 37.21 15.21 27.04 31.36 60.84 37.21 37.21 16.00 16.00 64.00 27.04 13.69 1981.98 42.17 47 Wt*Wt 30276 555025 662596 166464 51076 2805625 2223081 14641 3364 77284 48400 116964 43681 7056 97969 3600 2862864 5476 265225 586756 119025 44100 10000 14884 290521 664225 37636 78400 87616 213444 40000 52441 15625 7056 4900 50176 9801 40000 45796 506944 88209 56644 7921 5776 376996 37636 4356 13537551 288033 47 Dbh*Wt 991.8 6034.5 6756.2 2856.0 1401.2 19095.0 17295.6 544.5 203.0 1723.6 1254.0 2052.0 1170.4 336.0 2097.1 240.0 20473.2 333.0 4429.0 7123.8 2242.5 1176.0 430.0 549.0 4150.3 7172.0 970.0 1512.0 1776.0 3418.8 1120.0 1259.5 537.5 352.8 259.0 1366.4 386.1 1040.0 1198.4 5553.6 1811.7 1451.8 356.0 304.0 4912.0 1008.8 244.2 142968.3 3041.9 47 Predicted Residual 288.42 –114.42 716.97 28.03 752.68 61.32 520.55 –112.55 377.7 –151.7 1306.23 368.77 1341.94 149.06 74.14 46.86 –104.42 162.42 377.7 –99.7 288.42 –68.42 341.99 0.01 270.56 –61.56 –15.14 99.14 466.98 –153.98 –15.14 75.14 1431.22 260.78 74.14 –0.14 806.25 –291.25 931.25 –165.25 431.27 –86.27 270.56 –60.56 38.43 61.57 74.14 47.86 645.54 –106.54 841.96 –26.96 163.42 30.58 234.85 45.15 341.99 –45.99 591.98 –129.98 270.56 –70.56 252.7 –23.7 38.43 86.57 20.57 63.43 –68.71 138.71 359.84 –135.84 –33 132 199.14 0.86 270.56 –56.56 663.4 48.6 359.84 –62.84 359.84 –121.84 –15.14 104.14 –15.14 91.14 699.11 –85.11 199.14 –5.14 –68.71 134.71 Sum = 0 SS = 670190.7322 James P. Geaghan - Copyright 2011 Statistical Techniques II Page 14 Intermediate Calculations Sum X = n X i = 289.2 Sum X2 = i 1 n X i 1 Mean X = X i Sum XY = 2 i n Y Sum Y2 = = 1981.98 i 1 n = 17359 i Y n = X = 6.153191489 XY i i = 142968.3 Mean Y= Yi 2 = 13537551 i i 1 n i 1 Sum Y = n = Y = 369.3404255 n = 47 Correction factors and Corrected values (Sums of squares and cross-products) CF for X = Cxx = 1779.502979 CF for Y = Cyy = 6411380.447 CF for XY = Cxy = 106813.2511 Corrected SS X = Sxx = 202.4770213 Corrected SS Y = Syy = 7126170.553 Corrected SS XY = Sxy = 36155.04894 Model Parameter Estimates Slope = b1 = 36155.04894 / 202.4770213 =178.5637141 Intercept = b0 = 369.3404255 – 178.5637141 * 6.153191489 = –729.3963003 Regression Line Yi = b0 + b1 * Xi + ei Yi = –729.3963003 + 178.5637141 * Xi + ei ANOVA Table SSTotal =7126170.553, (the uncorrected value was USSTotal = 13537551) SSRegression = 36155.048942 / 202.4770213 = 6455979.821 SSError = 7126170.553 – 6455979.821 = 670190.7322 Source Regression Error Total df 1 45 46 SS 6455979.821 670190.732 7126170.553 MS 6455979.821 14893.128 Standard error of b1 : where t(0.05/2, 45 df) = 2.014103 ; Sb1 F 433.4871821 MSE X i X . 2 = 8.576401034 P(178.5637 – 2.0141*8.5764 1 178.5637 + 2.0141*8.5764 ) = 0.95 P( 161.289956 1 195.8375 ) = 0.95 Testing b1 against a specified value : H0: 1 = 200 versus H1: 1 200 t b1 1|H0 Sb1 = (178.5637141 – 200) / 8.576401034 = –2.49945 Note that t2 = F = 6.247251 ; SAS would be do this test as an F test. This allows for multiple degree of freedom tests that test several parameters jointly. James P. Geaghan - Copyright 2011 Statistical Techniques II Regression Diagnostic Criteria Appendix 1 Supplement Page 143 Criteria for the interpretation of selected regression statistics from the SAS output Reference was primarily Neter, J., Kutner, M. H., Nachtsheim, C. J., and Wasserman, W., Applied Linear Statistical Models, 4th Edition, Richard D. Irwin, Inc., Burr Ridge, Illinois, 1996. General regression diagnostics n 1 2 Adjusted R2 : Radj 1 n p b g FG SSError IJ FG n 1 IJ c1 R h b g H SSTotal K H n p K 2 This is intended to be an adjustment to R2 for additional variables in the model Unlike the usual R2, this value can decrease as more variables are entered in the model if the variables do not account for sufficient additional variation (equal to the MSE). Standardized regression coefficient bj': bj' = bj (Sxj / Sy) Unlike the usual regression coefficient, the magnitude of the standardized coefficient provides a meaningful comparison among the regression coefficients. Larger standardized regression coefficients have more impact on the calculation of the predicted value and are more “important”. Partial correlations Squared semi-partial correlation TYPE I = SCORR1 = SeqSSXj / SSTotal Squared partial correlation TYPE I = PCORR1 = SeqSSXj / (SeqSSXj + SSError) Squared semi-partial correlation TYPE II = SCORR2 = PartialSSXj / SSTotal Squared partial correlation TYPE II = PCORR2 = PartialSSXj / (PartialSSXj + SSError) Note that for regression, TYPE II SS and TYPE III SS are the same. Residual Diagnostics The hat matrix main diagonal elements, hii (Hat Diag , H values in SAS) , called “leverage values”, they are used to detect unusual observations in the X space. . This can also identify substantial extrapolation of new values. As a general rule, hii values greater than 0.5 are “large” while those between 0.2 and 0.5 are moderately large, also look for a leverage value which is noticeably larger than the next largest. The hii values sum to p mean,hii = p/n (note that this is < 1) A value may be an “outlier” if it is more than twice the valuehii (i.e.hii > 2p/n). Studentized residuals (“Student Residual” in SAS). Also called Internally Studentized Residual. There are two versions: Simpler calculation = ei / root(MSE) More common application = ei / root(MSE * (1-hii)) [SAS produces these] We already assume these are normally distributed, so these values would approximately follow a t distribution, where for large samples about 65% are between -1 and +1 about 95% are between -2 and +2 about 99% are between -2.6 and +2.6 Deleted Studentized residuals (“RStudent” in SAS). Also called externally studentized residual. There are also two versions as with the studentized residuals above Deleted Studentized = ei(i) / root(MSE(i)) Deleted Internally Studentized = ei(i) / root(MSE(i) *1-hii) [SAS produces these values] As with the studentized residuals above these values would approximately follow a t distribution James P. Geaghan - Copyright 2011 Statistical Techniques II Regression Diagnostic Criteria Appendix 1 Supplement Page 144 Influence Diagnostics DFFITS; an influence statistic, it measures the difference in fits as judged by the change in predicted value when the point is omitted This is a standardized value and can be interpreted as the number of standard deviation units for small to medium size databases, DFFITS should not exceed 1, while for large databases it should not exceed 2*sqrt(p/n) DBETAS; an influence statistic, it measures the difference in fits as judged by the change in the values of the regression coefficients note that this is also a standardized value for small to medium size databases, DFBETAS should not exceed 1, while for large databases it should not exceed 2/sqrt(n) Cook's D : influence statistic (D is for distance) The boundary of a simultaneous regional confidence region for all regression coefficients this does not follow an F distribution, but it is useful to compare it to the percentiles of the F distribution [F1-; p, n-p] where a change of < 10th or 20th percentile shows little effect, while the 50th percentile is considered large Multicollinearity Diagnostics VIF is related to the severity of multicollinearity a standardized estimate of regression coefficients would be expected to have a value of 1 if the regressors are uncorrelated If the mean of this value is much greater than 2, serious problems are indicated. No single VIF should exceed 10 Tolerance is the inverse of VIF, where Tolerancek = 1-Rk2 The Condition number (a multivariate evaluation) Eigen values are extracted from the regressors, These are variances of linear combinations of the regressors, and go from larger to smaller. If one or more are zero (at the end) then the matrix is not full rank. These sum to p, and if the Xk are independent, each would equal 1 The condition number is the square root of the ratio of the largest (always the first) to each of the others. If this value exceeds 30 then multicollinearity may be a problem. Model Evaluation and Validation R2p, AdjR2p and MSEp can be used to graphically compare and evaluate models. The subscript p refers to the number of parameters in the model Mallow's Cp criterion Use of this statistic presumes no bias in the full model MSE, so the full model should be carefully chosen to have little or no multicollinearity Cp criterion = (SSEp / TrueMSE) -(n - 2p) The Cp statistics will be approximately equal to p if there is no bias in the regression model PRESSp criterion (PRESS = Prediction SS) This criterion is based on deleted residuals. There are n deleted residuals in each regression, and PRESSp is the SS of deleted residuals This value should approximately equal the MSE if predictions are good, it will get larger as predictions are poorer They may be plotted, and the smaller PRESS statistic models represent better predictive models. This statistics can also be used for model validation James P. Geaghan - Copyright 2011 Statistical Techniques II Simple Linear Regression: Appendix 2 Annotated SAS example Page 145 The SAS program. I will presume you are familiar with the SAS data step. I will discuss it briefly only for this first example. SAS Statements – all SAS statements end in a semicolon; Comments – comments are statements that start with an asterisk. They do nothing in the program, they are included only for the purpose of documenting the program. Options can be specified to modify output appearance. The option statement below creates a page size (ps) of 256 lines (use 54 for the lab) and a line size of 80 character columns, and suppresses the centering of output and printing of the date and page numbers. The DATA step. All our programs will include a DATA section. In this section the data to be analyzed is entered into the SAS system and, if necessary, modified for analysis. A second statement informs SAS that the data is included in the program (CARDS) and that if there are missing values the system should NOT to the next line to get the data (MISSOVER). The next statement in my program is a TITLE statement. Up to 9 titles can be active (TITLE1 through TITLE9) and once set are printed at the top of each page. Setting a new title, say TITLE3, would not affect lower numbered titles (TITLE1 and TITLE2) but would delete all higher numbered titles (TITLE4 ...). The TITLE statement ends in a semicolon as usual, and the text to be used a the title is enclosed in single quotes. The input statement. Along with the DATA statement, this is an important statement. It names the variables to be used, tells SAS what type of variables they are (numeric or alphanumeric) and gets the data into the SAS data set. Note that only one variable in the list is followed by a $. This will cause SAS to assume that all variables are numeric except the variable called OBSID. The variable OBSID is one I created by adding to each observation a different letter. The first line got an “a”, the second a “b”, etc. The 26th observation got a “z” and the 27th an “A”, etc. This was done to have a way of distinguishing each observation. The LABEL statement provides a way of identifying each variable. It is optional, but if present will be used by SAS in a number of places to identify the variables. label ObsNo = 'Original observation number' Dbh = 'Diameter at breast height (inches)' etc. ... ; I have deactivated the labels by making them a comment statement. If data must be modified, it is done in the data step after the INPUT statement. I have two statements that create logarithms. These are not used in the first analysis, but will be used later in the semester. lweight = log(weight); ldbh = log(DBH); These statements create two new variables (LWEIGHT and LDBH) that are the natural logs of the original variables. Two last statements before the data. The CARDS statement tells SAS that the data step is done and data follows. The RUN statement tells SAS to process all information that it has so far and output any messages about the analysis to the LOG. cards; run; Note that two statements can occur on the same line. The SAS DATA step is now complete. The data will be entered into the SAS system and processing will continue. The rest of the statements in this program are procedures (PROCs) and associated statements. James P. Geaghan - Copyright 2011 Statistical Techniques II Simple Linear Regression: Appendix 2 Annotated SAS example Page 146 dm'log;clear;output;clear'; options ps=512 ls=120 nocenter nodate nonumber FORMCHAR="|----|+|---+=|-/\<>*"; TITLE1 'Appendix02: Estimating tree harvest weights'; ODS HTML style=minimal body='C:\SAS\Appendix02 Slr-Trees.HTML' ; ODS rtf style=minimal body='C:\SAS\Appendix02 Slr-Trees.RTF' ; filename input1 'C:\SAS\Appendix02 Slr-Trees.csv'; FILENAME OUT1'C:\SAS\Appendix02 Slr-Trees01.CGM'; FILENAME OUT2'C:\SAS\Appendix02 Slr-Trees02.CGM'; ***********************************************; *** Data from Freund & Wilson (1993) ***; *** TABLE 8.24 : ESTIMATING TREE WEIGHTS ***; ***********************************************; data one; infile input1 missover DSD dlm="," firstobs=2; input ObsNo Dbh Height Age Grav Weight; *********** label ObsNo = 'Original observation number' Dbh = 'Diameter at breast height (inches)' Height = 'Height of the tree (feet)' Age = 'Age of the tree (years)' Grav = 'Specific gravity of the wood' Weight = 'Harvest weight of the tree (lbs)' ObsId = 'Identification letter added to dataset'; lweight = log(weight); ldbh = log(DBH); observation + 1; if observation ge 27 then ObsID = byte(observation+64-26); * upper case *; if observation le 26 then ObsID = byte(observation+96); * lower case *; keep Dbh Height Age Grav Weight ldbh lweight obsid; datalines; run; proc print data=one; TITLE2 'Raw data print'; run; options ls=95 ps=61; proc plot data=one; plot weight*Dbh=obsid; TITLE2 'Scatter plot'; run; options ps=256 ls=85; proc means data=one n mean max min std stderr; TITLE2 'Raw data means'; var Dbh Height Age Grav Weight; run; proc univariate data=one normal plot; TITLE2 'Raw data Univariate analysis'; var Weight Dbh; run; proc reg data=one LINEPRINTER; ID ObsID DBH; TITLE2 'Simple linear regression'; model Weight = Dbh / clb alpha=0.01; *** p xpx i influence CLI CLM; Slope:Test DBH = 180; Joint:TEST intercept = 0, DBH = 180; run; options ls=78 ps=45; plot residual.*predicted.=obsid / VREF=0; run; OUTPUT OUT=NEXT1 P=Predicted R=Resid cookd=cooksd dffits=dffits STUDENT=student rstudent=rstudent lclm=lclm uclm=uclm lcl=lcl ucl=ucl; run; options ps=61 ls=95; proc print data=next1; TITLE3 'Listing of observation diagnostics'; var ObsId DBH Weight Predicted Resid student rstudent; run; proc print data=next1; TITLE3 'Listing of observation diagnostics'; var ObsId cooksd dffits lclm uclm lcl ucl; run; options ps=512 ls=85; proc univariate data=next1 normal plot; var Resid; TITLE3 'Residual analysis'; run; options ls=95 ps=61; proc plot data=one; plot weight*Dbh=obsid; TITLE2 'Scatter plot'; run; options ps=512 ls=85; ods html close; ods rtf close; run; quit; James P. Geaghan - Copyright 2011 Statistical Techniques II Simple Linear Regression: Appendix 2 Annotated SAS example Page 147 1 dm'log;clear;output;clear'; 2 options ps=512 ls=120 nocenter nodate nonumber FORMCHAR="|----|+|---+=|-/\<>*"; 3 TITLE1 'Appendix02: Estimating tree harvest weights'; 4 5 ODS HTML style=minimal body='C:\SAS\Appendix02 Slr-Trees.HTML' ; NOTE: Writing HTML Body file: C:\SAS\Appendix02 Slr-Trees.HTML 6 ODS rtf style=minimal body='C:\SAS\Appendix02 Slr-Trees.RTF' ; NOTE: Writing RTF Body file: C:\SAS\Appendix02 Slr-Trees.RTF 7 filename input1 'C:\SAS\Appendix02 Slr-Trees.csv'; 8 FILENAME OUT1'C:\SAS\Appendix02 Slr-Trees.CGM'; 9 FILENAME OUT2'C:\SAS\Appendix02 Slr-Trees.CGM'; 10 11 ***********************************************; 12 *** Data from Freund & Wilson (1993) ***; 13 *** TABLE 8.24 : ESTIMATING TREE WEIGHTS ***; 14 ***********************************************; 15 16 data one; infile input1 missover DSD dlm="," firstobs=2; 17 input ObsNo Dbh Height Age Grav Weight; 18 *********** label ObsNo = 'Original observation number' 19 Dbh = 'Diameter at breast height (inches)' 20 Height = 'Height of the tree (feet)' 21 Age = 'Age of the tree (years)' 22 Grav = 'Specific gravity of the wood' 23 Weight = 'Harvest weight of the tree (lbs)' 24 ObsId = 'Identification letter added to dataset'; 25 lweight = log(weight); 26 ldbh = log(DBH); 27 observation + 1; 28 if observation ge 27 then ObsID = byte(observation+64-26); * upper case *; 29 if observation le 26 then ObsID = byte(observation+96); * lower case *; 30 keep Dbh Height Age Grav Weight ldbh lweight obsid; 31 datalines; NOTE: The infile INPUT1 is: Filename=C:\SAS\Appendix02 Slr-Trees.csv, RECFM=V,LRECL=256, File Size (bytes)=1125, Last Modified=18Jan2009:19:36:54, Create Time=20Dec2009:11:35:59 NOTE: 47 records were read from the infile INPUT1. The minimum record length was 19. The maximum record length was 24. NOTE: The data set WORK.ONE has 47 observations and 8 variables. NOTE: DATA statement used (Total process time): real time 0.03 seconds cpu time 0.04 seconds 31 ! run; 33 proc print data=one; TITLE2 'Raw data print'; run; NOTE: There were 47 observations read from the data set WORK.ONE. NOTE: The PROCEDURE PRINT printed page 1. NOTE: PROCEDURE PRINT used (Total process time): real time 0.26 seconds cpu time 0.07 seconds EXST7015: Estimating tree weights from other morphometric variables Raw data print Obs 1 2 3 4 5 6 7 8 9 . . . 46 47 Obs No 1 2 3 4 5 6 7 8 9 Dbh 5.7 8.1 8.3 7.0 6.2 11.4 11.6 4.5 3.5 46 47 5.2 3.7 Height 34 68 70 54 37 79 70 37 32 47 33 Age 10 17 17 17 12 27 26 12 15 Grav 0.409 0.501 0.445 0.442 0.353 0.429 0.497 0.380 0.420 13 13 0.432 0.389 Weight 174 745 814 408 226 1675 1491 121 58 194 66 Obs ID a b c d e f g h i lweight 5.15906 6.61338 6.70196 6.01127 5.42053 7.42357 7.30720 4.79579 4.06044 ldbh 1.74047 2.09186 2.11626 1.94591 1.82455 2.43361 2.45101 1.50408 1.25276 T U 5.26786 4.18965 1.64866 1.30833 James P. Geaghan - Copyright 2011 Statistical Techniques II Simple Linear Regression: Appendix 2 Annotated SAS example Page 148 35 options ls=95 ps=61; proc plot data=one; plot weight*Dbh=obsid; 36 TITLE2 'Scatter plot'; run; 37 options ps=256 ls=85; 38 NOTE: There were 47 observations read from the data set WORK.ONE. NOTE: The PROCEDURE PLOT printed page 2. NOTE: PROCEDURE PLOT used (Total process time): real time 0.14 seconds cpu time 0.00 seconds EXST7015: Estimating tree weights from other morphometric variables Scatter plot Plot of Weight*Dbh. Symbol is value of ObsID. Weight | | 1800 + | | | f q | 1600 + | | | g | 1400 + | | | | 1200 + | | | | 1000 + | | | | 800 + c z | b t | N | | 600 + S | | y s | D | 400 + d | l u | o | B COj | F k Je 200 + A L m | a | w h | I Kn H r | i 0 + | --+---------+---------+---------+---------+---------+---------+---------+---------+---------+-------–-+3 4 5 6 7 8 9 10 11 12 13 Dbh NOTE: 11 obs hidden. 39 proc means data=one n mean max min std stderr; 40 TITLE2 'Raw data means'; 41 var Dbh Height Age Grav Weight; run; NOTE: There were 47 observations read from the data set WORK.ONE. NOTE: The PROCEDURE MEANS printed page 3. NOTE: PROCEDURE MEANS used (Total process time): real time 0.28 seconds cpu time 0.04 seconds EXST7015: Estimating tree weights from other morphometric variables Raw data means The MEANS Procedure Variable N Mean Maximum Minimum Variance Std Dev Std Error -------------------------------------------------------------------------------------------------------------Dbh 47 6.1531915 12.1000000 3.5000000 4.4016744 2.0980168 0.3060272 Height 47 49.5957447 79.0000000 27.0000000 167.6808511 12.9491641 1.8888297 Age 47 16.9574468 27.0000000 10.0000000 26.9111933 5.1876000 0.7566892 Grav 47 0.4452979 0.5080000 0.3530000 0.0014853 0.0385402 0.0056217 Weight 47 369.3404255 1692.00 58.0000000 154916.75 393.5946534 57.4116808 -------------------------------------------------------------------------------------------------------------- James P. Geaghan - Copyright 2011 Statistical Techniques II Simple Linear Regression: Appendix 2 Annotated SAS example Page 149 43 proc univariate data=one normal plot; 44 TITLE2 'Raw data Univariate analysis'; 45 var Weight Dbh; run; NOTE: The PROCEDURE UNIVARIATE printed pages 4-5. NOTE: PROCEDURE UNIVARIATE used (Total process time): real time 0.31 seconds cpu time 0.09 seconds Appendix02: Estimating tree harvest weights Raw data Univariate analysis The UNIVARIATE Procedure Variable: Weight Moments N Mean Std Deviation Skewness Uncorrected SS Coeff Variation 47 369.340426 393.594653 2.20870748 13537551 106.566903 Sum Weights Sum Observations Variance Kurtosis Corrected SS Std Error Mean Basic Statistical Measures Location Variability Mean 369.3404 Std Deviation Median 224.0000 Variance Mode 84.0000 Range Interquartile Range 47 17359 154916.751 4.83581557 7126170.55 57.4116808 393.59465 154917 1634 341.00000 Note: The mode displayed is the smallest of 3 modes with a count of 2. Tests for Location: Mu0=0 Test -StatisticStudent's t t 6.433193 Sign M 23.5 Signed Rank S 564 Tests for Normality Test Shapiro-Wilk Kolmogorov-Smirnov Cramer-von Mises Anderson-Darling -----p Value-----Pr > |t| <.0001 Pr >= |M| <.0001 Pr >= |S| <.0001 --Statistic--W 0.710878 D 0.24806 W-Sq 0.77793 A-Sq 4.435579 -----p Value-----Pr < W <0.0001 Pr > D <0.0100 Pr > W-Sq <0.0050 Pr > A-Sq <0.0050 Quantiles (Definition 5) Quantile Estimate 100% Max 1692 99% 1692 95% 1491 90% 814 75% Q3 462 50% Median 224 25% Q1 121 10% 74 5% 66 1% 58 0% Min 58 Extreme Observations ----Lowest---Value Obs 58 9 60 16 66 47 70 35 74 18 ----Highest--Value Obs 814 3 815 26 1491 7 1675 6 1692 17 James P. Geaghan - Copyright 2011 Statistical Techniques II Simple Linear Regression: Stem 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Leaf 89 Appendix 2 Annotated SAS example Page 150 Boxplot 2 1 9 * * 12 2 147 3 1 1 24 2 16 2 00144 5 001112233488 12 00222799 8 667778889 9 ----+----+----+----+ Multiply Stem.Leaf by 10**+2 | | | | +-----+ | + | *-----* +-----+ | Normal Probability Plot 1650+ * * | | * | ++ | ++ | +++ | ++ | +++ 850+ ++** | +*** | +++* | ++ * | +++ ** | ++ *** | +******** | ***** 50+ * * * *******+ +----+----+----+----+----+----+----+----+----+----+ -2 -1 0 +1 +2 Appendix02: Estimating tree harvest weights Raw data Univariate analysis The UNIVARIATE Procedure Variable: Dbh Moments N Mean Std Deviation Skewness Uncorrected SS Coeff Variation 47 6.15319149 2.09801677 1.17285986 1981.98 34.0963998 Sum Weights Sum Observations Variance Kurtosis Corrected SS Std Error Mean Basic Statistical Measures Location Variability Mean 6.153191 Std Deviation Median 5.700000 Variance Mode 4.000000 Range Interquartile Range 47 289.2 4.40167438 1.18369068 202.477021 0.3060272 2.09802 4.40167 8.60000 2.90000 Note: The mode displayed is the smallest of 2 modes with a count of 4. James P. Geaghan - Copyright 2011 Statistical Techniques II Simple Linear Regression: Appendix 2 Tests for Location: Mu0=0 Test -StatisticStudent's t t 20.10668 Sign M 23.5 Signed Rank S 564 Tests for Normality Test Shapiro-Wilk Kolmogorov-Smirnov Cramer-von Mises Anderson-Darling Annotated SAS example Page 151 -----p Value-----Pr > |t| <.0001 Pr >= |M| <.0001 Pr >= |S| <.0001 --Statistic--W 0.89407 D 0.171951 W-Sq 0.214712 A-Sq 1.387777 -----p Value-----Pr < W 0.0005 Pr > D <0.0100 Pr > W-Sq <0.0050 Pr > A-Sq <0.0050 Quantiles (Definition 5) Quantile 100% Max 99% 95% 90% 75% Q3 50% Median 25% Q1 10% 5% 1% 0% Min Estimate 12.1 12.1 11.4 8.8 7.4 5.7 4.5 4.0 3.7 3.5 3.5 Extreme Observations ----Lowest---- ----Highest--- Value 3.5 3.7 3.7 3.9 4.0 Value 8.8 9.3 11.4 11.6 12.1 Stem 12 11 11 10 10 9 9 8 8 7 7 6 6 5 5 4 4 3 Obs 9 47 35 37 44 Leaf 1 6 4 Obs 26 20 6 7 17 Boxplot 3 68 013 78 04 57 0011122 5666677 0224 555 0000233 5779 ----+----+----+----+ 1 1 1 1 2 3 2 2 2 7 7 4 3 7 4 0 | | | | | | | | | +-----+ | | | + | *-----* | | +-----+ | | James P. Geaghan - Copyright 2011 Statistical Techniques II Simple Linear Regression: Appendix 2 Annotated SAS example Page 152 Normal Probability Plot 12.25+ * | * | * ++ | +++ | ++ | +++ | *+ | ** | *** | +* | +** | +++** | +**** | ***** | *** | ** | ****** 3.75+ * * * * ++ +----+----+----+----+----+----+----+----+----+----+ -2 -1 0 +1 +2 47 proc reg data=one LINEPRINTER; ID ObsID DBH; 48 TITLE2 'Simple linear regression'; 49 model Weight = Dbh / clb alpha=0.01; *** p xpx i influence CLI CLM; 50 Slope:Test DBH = 180; 51 Joint:TEST intercept = 0, DBH = 180; run; options ls=78 ps=45; 53 plot residual.*predicted.=obsid / VREF=0; run; 54 OUTPUT OUT=NEXT1 P=Predicted R=Resid cookd=cooksd dffits=dffits 55 STUDENT=student rstudent=rstudent lclm=lclm uclm=uclm lcl=lcl ucl=ucl; 56 run; 57 options ps=61 ls=95; NOTE: The data set WORK.NEXT1 has 47 observations and 18 variables. NOTE: The PROCEDURE REG printed pages 6-9. NOTE: PROCEDURE REG used (Total process time): real time 0.59 seconds cpu time 0.28 seconds Appendix02: Estimating tree harvest weights Simple linear regression The REG Procedure Model: MODEL1 Dependent Variable: Weight Number of Observations Read Number of Observations Used Analysis of Variance Source Model Error Corrected Total Root MSE Dependent Mean Coeff Var 47 47 Sum of Squares 6455980 670191 7126171 DF 1 45 46 122.03740 369.34043 33.04198 Parameter Estimates Parameter Variable DF Estimate Intercept 1 -729.39630 Dbh 1 178.56371 R-Square Adj R-Sq Standard Error 55.69366 8.57640 t Value -13.10 20.82 Mean Square 6455980 14893 F Value 433.49 Pr > F <.0001 0.9060 0.9039 Pr > |t| <.0001 <.0001 99% Confidence Limits -879.18914 -579.60346 155.49675 201.63067 James P. Geaghan - Copyright 2011 Statistical Techniques II Simple Linear Regression: Appendix 2 Test Slope Results for Dependent Variable Weight Mean Source DF Square F Value Numerator 1 417.69334 0.03 Denominator 45 14893 Test Joint Results for Dependent Variable Weight Mean Source DF Square F Value Numerator 2 12807462 859.96 Denominator 45 14893 R e s i d u a l Annotated SAS example Page 153 Pr > F 0.8678 Pr > F <.0001 -+------+------+------+------+------+------+------+------+------+-RESIDUAL | | 400 + + | f | | | | | | | | q | | | 200 + + | i | | ?K g | | Q | | ? G | | ? ? B N c | | A b | 0 + r ? l + | F z | | ?k CO | | j u S | | a P d y | | ? o D | | t | -200 + Residual plots are a useful+tool | | for detecting various problems | | Outliers | s | | | Curvature | | Non-homogeneous variance | | and more -400 + + | | -+------+------+------+------+------+------+------+------+------+--200 0 200 400 600 800 1000 1200 1400 1600 Predicted Value of Weight PRED 58 proc print data=next1; 59 TITLE3 'Listing of observation diagnostics'; 60 var ObsId DBH Weight Predicted Resid student rstudent; run; NOTE: There were 47 observations read from the data set WORK.NEXT1. NOTE: The PROCEDURE PRINT printed page 10. NOTE: PROCEDURE PRINT used (Total process time): real time 0.13 seconds cpu time 0.03 seconds James P. Geaghan - Copyright 2011 Statistical Techniques II Simple Linear Regression: Appendix 2 Annotated SAS example Page 154 Appendix02: Estimating tree harvest weights Simple linear regression Listing of observation diagnostics Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 Obs ID a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F G H I J K L M N O P Q R S T U Dbh 5.7 8.1 8.3 7.0 6.2 11.4 11.6 4.5 3.5 6.2 5.7 6.0 5.6 4.0 6.7 4.0 12.1 4.5 8.6 9.3 6.5 5.6 4.3 4.5 7.7 8.8 5.0 5.4 6.0 7.4 5.6 5.5 4.3 4.2 3.7 6.1 3.9 5.2 5.6 7.8 6.1 6.1 4.0 4.0 8.0 5.2 3.7 Weight 174 745 814 408 226 1675 1491 121 58 278 220 342 209 84 313 60 1692 74 515 766 345 210 100 122 539 815 194 280 296 462 200 229 125 84 70 224 99 200 214 712 297 238 89 76 614 194 66 Predicted 288.42 716.97 752.68 520.55 377.70 1306.23 1341.94 74.14 -104.42 377.70 288.42 341.99 270.56 -15.14 466.98 -15.14 1431.22 74.14 806.25 931.25 431.27 270.56 38.43 74.14 645.54 841.96 163.42 234.85 341.99 591.98 270.56 252.70 38.43 20.57 -68.71 359.84 -33.00 199.14 270.56 663.40 359.84 359.84 -15.14 -15.14 699.11 199.14 -68.71 Resid -114.417 28.030 61.317 -112.550 -151.699 368.770 149.057 46.860 162.423 -99.699 -68.417 0.014 -61.560 99.141 -153.981 75.141 260.775 -0.140 -291.252 -165.246 -86.268 -60.560 61.572 47.860 -106.544 -26.964 30.578 45.152 -45.986 -129.975 -70.560 -23.704 86.572 63.429 138.711 -135.842 131.998 0.865 -56.560 48.599 -62.842 -121.842 104.141 91.141 -85.113 -5.135 134.711 student -0.94818 0.23442 0.51389 -0.93392 -1.25650 3.29162 1.33889 0.39083 1.36987 -0.82579 -0.56698 0.00012 -0.51029 0.83095 -1.27635 0.62979 2.38302 -0.00117 -2.44967 -1.40424 -0.71476 -0.50200 0.51447 0.39917 -0.88786 -0.22740 0.25412 0.37452 -0.38092 -1.08081 -0.58489 -0.19655 0.72336 0.53050 1.16676 -1.12516 1.10759 0.00718 -0.46884 0.40532 -0.52051 -1.00920 0.87285 0.76389 -0.71112 -0.04263 1.13312 rstudent -0.94710 0.23194 0.50965 -0.93256 -1.26484 3.73546 1.35112 0.38712 1.38372 -0.82282 -0.56266 0.00011 -0.50605 0.82804 -1.28558 0.62552 2.52082 -0.00116 -2.60199 -1.42001 -0.71082 -0.49778 0.51022 0.39541 -0.88573 -0.22498 0.25146 0.37092 -0.37727 -1.08288 -0.58057 -0.19444 0.71947 0.52622 1.17159 -1.12858 1.11046 0.00710 -0.46474 0.40153 -0.51625 -1.00941 0.87050 0.76031 -0.70716 -0.04215 1.13679 61 proc print data=next1; 62 TITLE3 'Listing of observation diagnostics'; 63 var ObsId cooksd dffits lclm uclm lcl ucl; run; NOTE: There were 47 observations read from the data set WORK.NEXT1. NOTE: The PROCEDURE PRINT printed page 11. NOTE: PROCEDURE PRINT used (Total process time): real time 0.12 seconds cpu time 0.03 seconds 64 options ps=512 ls=85; James P. Geaghan - Copyright 2011 Statistical Techniques II Simple Linear Regression: Appendix 2 Annotated SAS example Page 155 Appendix02: Estimating tree harvest weights Simple linear regression Listing of observation diagnostics Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 Obs ID a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F G H I J K L M N O P Q R S T U cooksd 0.01025 0.00114 0.00608 0.01110 0.01717 1.01075 0.18073 0.00275 0.05571 0.00742 0.00366 0.00000 0.00304 0.01596 0.01896 0.00917 0.69191 0.00000 0.16073 0.07442 0.00571 0.00294 0.00526 0.00287 0.01349 0.00153 0.00092 0.00173 0.00159 0.01742 0.00399 0.00046 0.01040 0.00588 0.03658 0.01377 0.02981 0.00000 0.00256 0.00295 0.00295 0.01108 0.01761 0.01348 0.01002 0.00002 0.03450 dffits -0.14301 0.04734 0.10939 -0.14877 -0.18654 1.61350 0.60670 0.07348 0.33716 -0.12135 -0.08496 0.00002 -0.07728 0.17801 -0.19616 0.13447 1.24438 -0.00022 -0.60223 -0.39013 -0.10629 -0.07602 0.10174 0.07505 -0.16386 -0.05473 0.04256 0.05826 -0.05578 -0.18699 -0.08866 -0.03009 0.14346 0.10758 0.27160 -0.16646 0.24481 0.00115 -0.07097 0.07610 -0.07614 -0.14888 0.18714 0.16345 -0.14078 -0.00686 0.26353 lclm 239.41 651.33 683.80 468.84 329.81 1176.08 1207.49 12.93 -182.13 329.81 239.41 293.98 221.01 -84.13 417.47 -84.13 1285.93 12.93 732.24 844.29 382.73 221.01 -25.76 12.93 585.83 764.38 108.65 183.92 293.98 536.12 221.01 202.51 -25.76 -45.17 -142.83 311.95 -103.66 146.45 221.01 602.28 311.95 311.95 -84.13 -84.13 635.03 146.45 -142.83 uclm 337.42 782.61 821.56 572.26 425.59 1436.38 1476.40 135.35 -26.72 425.59 337.42 389.99 320.11 53.84 516.49 53.84 1576.51 135.35 880.26 1018.20 479.81 320.11 102.61 135.35 705.25 919.55 218.19 285.78 389.99 647.83 320.11 302.90 102.61 86.31 5.41 407.74 37.67 251.82 320.11 724.52 407.74 407.74 53.84 53.84 763.20 251.82 5.41 lcl -43.45 382.24 417.30 188.27 45.99 953.14 987.24 -259.75 -441.73 45.99 -43.45 10.26 -61.39 -350.54 135.04 -350.54 1072.28 -259.75 469.78 591.69 99.47 -61.39 -296.02 -259.75 311.93 504.69 -169.35 -97.31 10.26 259.03 -61.39 -79.34 -296.02 -314.18 -405.21 28.14 -368.75 -133.30 -61.39 329.53 28.14 28.14 -350.54 -350.54 364.69 -133.30 -405.21 ucl 620.28 1051.70 1088.06 852.83 709.40 1659.32 1696.64 408.03 232.88 709.40 620.28 673.71 602.51 320.26 798.92 320.26 1790.17 408.03 1142.72 1270.80 763.07 602.51 372.87 408.03 979.16 1179.24 496.19 567.01 673.71 924.92 602.51 584.75 372.87 355.32 267.79 691.55 302.75 531.57 602.51 997.27 691.55 691.55 320.26 320.26 1033.54 531.57 267.79 66 proc univariate data=next1 normal plot; var Resid; 67 TITLE3 'Residual analysis'; run; NOTE: The PROCEDURE UNIVARIATE printed page 12. NOTE: PROCEDURE UNIVARIATE used (Total process time): real time 0.14 seconds cpu time 0.04 seconds 68 James P. Geaghan - Copyright 2011 Statistical Techniques II Simple Linear Regression: Appendix 2 Annotated SAS example Page 156 EXST7015: Estimating tree weights from other morphometric variables Simple linear regression Residual analysis The UNIVARIATE Procedure Variable: E (Residual) N Mean Std Deviation Skewness Uncorrected SS Coeff Variation Moments 47 Sum Weights 0 Sum Observations 120.703619 Variance 0.47869472 Kurtosis 670190.732 Corrected SS . Std Error Mean Basic Statistical Measures Location Variability Mean 0.00000 Std Deviation Median -0.14041 Variance Mode . Range Interquartile Range 47 0 14569.3637 1.04153074 670190.732 17.6064324 120.70362 14569 660.02160 161.40929 Tests for Location: Mu0=0 Test -Statistic-----p Value-----Student's t t 0 Pr > |t| 1.0000 Sign M -0.5 Pr >= |M| 1.0000 Signed Rank S -25 Pr >= |S| 0.7946 Test Shapiro-Wilk Kolmogorov-Smirnov Cramer-von Mises Anderson-Darling Tests for Normality --Statistic--W 0.973389 D 0.084574 W-Sq 0.044081 A-Sq 0.354877 -----p Value-----Pr < W 0.3544 Pr > D >0.1500 Pr > W-Sq >0.2500 Pr > A-Sq >0.2500 Quantiles (Definition 5) Quantile Estimate 100% Max 368.769960 99% 368.769960 95% 162.423301 90% 138.710558 75% Q3 75.141444 50% Median -0.140413 25% Q1 -86.267841 10% -135.842356 5% -153.980584 1% -291.251641 0% Min -291.251641 Extreme Observations ------Lowest---------Highest----Value Obs Value Obs -291.252 19 138.711 35 -165.246 20 149.057 7 -153.981 15 162.423 9 -151.699 5 260.775 17 -135.842 36 368.770 6 James P. Geaghan - Copyright 2011 Statistical Techniques II Simple Linear Regression: Stem 3 3 2 2 1 1 0 0 -0 -0 -1 -1 -2 -2 Appendix 2 Leaf 7 # 1 Boxplot 0 6 1 Annotated SAS example Page 157 | | | | +-----+ | + | *-----* +-----+ | | | | 56 00334 5555666899 0033 3210 997766665 4321110 755 2 5 10 4 4 9 7 3 9 1 ----+----+----+----+ Multiply Stem.Leaf by 10**+2 Normal Probability Plot 375+ * | + | * ++++ | ++++ | +++* | +***** | ***** | +**** | ++*** | ****** | ****** | * *+*+ | ++++ -275+ ++*+ +----+----+----+----+----+----+----+----+----+----+ -2 -1 0 +1 +2 6 font fixed courier new ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0----+----1----+----2----+---- 7 font fixed courier new ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0----+----1- 8 font fixed courier new ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+-- 9 font fixed courier new ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+- 10 font fixed courier new ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+-- 11 font fixed courier new ----+----1----+----2----+----3----+----4----+----5----+----6----+---- 12 font fixed courier new ----+----1----+----2----+----3----+----4----+----5----+----6---- James P. Geaghan - Copyright 2011 ...
View Full Document

This note was uploaded on 12/29/2011 for the course EXST 7015 taught by Professor Wang,j during the Fall '08 term at LSU.

Ask a homework question - tutors are online