00b SLR-Trees and Intrinsically Linear

00b SLR-Trees and Intrinsically Linear - EXST7015 :...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: EXST7015 : Statistical Techniques II Simple Linear Regression Anotated SAS example Geaghan Page 1 The SAS program. I will presume you are familiar with the SAS data step. I will discuss it briefly only for this first example. SAS Statements – all SAS statements end in a semicolon; Comments – comments are statements that start with an asterisk. They do nothing in the program, they are included only for the purpose of documenting the program. Options can be specified to modify output appearance. The option statement below creates a page size (ps) of 256 lines (use 54 for the lab) and a line size of 80 character columns, and suppresses the centering of output and printing of the date and page numbers. The DATA step. All our programs will include a DATA section. In this section the data to be analyzed is entered into the SAS system and, if necessary, modified for analysis. A second statement informs SAS that the data is included in the program (CARDS) and that if there are missing values the system should NOT to the next line to get the data (MISSOVER). The next statement in my program is a TITLE statement. Up to 9 titles can be active (TITLE1 through TITLE9) and once set are printed at the top of each page. Setting a new title, say TITLE3, would not affect lower numbered titles (TITLE1 and TITLE2) but would delete all higher numbered titles (TITLE4 ...). The TITLE statement ends in a semicolon as usual, and the text to be used a the title is enclosed in single quotes. The input statement. Along with the DATA statement, this is an important statement. It names the variables to be used, tells SAS what type of variables they are (numeric or alphanumeric) and gets the data into the SAS data set. Note that only one variable in the list is followed by a $. This will cause SAS to assume that all variables are numeric except the variable called OBSID. The variable OBSID is one I created by adding to each observation a different letter. The first line got an "a", the second a "b", etc. The 26th observation got a "z" and the 27th an "A", etc. This was done to have a way of distinguishing each observation. The LABEL statement provides a way of identifying each variable. It is optional, but if present will be used by SAS in a number of places to identify the variables. label ObsNo = 'Original observation number' Dbh = 'Diameter at breast height (inches)' etc. ... ; I have deactivated the labels by making them a comment statement. If data must be modified, it is done in the data step after the INPUT statement. I have two statements that create logarithms. These are not used in the first analysis, but will be used later in the semester. lweight = log(weight); ldbh = log(DBH); These statements create two new variables (LWEIGHT and LDBH) that are the natural logs of the original variables. Two last statements before the data. The CARDS statement tells SAS that the data step is done and data follows. The RUN statement tells SAS to process all information that it has so far and output any messages about the analysis to the LOG. cards; run; Note that two statements can occur on the same line. The SAS DATA step is now complete. The data will be entered into the SAS system and processing will continue. The rest of the statements in this program are procedures (PROCs) and associated statements. 04d-SLR-Tree example.doc EXST7015 : Statistical Techniques II Geaghan Simple Linear Regression Anotated SAS example Page 2 dm'log;clear;output;clear'; ***********************************************; *** Data from Freund & Wilson (1993) ***; *** TABLE 8.24 : ESTIMATING TREE WEIGHTS ***; ***********************************************; options ps=256 ls=80 nocenter nodate nonumber; ODS HTML style=minimal rs=none body='C:\Geaghan\EXST\EXST7015New\Spring2003\SAS\04-Slr-Trees.html' ; data one; infile cards missover; TITLE1 'EXST7015: Estimating tree weights from other morphometric variables'; input ObsNo Dbh Height Age Grav Weight ObsID $; *********** label ObsNo = 'Original observation number' Dbh = 'Diameter at breast height (inches)' Height = 'Height of the tree (feet)' Age = 'Age of the tree (years)' Grav = 'Specific gravity of the wood' Weight = 'Harvest weight of the tree (lbs)' ObsId = 'Identification letter added to dataset'; lweight = log(weight); ldbh = log(DBH); cards; run; the data set goes here ; proc print data=one; TITLE2 'Raw data print'; run; options ls=111 ps=61; proc plot data=one; plot weight*Dbh=obsid; TITLE2 'Scatter plot'; run; options ps=256 ls=132; proc means data=one n mean max min var std stderr; TITLE2 'Raw data means'; var Dbh Height Age Grav Weight; run; proc univariate data=one normal plot; TITLE2 'Raw data Univariate analysis'; var Weight Dbh; run; proc reg data=one LINEPRINTER; ID ObsID DBH; TITLE2 'Simple linear regression'; model Weight = Dbh / p xpx i influence clb alpha=0.01; *** CLI CLM; Slope:Test DBH = 200; Joint:TEST intercept = 0, DBH = 200; run; options ls=78 ps=45; plot residual.*predicted.=obsid / VREF=0; run; OUTPUT OUT=NEXT1 P=YHat R=E STUDENT=student rstudent=rstudent lcl=lcl lclm=lclm ucl=ucl uclm=uclm; run; options ps=61 ls=132; proc print data=next1; TITLE3 'Listing of observation diagnostics'; var ObsId DBH Weight YHat E student rstudent lcl lclm ucl uclm; run; options ps=256 ls=80; proc univariate data=next1 normal plot; var e; TITLE3 'Residual analysis'; run; options ls=111 ps=61; proc plot data=one; plot weight*Dbh=obsid; TITLE2 'Scatter plot'; run; options ps=256 ls=132; ods html close; 04d-SLR-Tree example.doc EXST7015 : Statistical Techniques II Simple Linear Regression Geaghan Page 3 Anotated SAS example 1 ********************************************; 2 *** Data from Freund & Wilson (1993) ***; 3 *** TABLE 8.24 : ESTIMATING TREE WEIGHTS ***; 4 ********************************************; 5 options ps=256 ls=80 nocenter nodate nonumber; 6 7 ODS HTML file='C:\Geaghan\EXST\EXST7015New\Fall2002\SAS\01-Slr-Trees.html'; NOTE: Writing HTML Body file: C:\Geaghan\EXST\EXST7015New\Fall2002\SAS\01-Slr-Trees.html 8 9 data one; infile cards missover; 10 TITLE1 'EXST7015: Estimating tree weights from morphometric variables'; 11 input ObsNo Dbh Height Age Grav Weight ObsID $; 12 ******** label ObsNo = 'Original observation number' 13 Dbh = 'Diameter at breast height (inches)' 14 Height = 'Height of the tree (feet)' 15 Age = 'Age of the tree (years)' 16 Grav = 'Specific gravity of the wood' 17 Weight = 'Harvest weight of the tree (lbs)' 18 ObsId = 'Identification letter added to dataset'; 19 lweight = log(weight); 20 ldbh = log(DBH); 21 cards; NOTE: The data set WORK.ONE has 47 observations and 9 variables. NOTE: DATA statement used: real time 1.24 seconds cpu time 0.20 seconds 21 ! run; 69 ; The first PROC is a PROC PRINT; This PROC causes the data to be printed in the SAS output. Note that a second title line added as "Raw data print". This is a TITLE2, so any previous title1 is retained. Also notice that there are multiple statements on the same line and the PROC is followed by a RUN statement. This causes the procedure to be executed and any comments regarding the statement are placed in the LOG prior to the next PROC. 70 proc print data=one; TITLE2 'Raw data print'; run; NOTE: There were 47 observations read from the data set WORK.ONE. NOTE: The PROCEDURE PRINT printed page 1. NOTE: PROCEDURE PRINT used: real time 0.97 seconds cpu time 0.17 seconds EXST7015: Estimating tree weights from other morphometric variables Raw data print EXST7015: Estimating tree weights from other morphometric variables Raw data print Obs Obs No 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 . . . 46 46 47 47 Dbh 5.7 8.1 8.3 7.0 6.2 11.4 11.6 4.5 3.5 6.2 5.2 3.7 Height 34 68 70 54 37 79 70 37 32 45 47 33 Age 10 17 17 17 12 27 26 12 15 15 Grav 0.409 0.501 0.445 0.442 0.353 0.429 0.497 0.380 0.420 0.449 13 13 0.432 0.389 Weight 174 745 814 408 226 1675 1491 121 58 278 194 66 Obs ID a b c d e f g h i j lweight 5.15906 6.61338 6.70196 6.01127 5.42053 7.42357 7.30720 4.79579 4.06044 5.62762 ldbh 1.74047 2.09186 2.11626 1.94591 1.82455 2.43361 2.45101 1.50408 1.25276 1.82455 T U 5.26786 4.18965 1.64866 1.30833 04d-SLR-Tree example.doc EXST7015 : Statistical Techniques II Simple Linear Regression Anotated SAS example Geaghan Page 4 The next PROC is a PLOT request. options ls=111 ps=61; proc plot data=one; plot weight*Dbh=obsid; TITLE1 'Scatter plot'; run; options ps=256 ls=132; It is surrounded by option statements. Although I usually like a large page size of (256), I don't want the plot to cover 256 lines, so I put the page size to 61 for the plot, and then reset it to 256 for subsequent output. The plot is for weight on DBH. Notice the “=ObsID” at the end of the plot statement. This will cause SAS to plot a single character (the ObsID I created) as a symbol representing each observation in the plot. I do this to be able to distinguish between the observations in the plot. 72 options ls=111 ps=61; proc plot data=one; plot weight*Dbh=obsid; 73 TITLE2 'Scatter plot'; run; 74 options ps=256 ls=132; NOTE: There were 47 observations read from the data set WORK.ONE. NOTE: The PROCEDURE PLOT printed page 2. NOTE: PROCEDURE PLOT used: real time 0.22 seconds cpu time 0.02 seconds EXST7015: Estimating tree weights from other morphometric variables Scatter plot Plot of Weight*Dbh. Symbol is value of ObsID. Weight | | 1800 + | | | f q | 1600 + | | | g | 1400 + | | | | 1200 + | | | | 1000 + | | | | 800 + c z | b t | N | | 600 + S | | y s | D | 400 + d | l u | o | B COj | F k Je 200 + A L m | a | w h | I Kn H r | i 0 + | --+---------+---------+---------+---------+---------+---------+---------+---------+---------+-------–-+3 4 5 6 7 8 9 10 11 12 13 Dbh NOTE: 11 obs hidden. 04d-SLR-Tree example.doc EXST7015 : Statistical Techniques II Simple Linear Regression Geaghan Page 5 Anotated SAS example The means statement is often used to examine variables and determine the number of observations of each variable, its minimum and maximum. This has limited utility for regression analysis. You might use it to look for outliers, or to get the range of values for a plot. 76 proc means data=one n mean max min var std stderr; 77 TITLE2 'Raw data means'; 78 var Dbh Height Age Grav Weight; run; NOTE: There were 47 observations read from the data set WORK.ONE. NOTE: The PROCEDURE MEANS printed page 3. NOTE: PROCEDURE MEANS used: real time 0.25 seconds cpu time 0.04 seconds EXST7015: Estimating tree weights from other morphometric variables Raw data means The MEANS Procedure Variable N Mean Maximum Minimum Variance Std Dev Std Error -------------------------------------------------------------------------------------------------------------Dbh 47 6.1531915 12.1000000 3.5000000 4.4016744 2.0980168 0.3060272 Height 47 49.5957447 79.0000000 27.0000000 167.6808511 12.9491641 1.8888297 Age 47 16.9574468 27.0000000 10.0000000 26.9111933 5.1876000 0.7566892 Grav 47 0.4452979 0.5080000 0.3530000 0.0014853 0.0385402 0.0056217 Weight 47 369.3404255 1692.00 58.0000000 154916.75 393.5946534 57.4116808 -------------------------------------------------------------------------------------------------------------- The SAS UNIVARIATE procedure is very useful in regression analysis. However, the application to the RAW variables is not very useful. We will be interested in using this PROC to evaluate normality. We will be ESPECIALLY interested in the tests, Shapiro-Wilk Shapiro-Wilk W W 0.710878 0.89407 Pr < W Pr < W <0.0001 0.0005 We will also be interested in other tools to evaluate normality (STEM & LEAF, BOX PLOT, NORMAL PROBABILITY PLOT), but NOT FOR THE RAW DATA for either variable (X or Y). Note that these tests of normality are not particularly useful. We will actually test residuals for reasons explained later. We will be assuming normality and testing for normality, but not on the original variables. Later we will be testing the Deviations or Residuals!!! These are the appropriate tests of normality, not the tests of the original variables!!! 80 proc univariate data=one normal plot; 81 TITLE2 'Raw data Univariate analysis'; 82 var Weight Dbh; run; NOTE: The PROCEDURE UNIVARIATE printed pages 4-5. NOTE: PROCEDURE UNIVARIATE used: real time 0.53 seconds cpu time 0.06 seconds EXST7015: Estimating tree weights from other morphometric variables Raw data Univariate analysis The UNIVARIATE Procedure Variable: Weight N Mean Std Deviation Skewness Uncorrected SS Coeff Variation Moments 47 Sum Weights 369.340426 Sum Observations 393.594653 Variance 2.20870748 Kurtosis 13537551 Corrected SS 106.566903 Std Error Mean 04d-SLR-Tree example.doc 47 17359 154916.751 4.83581557 7126170.55 57.4116808 EXST7015 : Statistical Techniques II Simple Linear Regression Anotated SAS example Basic Statistical Measures Location Variability Mean 369.3404 Std Deviation 393.59465 Median 224.0000 Variance 154917 Mode 84.0000 Range 1634 Interquartile Range 341.00000 NOTE: The mode displayed is the smallest of 3 modes with a count of 2. Geaghan Page 6 Tests for Location: Mu0=0 Test -Statistic-----p Value-----Student's t t 6.433193 Pr > |t| <.0001 Sign M 23.5 Pr >= |M| <.0001 Signed Rank S 564 Pr >= |S| <.0001 Test Shapiro-Wilk Kolmogorov-Smirnov Cramer-von Mises Anderson-Darling Tests for Normality --Statistic--W 0.710878 D 0.24806 W-Sq 0.77793 A-Sq 4.435579 -----p Value-----Pr < W <0.0001 Pr > D <0.0100 Pr > W-Sq <0.0050 Pr > A-Sq <0.0050 Quantiles (Definition 5) Quantile Estimate 100% Max 1692 99% 1692 95% 1491 90% 814 75% Q3 462 50% Median 224 25% Q1 121 10% 74 5% 66 1% 58 0% Min 58 Extreme Observations ----Lowest-------Highest--Value Obs Value Obs 58 9 814 3 60 16 815 26 66 47 1491 7 70 35 1675 6 74 18 1692 17 Stem 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Leaf 89 # 2 Boxplot * 9 1 * 12 2 147 3 1 1 24 2 16 2 00144 5 001112233488 12 00222799 8 667778889 9 ----+----+----+----+ Multiply Stem.Leaf by 10**+2 | | | | +-----+ | + | *-----* +-----+ | Normal Probability Plot 1650+ | | | | | | | 850+ | | | | | | | 50+ * * * ++ ++ +++ ++ +++ ++** +*** +++* ++ * +++ ** ++ *** +******** ***** * * * **** **+ +----+----+----+----+----+----+----+----+----+----+ -2 -1 0 +1 +2 04d-SLR-Tree example.doc EXST7015 : Statistical Techniques II Simple Linear Regression Geaghan Page 7 Anotated SAS example EXST7015: Estimating tree weights from other morphometric variables Raw data Univariate analysis The UNIVARIATE Procedure Variable: Dbh N Mean Std Deviation Skewness Uncorrected SS Coeff Variation Moments 47 Sum Weights 6.15319149 Sum Observations 2.09801677 Variance 1.17285986 Kurtosis 1981.98 Corrected SS 34.0963998 Std Error Mean 47 289.2 4.40167438 1.18369068 202.477021 0.3060272 Basic Statistical Measures Location Variability Mean 6.153191 Std Deviation 2.09802 Median 5.700000 Variance 4.40167 Mode 4.000000 Range 8.60000 Interquartile Range 2.90000 NOTE: The mode displayed is the smallest of 2 modes with a count of 4. Tests for Location: Mu0=0 Test -Statistic-----p Value-----Student's t t 20.10668 Pr > |t| <.0001 Sign M 23.5 Pr >= |M| <.0001 Signed Rank S 564 Pr >= |S| <.0001 Test Shapiro-Wilk Kolmogorov-Smirnov Cramer-von Mises Anderson-Darling Tests for Normality --Statistic--W 0.89407 D 0.171951 W-Sq 0.214712 A-Sq 1.387777 -----p Value-----Pr < W 0.0005 Pr > D <0.0100 Pr > W-Sq <0.0050 Pr > A-Sq <0.0050 Extreme Observations ----Lowest-------Highest--Value Obs Value Obs 3.5 9 8.8 26 3.7 47 9.3 20 3.7 35 11.4 6 3.9 37 11.6 7 4.0 44 12.1 17 Stem 12 11 11 10 10 9 9 8 8 7 7 6 6 5 5 4 4 3 Leaf 1 6 4 # 1 1 1 3 68 013 78 04 57 0011122 5666677 0224 555 0000233 5779 ----+----+----+----+ 1 2 3 2 2 2 7 7 4 3 7 4 Boxplot 0 | | | | | | | | | +-----+ | | | + | *-----* | | +-----+ | | Normal Probability Plot 12.25+ * | * | * ++ | +++ | ++ | +++ | *+ | ** | *** | +* | +** | +++** | +**** | **** | *** | ** | ****** 3.75+ * * * ++ +----+----+----+----+----+----+----+----+----+----+ -2 -1 0 +1 +2 04d-SLR-Tree example.doc EXST7015 : Statistical Techniques II Simple Linear Regression Geaghan Page 8 Anotated SAS example The preceding material is ancillary to regression, used to prepare or enhance our analysis. The important information for regression will be provided by PROC REG or PROC GLM. 84 proc reg data=one LINEPRINTER; ID ObsID DBH; 85 TITLE2 'Simple linear regression'; 86 model Weight = Dbh / p xpx i influence clb alpha=0.01; *** CLI CLM; 87 Slope:Test DBH = 200; 88 Joint:TEST intercept = 0, DBH = 200; 89 run; NOTE: 47 observations read. Additional useful statements that can be added to PROC REG include plot and output statements. NOTE: 89 90 91 92 93 94 NOTE: NOTE: NOTE: 47 observations used in computations. ! options ls=78 ps=45; plot residual.*predicted.=obsid; run; OUTPUT OUT=NEXT1 P=YHat R=E STUDENT=student rstudent=rstudent lcl=lcl lclm=lclm ucl=ucl uclm=uclm; run; options ps=61 ls=132; The data set WORK.NEXT1 has 47 observations and 17 variables. The PROCEDURE REG printed pages 6-11. PROCEDURE REG used: real time 1.09 seconds cpu time 0.17 seconds EXST7015: Estimating tree weights from other morphometric variables Simple linear regression The REG Procedure Model: MODEL1 Variable Intercept Dbh Weight Model Crossproducts X'X X'Y Y'Y Intercept Dbh 47 289.2 289.2 1981.98 17359 142968.3 Weight 17359 142968.3 13537551 X'X Inverse, Parameter Estimates, and SSE Variable Intercept Dbh Weight Intercept 0.2082694963 -0.030389579 -729.3963003 Dbh -0.030389579 0.004938832 178.56371409 Weight -729.3963003 178.56371409 670190.7322 The ANOVA table and Supplemental information Analysis of Variance Source Model Error Corrected Total Root MSE Dependent Mean Coeff Var DF 1 45 46 122.03740 369.34043 33.04198 Sum of Squares 6455980 670191 7126171 R-Square Adj R-Sq Mean Square 6455980 14893 0.9060 0.9039 04d-SLR-Tree example.doc F Value 433.49 Pr > F <.0001 EXST7015 : Statistical Techniques II Simple Linear Regression Geaghan Page 9 Anotated SAS example Parameter estimates and tests and confidence intervals if requested. Parameter Estimates Variable Intercept Dbh DF 1 1 Parameter Estimate -729.39630 178.56371 Standard Error 55.69366 8.57640 t Value -13.10 20.82 Pr > |t| <.0001 <.0001 99% Confidence Limits -879.18914 -579.60346 155.49675 201.63067 The parameter estimates are , Intercept = –729.396300 Slope = 178.563714 Equation: Yi = –729.4 + 178.6*Xi Interpretation : The weight starts at -729 when the diameter is zero and increases by 179 pounds for each additional inch in diameter. For a t-test of either parameter against an hypothesized value or a confidence interval on either parameter we would use the standard errors provided by SAS. Sb0 = 55.69366336 Sb1 = 8.57640103 A confidence interval is calculated as: Parameter ± tvalue*standard error The t-value has n–2 = 45 d.f. and can be found in a t-table. For a two tailed interval and a value of α = 0.05, the t-value is 2.014 For the slope the estimate is 178.6 and the standard error is 8.576 The confidence interval is given by: 178.6 ± 2.014*8.576 and is best stated as P(161.328 ≤ β1 ≤ 195.872) = 0.95 A 99% confidence interval is calculated by SAS because the option CLB was requested on the model statement and a value of alpha of 0.01 (α=0.01) was specified. b1 − β1|H 0 A t-test of an hypothesized value for the slope would be calculated as t = . Sb1 SAS automatically provides a t-test of each parameter against an hypothesized value of zero, the most common test. t values and P values are Intercept: t = –13.097, P value < 0.0001 Slope: t = 20.820, P value < 0.0001 Interpretation: The slope and intercept differ from zero. Therefore, the line does not pass through the origin, and the line is not "flat", basically the regression line is an improvement over the original flat line fitted by the correction factor. Other values may be of interest besides zero. These can be tested by hand, or with at "TEST" statement in SAS. I added an additional, optional, test. I decided to test two specific hypotheses about the regression coefficients. 87 Slope:Test DBH = 200; 88 Joint:TEST intercept = 0, DBH = 200; SAS provides a mechanism to do this. The statement "TEST DBH = 200;" is added to the program after the model statement. The test outputs the test result (in this program the output follows the list of observation diagnostics). 04d-SLR-Tree example.doc EXST7015 : Statistical Techniques II Simple Linear Regression Geaghan Page 10 Anotated SAS example This tests the hypothesis H0: βDBH = 200, and you can see that it is rejected here. SAS used an F test to test this (more flexible), we would probably use a t-test (computationally and conceptually easier). Test Slope Results for Dependent Variable Weight Mean Source DF Square F Value Numerator 1 93041 6.25 Denominator 45 14893 Pr > F 0.0162 This tests the second hypothesis, a joint test of the two hypotheses H0: β0 = 0 and H0: βDBH = 200. Note that this is a two degree of freedom test. Test Joint Results for Dependent Variable Weight Mean Source DF Square F Value Numerator 2 17479620 1173.67 Denominator 45 14893 R e s i d u a l Pr > F <.0001 -+------+------+------+------+------+------+------+------+------+-RESIDUAL | | 400 + + | f | | | | | | | | q | | | 200 + + | i | | ?K g | | Q | | ? G | | ? ? B N c | | A b | 0 + r ? l + | F z | | ?k CO | j u S Residual plots are a useful tool | a P d y | ? o D for detecting various problems | t -200 + Outliers | | Curvature | s | Non-homogeneous variance | | and more -400 + | -+------+------+------+------+------+------+------+------+------+--200 0 200 400 600 800 1000 1200 1400 1600 Predicted Value of Weight PRED Other useful output from PROC REG includes observation diagnostics, residual plots and the ability to output residuals for testing. AS YOU KNOW, WE TEST NORMALITY OF THE RESIDUALS, NOT THE RAW DATA! The output below is produced automatically by the PROC REG because of options requested on the MODEL statement. model Weight = Dbh / p xpx i influence clb alpha=0.01; *** CLI CLM; These are discussed on the pate following the output. 04d-SLR-Tree example.doc EXST7015 : Statistical Techniques II Simple Linear Regression Geaghan Page 11 Anotated SAS example The REG Procedure Model: MODEL1 Dependent Variable: Weight Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 ObsID a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F G H I J K L M N O P Q R S T U Dbh 5.7 8.1 8.3 7 6.2 11.4 11.6 4.5 3.5 6.2 5.7 6 5.6 4 6.7 4 12.1 4.5 8.6 9.3 6.5 5.6 4.3 4.5 7.7 8.8 5 5.4 6 7.4 5.6 5.5 4.3 4.2 3.7 6.1 3.9 5.2 5.6 7.8 6.1 6.1 4 4 8 5.2 3.7 Dep Var Weight 174.0000 745.0000 814.0000 408.0000 226.0000 1675 1491 121.0000 58.0000 278.0000 220.0000 342.0000 209.0000 84.0000 313.0000 60.0000 1692 74.0000 515.0000 766.0000 345.0000 210.0000 100.0000 122.0000 539.0000 815.0000 194.0000 280.0000 296.0000 462.0000 200.0000 229.0000 125.0000 84.0000 70.0000 224.0000 99.0000 200.0000 214.0000 712.0000 297.0000 238.0000 89.0000 76.0000 614.0000 194.0000 66.0000 Predicted Value 288.4169 716.9698 752.6825 520.5497 377.6987 1306 1342 74.1404 -104.4233 377.6987 288.4169 341.9860 270.5605 -15.1414 466.9806 -15.1414 1431 74.1404 806.2516 931.2462 431.2678 270.5605 38.4277 74.1404 645.5443 841.9644 163.4223 234.8478 341.9860 591.9752 270.5605 252.7041 38.4277 20.5713 -68.7106 359.8424 -32.9978 199.1350 270.5605 663.4007 359.8424 359.8424 -15.1414 -15.1414 699.1134 199.1350 -68.7106 Output Statistics Hat Diag Residual RStudent H -114.4169 -0.9471 0.0223 28.0302 0.2319 0.0400 61.3175 0.5096 0.0440 -112.5497 -0.9326 0.0248 -151.6987 -1.2648 0.0213 368.7700 3.7355 0.1572 149.0572 1.3511 0.1678 46.8596 0.3871 0.0348 162.4233 1.3837 0.0560 -99.6987 -0.8228 0.0213 -68.4169 -0.5627 0.0223 0.0140 0.000115 0.0214 -61.5605 -0.5061 0.0228 99.1414 0.8280 0.0442 -153.9806 -1.2856 0.0228 75.1414 0.6255 0.0442 260.7754 2.5208 0.1959 -0.1404 -0.001158 0.0348 -291.2516 -2.6020 0.0508 -165.2462 -1.4200 0.0702 -86.2678 -0.7108 0.0219 -60.5605 -0.4978 0.0228 61.5723 0.5102 0.0382 47.8596 0.3954 0.0348 -106.5443 -0.8857 0.0331 -26.9644 -0.2250 0.0559 30.5777 0.2515 0.0278 45.1522 0.3709 0.0241 -45.9860 -0.3773 0.0214 -129.9752 -1.0829 0.0290 -70.5605 -0.5806 0.0228 -23.7041 -0.1944 0.0234 86.5723 0.7195 0.0382 63.4287 0.5262 0.0401 138.7106 1.1716 0.0510 -135.8424 -1.1286 0.0213 131.9978 1.1105 0.0464 0.8650 0.007101 0.0258 -56.5605 -0.4647 0.0228 48.5993 0.4015 0.0347 -62.8424 -0.5163 0.0213 -121.8424 -1.0094 0.0213 104.1414 0.8705 0.0442 91.1414 0.7603 0.0442 -85.1134 -0.7072 0.0381 -5.1350 -0.0422 0.0258 134.7106 1.1368 0.0510 04d-SLR-Tree example.doc Cov Ratio 1.0275 1.0869 1.0814 1.0314 0.9950 0.7154 1.1587 1.0763 1.0176 1.0366 1.0546 1.0688 1.0580 1.0610 0.9942 1.0751 0.9932 1.0837 0.8277 1.0285 1.0452 1.0584 1.0748 1.0760 1.0442 1.1053 1.0728 1.0651 1.0620 1.0220 1.0542 1.0692 1.0624 1.0761 1.0365 1.0094 1.0378 1.0736 1.0599 1.0756 1.0559 1.0209 1.0576 1.0661 1.0631 1.0735 1.0402 DFFITS -0.1430 0.0473 0.1094 -0.1488 -0.1865 1.6135 0.6067 0.0735 0.3372 -0.1213 -0.0850 0.0000 -0.0773 0.1780 -0.1962 0.1345 1.2444 -0.0002 -0.6022 -0.3901 -0.1063 -0.0760 0.1017 0.0751 -0.1639 -0.0547 0.0426 0.0583 -0.0558 -0.1870 -0.0887 -0.0301 0.1435 0.1076 0.2716 -0.1665 0.2448 0.0012 -0.0710 0.0761 -0.0761 -0.1489 0.1871 0.1634 -0.1408 -0.0069 0.2635 ------DFBETAS-----Intercept Dbh -0.0736 0.0305 -0.0197 0.0324 -0.0502 0.0786 0.0092 -0.0562 -0.0556 -0.0042 -1.2320 1.5004 -0.4681 0.5669 0.0617 -0.0458 0.3180 -0.2656 -0.0362 -0.0027 -0.0437 0.0181 0.0000 -0.0000 -0.0427 0.0199 0.1609 -0.1282 -0.0133 -0.0500 0.1216 -0.0968 -0.9822 1.1749 -0.0002 0.0001 0.3106 -0.4592 0.2399 -0.3257 -0.0169 -0.0175 -0.0420 0.0196 0.0885 -0.0678 0.0631 -0.0468 0.0508 -0.0979 0.0300 -0.0431 0.0315 -0.0207 0.0363 -0.0199 -0.0217 0.0041 0.0400 -0.0963 -0.0490 0.0228 -0.0177 0.0090 0.1247 -0.0955 0.0949 -0.0737 0.2525 -0.2073 -0.0572 0.0043 0.2236 -0.1801 0.0008 -0.0005 -0.0392 0.0183 -0.0258 0.0473 -0.0262 0.0020 -0.0512 0.0038 0.1692 -0.1347 0.1478 -0.1177 0.0551 -0.0936 -0.0047 0.0029 0.2450 -0.2012 EXST7015 : Statistical Techniques II Simple Linear Regression Anotated SAS example Geaghan Page 12 Sum of Residuals 0 Sum of Squared Residuals 670191 Predicted Residual SS (PRESS) 810382 EXST7015: Estimating tree weights from other morphometric variables Simple linear regression There are a few diagnostics calculated from individual observations that are of interest. First the residuals are of interest only for their sign. Long strings of residuals with the same sign can indicate either curvature or a lack of independence. Since we don't know what constitutes an overly large residual, these are not very useful for detecting outliers. Another value of interest is the standardized residuals, in SAS the values "STUDENT" and "RSTUDENT". These are standardized residuals, and should have a mean of zero and a variance of one. They should follow a t distribution, so that for our example with 45 observations we expect that 99% would be between ±2.690. The HAT diag values: Hat diag is a relative measure of how far an X value is from the center of the X space. A high value indicates an unusual value of X. This is not necessarily bad, but unusual values should be examined for correctness. The hat diag values will sum to "p", where p is the number of parameters estimated in the model (2 for SLR). The mean of the hat diag values will be p/n, and any values more than twice this value are considered "large". Again, this is not necessarily a problem. Influence diagnostics examine how the regression would change if an observation were removed from the analysis. If an observation is removed an the regression does not change, the observation is not influential. If the regression changes a lot, the observation is very influential. DFFITS measures the change in terms of the "fit", as judged by the predicted (Yhat) value. If a point is removed and Yhat changes a lot, the point is influential. for small to medium size databases, DFFITS should not exceed 1, while for large databases it should not exceed 2*sqrt(p/n) DFBETAS measures the change in terms of the "fit", as judged by changes in b0 and b1. If a point is removed and b0 or b1 change a lot, the point is influential. For small to medium size databases, DFBETAS should not exceed 1, while for large databases it should not exceed 2/sqrt(n) Look for RSTUDENT values over 2.7, Hat diag values over 0.08 and DFFITS & DFBETAS over 1. Note large Hat diag values on both ends of the regression, large DFFITS and DFBETAS for observation 45 & 47 and a large RSTUDENT for observation 45. 95 proc print data=next1; 96 TITLE3 'Listing of observation diagnostics'; 97 var ObsId DBH Weight YHat E student rstudent lcl lclm ucl uclm; run; NOTE: There were 47 observations read from the data set WORK.NEXT1. NOTE: The PROCEDURE PRINT printed page 12. NOTE: PROCEDURE PRINT used: real time 0.21 seconds cpu time 0.04 seconds 98 options ps=256 ls=80; 04d-SLR-Tree example.doc EXST7015 : Statistical Techniques II Simple Linear Regression Geaghan Page 13 Anotated SAS example EXST7015: Estimating tree weights from other morphometric variables Simple linear regression Listing of observation diagnostics Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 Obs ID a b c d e f g h i j k l m n o p q r s t u v w x y z A B C D E F G H I J K L M N O P Q R S T U Dbh 5.7 8.1 8.3 7.0 6.2 11.4 11.6 4.5 3.5 6.2 5.7 6.0 5.6 4.0 6.7 4.0 12.1 4.5 8.6 9.3 6.5 5.6 4.3 4.5 7.7 8.8 5.0 5.4 6.0 7.4 5.6 5.5 4.3 4.2 3.7 6.1 3.9 5.2 5.6 7.8 6.1 6.1 4.0 4.0 8.0 5.2 3.7 Weight 174 745 814 408 226 1675 1491 121 58 278 220 342 209 84 313 60 1692 74 515 766 345 210 100 122 539 815 194 280 296 462 200 229 125 84 70 224 99 200 214 712 297 238 89 76 614 194 66 YHat 288.42 716.97 752.68 520.55 377.70 1306.23 1341.94 74.14 -104.42 377.70 288.42 341.99 270.56 -15.14 466.98 -15.14 1431.22 74.14 806.25 931.25 431.27 270.56 38.43 74.14 645.54 841.96 163.42 234.85 341.99 591.98 270.56 252.70 38.43 20.57 -68.71 359.84 -33.00 199.14 270.56 663.40 359.84 359.84 -15.14 -15.14 699.11 199.14 -68.71 E -114.417 28.030 61.317 -112.550 -151.699 368.770 149.057 46.860 162.423 -99.699 -68.417 0.014 -61.560 99.141 -153.981 75.141 260.775 -0.140 -291.252 -165.246 -86.268 -60.560 61.572 47.860 -106.544 -26.964 30.578 45.152 -45.986 -129.975 -70.560 -23.704 86.572 63.429 138.711 -135.842 131.998 0.865 -56.560 48.599 -62.842 -121.842 104.141 91.141 -85.113 -5.135 134.711 student -0.94818 0.23442 0.51389 -0.93392 -1.25650 3.29162 1.33889 0.39083 1.36987 -0.82579 -0.56698 0.00012 -0.51029 0.83095 -1.27635 0.62979 2.38302 -0.00117 -2.44967 -1.40424 -0.71476 -0.50200 0.51447 0.39917 -0.88786 -0.22740 0.25412 0.37452 -0.38092 -1.08081 -0.58489 -0.19655 0.72336 0.53050 1.16676 -1.12516 1.10759 0.00718 -0.46884 0.40532 -0.52051 -1.00920 0.87285 0.76389 -0.71112 -0.04263 1.13312 rstudent -0.94710 0.23194 0.50965 -0.93256 -1.26484 3.73546 1.35112 0.38712 1.38372 -0.82282 -0.56266 0.00011 -0.50605 0.82804 -1.28558 0.62552 2.52082 -0.00116 -2.60199 -1.42001 -0.71082 -0.49778 0.51022 0.39541 -0.88573 -0.22498 0.25146 0.37092 -0.37727 -1.08288 -0.58057 -0.19444 0.71947 0.52622 1.17159 -1.12858 1.11046 0.00710 -0.46474 0.40153 -0.51625 -1.00941 0.87050 0.76031 -0.70716 -0.04215 1.13679 04d-SLR-Tree example.doc lcl -43.45 382.24 417.30 188.27 45.99 953.14 987.24 -259.75 -441.73 45.99 -43.45 10.26 -61.39 -350.54 135.04 -350.54 1072.28 -259.75 469.78 591.69 99.47 -61.39 -296.02 -259.75 311.93 504.69 -169.35 -97.31 10.26 259.03 -61.39 -79.34 -296.02 -314.18 -405.21 28.14 -368.75 -133.30 -61.39 329.53 28.14 28.14 -350.54 -350.54 364.69 -133.30 -405.21 lclm 239.41 651.33 683.80 468.84 329.81 1176.08 1207.49 12.93 -182.13 329.81 239.41 293.98 221.01 -84.13 417.47 -84.13 1285.93 12.93 732.24 844.29 382.73 221.01 -25.76 12.93 585.83 764.38 108.65 183.92 293.98 536.12 221.01 202.51 -25.76 -45.17 -142.83 311.95 -103.66 146.45 221.01 602.28 311.95 311.95 -84.13 -84.13 635.03 146.45 -142.83 ucl 620.28 1051.70 1088.06 852.83 709.40 1659.32 1696.64 408.03 232.88 709.40 620.28 673.71 602.51 320.26 798.92 320.26 1790.17 408.03 1142.72 1270.80 763.07 602.51 372.87 408.03 979.16 1179.24 496.19 567.01 673.71 924.92 602.51 584.75 372.87 355.32 267.79 691.55 302.75 531.57 602.51 997.27 691.55 691.55 320.26 320.26 1033.54 531.57 267.79 uclm 337.42 782.61 821.56 572.26 425.59 1436.38 1476.40 135.35 -26.72 425.59 337.42 389.99 320.11 53.84 516.49 53.84 1576.51 135.35 880.26 1018.20 479.81 320.11 102.61 135.35 705.25 919.55 218.19 285.78 389.99 647.83 320.11 302.90 102.61 86.31 5.41 407.74 37.67 251.82 320.11 724.52 407.74 407.74 53.84 53.84 763.20 251.82 5.41 EXST7015 : Statistical Techniques II Simple Linear Regression 100 101 Geaghan Page 14 Anotated SAS example proc univariate data=next1 normal plot; var e; TITLE3 'Residual analysis'; run; NOTE: The PROCEDURE UNIVARIATE printed page 13. NOTE: PROCEDURE UNIVARIATE used: real time 0.11 seconds cpu time 0.03 seconds EXST7015: Estimating tree weights from other morphometric variables Simple linear regression Residual analysis The UNIVARIATE Procedure Variable: E (Residual) N Mean Std Deviation Skewness Uncorrected SS Coeff Variation Moments 47 Sum Weights 0 Sum Observations 120.703619 Variance 0.47869472 Kurtosis 670190.732 Corrected SS . Std Error Mean Basic Statistical Measures Location Variability Mean 0.00000 Std Deviation Median -0.14041 Variance Mode . Range Interquartile Range 47 0 14569.3637 1.04153074 670190.732 17.6064324 120.70362 14569 660.02160 161.40929 Tests for Location: Mu0=0 Test -Statistic-----p Value-----Student's t t 0 Pr > |t| 1.0000 Sign M -0.5 Pr >= |M| 1.0000 Signed Rank S -25 Pr >= |S| 0.7946 PROC UNIVARIATE test of normality. This is one of the major tools of interest from PROC UNIVARIATE. Test Shapiro-Wilk Kolmogorov-Smirnov Cramer-von Mises Anderson-Darling Tests for Normality --Statistic--W 0.973389 D 0.084574 W-Sq 0.044081 A-Sq 0.354877 -----p Value-----Pr < W 0.3544 Pr > D >0.1500 Pr > W-Sq >0.2500 Pr > A-Sq >0.2500 Quantiles (Definition 5) Quantile Estimate 100% Max 368.769960 99% 368.769960 95% 162.423301 90% 138.710558 75% Q3 75.141444 50% Median -0.140413 25% Q1 -86.267841 10% -135.842356 5% -153.980584 1% -291.251641 0% Min -291.251641 04d-SLR-Tree example.doc EXST7015 : Statistical Techniques II Simple Linear Regression Anotated SAS example Geaghan Page 15 Extreme Observations ------Lowest---------Highest----Value Obs Value Obs -291.252 19 138.711 35 -165.246 20 149.057 7 -153.981 15 162.423 9 -151.699 5 260.775 17 -135.842 36 368.770 6 Stem 3 3 2 2 1 1 0 0 -0 -0 -1 -1 -2 -2 Leaf 7 # 1 Boxplot 0 6 1 | | | | +-----+ | + | *-----* +-----+ | | | | 56 00334 5555666899 0033 3210 997766665 4321110 755 2 5 10 4 4 9 7 3 9 1 ----+----+----+----+ Multiply Stem.Leaf by 10**+2 Normal Probability Plot 375+ * | + | * ++++ | ++++ | +++* | +***** | ***** | +**** | ++*** | ****** | ****** | * *+*+ | ++++ -275+ ++*+ +----+----+----+----+----+----+----+----+----+----+ -2 -1 0 +1 +2 6 font fixed courier new ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0----+----1----+----2----+---- 7 font fixed courier new ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0----+----1- 8 font fixed courier new ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+-- 9 font fixed courier new ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+- 10 font fixed courier new ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+-11 font fixed courier new ----+----1----+----2----+----3----+----4----+----5----+----6----+---12 font fixed courier new ----+----1----+----2----+----3----+----4----+----5----+----6---- 04d-SLR-Tree example.doc EXST7015 : Statistical Techniques II Simple Linear Regression Geaghan Page 16 Anotated SAS example 110 GOPTIONS DEVICE=CGMflwa GSFMODE=REPLACE GSFNAME=OUT NOPROMPT noROTATE 111 ftext='TimesRoman' ftitle='TimesRoman' htext=1 htitle=1 ctitle=black ctext=black; 112 113 GOPTIONS GSFNAME=OUT1; 114 FILENAME OUT1'C:\Geaghan\EXST\EXST7015New\Fall2002\SAS\SLR-Trees1.CGM'; NOTE: There were 47 observations read from the data set WORK.ONE. NOTE: The PROCEDURE PLOT printed page 14. NOTE: PROCEDURE PLOT used: real time 0.09 seconds cpu time 0.04 seconds 115 PROC GPLOT DATA=ONE; 116 TITLE1 font='TimesRoman' H=1 'Simple Linear Regression Example'; 117 TITLE2 font='TimesRoman' H=1 'Wood harvest from trees'; 118 PLOT weight*Dbh=1 weight*Dbh=2 / overlay HAXIS=AXIS1 VAXIS=AXIS2; 119 AXIS1 LABEL=(font='TimesRoman' H=1 'Diameter at breast height (inches)') WIDTH=1 MINOR=(N=1) 120 VALUE=(font='TimesRoman' H=1) color=black ORDER=3 TO 13 BY 1; 121 AXIS2 LABEL=(ANGLE=90 font='TimesRoman' H=1 'Weight of wood harvested (lbs)') WIDTH=1 122 VALUE=(font='TimesRoman' H=1) MINOR=(N=5) color=black ORDER=0 TO 1800 BY 200; 123 SYMBOL1 color=red V=None I=RLcli99 L=1 MODE=INCLUDE; 124 SYMBOL2 color=blue V=dot I=None L=1 MODE=INCLUDE; RUN; NOTE: Regression equation : Weight = -729.3963 + 178.5637*Dbh. NOTE: Foreground color BLACK same as background. Part of your graph may not be visible. NOTE: 52 RECORDS WRITTEN TO C:\Geaghan\EXST\EXST7015New\Fall2002\SAS\SLR-Trees1.CGM 125 **** V = "dot" would place a dot for each point; 126 **** I = for regression: R requests fitted regression line, L, Q or C requests Linear, 127 Quadraatic or cubic, CLM or CLI requests corresponding confidence interval and 128 95 specifies alpha level for CI (any value from 50 to 99); 129 **** I = for categories" requests STD (std dev) 1 (1 width, 2 or 3) M (of mean=std err) 130 J (join means of bars) t (add top & bottom hash) p (use pooled variance); 131 **** Other options for categories: omit M=std dev, use B to get bar for min/max; 132 RUN: NOTE: There were 47 observations read from the data set WORK.ONE. NOTE: PROCEDURE GPLOT used: real time 0.22 seconds cpu time 0.10 seconds Weight of wood harvested (lbs) 1800 Simple Linear Regression Example Wood harvest from trees 1600 1400 1200 1000 800 600 400 200 0 3 4 5 6 7 8 9 10 Diameter at breast height (inches) 04d-SLR-Tree example.doc 11 12 13 EXST7015 : Statistical Techniques II Simple Linear Regression 1 2 3 4 5 6 7 Geaghan Page 1 Intrinsically linear example ***********************************************; *** Data from Freund & Wilson (1993) ***; *** TABLE 8.24 : ESTIMATING TREE WEIGHTS ***; ***********************************************; options ps=256 ls=132 nocenter nodate nonumber; data one; infile cards missover; 8 TITLE1 'EXST7015: Estimating tree weights from other morphometric variables'; 9 10 11 12 13 14 15 16 17 18 19 input ObsNo Dbh Height Age Grav Weight ObsID $; ******** label ObsNo = 'Original observation number' Dbh = 'Diameter at breast height (inches)' Height = 'Height of the tree (feet)' Age = 'Age of the tree (years)' Grav = 'Specific gravity of the wood' Weight = 'Harvest weight of the tree (lbs)' ObsId = 'Identification letter added to dataset'; lweight = log(weight); ldbh = log(DBH); cards; NOTE: The data set WORK.ONE has 47 observations and 9 variables. NOTE: DATA statement used: real time 0.06 seconds cpu time 0.06 seconds 19 ! run; 67 ; 68 proc print data=one; TITLE2 'Raw data print'; run; NOTE: There were 47 observations read from the data set WORK.ONE. NOTE: The PROCEDURE PRINT printed page 1. NOTE: PROCEDURE PRINT used: real time 0.03 seconds cpu time 0.03 seconds EXST7015: Estimating tree weights from other morphometric variables Raw data print Obs 1 2 3 . Obs No 1 2 3 Dbh 5.7 8.1 8.3 Height 34 68 70 Age 10 17 17 Grav 0.409 0.501 0.445 Weight 174 745 814 Obs ID a b c lweight 5.15906 6.61338 6.70196 ldbh 1.74047 2.09186 2.11626 614 194 66 S T U 6.41999 5.26786 4.18965 2.07944 1.64866 1.30833 . . (see previous handout for full listing) 45 46 47 45 46 47 8.0 5.2 3.7 61 47 33 13 13 13 0.508 0.432 0.389 70 options ls=99 ps=55; TITLE2 'Scatter plot'; 71 proc plot data=one; plot weight*Dbh=obsid; 72 run; 73 options ps=256 ls=132; 74 NOTE: There were 47 observations read from the data set WORK.ONE. NOTE: The PROCEDURE PLOT printed page 2. NOTE: PROCEDURE PLOT used: real time 0.01 seconds cpu time 0.01 seconds 05d-SLR-IntrinsicallyLinear-Trees.doc EXST7015 : Statistical Techniques II Simple Linear Regression Intrinsically linear example Geaghan Page 2 EXST7015: Estimating tree weights from other morphometric variables Scatter plot Plot of Weight*Dbh. Symbol is value of ObsID. Weight | | 1800 + | | f q | 1600 + | | g | 1400 + | | | 1200 + | | | 1000 + | | | 800 + c z | b t | N | 600 + S | y | s | D 400 + d | l u | B COj o | F Pe 200 + A L mk J | G a | Kn Hw h | iI p r 0 + | -+--------+--------+--------+--------+--------+--------+--------+--------+--------+-------+ 3 4 5 6 7 8 9 10 11 12 13 Dbh NOTE: 8 obs hidden. 75 76 77 78 NOTE: NOTE: 78 79 80 80 81 NOTE: NOTE: NOTE: proc reg data=one lineprinter; ID ObsID; TITLE2 'Simple linear regression'; model Weight = Dbh / CLB; output out=next1 p=yhat r=e; run; 47 observations read. 47 observations used in computations. ! options ls=99 ps=55; plot residual.*predicted.=obsid / VREF=0; run; ! options ps=256 ls=132; The data set WORK.NEXT1 has 47 observations and 11 variables. The PROCEDURE REG printed pages 3-4. PROCEDURE REG used: real time 0.10 seconds cpu time 0.10 seconds 05d-SLR-IntrinsicallyLinear-Trees.doc EXST7015 : Statistical Techniques II Simple Linear Regression Geaghan Page 3 Intrinsically linear example EXST7015: Estimating tree weights from other morphometric variables Simple linear regression The REG Procedure Model: MODEL1 Dependent Variable: Weight Analysis of Variance Source Model Error Corrected Total Root MSE Dependent Mean Coeff Var Sum of Squares 6455980 670191 7126171 DF 1 45 46 122.03740 369.34043 33.04198 R-Square Adj R-Sq Mean Square 6455980 14893 F Value 433.49 Pr > F <.0001 0.9060 0.9039 Parameter Estimates Variable Intercept Dbh DF 1 1 Parameter Estimate -729.39630 178.56371 Standard Error 55.69366 8.57640 t Value -13.10 20.82 Pr > |t| <.0001 <.0001 95% Confidence Limits -841.56910 -617.22350 161.28996 195.83747 Recall the original SLR. The fit didn’t seem to bad except for the negative intercept. The apparent curvature and possible lack of homogeneity of the residuals also indicated possible problems. There were even some possible outliers. The residuals, however, appeared to be normal. Pr<W = 0.3544 ---+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+---RESIDUAL | | | | | | 400 + + | | | f | | | | | 300 + + | | | q | | | | | 200 + + | | | i | R | ?K g | e | | s 100 + ? + i | p G | d | Hw c | u | ? A B N | a | b | l 0 + r ? l + | F z | | C | | ? O | | E u S | -100 + j y + | a P d D | | J | | e o t | | | -200 + + | | | | | | | | -300 + s + | | | | ---+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----100 0 100 200 300 400 500 600 700 800 900 1000 1100 1200 1300 1400 1500 Predicted Value of Weight PRED 05d-SLR-IntrinsicallyLinear-Trees.doc EXST7015 : Statistical Techniques II Simple Linear Regression Geaghan Page 4 Intrinsically linear example 85 86 proc reg data=one lineprinter; ID ObsID; 87 TITLE2 'Simple linear regression on logarithms'; 88 model lWeight = lDbh / CLB; 89 output out=next2 p=yhat r=le; 90 TEST lDBH = 3; 91 run; 91 ! options ls=99 ps=55; 92 plot residual.*predicted.=obsid / VREF=0; 93 run; 93 ! options ps=256 ls=132; 94 NOTE: The data set WORK.NEXT2 has 47 observations and 11 variables. NOTE: The PROCEDURE REG printed pages 6-8. NOTE: PROCEDURE REG used (Total process time): real time 0.01 seconds cpu time 0.01 seconds The REG Procedure Model: MODEL1 Dependent Variable: lweight Analysis of Variance Source Model Error Corrected Total Root MSE Dependent Mean Coeff Var Sum of Squares 35.94979 1.30846 37.25825 DF 1 45 46 0.17052 5.49466 3.10337 R-Square Adj R-Sq Mean Square 35.94979 0.02908 F Value 1236.37 Pr > F <.0001 0.9649 0.9641 Parameter Estimates Variable Intercept ldbh DF 1 1 Parameter Estimate 0.55219 2.79854 Standard Error 0.14275 0.07959 t Value 3.87 35.16 Pr > |t| 0.0004 <.0001 95% Confidence Limits 0.26469 0.83970 2.63824 2.95884 The model fits well. The R2 value is actually higher than the value for the SLR. Recall the original power model (before transformation) was Yi = b 0 X ib1 ei . After taking logarithms we fitted actually a simple linear regression, Yi′ = b′ + b1X′ , where b′ is the log of 0 i 0 the original b0. After the model is fitted we can back-transform the model, taking the antilog of b′ and putting the model back in the original form. Where the antilog of 0.55219 is 0 e0.55219=1.7371 the model in the original form is Yi = 1.7371Xi2.79854 . In this form the line will graph as a curve. The value of the slope is not too different from the hypothesized value of 3. However, the test of the slope against 3 (below) shows a significant difference. Test 1 Results for Dependent Variable lweight Mean Source DF Square F Value Pr > F Numerator 1 0.18629 6.41 0.0149 Denominator 45 0.02908 05d-SLR-IntrinsicallyLinear-Trees.doc EXST7015 : Statistical Techniques II Simple Linear Regression Intrinsically linear example Geaghan Page 5 So how about the residual plot? Recall that for the linear model it showed curvature and possible nonhomogeneous variance. The plots now look good: No curvature, good homogeneous spread and no outliers. ------+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+-----RESIDUAL | | | | | | | | | | 0.4 + + | B | | | | | | l N | | K c | 0.2 + G A b + | | | L C | | T F | | Q O u z f | R | I ? y S | e 0.0 + i n M d + s | U w ?k j D | i | E | d | R g q | u | H P o | a | t | l -0.2 + J + | e | | a | | | | p s | | | -0.4 + + | | | r | | | | | | | -0.6 + + | | | | | | | | | | ------+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+-----4.00 4.25 4.50 4.75 5.00 5.25 5.50 5.75 6.00 6.25 6.50 6.75 7.00 7.25 7.50 7.75 Predicted Value of lweight PRED 95 proc univariate data=next2 normal plot; var le; 96 TITLE3 'Residual analysis (log model)'; run; NOTE: The PROCEDURE UNIVARIATE printed page 9. NOTE: PROCEDURE UNIVARIATE used (Total process time): real time 0.00 seconds cpu time 0.00 seconds 05d-SLR-IntrinsicallyLinear-Trees.doc EXST7015 : Statistical Techniques II Simple Linear Regression Geaghan Page 6 Intrinsically linear example EXST7015: Estimating tree weights from other morphometric variables Simple linear regression on logarithms Residual analysis (log model) The UNIVARIATE Procedure Variable: le (Residual) Moments 47 Sum Weights 0 Sum Observations 0.16865578 Variance -0.3773174 Kurtosis 1.30845959 Corrected SS . Std Error Mean N Mean Std Deviation Skewness Uncorrected SS Coeff Variation 47 0 0.02844477 0.45701405 1.30845959 0.02460097 Residual from the transformed model were also tested for normality. The hypothesis of normality is not rejected for these results (Pr < W = 0.5634). Test Shapiro-Wilk Kolmogorov-Smirnov Cramer-von Mises Anderson-Darling Tests for Normality --Statistic--W 0.979294 D 0.128993 W-Sq 0.069887 A-Sq 0.396238 -----p Value-----Pr < W 0.5634 Pr > D 0.0483 Pr > W-Sq >0.2500 Pr > A-Sq >0.2500 Extreme Observations ------Lowest-----------Highest----Value Obs Value Obs -0.457352 -0.337451 -0.329823 -0.263905 -0.237736 Stem 3 3 2 2 1 1 0 0 -0 -0 -1 -1 -2 -2 -3 -3 -4 -4 18 16 19 1 5 0.227337 0.234177 0.267333 0.268304 0.363138 Leaf 6 # 1 77 1133 9 0123 556668 013334 333332210 8 443000 5 40 6 43 2 4 1 4 6 6 9 1 6 1 2 1 2 6 1 ----+----+----+----+ Multiply Stem.Leaf by 10**-1 Boxplot | | | | | +-----+ | | *--+--* | | | | +-----+ | | | | 0 3 37 40 12 28 Normal Probability Plot 0.375+ +*+ | +++ | +*+* | **** | +*+ | +*** | +*** | **** | ***** | *+++ | ****+ | *++ | +** | +++* | ++* * | +++ |++ -0.475+ * +----+----+----+----+----+----+----+----+----+----+ -2 -1 0 +1 +2 05d-SLR-IntrinsicallyLinear-Trees.doc ...
View Full Document

Ask a homework question - tutors are online