This preview shows page 1. Sign up to view the full content.
Unformatted text preview: Statistical Techniques II Page 16 SAS example of a Simple Linear Regression (Appendix 2)
Our analyses will be done in SAS. Other, simpler options, such as EXCEL, work well for simple
linear regression, but only SAS will cover all of the analyses and all of the options that we want to
discuss this semester.
If you are not familiar with SAS, see information available on my EXST7005 page, and talk to the TA
about getting up to speed.
The numerical example used is from your textbook. It is a data set taken from 47 trees. Each tree was
measured for diameter, height, weight of harvestable wood and other values.
Our objective will be to predict the weight of harvestable wood using just the diameter. The diameter
variable is measured about 4 feet off the ground and is called Diameter at Breast Height (DBH).
SAS Programming
The SAS program. I will presume you are familiar with the SAS data step. I will discuss it briefly
only for this first example.
SAS Statements – all SAS statements end in a semicolon;
Comments – comments are statements that start with an asterisk. They do nothing in the program, they
are included only for the purpose of documenting the program.
My Simple Linear Regression (SLR) example starts with the comments,
*********************************************;
*** Data from Freund & Wilson (1993)
***;
*** TABLE 8.24 : ESTIMATING TREE WEIGHTS ***;
*********************************************; This is for documentation purposes only. It does not affect the program.
Options – options can be specified to modify output appearance. The options I prefer are,
options ps=61 ls=78 nocenter nodate nonumber nolabel
FORMCHAR="++=/\<>*"; This option creates a page size (ps) of 61 lines (use 54 for the lab) a line size of 78 character
columns, and suppresses the centering of output and printing of the date and page numbers.
The DATA step. All our programs will include a DATA section. In this section the data to be
analyzed is entered into the SAS system and, if necessary, modified for analysis.
data one; A second statement informs SAS that the data is included in the program (CARDS)
and that if there are missing values the system should NOT to the next line to get the data
(MISSOVER).
infile cards missover; The TITLE statement ends in a semicolon as usual, and the text to be used a the title is enclosed in
single quotes.
TITLE1 'Estimating tree weights from other morphometric variables'; The input statement. Along with the DATA statement, this is an important statement. It names the
variables to be used, tells SAS what type of variables they are (numeric or alphanumeric) and gets
the data into the SAS data set.
input ObsNo Dbh Height Age Grav Weight ObsID $; Note that only one variable in the list is followed by a $. This will cause SAS to assume that all
variables are numeric except the variable called OBSID.
James P. Geaghan  Copyright 2011 Statistical Techniques II Page 17 The variable OBSID is one I created by adding to each observation a different letter. The first line
got an “a”, the second a “b”, etc. The 26th observation got a “z” and the 27th an “A”, etc. This
was done to have a way of distinguishing each observation.
The LABEL statement provides a way of identifying each variable. It is optional, but if present will
be used by SAS in a number of places to identify the variables.
label ObsNo = 'Original observation number'
Dbh = 'Diameter at breast height (inches)'
etc. ... ; I have deactivated the labels by making them a comment statement.
If data must be modified, it is done in the data step after the INPUT statement. I have two
statements that create logarithms. These are not used in the first analysis, but will be used later in
the semester.
lweight = log(weight);
ldbh = log(DBH); These statements create two new variables (LWEIGHT and LDBH) that are the natural logs of the
original variables.
Two last statements before the data. The CARDS statement tells SAS that the data step is done and
data follows. The RUN statement tells SAS to process all information that it has so far and output
any messages about the analysis to the LOG.
cards; run; Note that two statements can occur on the same line.
The SAS DATA step is now complete. The data will be entered into the SAS system and processing
will continue.
The rest of the statements in this program are procedures (PROCs) and associated statements.
I will briefly discuss some of these statements. For most of the semester we will concentrate on the
PROCs that actually do statistics, such as REG, GLM, LOGISTIC, ANOVA, and MIXED.
The first PROC is,
proc print data=one; TITLE2 'Raw data print'; run; This PROC causes the data to be printed with the second title line added as “Raw data print”.
See the Data list from PROC PRINT,
Notice that this is a TITLE2, so any previous title1 is kept.
Also notice I usually follow PROCs with a RUN statement. This causes the procedure to be
executed and any comments regarding the statement are placed in the LOG prior to the next
PROC.
The next PROC is a PLOT.
options ls=111 ps=61; proc plot data=one; plot weight*Dbh=obsid;
TITLE1 'Scatter plot'; run;
options ps=512 ls=132; It is surrounded by option statements. Although I usually like a large page size of (512), I don't
want the plot to cover 512 lines, so I put the page size to 61 for the plot, and then reset it to 512 for
subsequent output.
The plot is for weight on DBH. Notice the “=ObsID” at the end of the plot statement. This will
cause SAS to plot a single character (the ObsID I created) as a symbol representing each
observation in the plot. I do this to be able to distinguish between the observations in the plot.
James P. Geaghan  Copyright 2011 Statistical Techniques II Page 18 See the output from PROC PLOT,
The means statement is often used to examine variables and determine the number of observations of
each variable, its minimum and maximum.
proc means data=one n mean max min var std stderr;
TITLE1 'Raw data means';
var Dbh Height Age Grav Weight; run; This has limited utility for regression analysis.
See the output from PROC MEANS titled “Raw data means”
You might use it to look for outliers, or to get the range of values for a plot.
The SAS UNIVARIATE procedure is very useful in regression analysis. However, the application to
the RAW variables is not very useful.
proc univariate data=one normal plot;
TITLE1 'Raw data Univariate analysis';
var Weight Dbh; run; See the output from PROC Univariate
As far a regression is concerned, the preceding material is ancillary, used to prepare or enhance our
analysis. The important information for regression will be provided by PROC REG or PROC GLM.
47 proc reg data=one LINEPRINTER; ID ObsID DBH;
48
TITLE2 'Simple linear regression';
49
model Weight = Dbh / clb alpha=0.01; *** p xpx i influence
CLI CLM;
50
Slope:Test DBH = 180;
51
Joint:TEST intercept = 0, DBH = 180;
52
run; options ls=78 ps=45;
53
plot residual.*predicted.=obsid / VREF=0; run;
54
OUTPUT OUT=NEXT1 P=Predicted R=Resid cookd=cooksd
dffits=dffits
55
STUDENT=student rstudent=rstudent lclm=lclm uclm=uclm lcl=lcl
ucl=ucl;
56 run;
57 options ps=61 ls=95; PROC REG Output
See the output from PROC Univariate
The ANOVA table
The ANOVA table has one of the key tests of hypothesis and an estimate of the Mean Square
Error (MSE).
Supplemental information
This section is ancillary information. Not necessary, but informative in some cases. A popular
statistics from this section is the RSquare (R2).
Parameter estimates and tests
This section is the most important in terms of interpreting the regression, I provides estimates
of the regression coefficients (intercept and slope) with their standard errors and a test of each
James P. Geaghan  Copyright 2011 Statistical Techniques II Page 19 regression coefficient against zero. The confidence intervals are available on request (option
CLB). The default confidence interval is 95% but other values of alpha can be specified.
The parameter estimates are: Intercept = b0 = –729.396300 and Slope = b1 = 178.563714
The linear equation is: Yi = b0 + b1 Xi = –729.4 + 178.6* Xi for any value of X
Interpretation : The weight starts at –729 when the diameter is zero and increases by 179
pounds for each additional inch in diameter.
For a ttest of either parameter against an hypothesized value or a confidence interval on either
parameter we would use the standard errors provided by SAS.
Sb0 = 55.69366336 and Sb1 = 8.57640103 A 99% confidence interval is calculated by SAS because the option CLB was requested on the
model statement and a value of alpha = 0.01 was specified. P(155.5 ≤ 1 ≤ 201.6) = 0.99.
A 95% confidence interval is calculated as, Parameter ± tvalue*standard error.
The tvalue has n – 2 = 45 d.f. and can be found in a ttable. For a two tailed interval and a
value of a = 0.05, the tvalue is 2.014.
For the slope the estimate is 178.6 and the standard error is 8.576.
The confidence interval is then: 178.6 ± 2.014*8.576
The preferred expression is: P(161.328 ≤ 1 ≤ 195.872) = 0.95
SAS automatically provides a ttest of each parameter against an hypothesized value of zero, the
most common test.
t values and P values are
Intercept : calculated t = –13.097, P value < 0.0001
Slope : calculated t = 20.820, P value < 0.0001
Interpretation: The slope and intercept differ from zero. Therefore, the line does not pass
through the origin, and the line is not a “flat” line; the calculated regression line is an
improvement over the original flat line fitted by the correction factor.
Other values may be of interest besides zero. These can be tested by hand, or with at
“TEST” statement in SAS.
I added an additional, optional, test. I decided to test two specific hypotheses about the regression
coefficients.
50
51 Slope:Test DBH = 180;
Joint:TEST intercept = 0, DBH = 180; SAS provides a mechanism to do this. The statement “TEST DBH = 200;” is added to the program
after the model statement. The test outputs the test result (in this program the output follows the
list of observation diagnostics).
This tests the hypothesis, H 0 : DBH =200 , and you can see that it is rejected here. SAS used an F
test to test this (more flexible), we would probably use a ttest (computationally and conceptually
easier).
The second hypothesis tested is a joint test of the two hypotheses together;
H 0 : 0 =0 and H 0 : D B H = 2 0 0 .
James P. Geaghan  Copyright 2011 Statistical Techniques II Page 20 This is a two degree of freedom test.
Other useful information
Other useful output from PROC REG includes observation diagnostics, residual plots and the
ability to output residuals for testing.
AS YOU KNOW, WE TEST NORMALITY OF THE RESIDUALS, NOT THE RAW DATA!!!
Observation diagnostics
There are a few diagnostics calculated from individual observations that are of interest.
First the residuals are of interest only for their sign. Long strings of residuals with the same sign
can indicate either curvature or a lack of independence.
Since we don't know what constitutes an overly large residual, these are not very useful for
detecting outliers.
Another value of interest is the standardized residuals, in SAS the values “STUDENT” and
“RSTUDENT”.
These are standardized residuals, and should have a mean of zero and a variance of one. They
should follow a t distribution, so that for our example with 45 observations we expect that 99%
would be between ±2.690.
The HAT diag values.
Hat diag is a relative measure of how far an X value is from the center of the X space. A high
value indicates an unusual value of X. This is not necessarily bad, but unusual values should
be examined for correctness.
The hat diag values will sum to “p”, where p is the number of parameters estimated in the
model (2 for SLR).
The mean of the hat diag values will be p/n, and any values more than twice this value are
considered “large”. Again, this is not necessarily a problem.
Influence diagnostics examine how the regression would change if an observation were removed
from the analysis. If an observation is removed an the regression does not change, the
observation is not influential. If the regression changes a lot, the observation is very influential.
DFFITS measures the change in terms of the “fit”, as judged by the predicted (Yhat) value. If a
point is removed and Yhat changes a lot, the point is influential.
for small to medium size databases, DFFITS should not exceed 1, while for large databases it
should not exceed 2*sqrt(p/n)
DFBETAS measures the change in terms of the “fit”, as judged by changes in b0 and b1. If a point
is removed and b0 or b1 change a lot, the point is influential.
for small to medium size databases, DFBETAS should not exceed 1, while for large databases
it should not exceed 2/sqrt(n) (see Appendix 1).
Observation diagnostics Look for RSTUDENT values over 2.7
Look for Hat diag values over 0.08
Look for DFFITS & DFBETAS over 1.
Large Hat diag values on both ends of the regression
James P. Geaghan  Copyright 2011 Statistical Techniques II Page 21 Large DFFITS and DFBETAS for observation f & q.
Large RSTUDENT for observation f Residual plots (produced by either PROC REG or PROC PLOT)
Residual plots are a useful tool for detecting various problems Outliers
Curvature
Nonhomogeneous variance
and more Univariate tests & graphics (with options “normal” and “plot”)
66
67 proc univariate data=next1 normal plot; var Resid;
TITLE3 'Residual analysis'; run; Note that residuals sum to zero
Examine tests of normality
Check stem and leaf and box plots for symmetry and outliers
Examine “Normal Probability Plot” to study departure from normality The SAS UNIVARIATE procedure is very useful in regression analysis. However, the
application to the RAW variables is not very useful. We will be interested in using this
PROC to evaluate normality. We will be ESPECIALLY interested in the tests, ShapiroWilk W 0.710878 Pr < W <0.0001 ShapiroWilk W 0.89407 Pr < W 0.0005 We will also be interested in other tools to evaluate normality (STEM & LEAF, BOX PLOT,
NORMAL PROBABILITY PLOT), but NOT FOR THE RAW DATA for either variable (X
or Y). Note that these tests of normality are not particularly useful. We will actually test
residuals for reasons explained later.
We will be assuming normality and testing for normality, but not on the original variables.
Later we will be testing the Deviations or Residuals!!! These are the appropriate tests of
normality, not the tests of the original variables!!!
High quality graphics are available in SAS with procedures such as GPLOT and GCHART.
PROC GPLOT was used to produce the following plot. Among other things GPLOT will
automagically produce regression lines (linear, quadratic or cubic) with confidence intervals ( =
0.05 or 0.01) for the regression line or the individual points.
110
111
112
114
115
116
117
118
119
120
121
122
200;
123
124 GOPTIONS DEVICE=CGMflwa GSFMODE=REPLACE GSFNAME=OUT1 NOPROMPT noROTATE
ftext='TimesRoman' ftitle='TimesRoman' htext=1 htitle=1 ctitle=black ctext=black;
FILENAME OUT1'C:\SAS\SLRTrees1.CGM';
PROC GPLOT DATA=ONE;
TITLE1 font='TimesRoman' H=1 'Simple Linear Regression Example';
TITLE2 font='TimesRoman' H=1 'Wood harvest from trees';
PLOT weight*Dbh=1 weight*Dbh=2 / overlay HAXIS=AXIS1 VAXIS=AXIS2;
AXIS1 LABEL=(font='TimesRoman' H=1 'Diameter at breast height (inches)') WIDTH=1
MINOR=(N=1) VALUE=(font='TimesRoman' H=1) color=black ORDER=3 TO 13 BY 1;
AXIS2 LABEL=(ANGLE=90 H=1 'Weight of wood harvested (lbs)') WIDTH=1
VALUE=(font='TimesRoman' H=1) MINOR=(N=5) color=black ORDER=0 TO 1800 BY
SYMBOL1 color=red V=None I=RLcli99
SYMBOL2 color=blue V=dot I=None L=1 MODE=INCLUDE;
L=1 MODE=INCLUDE; RUN;
James P. Geaghan  Copyright 2011 Statistical Techniques II Page 22 Weight of wood harvested (lbs) 1800 Simple Linear Regression Example
Wood harvest from trees 1600
1400
1200
1000
800
600
400
200
0
3 4 5 6 7 8 9 10 11 12 13 Diameter at breast height (inches) Summary of simple linear regression
For this relationship a significant correlation exists between the diameter of the tree and the weight
of the wood harvested from the tree. In fact, we get 178.6 pounds of wood for each additional
inch of diameter
P(161.328 ≤ 1 ≤ 195.872) = 0.95
The equation to predict wood harvest from diameter is Yi = –729.4 + 178.6*Xi
We might expect that a tree with a diameter of zero to have a weight of zero, but our model
says that the weight for such a tree would actually be –729.4. The first question is whether this
is “statistically significant” in differing from the hypothesized value of zero. It is (P<0.0001).
This is impossible, so either there is something about tree growth we don't understand, or we do
not have the right model.
So we try to evaluate our model.
Are the observations correct and reasonable?
Examine the RSTUDENT values. We note a potential problem with Obs “f”
Examine the residual plot. This plot appears to show that the line is actually curved and
possibly has nonhomogeneous variance!!!
The Hat diag values indicated that the values on the higher end of the regression were possibly
“unusual”.
This is not uncommon for simple linear regression, which is kind of one dimensional for X.
This statistics will be more useful for multiple regression.
The influence diagnostics indicated that a number of observations were “influential”.
If the observations are correct, and there are no outliers, there is no problem. Also if an
observation IS an outlier, but it is not influential, we don't have much of a problem.
Problems occur when an observation is BOTH and outlier and influential.
Like observation “f”!!! James P. Geaghan  Copyright 2011 Statistical Techniques II Page 23 Examine the PROC UNIVARIATE output for tests and graphics of normality and for outliers.
The ShapiroWilk test indicates the residuals do not depart from normality.
The graphics do not show a great departure from normality, but there is a possible outlier
(observation “f”).
The normal probability plot shows only one departure, and it appears to be the outlier on
the upper end (f again).
So this regression appears to fit “well”. Everything is significant and the R2 is pretty high, but
there are a lot of problems.
The basic problem is that we do not have the right model. The model should really have some
curvature (we will cover this later). Then, observations that are outliers on the ends might fit
right on the line. Curvilinear Regression or Intrinsically Linear Regression
As the name implies, these are regressions that fit curves. However, the regressions we will
discuss are also linear models, so most of the techniques and SAS procedures we have discussed
will still be relevant.
We will discuss two basic types of curvilinear model.
The first are models that are not linear, but that can be “linearized” by transformation. These
models are referred to as “intrinsically linear”, because after transformation they are linear.
Later we will cover polynomial regressions. These are an extraordinarily flexible family of
curves that will fit almost anything. Unfortunately, they rarely have a good, interpretation of
the parameter estimates.
Intrinsically linear models
These are models that contain some transformed variable, logarithms, inverses, square roots, sines,
etc. We will concentrate on logarithms, since these models are some of the most useful.
What is the effect of taking a logarithm of a dependent or independent variable? For example,
instead of Yi b0 b1 X i ei , fit log Yi b0 b1 X i ei
If we fit log Yi b0 b1 X i ei , then the original model, before we took logarithms, must have been Yi b0 e b1 X i ei where “e” is the base of the natural logarithm (2.718281828). This
model is called the “Exponential Growth
Exponential growth and decay
model” if b1 is positive, or the exponential
decay model if it is not. It is used in the
35
biological sciences to fit exponential growth 30
25
(+1) or mortality (–1). This function
20
produces a curve that increases or decreases
15
proportionally. Exponential model Yi b0 e b1 X i ei 10
5
0
0 10 20 30 James P. Geaghan  Copyright 2011 Statistical Techniques II Other examples of curvilinear models.
Power model: Yi b0 X ib1 ei . Taking
log Yi b0 X ib1 ei gives log Yi log b0 b1 log X i log ei . Page 24 30
25
20 15
This is a simple linear regression with a
10
dependent variable of log(Yi) and an
5
independent variable of log(Xi). This model is
used to fit many things, including morphometric 0
data, The following graphic was fitted for the
0
values of b0 and b1 = 29, –1, 19,0, 4, 0.5, 1,1,
0.03, 2. Models following a hyperbolic shape, with an
25
asymptote can be fitted with inverses. A model
20
with an inverse (e.g. 1 X i ) will fit a “hyperbola”,
15
with its asymptote. Yi bo b1 1 ei X i where b0 fits the asymptote 10 b1=negative
b1=0
0<b1<1
b1=1
5 10 15 b1>1
20 25 30 Hyperbolic curves
Yhat = 10 + 10(1/Xi)
Yhat = 10  10(1/Xi) 5
0 0
10
20
30
These are a few of many possible curvilinear regressions. Models including power terms,
exponents, logarithms, inverses, roots, and trigonometric functions fit may be curvilinear. 40 A note on logarithms. The model described above requires natural logs. In SAS the function
“LOG” produces natural logs (LOG10 gives log base 10). In EXCEL and on many calculators the
natural log function is “LN”.
However, not all are curves can be fitted by linear models with transformations. Some are
nonlinear, and require nonlinear curve fitting techniques.
For example,
Yi b0 X ib1 ei is curvilinear
Yi b0 X ib1 ei is nonlinear
Yi b0 b1 X i b2 X i2 ei is linear (polynomial)
Yi b0 b1 X i b2 X ib3 ei is nonlinear Note that the power model, Yi b0 X ib1 ei , has an error multiplied by Xi. This is interesting because
when the error is multiplied by the independent variable, the variability about the regression line of
the raw data should appear to increase as Xi increases. The log transformation (of Yi) should
remove this nonhomogeneous variance. This should not be true after log transformation of Xi. James P. Geaghan  Copyright 2011 Statistical Techniques II Page 25 Curvilinear Residual Patterns
Transformations of Yi, like log transformations, will affect homogeneity of variance. The raw data
should actually appear nonhomogeneous.
Yi Yi Xi Xi Transformations of Xi will fit curves but will not affect the homogeneity of variance.
Y Yi i Xi Xi Polynomials assume homogeneous variance and will not adjust variance.
Yi Xi Intrinsically Linear (Curvilinear) regression example
Remember our SLR example about the amount of wood harvested from trees, predicted on the
basis of DBH (diameter at breast height)? Recall that it looked a little curved, and maybe even
had nonhomogeneous variance? Let's take another look at that model.
Typically, morphometric relationships (between parts of an organism) are best fitted with
models with both Log(Y) and Log(X). Fish length  scale length
Fish total length  fish fork length
Crab width  crab length
Fish length  fish weight, etc. Statistics quote: There are three kinds of lies: lies, damned lies, and statistics. Benjamin Disraeli (1804  1881)
James P. Geaghan  Copyright 2011 Statistical Techniques II
Regression Diagnostic Criteria Appendix 1 Supplement
Page 143 Criteria for the interpretation of selected regression statistics from the SAS output
Reference was primarily Neter, J., Kutner, M. H., Nachtsheim, C. J., and Wasserman, W.,
Applied Linear Statistical Models, 4th Edition, Richard D. Irwin, Inc., Burr Ridge, Illinois,
1996.
General regression diagnostics
n 1
2
Adjusted R2 : Radj 1 n p b g FG SSError IJ FG n 1 IJ c1 R h
b g H SSTotal K H n p K
2 This is intended to be an adjustment to R2 for additional variables in the model
Unlike the usual R2, this value can decrease as more variables are entered in the model if
the variables do not account for sufficient additional variation (equal to the MSE).
Standardized regression coefficient bj': bj' = bj (Sxj / Sy) Unlike the usual regression coefficient, the magnitude of the standardized coefficient
provides a meaningful comparison among the regression coefficients. Larger standardized regression coefficients have more impact on the calculation of the
predicted value and are more “important”.
Partial correlations Squared semipartial correlation TYPE I = SCORR1 = SeqSSXj / SSTotal Squared partial correlation TYPE I = PCORR1 = SeqSSXj / (SeqSSXj + SSError) Squared semipartial correlation TYPE II = SCORR2 = PartialSSXj / SSTotal Squared partial correlation TYPE II = PCORR2 = PartialSSXj / (PartialSSXj + SSError) Note that for regression, TYPE II SS and TYPE III SS are the same. Residual Diagnostics
The hat matrix main diagonal elements, hii (Hat Diag , H values in SAS) , called “leverage
values”, they are used to detect unusual observations in the X space. . This can also identify
substantial extrapolation of new values. As a general rule, hii values greater than 0.5 are “large”
while those between 0.2 and 0.5 are moderately large, also look for a leverage value which is
noticeably larger than the next largest. The hii values sum to p mean,hii = p/n (note that this is < 1) A value may be an “outlier” if it is more than twice the valuehii (i.e.hii > 2p/n).
Studentized residuals (“Student Residual” in SAS). Also called Internally Studentized Residual.
There are two versions: Simpler calculation = ei / root(MSE) More common application = ei / root(MSE * (1hii)) [SAS produces these]
We already assume these are normally distributed, so these values would approximately follow
a t distribution, where for large samples about 65% are between 1 and +1 about 95% are between 2 and +2 about 99% are between 2.6 and +2.6
Deleted Studentized residuals (“RStudent” in SAS). Also called externally studentized residual.
There are also two versions as with the studentized residuals above Deleted Studentized = ei(i) / root(MSE(i)) Deleted Internally Studentized = ei(i) / root(MSE(i) *1hii) [SAS produces these values]
As with the studentized residuals above these values would approximately follow a t
distribution James P. Geaghan  Copyright 2011 Statistical Techniques II
Regression Diagnostic Criteria Appendix 1 Supplement
Page 144 Influence Diagnostics
DFFITS; an influence statistic, it measures the difference in fits as judged by the change in
predicted value when the point is omitted This is a standardized value and can be interpreted as the number of standard deviation
units for small to medium size databases, DFFITS should not exceed 1, while for large databases
it should not exceed 2*sqrt(p/n)
DBETAS; an influence statistic, it measures the difference in fits as judged by the change in the
values of the regression coefficients note that this is also a standardized value for small to medium size databases, DFBETAS should not exceed 1, while for large
databases it should not exceed 2/sqrt(n)
Cook's D : influence statistic (D is for distance) The boundary of a simultaneous regional confidence region for all regression coefficients this does not follow an F distribution, but it is useful to compare it to the percentiles of the
F distribution [F1; p, np] where a change of < 10th or 20th percentile shows little effect,
while the 50th percentile is considered large
Multicollinearity Diagnostics
VIF is related to the severity of multicollinearity a standardized estimate of regression coefficients would be expected to have a value of 1 if
the regressors are uncorrelated If the mean of this value is much greater than 2, serious problems are indicated. No single VIF should exceed 10 Tolerance is the inverse of VIF, where Tolerancek = 1Rk2
The Condition number (a multivariate evaluation) Eigen values are extracted from the regressors, These are variances of linear combinations
of the regressors, and go from larger to smaller. If one or more are zero (at the end) then the matrix is not full rank. These sum to p, and if the Xk are independent, each would equal 1 The condition number is the square root of the ratio of the largest (always the first) to each
of the others. If this value exceeds 30 then multicollinearity may be a problem.
Model Evaluation and Validation
R2p, AdjR2p and MSEp can be used to graphically compare and evaluate models. The subscript p
refers to the number of parameters in the model
Mallow's Cp criterion Use of this statistic presumes no bias in the full model MSE, so the full model should be
carefully chosen to have little or no multicollinearity Cp criterion = (SSEp / TrueMSE) (n  2p) The Cp statistics will be approximately equal to p if there is no bias in the regression model
PRESSp criterion (PRESS = Prediction SS) This criterion is based on deleted residuals. There are n deleted residuals in each regression, and PRESSp is the SS of deleted residuals This value should approximately equal the MSE if predictions are good, it will get larger as
predictions are poorer They may be plotted, and the smaller PRESS statistic models represent better predictive
models. This statistics can also be used for model validation
James P. Geaghan  Copyright 2011 Statistical Techniques II
Simple Linear Regression: Appendix 2 Annotated SAS example
Page 145 The SAS program.
I will presume you are familiar with the SAS data step. I will discuss it briefly only for this first
example.
SAS Statements – all SAS statements end in a semicolon;
Comments – comments are statements that start with an asterisk. They do nothing in the program,
they are included only for the purpose of documenting the program.
Options can be specified to modify output appearance. The option statement below creates a page
size (ps) of 256 lines (use 54 for the lab) and a line size of 80 character columns, and
suppresses the centering of output and printing of the date and page numbers.
The DATA step. All our programs will include a DATA section. In this section the data to be
analyzed is entered into the SAS system and, if necessary, modified for analysis.
A second statement informs SAS that the data is included in the program (CARDS) and that if
there are missing values the system should NOT to the next line to get the data
(MISSOVER).
The next statement in my program is a TITLE statement. Up to 9 titles can be active (TITLE1
through TITLE9) and once set are printed at the top of each page. Setting a new title, say
TITLE3, would not affect lower numbered titles (TITLE1 and TITLE2) but would delete all
higher numbered titles (TITLE4 ...). The TITLE statement ends in a semicolon as usual, and
the text to be used a the title is enclosed in single quotes.
The input statement. Along with the DATA statement, this is an important statement. It names
the variables to be used, tells SAS what type of variables they are (numeric or alphanumeric)
and gets the data into the SAS data set.
Note that only one variable in the list is followed by a $. This will cause SAS to assume that all
variables are numeric except the variable called OBSID.
The variable OBSID is one I created by adding to each observation a different letter. The first
line got an “a”, the second a “b”, etc. The 26th observation got a “z” and the 27th an “A”,
etc. This was done to have a way of distinguishing each observation.
The LABEL statement provides a way of identifying each variable. It is optional, but if present
will be used by SAS in a number of places to identify the variables.
label ObsNo = 'Original observation number'
Dbh = 'Diameter at breast height (inches)'
etc. ... ;
I have deactivated the labels by making them a comment statement.
If data must be modified, it is done in the data step after the INPUT statement. I have two
statements that create logarithms. These are not used in the first analysis, but will be used
later in the semester.
lweight = log(weight);
ldbh = log(DBH);
These statements create two new variables (LWEIGHT and LDBH) that are the natural logs of
the original variables.
Two last statements before the data. The CARDS statement tells SAS that the data step is done
and data follows. The RUN statement tells SAS to process all information that it has so far
and output any messages about the analysis to the LOG.
cards; run;
Note that two statements can occur on the same line.
The SAS DATA step is now complete. The data will be entered into the SAS system and
processing will continue. The rest of the statements in this program are procedures (PROCs)
and associated statements. James P. Geaghan  Copyright 2011 Statistical Techniques II
Simple Linear Regression: Appendix 2 Annotated SAS example
Page 146 dm'log;clear;output;clear';
options ps=512 ls=120 nocenter nodate nonumber FORMCHAR="++=/\<>*";
TITLE1 'Appendix02: Estimating tree harvest weights';
ODS HTML style=minimal body='C:\SAS\Appendix02 SlrTrees.HTML' ;
ODS rtf style=minimal body='C:\SAS\Appendix02 SlrTrees.RTF' ;
filename input1 'C:\SAS\Appendix02 SlrTrees.csv';
FILENAME OUT1'C:\SAS\Appendix02 SlrTrees01.CGM';
FILENAME OUT2'C:\SAS\Appendix02 SlrTrees02.CGM';
***********************************************;
*** Data from Freund & Wilson (1993)
***;
*** TABLE 8.24 : ESTIMATING TREE WEIGHTS
***;
***********************************************;
data one; infile input1 missover DSD dlm="," firstobs=2;
input ObsNo Dbh Height Age Grav Weight;
*********** label ObsNo = 'Original observation number'
Dbh
= 'Diameter at breast height (inches)'
Height = 'Height of the tree (feet)'
Age
= 'Age of the tree (years)'
Grav
= 'Specific gravity of the wood'
Weight = 'Harvest weight of the tree (lbs)'
ObsId = 'Identification letter added to dataset';
lweight = log(weight);
ldbh = log(DBH);
observation + 1;
if observation ge 27 then ObsID = byte(observation+6426); * upper case *;
if observation le 26 then ObsID = byte(observation+96); * lower case *;
keep Dbh Height Age Grav Weight ldbh lweight obsid;
datalines; run;
proc print data=one; TITLE2 'Raw data print'; run;
options ls=95 ps=61; proc plot data=one; plot weight*Dbh=obsid;
TITLE2 'Scatter plot'; run;
options ps=256 ls=85;
proc means data=one n mean max min std stderr;
TITLE2 'Raw data means';
var Dbh Height Age Grav Weight; run;
proc univariate data=one normal plot;
TITLE2 'Raw data Univariate analysis';
var Weight Dbh; run;
proc reg data=one LINEPRINTER; ID ObsID DBH;
TITLE2 'Simple linear regression';
model Weight = Dbh / clb alpha=0.01; *** p xpx i influence CLI CLM;
Slope:Test DBH = 180;
Joint:TEST intercept = 0, DBH = 180;
run; options ls=78 ps=45;
plot residual.*predicted.=obsid / VREF=0; run;
OUTPUT OUT=NEXT1 P=Predicted R=Resid cookd=cooksd dffits=dffits
STUDENT=student rstudent=rstudent lclm=lclm uclm=uclm lcl=lcl ucl=ucl;
run;
options ps=61 ls=95;
proc print data=next1;
TITLE3 'Listing of observation diagnostics';
var ObsId DBH Weight Predicted Resid student rstudent; run;
proc print data=next1;
TITLE3 'Listing of observation diagnostics';
var ObsId cooksd dffits lclm uclm lcl ucl; run;
options ps=512 ls=85;
proc univariate data=next1 normal plot; var Resid;
TITLE3 'Residual analysis'; run;
options ls=95 ps=61; proc plot data=one; plot weight*Dbh=obsid;
TITLE2 'Scatter plot'; run;
options ps=512 ls=85;
ods html close;
ods rtf close;
run;
quit;
James P. Geaghan  Copyright 2011 Statistical Techniques II
Simple Linear Regression: Appendix 2 Annotated SAS example
Page 147 1
dm'log;clear;output;clear';
2
options ps=512 ls=120 nocenter nodate nonumber FORMCHAR="++=/\<>*";
3
TITLE1 'Appendix02: Estimating tree harvest weights';
4
5
ODS HTML style=minimal body='C:\SAS\Appendix02 SlrTrees.HTML' ;
NOTE: Writing HTML Body file: C:\SAS\Appendix02 SlrTrees.HTML
6
ODS rtf style=minimal body='C:\SAS\Appendix02 SlrTrees.RTF' ;
NOTE: Writing RTF Body file: C:\SAS\Appendix02 SlrTrees.RTF
7
filename input1 'C:\SAS\Appendix02 SlrTrees.csv';
8
FILENAME OUT1'C:\SAS\Appendix02 SlrTrees.CGM';
9
FILENAME OUT2'C:\SAS\Appendix02 SlrTrees.CGM';
10
11
***********************************************;
12
*** Data from Freund & Wilson (1993)
***;
13
*** TABLE 8.24 : ESTIMATING TREE WEIGHTS
***;
14
***********************************************;
15
16
data one; infile input1 missover DSD dlm="," firstobs=2;
17
input ObsNo Dbh Height Age Grav Weight;
18
*********** label ObsNo = 'Original observation number'
19
Dbh
= 'Diameter at breast height (inches)'
20
Height = 'Height of the tree (feet)'
21
Age
= 'Age of the tree (years)'
22
Grav
= 'Specific gravity of the wood'
23
Weight = 'Harvest weight of the tree (lbs)'
24
ObsId = 'Identification letter added to dataset';
25
lweight = log(weight);
26
ldbh = log(DBH);
27
observation + 1;
28
if observation ge 27 then ObsID = byte(observation+6426); * upper case *;
29
if observation le 26 then ObsID = byte(observation+96); * lower case *;
30
keep Dbh Height Age Grav Weight ldbh lweight obsid;
31
datalines;
NOTE: The infile INPUT1 is:
Filename=C:\SAS\Appendix02 SlrTrees.csv, RECFM=V,LRECL=256,
File Size (bytes)=1125, Last Modified=18Jan2009:19:36:54,
Create Time=20Dec2009:11:35:59
NOTE: 47 records were read from the infile INPUT1.
The minimum record length was 19.
The maximum record length was 24.
NOTE: The data set WORK.ONE has 47 observations and 8 variables.
NOTE: DATA statement used (Total process time):
real time
0.03 seconds
cpu time
0.04 seconds
31
!
run;
33
proc print data=one; TITLE2 'Raw data print'; run;
NOTE: There were 47 observations read from the data set WORK.ONE.
NOTE: The PROCEDURE PRINT printed page 1.
NOTE: PROCEDURE PRINT used (Total process time):
real time
0.26 seconds
cpu time
0.07 seconds
EXST7015: Estimating tree weights from other morphometric variables
Raw data print Obs
1
2
3
4
5
6
7
8
9
. . .
46
47 Obs
No
1
2
3
4
5
6
7
8
9 Dbh
5.7
8.1
8.3
7.0
6.2
11.4
11.6
4.5
3.5 46
47 5.2
3.7 Height
34
68
70
54
37
79
70
37
32
47
33 Age
10
17
17
17
12
27
26
12
15 Grav
0.409
0.501
0.445
0.442
0.353
0.429
0.497
0.380
0.420 13
13 0.432
0.389 Weight
174
745
814
408
226
1675
1491
121
58
194
66 Obs
ID
a
b
c
d
e
f
g
h
i lweight
5.15906
6.61338
6.70196
6.01127
5.42053
7.42357
7.30720
4.79579
4.06044 ldbh
1.74047
2.09186
2.11626
1.94591
1.82455
2.43361
2.45101
1.50408
1.25276 T
U 5.26786
4.18965 1.64866
1.30833 James P. Geaghan  Copyright 2011 Statistical Techniques II
Simple Linear Regression: Appendix 2 Annotated SAS example
Page 148 35
options ls=95 ps=61; proc plot data=one; plot weight*Dbh=obsid;
36
TITLE2 'Scatter plot'; run;
37
options ps=256 ls=85;
38
NOTE: There were 47 observations read from the data set WORK.ONE.
NOTE: The PROCEDURE PLOT printed page 2.
NOTE: PROCEDURE PLOT used (Total process time):
real time
0.14 seconds
cpu time
0.00 seconds
EXST7015: Estimating tree weights from other morphometric variables
Scatter plot
Plot of Weight*Dbh. Symbol is value of ObsID.
Weight 

1800 +



f
q

1600 +



g

1400 +




1200 +




1000 +




800 +
c
z

b
t

N


600 +
S


y
s

D

400 +
d

l
u

o

B
COj

F k
Je
200 +
A L
m

a

w h

I Kn H r

i
0 +

++++++++++–+3
4
5
6
7
8
9
10
11
12
13
Dbh
NOTE: 11 obs hidden. 39
proc means data=one n mean max min std stderr;
40
TITLE2 'Raw data means';
41
var Dbh Height Age Grav Weight; run;
NOTE: There were 47 observations read from the data set WORK.ONE.
NOTE: The PROCEDURE MEANS printed page 3.
NOTE: PROCEDURE MEANS used (Total process time):
real time
0.28 seconds
cpu time
0.04 seconds
EXST7015: Estimating tree weights from other morphometric variables
Raw data means
The MEANS Procedure
Variable
N
Mean
Maximum
Minimum
Variance
Std Dev
Std Error
Dbh
47
6.1531915
12.1000000
3.5000000
4.4016744
2.0980168
0.3060272
Height
47
49.5957447
79.0000000
27.0000000
167.6808511
12.9491641
1.8888297
Age
47
16.9574468
27.0000000
10.0000000
26.9111933
5.1876000
0.7566892
Grav
47
0.4452979
0.5080000
0.3530000
0.0014853
0.0385402
0.0056217
Weight
47
369.3404255
1692.00
58.0000000
154916.75
393.5946534
57.4116808
 James P. Geaghan  Copyright 2011 Statistical Techniques II
Simple Linear Regression: Appendix 2 Annotated SAS example
Page 149 43
proc univariate data=one normal plot;
44
TITLE2 'Raw data Univariate analysis';
45
var Weight Dbh; run;
NOTE: The PROCEDURE UNIVARIATE printed pages 45.
NOTE: PROCEDURE UNIVARIATE used (Total process time):
real time
0.31 seconds
cpu time
0.09 seconds
Appendix02: Estimating tree harvest weights
Raw data Univariate analysis
The UNIVARIATE Procedure
Variable: Weight
Moments
N
Mean
Std Deviation
Skewness
Uncorrected SS
Coeff Variation 47
369.340426
393.594653
2.20870748
13537551
106.566903 Sum Weights
Sum Observations
Variance
Kurtosis
Corrected SS
Std Error Mean Basic Statistical Measures
Location
Variability
Mean
369.3404
Std Deviation
Median
224.0000
Variance
Mode
84.0000
Range
Interquartile Range 47
17359
154916.751
4.83581557
7126170.55
57.4116808 393.59465
154917
1634
341.00000 Note: The mode displayed is the smallest of 3 modes with a count of 2.
Tests for Location: Mu0=0
Test
StatisticStudent's t
t 6.433193
Sign
M
23.5
Signed Rank
S
564
Tests for Normality
Test
ShapiroWilk
KolmogorovSmirnov
Cramervon Mises
AndersonDarling p ValuePr > t
<.0001
Pr >= M
<.0001
Pr >= S
<.0001 StatisticW
0.710878
D
0.24806
WSq
0.77793
ASq 4.435579 p ValuePr < W
<0.0001
Pr > D
<0.0100
Pr > WSq <0.0050
Pr > ASq <0.0050 Quantiles (Definition 5)
Quantile
Estimate
100% Max
1692
99%
1692
95%
1491
90%
814
75% Q3
462
50% Median
224
25% Q1
121
10%
74
5%
66
1%
58
0% Min
58
Extreme Observations
LowestValue
Obs
58
9
60
16
66
47
70
35
74
18 HighestValue
Obs
814
3
815
26
1491
7
1675
6
1692
17
James P. Geaghan  Copyright 2011 Statistical Techniques II
Simple Linear Regression:
Stem
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0 Leaf
89 Appendix 2 Annotated SAS example
Page 150 Boxplot
2
1 9 *
* 12
2
147
3
1
1
24
2
16
2
00144
5
001112233488
12
00222799
8
667778889
9
++++
Multiply Stem.Leaf by 10**+2 



++
 + 
**
++
 Normal Probability Plot
1650+
*
*


*

++

++

+++

++

+++
850+
++**

+***

+++*

++ *

+++ **

++
***

+********

*****
50+
*
* * *******+
+++++++++++
2
1
0
+1
+2 Appendix02: Estimating tree harvest weights
Raw data Univariate analysis
The UNIVARIATE Procedure
Variable: Dbh
Moments
N
Mean
Std Deviation
Skewness
Uncorrected SS
Coeff Variation 47
6.15319149
2.09801677
1.17285986
1981.98
34.0963998 Sum Weights
Sum Observations
Variance
Kurtosis
Corrected SS
Std Error Mean Basic Statistical Measures
Location
Variability
Mean
6.153191
Std Deviation
Median
5.700000
Variance
Mode
4.000000
Range
Interquartile Range 47
289.2
4.40167438
1.18369068
202.477021
0.3060272 2.09802
4.40167
8.60000
2.90000 Note: The mode displayed is the smallest of 2 modes with a count of 4. James P. Geaghan  Copyright 2011 Statistical Techniques II
Simple Linear Regression: Appendix 2 Tests for Location: Mu0=0
Test
StatisticStudent's t
t 20.10668
Sign
M
23.5
Signed Rank
S
564
Tests for Normality
Test
ShapiroWilk
KolmogorovSmirnov
Cramervon Mises
AndersonDarling Annotated SAS example
Page 151 p ValuePr > t
<.0001
Pr >= M
<.0001
Pr >= S
<.0001 StatisticW
0.89407
D
0.171951
WSq 0.214712
ASq 1.387777 p ValuePr < W
0.0005
Pr > D
<0.0100
Pr > WSq <0.0050
Pr > ASq <0.0050 Quantiles (Definition 5)
Quantile
100% Max
99%
95%
90%
75% Q3
50% Median
25% Q1
10%
5%
1%
0% Min Estimate
12.1
12.1
11.4
8.8
7.4
5.7
4.5
4.0
3.7
3.5
3.5 Extreme Observations
Lowest Highest Value
3.5
3.7
3.7
3.9
4.0 Value
8.8
9.3
11.4
11.6
12.1 Stem
12
11
11
10
10
9
9
8
8
7
7
6
6
5
5
4
4
3 Obs
9
47
35
37
44
Leaf
1
6
4 Obs
26
20
6
7
17 Boxplot 3
68
013
78
04
57
0011122
5666677
0224
555
0000233
5779
++++ 1
1
1 1
2
3
2
2
2
7
7
4
3
7
4 0









++


 + 
**


++

 James P. Geaghan  Copyright 2011 Statistical Techniques II
Simple Linear Regression: Appendix 2 Annotated SAS example
Page 152 Normal Probability Plot
12.25+
*

*

*
++

+++

++

+++

*+

**

***

+*

+**

+++**

+****

*****

***

**

******
3.75+
*
* * * ++
+++++++++++
2
1
0
+1
+2 47
proc reg data=one LINEPRINTER; ID ObsID DBH;
48
TITLE2 'Simple linear regression';
49
model Weight = Dbh / clb alpha=0.01; *** p xpx i influence CLI CLM;
50
Slope:Test DBH = 180;
51
Joint:TEST intercept = 0, DBH = 180; run; options ls=78 ps=45;
53
plot residual.*predicted.=obsid / VREF=0; run;
54
OUTPUT OUT=NEXT1 P=Predicted R=Resid cookd=cooksd dffits=dffits
55
STUDENT=student rstudent=rstudent lclm=lclm uclm=uclm lcl=lcl ucl=ucl;
56
run;
57
options ps=61 ls=95;
NOTE: The data set WORK.NEXT1 has 47 observations and 18 variables.
NOTE: The PROCEDURE REG printed pages 69.
NOTE: PROCEDURE REG used (Total process time):
real time
0.59 seconds
cpu time
0.28 seconds
Appendix02: Estimating tree harvest weights
Simple linear regression
The REG Procedure
Model: MODEL1
Dependent Variable: Weight
Number of Observations Read
Number of Observations Used
Analysis of Variance
Source
Model
Error
Corrected Total
Root MSE
Dependent Mean
Coeff Var 47
47
Sum of
Squares
6455980
670191
7126171 DF
1
45
46 122.03740
369.34043
33.04198 Parameter Estimates
Parameter
Variable
DF
Estimate
Intercept
1
729.39630
Dbh
1
178.56371 RSquare
Adj RSq Standard
Error
55.69366
8.57640 t Value
13.10
20.82 Mean
Square
6455980
14893 F Value
433.49 Pr > F
<.0001 0.9060
0.9039 Pr > t
<.0001
<.0001 99% Confidence Limits
879.18914
579.60346
155.49675
201.63067
James P. Geaghan  Copyright 2011 Statistical Techniques II
Simple Linear Regression: Appendix 2 Test Slope Results for Dependent Variable Weight
Mean
Source
DF
Square
F Value
Numerator
1
417.69334
0.03
Denominator
45
14893
Test Joint Results for Dependent Variable Weight
Mean
Source
DF
Square
F Value
Numerator
2
12807462
859.96
Denominator
45
14893 R
e
s
i
d
u
a
l Annotated SAS example
Page 153 Pr > F
0.8678 Pr > F
<.0001 ++++++++++RESIDUAL 

400 +
+

f








q



200 +
+

i


?K
g


Q


? G


? ?
B
N c


A
b

0 +
r
?
l
+

F
z


?k CO


j u
S


a P
d
y


? o
D


t

200 +
Residual plots are a useful+tool


for detecting various problems


Outliers

s



Curvature


Nonhomogeneous variance


and more
400 +
+


++++++++++200
0
200
400
600
800
1000
1200
1400
1600
Predicted Value of Weight
PRED 58
proc print data=next1;
59
TITLE3 'Listing of observation diagnostics';
60
var ObsId DBH Weight Predicted Resid student rstudent; run;
NOTE: There were 47 observations read from the data set WORK.NEXT1.
NOTE: The PROCEDURE PRINT printed page 10.
NOTE: PROCEDURE PRINT used (Total process time):
real time
0.13 seconds
cpu time
0.03 seconds James P. Geaghan  Copyright 2011 Statistical Techniques II
Simple Linear Regression: Appendix 2 Annotated SAS example
Page 154 Appendix02: Estimating tree harvest weights
Simple linear regression
Listing of observation diagnostics Obs
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47 Obs
ID
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v
w
x
y
z
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U Dbh
5.7
8.1
8.3
7.0
6.2
11.4
11.6
4.5
3.5
6.2
5.7
6.0
5.6
4.0
6.7
4.0
12.1
4.5
8.6
9.3
6.5
5.6
4.3
4.5
7.7
8.8
5.0
5.4
6.0
7.4
5.6
5.5
4.3
4.2
3.7
6.1
3.9
5.2
5.6
7.8
6.1
6.1
4.0
4.0
8.0
5.2
3.7 Weight
174
745
814
408
226
1675
1491
121
58
278
220
342
209
84
313
60
1692
74
515
766
345
210
100
122
539
815
194
280
296
462
200
229
125
84
70
224
99
200
214
712
297
238
89
76
614
194
66 Predicted
288.42
716.97
752.68
520.55
377.70
1306.23
1341.94
74.14
104.42
377.70
288.42
341.99
270.56
15.14
466.98
15.14
1431.22
74.14
806.25
931.25
431.27
270.56
38.43
74.14
645.54
841.96
163.42
234.85
341.99
591.98
270.56
252.70
38.43
20.57
68.71
359.84
33.00
199.14
270.56
663.40
359.84
359.84
15.14
15.14
699.11
199.14
68.71 Resid
114.417
28.030
61.317
112.550
151.699
368.770
149.057
46.860
162.423
99.699
68.417
0.014
61.560
99.141
153.981
75.141
260.775
0.140
291.252
165.246
86.268
60.560
61.572
47.860
106.544
26.964
30.578
45.152
45.986
129.975
70.560
23.704
86.572
63.429
138.711
135.842
131.998
0.865
56.560
48.599
62.842
121.842
104.141
91.141
85.113
5.135
134.711 student
0.94818
0.23442
0.51389
0.93392
1.25650
3.29162
1.33889
0.39083
1.36987
0.82579
0.56698
0.00012
0.51029
0.83095
1.27635
0.62979
2.38302
0.00117
2.44967
1.40424
0.71476
0.50200
0.51447
0.39917
0.88786
0.22740
0.25412
0.37452
0.38092
1.08081
0.58489
0.19655
0.72336
0.53050
1.16676
1.12516
1.10759
0.00718
0.46884
0.40532
0.52051
1.00920
0.87285
0.76389
0.71112
0.04263
1.13312 rstudent
0.94710
0.23194
0.50965
0.93256
1.26484
3.73546
1.35112
0.38712
1.38372
0.82282
0.56266
0.00011
0.50605
0.82804
1.28558
0.62552
2.52082
0.00116
2.60199
1.42001
0.71082
0.49778
0.51022
0.39541
0.88573
0.22498
0.25146
0.37092
0.37727
1.08288
0.58057
0.19444
0.71947
0.52622
1.17159
1.12858
1.11046
0.00710
0.46474
0.40153
0.51625
1.00941
0.87050
0.76031
0.70716
0.04215
1.13679 61
proc print data=next1;
62
TITLE3 'Listing of observation diagnostics';
63
var ObsId cooksd dffits lclm uclm lcl ucl; run;
NOTE: There were 47 observations read from the data set WORK.NEXT1.
NOTE: The PROCEDURE PRINT printed page 11.
NOTE: PROCEDURE PRINT used (Total process time):
real time
0.12 seconds
cpu time
0.03 seconds
64
options ps=512 ls=85; James P. Geaghan  Copyright 2011 Statistical Techniques II
Simple Linear Regression: Appendix 2 Annotated SAS example
Page 155 Appendix02: Estimating tree harvest weights
Simple linear regression
Listing of observation diagnostics Obs
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47 Obs
ID
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v
w
x
y
z
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U cooksd
0.01025
0.00114
0.00608
0.01110
0.01717
1.01075
0.18073
0.00275
0.05571
0.00742
0.00366
0.00000
0.00304
0.01596
0.01896
0.00917
0.69191
0.00000
0.16073
0.07442
0.00571
0.00294
0.00526
0.00287
0.01349
0.00153
0.00092
0.00173
0.00159
0.01742
0.00399
0.00046
0.01040
0.00588
0.03658
0.01377
0.02981
0.00000
0.00256
0.00295
0.00295
0.01108
0.01761
0.01348
0.01002
0.00002
0.03450 dffits
0.14301
0.04734
0.10939
0.14877
0.18654
1.61350
0.60670
0.07348
0.33716
0.12135
0.08496
0.00002
0.07728
0.17801
0.19616
0.13447
1.24438
0.00022
0.60223
0.39013
0.10629
0.07602
0.10174
0.07505
0.16386
0.05473
0.04256
0.05826
0.05578
0.18699
0.08866
0.03009
0.14346
0.10758
0.27160
0.16646
0.24481
0.00115
0.07097
0.07610
0.07614
0.14888
0.18714
0.16345
0.14078
0.00686
0.26353 lclm
239.41
651.33
683.80
468.84
329.81
1176.08
1207.49
12.93
182.13
329.81
239.41
293.98
221.01
84.13
417.47
84.13
1285.93
12.93
732.24
844.29
382.73
221.01
25.76
12.93
585.83
764.38
108.65
183.92
293.98
536.12
221.01
202.51
25.76
45.17
142.83
311.95
103.66
146.45
221.01
602.28
311.95
311.95
84.13
84.13
635.03
146.45
142.83 uclm
337.42
782.61
821.56
572.26
425.59
1436.38
1476.40
135.35
26.72
425.59
337.42
389.99
320.11
53.84
516.49
53.84
1576.51
135.35
880.26
1018.20
479.81
320.11
102.61
135.35
705.25
919.55
218.19
285.78
389.99
647.83
320.11
302.90
102.61
86.31
5.41
407.74
37.67
251.82
320.11
724.52
407.74
407.74
53.84
53.84
763.20
251.82
5.41 lcl
43.45
382.24
417.30
188.27
45.99
953.14
987.24
259.75
441.73
45.99
43.45
10.26
61.39
350.54
135.04
350.54
1072.28
259.75
469.78
591.69
99.47
61.39
296.02
259.75
311.93
504.69
169.35
97.31
10.26
259.03
61.39
79.34
296.02
314.18
405.21
28.14
368.75
133.30
61.39
329.53
28.14
28.14
350.54
350.54
364.69
133.30
405.21 ucl
620.28
1051.70
1088.06
852.83
709.40
1659.32
1696.64
408.03
232.88
709.40
620.28
673.71
602.51
320.26
798.92
320.26
1790.17
408.03
1142.72
1270.80
763.07
602.51
372.87
408.03
979.16
1179.24
496.19
567.01
673.71
924.92
602.51
584.75
372.87
355.32
267.79
691.55
302.75
531.57
602.51
997.27
691.55
691.55
320.26
320.26
1033.54
531.57
267.79 66
proc univariate data=next1 normal plot; var Resid;
67
TITLE3 'Residual analysis'; run;
NOTE: The PROCEDURE UNIVARIATE printed page 12.
NOTE: PROCEDURE UNIVARIATE used (Total process time):
real time
0.14 seconds
cpu time
0.04 seconds
68 James P. Geaghan  Copyright 2011 Statistical Techniques II
Simple Linear Regression: Appendix 2 Annotated SAS example
Page 156 EXST7015: Estimating tree weights from other morphometric variables
Simple linear regression
Residual analysis
The UNIVARIATE Procedure
Variable: E (Residual) N
Mean
Std Deviation
Skewness
Uncorrected SS
Coeff Variation Moments
47
Sum Weights
0
Sum Observations
120.703619
Variance
0.47869472
Kurtosis
670190.732
Corrected SS
.
Std Error Mean Basic Statistical Measures
Location
Variability
Mean
0.00000
Std Deviation
Median
0.14041
Variance
Mode
.
Range
Interquartile Range 47
0
14569.3637
1.04153074
670190.732
17.6064324 120.70362
14569
660.02160
161.40929 Tests for Location: Mu0=0
Test
Statisticp ValueStudent's t
t
0
Pr > t
1.0000
Sign
M
0.5
Pr >= M
1.0000
Signed Rank
S
25
Pr >= S
0.7946 Test
ShapiroWilk
KolmogorovSmirnov
Cramervon Mises
AndersonDarling Tests for Normality
StatisticW
0.973389
D
0.084574
WSq 0.044081
ASq 0.354877 p ValuePr < W
0.3544
Pr > D
>0.1500
Pr > WSq >0.2500
Pr > ASq >0.2500 Quantiles (Definition 5)
Quantile
Estimate
100% Max
368.769960
99%
368.769960
95%
162.423301
90%
138.710558
75% Q3
75.141444
50% Median
0.140413
25% Q1
86.267841
10%
135.842356
5%
153.980584
1%
291.251641
0% Min
291.251641
Extreme Observations
LowestHighestValue
Obs
Value
Obs
291.252
19
138.711
35
165.246
20
149.057
7
153.981
15
162.423
9
151.699
5
260.775
17
135.842
36
368.770
6 James P. Geaghan  Copyright 2011 Statistical Techniques II
Simple Linear Regression:
Stem
3
3
2
2
1
1
0
0
0
0
1
1
2
2 Appendix 2 Leaf
7 #
1 Boxplot
0 6 1 Annotated SAS example
Page 157 



++
 + 
**
++



 56
00334
5555666899
0033
3210
997766665
4321110
755 2
5
10
4
4
9
7
3 9
1
++++
Multiply Stem.Leaf by 10**+2 Normal Probability Plot
375+
*

+

* ++++

++++

+++*

+*****

*****

+****

++***

******

******

* *+*+

++++
275+ ++*+
+++++++++++
2
1
0
+1
+2 6 font fixed courier new
+1+2+3+4+5+6+7+8+9+0+1+2+ 7 font fixed courier new
+1+2+3+4+5+6+7+8+9+0+1 8 font fixed courier new
+1+2+3+4+5+6+7+8+9+ 9 font fixed courier new
+1+2+3+4+5+6+7+8+ 10 font fixed courier new
+1+2+3+4+5+6+7+ 11 font fixed courier new
+1+2+3+4+5+6+ 12 font fixed courier new
+1+2+3+4+5+6 James P. Geaghan  Copyright 2011 Statistical Techniques II
Simple Linear Regression: Appendix 2 Annotated SAS example
Page 158 110
GOPTIONS DEVICE=CGMflwa GSFMODE=REPLACE GSFNAME=OUT NOPROMPT noROTATE
111
ftext='TimesRoman' ftitle='TimesRoman' htext=1 htitle=1 ctitle=black ctext=black;
112
113
GOPTIONS GSFNAME=OUT1;
114
FILENAME OUT1'C:\SAS\SLRTrees1.CGM';
NOTE: There were 47 observations read from the data set WORK.ONE.
NOTE: The PROCEDURE PLOT printed page 14.
NOTE: PROCEDURE PLOT used:
real time
0.09 seconds
cpu time
0.04 seconds
115
PROC GPLOT DATA=ONE;
116
TITLE1 font='TimesRoman' H=1 'Simple Linear Regression Example';
117
TITLE2 font='TimesRoman' H=1 'Wood harvest from trees';
118
PLOT weight*Dbh=1 weight*Dbh=2 / overlay HAXIS=AXIS1 VAXIS=AXIS2;
119
AXIS1 LABEL=(font='TimesRoman' H=1 'Diameter at breast height (inches)') WIDTH=1 MINOR=(N=1)
120
VALUE=(font='TimesRoman' H=1) color=black ORDER=3 TO 13 BY 1;
121
AXIS2 LABEL=(ANGLE=90 font='TimesRoman' H=1 'Weight of wood harvested (lbs)') WIDTH=1
122
VALUE=(font='TimesRoman' H=1) MINOR=(N=5) color=black ORDER=0 TO 1800 BY 200;
123
SYMBOL1 color=red V=None I=RLcli99 L=1 MODE=INCLUDE;
124
SYMBOL2 color=blue V=dot I=None
L=1 MODE=INCLUDE; RUN;
NOTE: Regression equation : Weight = 729.3963 + 178.5637*Dbh.
NOTE: Foreground color BLACK same as background. Part of your graph may not be visible.
NOTE: 52 RECORDS WRITTEN TO C:\SAS\SLRTrees1.CGM
125
**** V = "dot" would place a dot for each point;
126
**** I = for regression: R requests fitted regression line, L, Q or C requests Linear,
127
Quadraatic or cubic, CLM or CLI requests corresponding confidence interval and
128
95 specifies alpha level for CI (any value from 50 to 99);
129
**** I = for categories: requests STD (std dev) 1 (1 width, 2 or 3) M (of mean=std err)
130
J (join means of bars) t (add top & bottom hash) p (use pooled variance);
131
**** Other options for categories: omit M=std dev, use B to get bar for min/max;
132
RUN:
NOTE: There were 47 observations read from the data set WORK.ONE.
NOTE: PROCEDURE GPLOT used:
real time
0.22 seconds
cpu time
0.10 seconds Weight of wood harvested (lbs) 1800 Simple Linear Regression Example
Wood harvest from trees 1600
1400
1200
1000
800
600
400
200
0
3 4 5 6 7 8 9 10 11 12 Diameter at breast height (inches) James P. Geaghan  Copyright 2011 13 ...
View
Full
Document
This note was uploaded on 12/29/2011 for the course EXST 7015 taught by Professor Wang,j during the Fall '08 term at LSU.
 Fall '08
 Wang,J

Click to edit the document details