Unformatted text preview: Statistical Techniques II Page 25 Curvilinear Residual Patterns
Transformations of Yi, like log transformations, will affect homogeneity of variance. The raw data
should actually appear nonhomogeneous.
Yi Yi Xi Xi Transformations of Xi will fit curves but will not affect the homogeneity of variance.
Y Yi i Xi Xi Polynomials assume homogeneous variance and will not adjust variance.
Yi Xi Intrinsically Linear (Curvilinear) regression example
Remember our SLR example about the amount of wood harvested from trees, predicted on the
basis of DBH (diameter at breast height)? Recall that it looked a little curved, and maybe even
had nonhomogeneous variance? Let's take another look at that model.
Typically, morphometric relationships (between parts of an organism) are best fitted with
models with both Log(Y) and Log(X). Fish length  scale length
Fish total length  fish fork length
Crab width  crab length
Fish length  fish weight, etc. Statistics quote: There are three kinds of lies: lies, damned lies, and statistics. Benjamin Disraeli (1804  1881)
James P. Geaghan  Copyright 2011 Statistical Techniques II Page 26 Here we have tree diameter and tree weight. Lets try a loglog model (using natural logs).
Since we are fitting a linear measurement to a volumetric or weight measurement, I
expect the following relationship to apply.
1gm = 1c3, for the metric system if the material has the same density as water (specific
gravity = 1).
In other words, I expect Wood weight b0 woodlength 3 log Wood weight log b0 3log woodlength
Note that this line will go through the origin, no problem there.
Yi b0 X ib1 ei The power term should be approximately 3, so we will test the coefficient of DBH
against a value of 3.
See computer output (Appenxix 3).
The residual plot for the loglog model appears to show no curvature, no nonhomogeneous
variance, no obvious outliers, and no significant departure from the normal distribution.
In short, it is much improved, and probably fits better than the linear model and it is
interpretable. Geometrically, the model for a tree trunk should approximate a cylinder or cone,
depending on the taper in the bole.
Weight = C**(specific gravity)*(D/2)2*H
where C=1 for a cylinder and 1/3 for a cone.
Curvilinear Regression Notes and Summary
For transformed models,
The usual regression assumptions must be met for the transformed model, not the raw data
(homogeneity, normality, etc.).
Estimates, hypothesis tests and confidence intervals would be calculated for the transformed
model. The estimates and limits can then be detransformed.
A wide range of biometrics situations call for established curvilinear models. These would
include, exponential growth, mortality, morphometric models, instrument standardization,
some other growth models (power models and quadratics have been used), recruitment models.
Check the literature in your field to see what models are used. Statistics quote: "In earlier times, they had no statistics, and so they had to fall back on lies".  Stephen Leacock James P. Geaghan  Copyright 2011 Statistical Techniques II Page 27 Matrix Algebra (see Appendix 4)
We will not be doing our regressions with matrix algebra, except that the computer does employ
matrices. In fact, there is really no other way to do the basic calculations.
You will be responsible for knowing about matrices only to the extent that PROC REG or PROC GLM
produces information. This is primarily the initial and final matrices.
So, what is a matrix?
A matrix is a rectangular arrangement of numbers. The matrix is usually denoted by a capital letter. A= LM4
1
D= M
MM3
N2 LM1 3OP
N7 9 Q OP
P
5P
0P
Q 2 4
6 0
0
3 The dimensions of a matrix are given by the number of rows and columns in the matrix (i.e. the
dimensions are r by c). For the matrices above,
A is 2 by 2
D is 4 by 3
For a simple linear regression the matrices of initial interest would be the data matrices, a Y matrix of
values of the dependent variable and an X matrix of values of the independent variable.
The X matrix also has a column of ones added to fit the intercept. LMY OP
MMY PP
MY P
Y MY P
MMY PP
MMY PP
NY Q
1 2 3 4 5
6 7 LM1
MM1
MM1
X 1
MM1
MM1
N1 OP
P
X P
P
X P
XP
P
X P
X P
Q
X1
X2 3 4 5
6 7 As with our algebraic calculations we need some intermediate values; sums, sums of squares and
crossproducts. These are obtained by calculating
First, a transpose matrix for both X and Y. This is simply the matrix turned on its side so the rows of
the original matrix become the columns of the transpose.
These are denoted X' and Y'.
X LM 1
NX 1 1 1 1 1 1 1
X2 X3 X4 X5 X6 X7 OP
Q Y Y1 Y2 Y3 Y4 Y5 Y6 Y7 We now calculate 3 matrices, X'X, Y'Y and X'Y. This requires matrix multiplication.
The results of these 3 calculations are; James P. Geaghan  Copyright 2011 Statistical Techniques II XX Page 28 LM 1
NX 1
X2 1 YY Y1 Y2 1
1
X3 X 4 Y3 Y4 1
1
X5 X 6 Y5 Y6 OP
PP
PP
PP
X P
X P
Q LM1
MM1
1
1 OM
1
X PM
QM1
MM1
MN1 X1
X2
X3
X4 =
X5 7 6 LM n
MM
MN X O
PP
X P
Q
n X P
i i 1
n n i i 1 2
i i 1 7 Y1 Y 2 Y3 n 2
Y7 Y4 Yi Y5 i=1 Y6 Y 7 LMY OP
MMY PP L
O
Y
1 OM P M Y P
Y =M
X PM P M
QMY P X Y PP
PQ
MMY PP MN
MNY PQ
1 2 XY n LM 1
NX 1
X2 1 1
1
X3 X 4 1
X5 3 1
X6 i i 1
n 4 7 i 5 i i 1 6
7 Notice that the contents of these 3 matrices are the same as the values we used for the algebraic
solution. n n n n n X , X , Y , Y , X Y , n
i 1 i i 1 2
i i 1 i i 1 2 i i 1 i i Normal equations – when the equations needed to solve a simple linear regression are derived, the
result is two equations with two unknowns that must be resolved. These are called the normal
equations. The normal equations are: b0 n b1X i Yi
b0 X i b1X i2 Yi X i If you solve these algebraically, you get the two equations we use to solve for b0 and b1. n
When expressed as matrices this factors out to n Xi i 1 n Xi b0 Yi i 1 i 1 n n 2 b1 Xi Yi Xi i 1 i 1 n In simple matrix notation, the B matrix (vector) times the X'X equals the X'Y: (X'X)B=X'Y.
As with the algebraic equations we need to solve for B (i.e. b0 and b1). If we do this with
algebra, we get the usual equations.
James P. Geaghan  Copyright 2011 Statistical Techniques II Page 29 Solving the matrix equations we get, B=(X'X)–1X'Y
This equation is the matrix algebra solution for a simple linear regression. Note that there is no
such thing as matrix “division”. The solution requires multiplying by the inverse matrix.
As with the algebraic values, if we multiply the B values (B matrix) by the X values we get the
predicted values.
XB=X(X'X)–1X'Y = Yhat vector
What do we need to know about these matrix calculations?
We need to know that the solution to the problem using matrix algebra involves the same
values as for the simple linear regression.
We need to know that the (X'X)–1 is a key component to this solution.
We need to know that the predicted values require the matrix segment X(X'X) –1X' times the Y
vector (MAIN DIAGONAL).
Why? We can get the matrices from SAS, but we want to understand what we have. (X'X) –1 is
a key component, not only of the solution for the regression coefficients, but also for the
variancecovariance matrix. The X(X'X) –1X' matrix main diagonal is a diagnostic that we will
use (hat diag).
But the most important reason for using matrices is that the solution for simple linear and
multiple regression are the same. Basically, matrix algebra is the ONLY way to solve multiple
regressions.
So, what do we get from SAS? If the options XPX and I are placed on the model statement, we can get
the X'X matrix and the (X'X) –1 matrix.
For the simple linear regression that we saw for the tree weights and diameters, these options produce
the following output.
Model Crossproducts X'X X'Y Y'Y
X'X
INTERCEP
INTERCEP
47
DBH
289.2
WEIGHT
17359 DBH
289.2
1981.98
142968.3 WEIGHT
17359
142968.3
13537551 X'X Inverse, Parameter Estimates, and SSE
INTERCEP
DBH
INTERCEP
0.2082694963
0.030389579
DBH
0.030389579
0.004938832
WEIGHT
729.3963003
178.56371409 WEIGHT
729.3963003
178.56371409
670190.7322 The first two rows and columns of numbers contain the X'X matrix, which has the values for
n Xi,
i 1 n X
i 1 2
i , n. X'X
INTERCEP
DBH INTERCEP
47
289.2 The last column has X'Y (values for DBH
289.2
1981.98
n Yi ,
i 1 n n i 1 i 1 X iYi ) and the last value is Y'Y ( Yi 2 ). Model Crossproducts X'X X'Y Y'Y
James P. Geaghan  Copyright 2011 Statistical Techniques II Page 30 X'X
INTERCEP
DBH
WEIGHT WEIGHT
17359
142968.3
13537551 In the X'X inverse matrix section, the first two rows and columns of numbers contain the (X'X)–1
matrix and the value in the third row and third column is the SSE. The other values are b0 and b1.
INTERCEP
DBH
WEIGHT INTERCEP
0.2082694963
0.030389579
729.3963003 DBH
0.030389579
0.004938832
178.56371409 WEIGHT
729.3963003
178.56371409
670190.7322 You will be responsible only for knowing where the 6 intermediate values are for simple linear
regression, and where to find the (X'X)–1 matrix.
Model Crossproducts X'X X'Y Y'Y
X'X
INTERCEP
INTERCEP
47
DBH
289.2
WEIGHT
17359 DBH
289.2
1981.98
142968.3 WEIGHT
17359
142968.3
13537551 X'X Inverse, Parameter Estimates, and SSE
INTERCEP
DBH
INTERCEP
0.2082694963
0.030389579
DBH
0.030389579
0.004938832
WEIGHT
729.3963003
178.56371409 WEIGHT
729.3963003
178.56371409
670190.7322 Multiple Regression with matrix algebra
The only difference between simple linear regression and multiple regression is the fact that multiple
regression has several independent variables (Xi variables).
There for the matrix X'X will be larger. For a simple linear regression, X'X is 2×2. For a 3 factor
multiple regression (X1, X2, X3 and an intercept) the X'X matrix will be 4×4. Y1 Y 2 Y3 Y Y4 Y5 Y6 Y 7 1
1 1 X 1
1 1
1 X11 X 21 X12
X13 X 22
X 23 X14 X 24 X15
X16 X 25
X 26 X17 X 27 X31 X32 X33 X34 X35 X36 X37 James P. Geaghan  Copyright 2011 ...
View
Full
Document
This note was uploaded on 12/29/2011 for the course EXST 7015 taught by Professor Wang,j during the Fall '08 term at LSU.
 Fall '08
 Wang,J

Click to edit the document details