This preview shows page 1. Sign up to view the full content.
Unformatted text preview: Statistical Methods I (EXST 7005) Page 140 Simple Linear Regression
Simple regression applications are used to fit a model describing a linear relationship between two
variables. The aspects of least squares regression and correlation were developed by Sir
Francis Galton in the late 1800’s.
The application can be used to test for a statistically significant correlation between the variables.
Finding a relationship does not prove a “cause and effect” relationship, but the model can be
used to quantify a relationship where one is known to exist. The model provides a measure of
the rate of change of one variable relative to another variable..
There is a potential change in the value of variable Y as the value of variable X changes.
Variable values will always be paired, one termed an
independent variable (often referred to as the X
variable) and a dependent variable (termed a Y
variable). For each value of X there is assumed to
be a normally distributed population of values for
the variable Y. Y The linear model which describes the relationship
between two variables is given as X Yi = β 0 + β1 X i + ε i
The “Y” variable is called the dependent variable or response variable (vertical axis). μ y. x = β0 + β1 X i is the population equation for a straight line. No error is needed in this
equation because it describes the line itself. The term μ y. x is estimated with at each
ˆ
value of Xi with Y . μy.x = the true population mean of Y at each value of X
The “X” variable is called the independent variable or predictor variable (horizontal
axis). β0 = the true value of the intercept (the value of Y when X = 0)
β1 = the true value of the slope, the amount of change in Y for each unit change in X (i.e.
if X changes by 1 unit, Y changes by β1 units).
The two population parameters to abe estimated, β0 and β1 are also referred to as the
regression coefficients. All variability in the model is assumed to be due to Yi, so variance is measured vertically
The variability is assumed to be normally distributed at each value of Xi
The Xi variable is assumed to have no variance since all variability is in Yi (this is a new
assumption)
The values β0 and β1 (b0 and b1 for a sample) are called the regressions coefficients.
The β0 value is the value of Y at the point where the line crosses the Y axis. This value is
called the intercept. If this value is zero the line crosses at the origin of the X and Y
James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 141 axes, and the linear equation reduces from “Yi=b0+ b1Xi” to “Yi=b1Xi” and is said to
have “no intercept”, even though the regression line does cross the Y axis. The units
on b0 are the same units as for Yi.
The β1 value is called the slope. It determines the incline or angle of the regression line. If
the slope is 0, the line is horizontal. At this point the linear model reduced to “Yi=b0”,
and the regression is said to have “no slope”. The slope gives the change in Y per unit
of X. The units on the slope are then “Y units per X unit”.
The population equation for the line describes a perfect line with no variation. In practice there
is always variation about the line. We include
an additional term to represent this variation.
Yi = β 0 + β1 X i + ε i for a population Yi = b0 + b1 X i + ei for a sample Y When we put this term in the model, we are
describing individual points as their position
on the line plus or minus some deviation
The Sum of Squares of deviations from the line
will form the basis of a variance for the
regression line X When we leave the ei off the sample model we are describing a point on the regression line,
predicted from the sample estimates. To indicate this we put a “hat” on the Yi value, ˆ
Yi = b0 + b1 Xi . Characteristics of a Regression Line
The line will pass through the point Y , X (also the point b0, 0)
The sum of squared deviations (measured vertically) of the points from the regression line
will be a minimum.
Values on the line for any value of Xi can be described by the equation ˆ
Yi = b0 + b1Xi Common objectives in Regression : there are a number of possible objectives
Determine if there is a relationship between Yi and Xi .
This would be determined by some hypothesis test.
The strength of the relationship is, to some extent, reflected in the correlation or R2 value.
Determine the value of the rate of change of Yi relative to Xi .
This is measured by the slope of the regression line.
This objective would usually be accompanied by a test of the slope against 0 (or some
other value) and/or a confidence interval on the slope.
Establish and employ a predictive equation for Yi from Xi . James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 142 This objective would usually be preceded by a Objective 1 above to show that a
relationship exists.
The predicted values would usually be given with their confidence interval, or the
regression with its confidence band. Assumptions in Regression Analysis
Independence The best guarantee of this assumption is random sampling. This is a difficult assumption
to check.
This assumption is made for all tests we will see in this course.
Normality of the observations at each value of Xi (or the pooled deviations from the
regression line) This is relatively easy to test if the appropriate values
are tested (e.g. residuals in ANOVA or Regression,
not the raw Yi values). This can be tested with the
ShapiroWilks W statistic in PROC UNIVARIATE. Y This assumption is made for all tests we have seen this
semester except the Chi square tests of Goodness of
Fit and Independence X Homogeneity of error (homogeneous variances or homoscedasticity) This is easy to check for and to test in analysis of variance (S2 on mean or tests like
Bartalett’s in ANOVA). In Regression the simplest way to check is by examining the
the residual plot.
This assumption is made for ANOVA (for pooled variance) and Regression. Recall that
in 2 sample ttests the equality of the variances need not be assumed, it can be readily
tested.
Xi measured without error: This must be assumed in ordinary least squares regressions,
since all error is measured in a vertical direction and occurs in Yi . Assumptions – general assumptions
The Y variable is normally distributed at each value of X
The variance is homogeneous (across X).
Observations are independent of each other and ei independent of the rest of the model.
Special assumption for regression. Assume that all of the variation is attributable to the dependent variable (Y), and that the
variable X is measured without error.
Note that the deviations are measured vertically, not horizontally or perpendicular to the
line. James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 143 Fitting the line
Fitting the line starts with a corrected SSDeviation, this is the SSDeviation of the
observations from a horizontal line through the mean.
The line will pass through the point ⎯X, ⎯Y.
The fitted line is pivoted on this point
until it has a minimum SSDeviations. Y How do we know the SSDeviations are a
minimum? Actually, we solve the equation
for ei, and use calculus to determine the
solution that has a minimum of the sum of
squared deviations. X Yi = b0 + b1Xi + ei
ˆ
ei = Yi − (b0 + b1 X i ) = Yi − Yi
n ∑ ei2 =
i =1 n ∑ [Yi − (b0 + b1 X i )]2 =
i =1 ∑ (Y
n i =1 i ˆ
− Yi ) 2 The line has some desirable properties
E(b0) = β0
E(b1) = β1
E( YX ) = μY.X
Therefore, the parameter estimates and predicted values are unbiased estimates.
Derivation of the formulas You do not need to learn this derivation for this class! However you should be aware of the
process and its objectives.
Any observation from a sample can be written as Yi = b0 + b1 X i + ei .
where; ei = a deviation of the observed point from the regression line
The idea of regression is to minimize the deviation of the observations from the regression
line, this is called a Least Squares Fit. The simple sum of the deviations is zero,
Σei = 0 , so minimizing will require a square or an absolute value to remove the sign.
The sum of the squared deviations is, ∑e 2
i = ∑ (Y − Yˆ )
i i 2 = ∑ (Y − b
i 0 − b1 X i ) 2 The objective is to select b0 and b1 such that Σei is a minimum, by using some
techniques from calculus. We have previously defined the uncorrected sum of
squares and corrected sum of squares of a variable Yi.
2 James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 144 The corrected sum of squares of Y ∑Y
Y
The correction factor is ( ∑ ) The uncorrected SS is 2 i 2 i n The corrected SS is CSS = SYY = ∑ (Yi − Y ) = ∑ Yi
2 2 ( Y)
− ∑ 2 i n We will call this corrected sum of squares SYY and the correction factor CYY
The corrected sum of squares of X We could define the exact same series of calculations for Xi, and call it SXX
The corrected cross products of Y and X We need a cross product for regression, and a corrected cross product. The cross product
is XiYi.
The uncorrected sum of cross products is ∑Y X
i i The correction factor for the cross products is CXY = ( ΣYi )( ΣX i )
n The corrected cross product is CCP = S XY = ∑ (Yi − Y )( X i − X ) = ∑ Yi X i − ( ΣYi )( ΣX i )
n The formulas for calculating the slope and intercept can be derived as follows Take the partial derivative with respect to each of the parameter estimates, b0 and b1.
For b0 :
∂ (∑ e )
n 2
i i =1 ∂b0 n = 2 ∑ (Yi − b0 − b1 X i )(1) , which is set equal to 0 and solved for b0.
i =1 −ΣYi + nb0 + b1ΣX i = 0 (this is the first “normal equation”) Likewise, for b1 we obtain the partial derivative, set it equal to 0 and solved for b1. (∑ e )
n ∂ 2
i i =1 ∂b1 n = 2∑ (Yi − b0 − b1 X i )( X i )
i =1 −Σ (Yi X i − b0 X i − b1 X i2 ) = −Σ Yi X i + b0 Σ X i + b1Σ X i2 ) (second “normal equation”) The normal equations can be written as,
b0 n + b1Σ X i = Σ Yi b0 Σ X i + b1Σ X i2 = Σ Yi X i At this point we have two equations and two unknowns so we can solve for the
unknown regression coefficient values b0 and b1. James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 145 For b0 the solution is: nb0 = Σ Yi − b1Σ X i and b0 = ΣYi n − b1 ΣX i n = Yi − b1 X i . Note that estimating β0 requires a prior estimate of b1 and the means of the variables X
and Y.
For b1, given that, b0 = ΣYi n − b1 ΣX i 2
n and ΣYi X i = b0ΣX i + b1ΣX i then ΣY
ΣX i ⎞
ΣY ΣX
( ΣX i )
Σ Yi X i = ⎛ i
− b1
Σ X i + b1Σ X i2 = i i
− b1
⎜
n
n⎟
n
n
⎝
⎠
2 ΣYi X i − b1 = ΣYi ΣX i Σ Yi X i  n Σ Yi ΣX i ( ΣX i ) X −
2
i = b1 X −
2
i 2 n = SYX
S XX
n ( ΣX i )
b 2 1 ⎛ 2 ( ΣX i ) 2
= b1 ⎜ X i −
⎜
n
n
⎝ + b1 X i2 ⎞
⎟
⎟
⎠ so b1 is the corrected cross products over the corrected SS of X
The intermediate statistics needed to solve all elements of a SLR are
ΣYi , ΣX i , ΣYi 2 , ΣX i2 , ΣYi X i and n . We have not seen ΣYi 2 used in the calculations yet,
but we will need it later to calculate variance. Review
We want to fit the best possible line through some observed data points. We define this as the line
that minimizes the vertically measured distances from the observed values to the fitted line.
The line that achieves this is defined by the equations b0 =
b1 = ΣYi n − b1 Σ Yi X i ΣX i2 − ΣX i ΣYi ΣX i ( ΣX i ) 2 n = Yi − b1 X i n = SYX
S XX
n These calculations provide us with two parameter estimates that we can then use to get the ˆ
equation for the fitted line. Yi = b0 + b1 Xi . Testing hypotheses about regressions
The total variation about a regression is exactly the same calculation as the total for Analysis of
Variance. SSTotal = SSDeviations from the mean = Uncorrected SSTotal – Correction factor
The simple regression analysis will produce two sources of variation.
SSRegression – the variation explained by the regression
SSError – the remaining, unexplained variation about the regression line.
These sources of variation are expressed in an ANOVA source table.
James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005)
Source
Regression
Error
Total d.f.
1
n–2
n–1 Page 146
d.f. used to fit slope
error d.f.
d.f. lost in adjusting for (“correcting for”) the mean Note that one degree of freedom is lost from the total for the “correction for the mean”, which
actually fits the intercept. The single regression d.f. is for fitting the slope.
Y X The correction fits a flat line through the mean
Y X The “regression” actually fits the slope. The difference between these two models is that one has no slope, or a slope equal to zero ( b1 = 0 )
and the other has a slope fitted. Testing for a difference between these two cases is the
common hypothesis test of interest in regression and it is expressed as H 0 : β1 = 0 .
The results of a regression are expressed in an ANOVA table. The regression is tested with an F
test, formed by dividing the MSRegression by the MSError.
Source
Regression
Error
Total df
1
n–2
n–1 SS
SSRegression
SSError
SSTotal MS
MSRegression
MSError MSRegression F
/MSError This is a one tailed F test, as it was with ANOVA, and it has 1 and n–1 d.f. It tests the null
hypothesis H 0 : β1 = 0 versus the alternative H1: β1 ≠ 0 . The R2 statistic
This is a popular statistic for interpretation. The concept is that we want to know what
proportion of the corrected total sum of squares is explained by the regression line.
Source
Regression
Error
Total d.f.
1
n–2
n–1 SS
SSReg
SSError
SSTotal In the regression the process of fitting the regression the SSTotal is divided into two parts, the
sum of squares “explained” by the regression (SSRegression) and the remaining
unexplained variation (SSError). Since these sum to the SStotal, we can calculate what
James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 147 fraction of the total was fitted or explained the regression. This is often expressed as a
percentage of the total sum of squares explained by the model, and is given by R2 =
SSRegression / SSTotal.
This is often multiplied by 100% and expressed as a percent.
We might state that the regression explains 75% of the total variation.
This is a very popular statistic, but it can be very misleading.
For some studies an R2 value of 25% or 35% can be pretty good. For example, if you are
trying to relate the abundance of an organism to environmental variables. On the other
hand, if you are doing mophometric relationships, like relating a crabs width to its
length, an R2 value of less than 90% is pretty bad.
A note on regression models applied to transformed variables.
Studies of mophometric relationships, including relationships of lengths to weights, should
be done with logarithmic values of both X and Y. The log(Y) on log(X) model, called a
power model, is a very flexible model used for many purposes.
Many other models involving logs, powers, inverses are possible. These will fit curves of
one shape or another. When using transformed variables in regression, all tests and
confidence intervals are placed on the transformed values. Otherwise, they are used
like any other simple linear regression.
Numerical Example : Some freshwaterfish ectoparasites accumulate on the fish as it grows.
Once the parasite is on the fish, it does not leave. The parasite completes it’s live cycle after
the fish is consumed by a bird and finds it way again into the water. Since the parasite attaches
and does not leave, older fish should accumulate more parasites. We want to test this
hypothesis.
Raw data with squares and crossproducts
Observation
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16 Age
1
2
3
3
3
4
4
5
6
6
6
7
7
8
9
9 Parasites
3
7
8
12
10
15
14
16
17
15
16
19
21
18
17
20 Age2
1
4
9
9
9
16
16
25
36
36
36
49
49
64
81
81 Parasite2
9
49
64
144
100
225
196
256
289
225
256
361
441
324
289
400 Age*Parasite
3
14
24
36
30
60
56
80
102
90
96
133
147
144
153
180 Summary data
Sum
Mean
n 83
5.1875
16 228
14.25
16 521
32.5625
16 3628
226.75
16 1348
84.25
16 James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 148 Intermediate Calculations
Σ X = 83
Σ X2 = 521
Mean of Xi = ⎯X = 5.1875
Σ XY = 1348 Σ Y = 228
Σ Y2 = 3628
Mean of Yi = ⎯Y = 14.25
n = 16 Correction factors and Corrected values (Sums of squares and crossproducts)
CF for X
Cxx = 430.5625
Corrected SS X
Sxx = 90.4375
CF for Y
Cyy = 3249
Corrected SS Y
Syy = 379
CF for XY
Cxy = 1182.75
Corrected CP XY
Sxy = 165.25
ANOVA Table (values needed):
SSTotal = 379
SSRegression = 165.252 / 90.4375 = 301.9495508
SSError =
379 – 301.9495508 = 77.05044921
Source
Regression
Error
Total df
1
14
15 SS
301.9495508
77.05044921
379. MS
301.9495508
5.503603515 F
54.8639723
Tabular F0.05; 1, 14 = 4.600
Tabular F0.01; 1, 14 = 8.862 Model Parameter Estimates ∑ (Y − Y )( X
n Slope = b1 = i =1 i . ∑( X
n i =1 i i − X. − X. ) 2 ) = S xy
S xx =165.25 / 90.4375 = 1.827228749 Intercept = b0 = ⎯Yb1⎯X = 14.25 – 1.827228749 *5.1875 = 4.771250864
Regression Equation Yi = b0 + b1 * Xi + ei = Yi = 4.771250864 + 1.827228749 * Xi + ei
Regression Line
Standard error of b1 : Sb1 = ∧ Yi = b0 + b1 * Xi = Yi = 4.771250864 + 1.827228749 * Xi MSE ∑( X
n i =1 Confidence interval on b1 i − X. ) 2 = 5.5036
MSE
so Sb1 =
= 0.2467
90.4375
S xx where b1 = 1.827228749 and t(0.05/2, 14df) = 2.145 P(1.827228749 – 2.145*0.246688722 ≤ β1 ≤ 1.827228749 + 2.145*0.246688722) = 0.95
P(1.29808144 ≤ β1 ≤ 2.356376058) = 0.95
Testing b1 against a specified value: e.g. H0: β1 = 5 versus H1: β1 ≠ 5
where b1 = 1.827228749, Sb1 = 0.246688722 and t(0.05/2, 14df) = 2.145
= (1.827228749 – 5) / 0.246688722 = – 12.86144 James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 149 ∧ Standard error of the regression line (i.e. Yi ) : S ˆ
μY  X = ⎛
⎞
2
⎜ 1 ( X i − X. ) ⎟
MSE⎜ +
n n
2⎟
⎜ ∑ ( X i − X. ) ⎟
⎝ i =1
⎠
∧ Standard error of the individual points (i.e. Yi): This is a linear combination of Yi and ei, so the
variances are the sum of the variance of these two, where the variance of ei is MSE. The
2
= S
+ MSE =
standard error is then S μ
ˆ
μY  X YX ⎛
⎞
2
⎜ 1 ( X i − X. ) ⎟
MSE ⎜ +
+ MSE
n n
2⎟
∑ ( X i − X. ) ⎟
⎜
⎝ i =1
⎠ = ⎛
⎞
2
⎜ 1 ( X i − X. ) ⎟
MSE ⎜ 1+ +
n n
2⎟
∑ ( X i − X. ) ⎟
⎜
⎝
⎠
i =1 Standard error of b0 is the same as the standard error of the regression line where Xi = 0
Square Root of [5.503603515 (0.0625 + 26.91015625/90.4375)] = 1.407693696
Confidence interval on b0, where b0 = 4.771250864 and t(0.05/2, 14df) = 2.145
P(4.771250864 – 2.145*1.407693696 ≤ β0 ≤ 4.771250864+2.145*1.407693696) = 0.95
P(1.751747886 ≤ β0 ≤ 7.790753842) = 0.95
Estimate the standard error of an individual observation for number of parasites for a tenyear∧ old fish: Y = b0 + b1 X i =4.77125 + 1.82723*X=4.77125 + 1.82723*10 = 23.04354
Square Root of [ 5.503603515*(1+0.0625+(10 – 5.1875)2/90.4375)] =
Square Root of [ 5.503603515*(1+0.0625+(23.16015625)/90.4375)] = 2.693881509
Confidence interval on μYX=10
P(23.04353836 – 2.145*2.693881509 ≤ μYX=10 ≤ 23.04353836+2.145*2.693881509) = 0.95
P(17.26516252 ≤ μYX=10 ≤ 28.82191419) = 0.95
Calculate the coefficient of Determination and correlation
R2 =
r= 0.796700662
0.892580899 or 79.67006617 % See SAS output
Overview of results and findings from the SAS program James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 150 I. Objective 1 : Determine if older fish have more parasites. (SAS can provide this)
A. This determination would be made by examining the slope. The slope is the mean change in
parasite number for each unit increase in age. The hypothesis tested is H0: β1=0 versus H1:
β1≠0
1. If this number does not differ from zero, then there is no apparent relationship between age
and number of parasites. If it differs from zero and is positive, then parasites increase with
age. If it differs from zero and is negative, then parasites decrease with age.
2. For a simple linear regression we can examine the F test of the model, the F test of the
Type I, the F test of the Type II, the F test of the Type III or the ttest of the slope. For a
simple linear regression these all provide the same result. For multiple regressions (more
than 1 independent variable) we would examine the Type II or Type III F test (these are the
same in regression) or the ttest of regression coefficients. [Alternatively, a confidence
interval can be placed on the coefficient, and if the interval does not include 0, the estimate
of the coefficient is significantly different from zero].
B. In this case, the F tests mentioned had values of 54.86, and the probability of this F value with
1 and 14 d.f. is less than 0.0001. Likewise, the t test of the slope was 7.41, which was also
significant at the same level. Note that t2=F, these are the same test. We can therefore
conclude that the slope does differ from zero. Since it is positive we further conclude that
older fish have more parasites.
II. Objective 2 : Estimate the rate of accumulation of parasites. (SAS can provide this)
A. The slope for this example is 1.827228749 parasites per year (note the units). It is positive, so
we expect parasite numbers to increase by 1.8 per year.
B. The standard error for the slope was 0.24668872. This value is provided by SAS and can be
used for hypothesis testing or confidence intervals. SAS provides a ttest of H0: β1=0, but
hypotheses about values other than zero must be requested (SAS TEST statement) or
calculated by hand. The confidence interval in this case is: This calculation was done
previously and is partly repeated below.
P[b1 – tα/2,14 d.f. Sb1 ≤ β1 ≤ b1 + tα/2,14 d.f. Sb1]=0.95
P[1.827228749 – 2.144789(0.246689) ≤ β1 ≤ 1.827228749 +
2.144789(0.246689)]=0.95
P[1.298134 ≤ β1 ≤ 2.356324]=0.95
Note that this confidence interval does not include zero, so it differs significantly from zero.
III. Estimate the intercept with confidence interval.
A. The intercept may also require a confidence interval. This was calculated previously and was;
P(1.751747886 ≤ β0 ≤ 7.790753842) = 0.95 James P. Geaghan Copyright 2010 ...
View Full
Document
 Fall '08
 Geaghan,J

Click to edit the document details