This preview shows page 1. Sign up to view the full content.
Unformatted text preview: EXST7015 : Statistical Techniques II
Simple Linear Regression Intro & Review Geaghan
Page 13 Introduction
Major topics (a comprehensive outline is provided elsewhere) Regression : SLR, Multiple, Curvilinear & Logistic
Experimental Design : CRD, RBD, LSD, Splitplot & Repeated Measures
Treatment arrangements : Single factor, Factorial, Nested Course Objectives The objectives of the introductory course were to develop an understanding of elemental
statistics, the ability to understand and apply basic statistical procedures. We will develop
those concepts further, applying the terminology and notation from the basic methods
courses to advanced techniques for making statistical inferences.
We will cover the major methodologies of parametric statistics used for prediction and
hypotheses testing (primarily regression and experimental design).
Our emphasis will be on RECOGNIZING analytical problems and on being able to do the
statistical analysis with SAS software. We will see SAS programs and output for virtually
all analyses covered this semester. Simple Linear Regression (review?)
Y  the dependent variable The objective: Given points plotted on two coordinates, Y and X, find the best line to fit the data.
35 30 25 20 0 1 2 3 4 5 6 7 8 9 10 X  the independent variable
The concept: Data consists of paired observations with a presumed potential for the existence of
some underlying relationship. We wish to determine the nature of the relationship and quantify it if
it exists.
Note that we cannot prove that the relationship exists by using regression (i.e. we cannot
prove cause and effect). Regression can only show if a “correlation” exists, and provide an
equation for the relationship.
Given a data set consisting of paired, quantitative variables, and recognizing that there is variation
in the data set, we will define,
POPULATION MODEL (SLR): Yi 0 1 X i i
James P. Geaghan  Copyright 2011 EXST7015 : Statistical Techniques II
Simple Linear Regression Intro & Review Geaghan
Page 14 This is the model we will fit. It is the equation describing straight line for a population
and we want to estimate the parameters in the equation. The population parameters to be
estimated are for the underlying model, y. x 0 1 X i , are: y.x = the true population mean of Y at each value of X 0 = the true value of the Y intercept 1 = the true value of the slope, the change in Y per unit of X Terminology Dependent variable: variable to be predicted
Y = dependent variable (all variation occurs in Y)
Independent variable: predictor or regressor variable
X = independent variable (X is measured without error)
Intercept: value of Y when X = 0, point where the regression line passes through the Y axis.
The units on the intercept are the same as the “Y” units
Slope: the value of the change in Y for each unit increase in X. The units on the slope are “Y”
units per “X” unit
Deviation: distance from an observed point to the regression line, also called a residual.
Least squares regression line: the line that minimizes the squared distances from the line to the
individual observations. Y  the dependent variable Regression line Deviations
Intercept
0 1 2 3 4 5 6 7 8 9 10 X  the independent variable The regression line itself represents the mean of Y at each value of X ( y.x ).
Regression calculations All calculations for simple linear regression start with the same values. These are,
n Xi,
i 1 n X i2 ,
i 1 n Yi ,
i 1 n Yi 2 ,
i 1 n XY, n
i 1 i i Calculations for simple linear regression are first adjusted for the mean. These are called “corrected
values”. They are corrected for the MEAN by subtracting a “correction factor”.
As a result, all simple linear regressions are adjusted for the mean of X and Y and pass through the
point Y , X .
James P. Geaghan  Copyright 2011 EXST7015 : Statistical Techniques II
Simple Linear Regression Intro & Review Geaghan
Page 15
Y , X Y X
The original sums and sums of squares of Y are distances and squared distances from zero. These
are referred to as “uncorrected” meaning unadjusted for the mean. Y 0 X
The “corrected” deviations sum to zero (half negative and half positive) and the sums of the squares
are squared distances from the mean of Y. Y
Y X
Once the means X , Y and corrected sums of squares and cross products S XX , SYY , S XY are
obtained, the calculations for the parameter estimates are:
Slope = b1 S XY S XX Intercept = b0 Y b1 X
We have fitted the sample equation
Yi b0 b1 X i ei , which estimates the population parameters of the model, Yi 0 1 X i i Variance estimates for regression After the regression line is fitted, variance calculations are based on the deviations from the
regression. From the regression model Yi b0 b1 X i ei we derive the formula for the
ˆ
deviations e Y b b X or e = Y Y .
i i 0 1 i i i i James P. Geaghan  Copyright 2011 EXST7015 : Statistical Techniques II
Simple Linear Regression Intro & Review Geaghan
Page 16 Y X
As with other calculations of variance, we calculate a sum of squares (corrected for the mean).
This is simplified by the fact that the deviations, or residuals, already have a mean of zero,
n SSResiduals = ei2 = SSError .
i 1 The degrees of freedom (d.f.) for the variance calculation is n–2, since two parameters are
estimated prior to the variance (0 and 1).
The variance estimate is called the MSE (Mean square error). It is the SSError divided by the
d.f.,
MSE SSE n 2 . The variances for the two parameter estimates and the predicted values are all different, but all
are based on the MSE, and all have n–2 d.f. (ttests) or n–2 d.f. for the denominator (F tests).
Variance of the slope = MSE S XX 1 X 2 Variance of the intercept = MSE n S XX 1 X X 2 i Variance of a predicted value at Xi = MSE S XX
n Any of these variances can be used for a ttest of an estimate against an hypothesized value for
the appropriate parameter (i.e. slope, intercept or predicted value respectively).
ANOVA table for regression A common representation of regression results is an ANOVA table. Given the SSError (sum of
squared deviations from the regression), and the initial total sum of squares (SYY), the sum of
squares of Y adjusted for the mean, we can construct an ANOVA table
Simple Linear Regression ANOVA table d.f. Sum of Squares Mean Square 1 SSRegression MSReg Error n–2 SSError MSError Total n–1 SYY = SSTotal Regression F
MSReg /MSError In the ANOVA table
James P. Geaghan  Copyright 2011 EXST7015 : Statistical Techniques II
Simple Linear Regression Intro & Review Geaghan
Page 17 The SSRegression and SSError sum to the SSTotal, so given the total (SYY) and one of the
two terms, we can get the other.
The easiest to calculate first is usually the SSRegression since we usually already have the
necessary intermediate values.
SSRegression = S XY 2 S XX The SSRegression is a measure of the “improvement” in the fit due to the regression line.
The deviations start at SYY and are reduced to SSError. The difference is the improvement,
and is equal to the SSRegression.
This gives another statistic called the R2. What portion of the SSTotal (SYY) is accounted
for by the regression?
R2 = SSRegression / SSTotal
The degrees of freedom in the ANOVA table are,
n–1 for the total, one lost for the correction for the mean (which also fits the intercept)
n–2 for the error, since two parameters are estimated to get the regression line. 1 d.f. for the regression, which is the d.f. for the slope.
The F test is constructed by calculating the MSRegression / MSError. This F test has 1 in the
numerator and (n–2) d.f. in the denominator. This is exactly the same test as the ttest of the
slope against zero.
To test the slope against an hypothesized value (say zero) using the ttest with n–2 d.f.,
calculate t b1 b1Hypothesized
Sb1 b1 0
MSE
S XX Assumptions for the Regression We will recognize 4 assumptions
1) Normality – We take the deviations from regression and pool them all together into one
estimate of variance. Some of the tests we use require the assumption of normality, so these
deviations should be normally distributed.
Y X For each value of X there is a population of values for the variable Y (normally distributed).
2) Homogeneity of variance – When we pool these deviations (variances) we also assume that
the variances are the same at each value of Xi. In some cases this is not true, particularly when
the variance increases as X increases.
James P. Geaghan  Copyright 2011 EXST7015 : Statistical Techniques II
Simple Linear Regression Intro & Review Geaghan
Page 18 3) X is measured without error! Since variances are measured only vertically, all variance is in
Y, no provisions are made for variance in X.
4) Independence. This enters in several places. First, the observations should be independent of
each other (i.e. the value of ei should be independent of ej, for i ≠ j). Also, in the equation for
the line Yi b0 b1 X i ei we assume that the term ei is independent of the rest of the model. We
will talk more of this when we get to multiple regression.
So the four assumptions are: Normality
Homogeneity of variance
Independence
X measured without error These are explicit assumptions, and we will examine or test these assumptions when possible.
There are also some other assumptions that I consider implicit. We will not state these, but in some
cases they can be tested. For example, There is order in the Universe. Otherwise what are you investigating?
The underlying fundamental relationship that I just fitted a straight line to really is a straight
line. Sometimes this one can be examined statistically. Characteristics of a Regression Line The line will pass through the point Y , X (also the point b0, 0) The sum of deviations will be zero ( ei 0 ) The sum of squared deviations (measured vertically, ei2 Yi b0 b1 X i of the points from the regression line will be a minimum.
ˆ
Values on the line can be described by the equation Yi b0 b1 X i .
The line has some desirable properties (if the assumptions are met)
o E b0 0 2 o
o E b1 1 E YX Y . X Therefore, the parameter estimates and predicted values are unbiased estimates. Note that linear regression is considered statistically robust. That is, the tests of hypothesis
tend to give good results if the assumptions are not violated to a great extent. Crossproducts and correlation Crossproducts are used in a number of related calculations (can be + or –). a crossproduct = Yi X i
Sum of crossproducts = Y i X i
Corrected sum of crossproducts = S X Y
S
Covariance = XY n 1
James P. Geaghan  Copyright 2011 EXST7015 : Statistical Techniques II
Simple Linear Regression Intro & Review S XY Slope = SSRegression = Correlation = S XY R2 =r2 = S 2
XY Geaghan
Page 19 S XX
2
SXY SXX
SYY S XX SYY SXX = SSRegression / SSTotal Simple Linear Regression Summary See Simple linear regression notes from EXST7005 for additional information, including the
derivation of the equations for the slope and intercept. You are not responsible for these
derivations.
Know the terminology, characteristics and properties of a regression line, the assumptions,
and the components to the ANOVA table.
You will not be fitting regressions by hand, but I will expect you to understand where the
values on SAS output come from and what they mean.
Particular emphasis will be placed on working with, and interpreting, numerical regression
analyses. Analyses will mostly be done with SAS. James P. Geaghan  Copyright 2011 ...
View
Full
Document
This note was uploaded on 12/29/2011 for the course EXST 7015 taught by Professor Wang,j during the Fall '08 term at LSU.
 Fall '08
 Wang,J

Click to edit the document details