EXST7015 Fall2011 SLR Intro

EXST7015 Fall2011 SLR Intro - EXST7015 : Statistical...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: EXST7015 : Statistical Techniques II Simple Linear Regression Intro & Review Geaghan Page 13 Introduction Major topics (a comprehensive outline is provided elsewhere) Regression : SLR, Multiple, Curvilinear & Logistic Experimental Design : CRD, RBD, LSD, Split-plot & Repeated Measures Treatment arrangements : Single factor, Factorial, Nested Course Objectives The objectives of the introductory course were to develop an understanding of elemental statistics, the ability to understand and apply basic statistical procedures. We will develop those concepts further, applying the terminology and notation from the basic methods courses to advanced techniques for making statistical inferences. We will cover the major methodologies of parametric statistics used for prediction and hypotheses testing (primarily regression and experimental design). Our emphasis will be on RECOGNIZING analytical problems and on being able to do the statistical analysis with SAS software. We will see SAS programs and output for virtually all analyses covered this semester. Simple Linear Regression (review?) Y - the dependent variable The objective: Given points plotted on two coordinates, Y and X, find the best line to fit the data. 35 30 25 20 0 1 2 3 4 5 6 7 8 9 10 X - the independent variable The concept: Data consists of paired observations with a presumed potential for the existence of some underlying relationship. We wish to determine the nature of the relationship and quantify it if it exists. Note that we cannot prove that the relationship exists by using regression (i.e. we cannot prove cause and effect). Regression can only show if a “correlation” exists, and provide an equation for the relationship. Given a data set consisting of paired, quantitative variables, and recognizing that there is variation in the data set, we will define, POPULATION MODEL (SLR): Yi 0 1 X i i James P. Geaghan - Copyright 2011 EXST7015 : Statistical Techniques II Simple Linear Regression Intro & Review Geaghan Page 14 This is the model we will fit. It is the equation describing straight line for a population and we want to estimate the parameters in the equation. The population parameters to be estimated are for the underlying model, y. x 0 1 X i , are: y.x = the true population mean of Y at each value of X 0 = the true value of the Y intercept 1 = the true value of the slope, the change in Y per unit of X Terminology Dependent variable: variable to be predicted Y = dependent variable (all variation occurs in Y) Independent variable: predictor or regressor variable X = independent variable (X is measured without error) Intercept: value of Y when X = 0, point where the regression line passes through the Y axis. The units on the intercept are the same as the “Y” units Slope: the value of the change in Y for each unit increase in X. The units on the slope are “Y” units per “X” unit Deviation: distance from an observed point to the regression line, also called a residual. Least squares regression line: the line that minimizes the squared distances from the line to the individual observations. Y - the dependent variable Regression line Deviations Intercept 0 1 2 3 4 5 6 7 8 9 10 X - the independent variable The regression line itself represents the mean of Y at each value of X ( y.x ). Regression calculations All calculations for simple linear regression start with the same values. These are, n Xi, i 1 n X i2 , i 1 n Yi , i 1 n Yi 2 , i 1 n XY, n i 1 i i Calculations for simple linear regression are first adjusted for the mean. These are called “corrected values”. They are corrected for the MEAN by subtracting a “correction factor”. As a result, all simple linear regressions are adjusted for the mean of X and Y and pass through the point Y , X . James P. Geaghan - Copyright 2011 EXST7015 : Statistical Techniques II Simple Linear Regression Intro & Review Geaghan Page 15 Y , X Y X The original sums and sums of squares of Y are distances and squared distances from zero. These are referred to as “uncorrected” meaning unadjusted for the mean. Y 0 X The “corrected” deviations sum to zero (half negative and half positive) and the sums of the squares are squared distances from the mean of Y. Y Y X Once the means X , Y and corrected sums of squares and cross products S XX , SYY , S XY are obtained, the calculations for the parameter estimates are: Slope = b1 S XY S XX Intercept = b0 Y b1 X We have fitted the sample equation Yi b0 b1 X i ei , which estimates the population parameters of the model, Yi 0 1 X i i Variance estimates for regression After the regression line is fitted, variance calculations are based on the deviations from the regression. From the regression model Yi b0 b1 X i ei we derive the formula for the ˆ deviations e Y b b X or e = Y -Y . i i 0 1 i i i i James P. Geaghan - Copyright 2011 EXST7015 : Statistical Techniques II Simple Linear Regression Intro & Review Geaghan Page 16 Y X As with other calculations of variance, we calculate a sum of squares (corrected for the mean). This is simplified by the fact that the deviations, or residuals, already have a mean of zero, n SSResiduals = ei2 = SSError . i 1 The degrees of freedom (d.f.) for the variance calculation is n–2, since two parameters are estimated prior to the variance (0 and 1). The variance estimate is called the MSE (Mean square error). It is the SSError divided by the d.f., MSE SSE n 2 . The variances for the two parameter estimates and the predicted values are all different, but all are based on the MSE, and all have n–2 d.f. (t-tests) or n–2 d.f. for the denominator (F tests). Variance of the slope = MSE S XX 1 X 2 Variance of the intercept = MSE n S XX 1 X X 2 i Variance of a predicted value at Xi = MSE S XX n Any of these variances can be used for a t-test of an estimate against an hypothesized value for the appropriate parameter (i.e. slope, intercept or predicted value respectively). ANOVA table for regression A common representation of regression results is an ANOVA table. Given the SSError (sum of squared deviations from the regression), and the initial total sum of squares (SYY), the sum of squares of Y adjusted for the mean, we can construct an ANOVA table Simple Linear Regression ANOVA table d.f. Sum of Squares Mean Square 1 SSRegression MSReg Error n–2 SSError MSError Total n–1 SYY = SSTotal Regression F MSReg /MSError In the ANOVA table James P. Geaghan - Copyright 2011 EXST7015 : Statistical Techniques II Simple Linear Regression Intro & Review Geaghan Page 17 The SSRegression and SSError sum to the SSTotal, so given the total (SYY) and one of the two terms, we can get the other. The easiest to calculate first is usually the SSRegression since we usually already have the necessary intermediate values. SSRegression = S XY 2 S XX The SSRegression is a measure of the “improvement” in the fit due to the regression line. The deviations start at SYY and are reduced to SSError. The difference is the improvement, and is equal to the SSRegression. This gives another statistic called the R2. What portion of the SSTotal (SYY) is accounted for by the regression? R2 = SSRegression / SSTotal The degrees of freedom in the ANOVA table are, n–1 for the total, one lost for the correction for the mean (which also fits the intercept) n–2 for the error, since two parameters are estimated to get the regression line. 1 d.f. for the regression, which is the d.f. for the slope. The F test is constructed by calculating the MSRegression / MSError. This F test has 1 in the numerator and (n–2) d.f. in the denominator. This is exactly the same test as the t-test of the slope against zero. To test the slope against an hypothesized value (say zero) using the t-test with n–2 d.f., calculate t b1 b1Hypothesized Sb1 b1 0 MSE S XX Assumptions for the Regression We will recognize 4 assumptions 1) Normality – We take the deviations from regression and pool them all together into one estimate of variance. Some of the tests we use require the assumption of normality, so these deviations should be normally distributed. Y X For each value of X there is a population of values for the variable Y (normally distributed). 2) Homogeneity of variance – When we pool these deviations (variances) we also assume that the variances are the same at each value of Xi. In some cases this is not true, particularly when the variance increases as X increases. James P. Geaghan - Copyright 2011 EXST7015 : Statistical Techniques II Simple Linear Regression Intro & Review Geaghan Page 18 3) X is measured without error! Since variances are measured only vertically, all variance is in Y, no provisions are made for variance in X. 4) Independence. This enters in several places. First, the observations should be independent of each other (i.e. the value of ei should be independent of ej, for i ≠ j). Also, in the equation for the line Yi b0 b1 X i ei we assume that the term ei is independent of the rest of the model. We will talk more of this when we get to multiple regression. So the four assumptions are: Normality Homogeneity of variance Independence X measured without error These are explicit assumptions, and we will examine or test these assumptions when possible. There are also some other assumptions that I consider implicit. We will not state these, but in some cases they can be tested. For example, There is order in the Universe. Otherwise what are you investigating? The underlying fundamental relationship that I just fitted a straight line to really is a straight line. Sometimes this one can be examined statistically. Characteristics of a Regression Line The line will pass through the point Y , X (also the point b0, 0) The sum of deviations will be zero ( ei 0 ) The sum of squared deviations (measured vertically, ei2 Yi b0 b1 X i of the points from the regression line will be a minimum. ˆ Values on the line can be described by the equation Yi b0 b1 X i . The line has some desirable properties (if the assumptions are met) o E b0 0 2 o o E b1 1 E YX Y . X Therefore, the parameter estimates and predicted values are unbiased estimates. Note that linear regression is considered statistically robust. That is, the tests of hypothesis tend to give good results if the assumptions are not violated to a great extent. Crossproducts and correlation Crossproducts are used in a number of related calculations (can be + or –). a crossproduct = Yi X i Sum of crossproducts = Y i X i Corrected sum of crossproducts = S X Y S Covariance = XY n 1 James P. Geaghan - Copyright 2011 EXST7015 : Statistical Techniques II Simple Linear Regression Intro & Review S XY Slope = SSRegression = Correlation = S XY R2 =r2 = S 2 XY Geaghan Page 19 S XX 2 SXY SXX SYY S XX SYY SXX = SSRegression / SSTotal Simple Linear Regression Summary See Simple linear regression notes from EXST7005 for additional information, including the derivation of the equations for the slope and intercept. You are not responsible for these derivations. Know the terminology, characteristics and properties of a regression line, the assumptions, and the components to the ANOVA table. You will not be fitting regressions by hand, but I will expect you to understand where the values on SAS output come from and what they mean. Particular emphasis will be placed on working with, and interpreting, numerical regression analyses. Analyses will mostly be done with SAS. James P. Geaghan - Copyright 2011 ...
View Full Document

This note was uploaded on 12/29/2011 for the course EXST 7015 taught by Professor Wang,j during the Fall '08 term at LSU.

Ask a homework question - tutors are online