EXST7005 Fall2010 24a Regression

EXST7005 Fall2010 24a Regression - Statistical Methods I...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Statistical Methods I (EXST 7005) Page 140 Simple Linear Regression Simple regression applications are used to fit a model describing a linear relationship between two variables. The aspects of least squares regression and correlation were developed by Sir Francis Galton in the late 1800’s. The application can be used to test for a statistically significant correlation between the variables. Finding a relationship does not prove a “cause and effect” relationship, but the model can be used to quantify a relationship where one is known to exist. The model provides a measure of the rate of change of one variable relative to another variable.. There is a potential change in the value of variable Y as the value of variable X changes. Variable values will always be paired, one termed an independent variable (often referred to as the X variable) and a dependent variable (termed a Y variable). For each value of X there is assumed to be a normally distributed population of values for the variable Y. Y The linear model which describes the relationship between two variables is given as X Yi = β 0 + β1 X i + ε i The “Y” variable is called the dependent variable or response variable (vertical axis). μ y. x = β0 + β1 X i is the population equation for a straight line. No error is needed in this equation because it describes the line itself. The term μ y. x is estimated with at each ˆ value of Xi with Y . μy.x = the true population mean of Y at each value of X The “X” variable is called the independent variable or predictor variable (horizontal axis). β0 = the true value of the intercept (the value of Y when X = 0) β1 = the true value of the slope, the amount of change in Y for each unit change in X (i.e. if X changes by 1 unit, Y changes by β1 units). The two population parameters to abe estimated, β0 and β1 are also referred to as the regression coefficients. All variability in the model is assumed to be due to Yi, so variance is measured vertically The variability is assumed to be normally distributed at each value of Xi The Xi variable is assumed to have no variance since all variability is in Yi (this is a new assumption) The values β0 and β1 (b0 and b1 for a sample) are called the regressions coefficients. The β0 value is the value of Y at the point where the line crosses the Y axis. This value is called the intercept. If this value is zero the line crosses at the origin of the X and Y James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 141 axes, and the linear equation reduces from “Yi=b0+ b1Xi” to “Yi=b1Xi” and is said to have “no intercept”, even though the regression line does cross the Y axis. The units on b0 are the same units as for Yi. The β1 value is called the slope. It determines the incline or angle of the regression line. If the slope is 0, the line is horizontal. At this point the linear model reduced to “Yi=b0”, and the regression is said to have “no slope”. The slope gives the change in Y per unit of X. The units on the slope are then “Y units per X unit”. The population equation for the line describes a perfect line with no variation. In practice there is always variation about the line. We include an additional term to represent this variation. Yi = β 0 + β1 X i + ε i for a population Yi = b0 + b1 X i + ei for a sample Y When we put this term in the model, we are describing individual points as their position on the line plus or minus some deviation The Sum of Squares of deviations from the line will form the basis of a variance for the regression line X When we leave the ei off the sample model we are describing a point on the regression line, predicted from the sample estimates. To indicate this we put a “hat” on the Yi value, ˆ Yi = b0 + b1 Xi . Characteristics of a Regression Line The line will pass through the point Y , X (also the point b0, 0) The sum of squared deviations (measured vertically) of the points from the regression line will be a minimum. Values on the line for any value of Xi can be described by the equation ˆ Yi = b0 + b1Xi Common objectives in Regression : there are a number of possible objectives Determine if there is a relationship between Yi and Xi . This would be determined by some hypothesis test. The strength of the relationship is, to some extent, reflected in the correlation or R2 value. Determine the value of the rate of change of Yi relative to Xi . This is measured by the slope of the regression line. This objective would usually be accompanied by a test of the slope against 0 (or some other value) and/or a confidence interval on the slope. Establish and employ a predictive equation for Yi from Xi . James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 142 This objective would usually be preceded by a Objective 1 above to show that a relationship exists. The predicted values would usually be given with their confidence interval, or the regression with its confidence band. Assumptions in Regression Analysis Independence The best guarantee of this assumption is random sampling. This is a difficult assumption to check. This assumption is made for all tests we will see in this course. Normality of the observations at each value of Xi (or the pooled deviations from the regression line) This is relatively easy to test if the appropriate values are tested (e.g. residuals in ANOVA or Regression, not the raw Yi values). This can be tested with the Shapiro-Wilks W statistic in PROC UNIVARIATE. Y This assumption is made for all tests we have seen this semester except the Chi square tests of Goodness of Fit and Independence X Homogeneity of error (homogeneous variances or homoscedasticity) This is easy to check for and to test in analysis of variance (S2 on mean or tests like Bartalett’s in ANOVA). In Regression the simplest way to check is by examining the the residual plot. This assumption is made for ANOVA (for pooled variance) and Regression. Recall that in 2 sample t-tests the equality of the variances need not be assumed, it can be readily tested. Xi measured without error: This must be assumed in ordinary least squares regressions, since all error is measured in a vertical direction and occurs in Yi . Assumptions – general assumptions The Y variable is normally distributed at each value of X The variance is homogeneous (across X). Observations are independent of each other and ei independent of the rest of the model. Special assumption for regression. Assume that all of the variation is attributable to the dependent variable (Y), and that the variable X is measured without error. Note that the deviations are measured vertically, not horizontally or perpendicular to the line. James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 143 Fitting the line Fitting the line starts with a corrected SSDeviation, this is the SSDeviation of the observations from a horizontal line through the mean. The line will pass through the point ⎯X, ⎯Y. The fitted line is pivoted on this point until it has a minimum SSDeviations. Y How do we know the SSDeviations are a minimum? Actually, we solve the equation for ei, and use calculus to determine the solution that has a minimum of the sum of squared deviations. X Yi = b0 + b1Xi + ei ˆ ei = Yi − (b0 + b1 X i ) = Yi − Yi n ∑ ei2 = i =1 n ∑ [Yi − (b0 + b1 X i )]2 = i =1 ∑ (Y n i =1 i ˆ − Yi ) 2 The line has some desirable properties E(b0) = β0 E(b1) = β1 E( YX ) = μY.X Therefore, the parameter estimates and predicted values are unbiased estimates. Derivation of the formulas You do not need to learn this derivation for this class! However you should be aware of the process and its objectives. Any observation from a sample can be written as Yi = b0 + b1 X i + ei . where; ei = a deviation of the observed point from the regression line The idea of regression is to minimize the deviation of the observations from the regression line, this is called a Least Squares Fit. The simple sum of the deviations is zero, Σei = 0 , so minimizing will require a square or an absolute value to remove the sign. The sum of the squared deviations is, ∑e 2 i = ∑ (Y − Yˆ ) i i 2 = ∑ (Y − b i 0 − b1 X i ) 2 The objective is to select b0 and b1 such that Σei is a minimum, by using some techniques from calculus. We have previously defined the uncorrected sum of squares and corrected sum of squares of a variable Yi. 2 James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 144 The corrected sum of squares of Y ∑Y Y The correction factor is ( ∑ ) The uncorrected SS is 2 i 2 i n The corrected SS is CSS = SYY = ∑ (Yi − Y ) = ∑ Yi 2 2 ( Y) − ∑ 2 i n We will call this corrected sum of squares SYY and the correction factor CYY The corrected sum of squares of X We could define the exact same series of calculations for Xi, and call it SXX The corrected cross products of Y and X We need a cross product for regression, and a corrected cross product. The cross product is XiYi. The uncorrected sum of cross products is ∑Y X i i The correction factor for the cross products is CXY = ( ΣYi )( ΣX i ) n The corrected cross product is CCP = S XY = ∑ (Yi − Y )( X i − X ) = ∑ Yi X i − ( ΣYi )( ΣX i ) n The formulas for calculating the slope and intercept can be derived as follows Take the partial derivative with respect to each of the parameter estimates, b0 and b1. For b0 : ∂ (∑ e ) n 2 i i =1 ∂b0 n = 2 ∑ (Yi − b0 − b1 X i )(-1) , which is set equal to 0 and solved for b0. i =1 −ΣYi + nb0 + b1ΣX i = 0 (this is the first “normal equation”) Likewise, for b1 we obtain the partial derivative, set it equal to 0 and solved for b1. (∑ e ) n ∂ 2 i i =1 ∂b1 n = 2∑ (Yi − b0 − b1 X i )(- X i ) i =1 −Σ (Yi X i − b0 X i − b1 X i2 ) = −Σ Yi X i + b0 Σ X i + b1Σ X i2 ) (second “normal equation”) The normal equations can be written as, b0 n + b1Σ X i = Σ Yi b0 Σ X i + b1Σ X i2 = Σ Yi X i At this point we have two equations and two unknowns so we can solve for the unknown regression coefficient values b0 and b1. James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 145 For b0 the solution is: nb0 = Σ Yi − b1Σ X i and b0 = ΣYi n − b1 ΣX i n = Yi − b1 X i . Note that estimating β0 requires a prior estimate of b1 and the means of the variables X and Y. For b1, given that, b0 = ΣYi n − b1 ΣX i 2 n and ΣYi X i = b0ΣX i + b1ΣX i then ΣY ΣX i ⎞ ΣY ΣX ( ΣX i ) Σ Yi X i = ⎛ i − b1 Σ X i + b1Σ X i2 = i i − b1 ⎜ n n⎟ n n ⎝ ⎠ 2 ΣYi X i − b1 = ΣYi ΣX i Σ Yi X i - n Σ Yi ΣX i ( ΣX i ) X − 2 i = b1 X − 2 i 2 n = SYX S XX n ( ΣX i ) b 2 1 ⎛ 2 ( ΣX i ) 2 = b1 ⎜ X i − ⎜ n n ⎝ + b1 X i2 ⎞ ⎟ ⎟ ⎠ so b1 is the corrected cross products over the corrected SS of X The intermediate statistics needed to solve all elements of a SLR are ΣYi , ΣX i , ΣYi 2 , ΣX i2 , ΣYi X i and n . We have not seen ΣYi 2 used in the calculations yet, but we will need it later to calculate variance. Review We want to fit the best possible line through some observed data points. We define this as the line that minimizes the vertically measured distances from the observed values to the fitted line. The line that achieves this is defined by the equations b0 = b1 = ΣYi n − b1 Σ Yi X i ΣX i2 − ΣX i ΣYi ΣX i ( ΣX i ) 2 n = Yi − b1 X i n = SYX S XX n These calculations provide us with two parameter estimates that we can then use to get the ˆ equation for the fitted line. Yi = b0 + b1 Xi . Testing hypotheses about regressions The total variation about a regression is exactly the same calculation as the total for Analysis of Variance. SSTotal = SSDeviations from the mean = Uncorrected SSTotal – Correction factor The simple regression analysis will produce two sources of variation. SSRegression – the variation explained by the regression SSError – the remaining, unexplained variation about the regression line. These sources of variation are expressed in an ANOVA source table. James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Source Regression Error Total d.f. 1 n–2 n–1 Page 146 d.f. used to fit slope error d.f. d.f. lost in adjusting for (“correcting for”) the mean Note that one degree of freedom is lost from the total for the “correction for the mean”, which actually fits the intercept. The single regression d.f. is for fitting the slope. Y X The correction fits a flat line through the mean Y X The “regression” actually fits the slope. The difference between these two models is that one has no slope, or a slope equal to zero ( b1 = 0 ) and the other has a slope fitted. Testing for a difference between these two cases is the common hypothesis test of interest in regression and it is expressed as H 0 : β1 = 0 . The results of a regression are expressed in an ANOVA table. The regression is tested with an F test, formed by dividing the MSRegression by the MSError. Source Regression Error Total df 1 n–2 n–1 SS SSRegression SSError SSTotal MS MSRegression MSError MSRegression F /MSError This is a one tailed F test, as it was with ANOVA, and it has 1 and n–1 d.f. It tests the null hypothesis H 0 : β1 = 0 versus the alternative H1: β1 ≠ 0 . The R2 statistic This is a popular statistic for interpretation. The concept is that we want to know what proportion of the corrected total sum of squares is explained by the regression line. Source Regression Error Total d.f. 1 n–2 n–1 SS SSReg SSError SSTotal In the regression the process of fitting the regression the SSTotal is divided into two parts, the sum of squares “explained” by the regression (SSRegression) and the remaining unexplained variation (SSError). Since these sum to the SStotal, we can calculate what James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 147 fraction of the total was fitted or explained the regression. This is often expressed as a percentage of the total sum of squares explained by the model, and is given by R2 = SSRegression / SSTotal. This is often multiplied by 100% and expressed as a percent. We might state that the regression explains 75% of the total variation. This is a very popular statistic, but it can be very misleading. For some studies an R2 value of 25% or 35% can be pretty good. For example, if you are trying to relate the abundance of an organism to environmental variables. On the other hand, if you are doing mophometric relationships, like relating a crabs width to its length, an R2 value of less than 90% is pretty bad. A note on regression models applied to transformed variables. Studies of mophometric relationships, including relationships of lengths to weights, should be done with logarithmic values of both X and Y. The log(Y) on log(X) model, called a power model, is a very flexible model used for many purposes. Many other models involving logs, powers, inverses are possible. These will fit curves of one shape or another. When using transformed variables in regression, all tests and confidence intervals are placed on the transformed values. Otherwise, they are used like any other simple linear regression. Numerical Example : Some freshwater-fish ectoparasites accumulate on the fish as it grows. Once the parasite is on the fish, it does not leave. The parasite completes it’s live cycle after the fish is consumed by a bird and finds it way again into the water. Since the parasite attaches and does not leave, older fish should accumulate more parasites. We want to test this hypothesis. Raw data with squares and crossproducts Observation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Age 1 2 3 3 3 4 4 5 6 6 6 7 7 8 9 9 Parasites 3 7 8 12 10 15 14 16 17 15 16 19 21 18 17 20 Age2 1 4 9 9 9 16 16 25 36 36 36 49 49 64 81 81 Parasite2 9 49 64 144 100 225 196 256 289 225 256 361 441 324 289 400 Age*Parasite 3 14 24 36 30 60 56 80 102 90 96 133 147 144 153 180 Summary data Sum Mean n 83 5.1875 16 228 14.25 16 521 32.5625 16 3628 226.75 16 1348 84.25 16 James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 148 Intermediate Calculations Σ X = 83 Σ X2 = 521 Mean of Xi = ⎯X = 5.1875 Σ XY = 1348 Σ Y = 228 Σ Y2 = 3628 Mean of Yi = ⎯Y = 14.25 n = 16 Correction factors and Corrected values (Sums of squares and crossproducts) CF for X Cxx = 430.5625 Corrected SS X Sxx = 90.4375 CF for Y Cyy = 3249 Corrected SS Y Syy = 379 CF for XY Cxy = 1182.75 Corrected CP XY Sxy = 165.25 ANOVA Table (values needed): SSTotal = 379 SSRegression = 165.252 / 90.4375 = 301.9495508 SSError = 379 – 301.9495508 = 77.05044921 Source Regression Error Total df 1 14 15 SS 301.9495508 77.05044921 379. MS 301.9495508 5.503603515 F 54.8639723 Tabular F0.05; 1, 14 = 4.600 Tabular F0.01; 1, 14 = 8.862 Model Parameter Estimates ∑ (Y − Y )( X n Slope = b1 = i =1 i . ∑( X n i =1 i i − X. − X. ) 2 ) = S xy S xx =165.25 / 90.4375 = 1.827228749 Intercept = b0 = ⎯Y-b1⎯X = 14.25 – 1.827228749 *5.1875 = 4.771250864 Regression Equation Yi = b0 + b1 * Xi + ei = Yi = 4.771250864 + 1.827228749 * Xi + ei Regression Line Standard error of b1 : Sb1 = ∧ Yi = b0 + b1 * Xi = Yi = 4.771250864 + 1.827228749 * Xi MSE ∑( X n i =1 Confidence interval on b1 i − X. ) 2 = 5.5036 MSE so Sb1 = = 0.2467 90.4375 S xx where b1 = 1.827228749 and t(0.05/2, 14df) = 2.145 P(1.827228749 – 2.145*0.246688722 ≤ β1 ≤ 1.827228749 + 2.145*0.246688722) = 0.95 P(1.29808144 ≤ β1 ≤ 2.356376058) = 0.95 Testing b1 against a specified value: e.g. H0: β1 = 5 versus H1: β1 ≠ 5 where b1 = 1.827228749, Sb1 = 0.246688722 and t(0.05/2, 14df) = 2.145 = (1.827228749 – 5) / 0.246688722 = – 12.86144 James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 149 ∧ Standard error of the regression line (i.e. Yi ) : S ˆ μY | X = ⎛ ⎞ 2 ⎜ 1 ( X i − X. ) ⎟ MSE⎜ + n n 2⎟ ⎜ ∑ ( X i − X. ) ⎟ ⎝ i =1 ⎠ ∧ Standard error of the individual points (i.e. Yi): This is a linear combination of Yi and ei, so the variances are the sum of the variance of these two, where the variance of ei is MSE. The 2 = S + MSE = standard error is then S μ ˆ μY | X Y|X ⎛ ⎞ 2 ⎜ 1 ( X i − X. ) ⎟ MSE ⎜ + + MSE n n 2⎟ ∑ ( X i − X. ) ⎟ ⎜ ⎝ i =1 ⎠ = ⎛ ⎞ 2 ⎜ 1 ( X i − X. ) ⎟ MSE ⎜ 1+ + n n 2⎟ ∑ ( X i − X. ) ⎟ ⎜ ⎝ ⎠ i =1 Standard error of b0 is the same as the standard error of the regression line where Xi = 0 Square Root of [5.503603515 (0.0625 + 26.91015625/90.4375)] = 1.407693696 Confidence interval on b0, where b0 = 4.771250864 and t(0.05/2, 14df) = 2.145 P(4.771250864 – 2.145*1.407693696 ≤ β0 ≤ 4.771250864+2.145*1.407693696) = 0.95 P(1.751747886 ≤ β0 ≤ 7.790753842) = 0.95 Estimate the standard error of an individual observation for number of parasites for a ten-year∧ old fish: Y = b0 + b1 X i =4.77125 + 1.82723*X=4.77125 + 1.82723*10 = 23.04354 Square Root of [ 5.503603515*(1+0.0625+(10 – 5.1875)2/90.4375)] = Square Root of [ 5.503603515*(1+0.0625+(23.16015625)/90.4375)] = 2.693881509 Confidence interval on μY|X=10 P(23.04353836 – 2.145*2.693881509 ≤ μY|X=10 ≤ 23.04353836+2.145*2.693881509) = 0.95 P(17.26516252 ≤ μY|X=10 ≤ 28.82191419) = 0.95 Calculate the coefficient of Determination and correlation R2 = r= 0.796700662 0.892580899 or 79.67006617 % See SAS output Overview of results and findings from the SAS program James P. Geaghan Copyright 2010 Statistical Methods I (EXST 7005) Page 150 I. Objective 1 : Determine if older fish have more parasites. (SAS can provide this) A. This determination would be made by examining the slope. The slope is the mean change in parasite number for each unit increase in age. The hypothesis tested is H0: β1=0 versus H1: β1≠0 1. If this number does not differ from zero, then there is no apparent relationship between age and number of parasites. If it differs from zero and is positive, then parasites increase with age. If it differs from zero and is negative, then parasites decrease with age. 2. For a simple linear regression we can examine the F test of the model, the F test of the Type I, the F test of the Type II, the F test of the Type III or the t-test of the slope. For a simple linear regression these all provide the same result. For multiple regressions (more than 1 independent variable) we would examine the Type II or Type III F test (these are the same in regression) or the t-test of regression coefficients. [Alternatively, a confidence interval can be placed on the coefficient, and if the interval does not include 0, the estimate of the coefficient is significantly different from zero]. B. In this case, the F tests mentioned had values of 54.86, and the probability of this F value with 1 and 14 d.f. is less than 0.0001. Likewise, the t test of the slope was 7.41, which was also significant at the same level. Note that t2=F, these are the same test. We can therefore conclude that the slope does differ from zero. Since it is positive we further conclude that older fish have more parasites. II. Objective 2 : Estimate the rate of accumulation of parasites. (SAS can provide this) A. The slope for this example is 1.827228749 parasites per year (note the units). It is positive, so we expect parasite numbers to increase by 1.8 per year. B. The standard error for the slope was 0.24668872. This value is provided by SAS and can be used for hypothesis testing or confidence intervals. SAS provides a t-test of H0: β1=0, but hypotheses about values other than zero must be requested (SAS TEST statement) or calculated by hand. The confidence interval in this case is: This calculation was done previously and is partly repeated below. P[b1 – tα/2,14 d.f. Sb1 ≤ β1 ≤ b1 + tα/2,14 d.f. Sb1]=0.95 P[1.827228749 – 2.144789(0.246689) ≤ β1 ≤ 1.827228749 + 2.144789(0.246689)]=0.95 P[1.298134 ≤ β1 ≤ 2.356324]=0.95 Note that this confidence interval does not include zero, so it differs significantly from zero. III. Estimate the intercept with confidence interval. A. The intercept may also require a confidence interval. This was calculated previously and was; P(1.751747886 ≤ β0 ≤ 7.790753842) = 0.95 James P. Geaghan Copyright 2010 ...
View Full Document

Ask a homework question - tutors are online