Unformatted text preview: Chapter 1. Simple Linear Regression
1. Introduction We are interested in establishing the relationship between two variables, especially predicting one variable (y) based on the other (x). Deterministic Model vs. Probabilistic Model ‐ Deterministic model assumes that by knowing x we are able to predict y exactly. This model hypothesizes an exact relationship between the variables and there is no allowance for error in the prediction. ‐ A probabilistic model: If y is the variable interest, y = Deterministic component + Random error ‐ We assume the mean value of the random error is 0, i.e., E (y ) = Deterministic component ‐ In this chapter we consider the simplest probabilistic model – the deterministic portion of the model graphs as a straight line. ‐ Fitting this model to a set of data is called regression. STAT6220 2 2. Model ‐ Suppose we are given observations in pairs: (X1 ; Y1 ); : : : ; (Xn ; Yn ), where Xi ; Yi 2 R . ‐ Suppose we want to predict variable Y as a function of X because we believe there is some underlying relationship between Y and X , for example, Y can be approximated by a function X , i.e., Y ¼ f (X ). ‐ We will consider the case when f (x) is a linear function of x: f (x) = ¯0 + ¯1 x ‐ The probabilistic model is y = ¯0 + ¯1x + ² ‐ Y : dependent or response variable, X : Independent or predictor variable ‐ ² (epsilon) : random error component ‐ ¯0 : y‐ intercept of the line, the point at which the line intersects of cuts through the y‐axis ‐ ¯1 : Slope of the line, the amount of increase (or decrease) in the deterministic component of y for every 1‐unit increase in x. ‐ Exercise: Draw y = 2 ¡ 1 x. 2 STAT6220 3 ‐ Note that linearity is not always reasonable assumption but a simple and good starting point for more complicated models. ‐ Model assumptions: 1) Linearity : E (²i ) = 0 for all i. This implies that the mean of y given x is ______________ 2) Homoscedasticity: the errors have the same variance, i.e., Var(²i ) = ¾ 2 for all i. This implies that the variance of y is _______________________ 3) Independence: The errors are independent of each other. 4) Normality: ²i is normally distributed for all i. ‐ The above model has the following parameters to estimate from the sample: ________________________ 3. Method of Least Squares ‐ Example 1: Suppose an experiment involving five subjects is conducted to determine the relationship between the percentage of a certain drug in the bloodstream and the length of time it takes to react to a stimulus. The results are STAT6220 4 Subject Amount of drug x(%) Reaction time y(sec) 1 1 1 2 2 1 3 3 2 4 4 2 5 5 4 ‐ First we need to determine if a linear relationship between y and x is plausible. It is helpful to plot the sample data. Such a plot, known as a scatter diagram, locates each of the five data point in the plane of x and y. (Draw here) Would a straight line adequately describe the trend in the data? ‐ The estimate of the model is ^ y = ¯0 + ¯1 x; ^^ ^ ^ or yi = ¯0 + ¯1 xi ^ ‐ Note: y i is a predicted value of yi. ei = yi ¡ yi is called residual (estimated error). ^ ^
STAT6220 5 ‐ The most common method to estimate them is the least‐squares (LS) method. That is to ^^ choose ¯0 ; ¯1 by minimizing the sum of squared errors: n X L=
i=1 (y i ¡ y i )2 = ^ n X i=1 ^ ^ [yi ¡ (¯0 + ¯1 xi )]2 ^ ^ ^^ ‐ Exercise: Calculate the LS estimators ¯0 ; ¯1 by differentiating L , dL=d¯0 = 0; dL=d¯1 = 0 ‐ Define X X ( P
i xi ) 2 Sxx =
i (x i ¡ x ) = ¹ Sxy (y i ¡ y ) = ¹ ¡ n i i i P P X X ( i xi )( i yi ) : = (xi ¡ x)(yi ¡ y ) = ¹ ¹ (x i y i ) ¡ n i i ; 2 x2 i X Syy = ¡ 2 X 2 yi ( P i yi ) 2 n ; ‐ Then STAT6220 Sxy ^ ¯1 = Sxx ^ and ¯0 = y ¡ ¯1 x ¹ ^¹ 6 ‐ Example 1: 1) Calculate the LS estimate for the example. 2) Calculate the residuals (errors) and the sum of squared error. (SSE) This SSE is smaller than what any other straight‐line model can produce. STAT6220 7 Decomposition of Sum of Squares ‐ Note that yi ¡ y = (^i ¡ y ) + (yi ¡ yi ) ¹ y ¹ ^ ‐ (Illustration): ‐ Total sum of squares (TSS) equals sum of squares due to regression (SSR) (explained by model) plus sum of squares due to error (SSE) (unexplained variability). ‐ Note: SSE = Sum of squared error = Sum of squares due to error = Residual sum of squares (RSS) Estimation of the variance of random error ¾ 2 ‐ The estimator of error variance (¾ 2) for the simple linear model is Mean Squared Error (MSE), the sample variance of errors. SSE s2 = ¾ 2 = M SE = ^ = n¡2
STAT6220 X ¹ (yi ¡ y )2 = X (^i ¡ y )2 + y ¹ X ( y i ¡ y i ) 2 ^ 8 p ‐ The squared root of MSE(=s), i.e., M SE is called the sample standard deviation or standard error of residuals. ‐ We expect most (about 95%) of the observed values to lie within 2s of their respective least squares predicted values y . That is, about 95% of residuals are expected to be less than 2s. ^ ‐ Degree of freedom (n ¡ 2) : We estimate two parameters ¯0 ; ¯1. If we have two data points, then there is no degree of freedom left. ‐ Example 1: Calculate s for the example. How many residuals are within 2s? 4. Assessing the Usefulness of the Model ‐ Now that we have estimated the regression line and error variance, we are ready to make statistical inferences about the model’s usefulness for estimation and prediction. ‐ The estimated slope, intercept, and standard error in a simple regression model are all estimates based on limited data. Their significance is affected by random error. Inferences about STAT6220 9 ^ ^ ^ ‐ The sampling distribution of ¯1 is normal with mean E (¯1 ) = ¯1 variance V ar (¯1 ) = ¾ 2 =Sxx. p ^1 = s= Sxx ‐ (estimated) S.E. of ¯ ‐ 100(1‐α % C.I. for β : Hypothesis testing for ¯1 s ^ ¯1 § t®=2(n ¡ 2) p Sxx ‐ Refer again to Example 1 and suppose the reaction times are completely unrelated to the percentage of drug in the blood‐stream. If x contributes no information about y, what could be said about the value of ¯1? ‐ To test the null hypothesis that the linear model contributes no information about y against the alternative hypothesis that it does, we test H0 : ¯1 = 0 vs. Ha : ¯1 6= 0 Alternatively, we could construct a 100(1‐α)% confidence interval for β . H would be supported if the C.I. contained zero; otherwise H would be supported. Test Statistic: Rejection region: Reject H 0 if STAT6220 t= ^ ¯1 ¡ 0 p » T (n ¡ 2) under H 0 s= Sxx 10 t > t® ( Ha : ¯1 > 0 ) t < ¡t® ( Ha : ¯1 < 0 ) jtj > t®=2 ( Ha : ¯1 6= 0 ) ‐ p‐value: You can find the bounds of the p‐value from t‐table. Computer output gives the exact value. ‐ Hypothesis test for specified ¯1 = c0 where c0 is any nonzero constant. Replace 0 in test statistic by c0. ‐ Alternative F‐test: o test statistic: H0 : ¯1 = 0 vs. Ha : ¯1 6= 0 MSR SSR=1 = » F (1; n ¡ 2) under H 0 SSE=(n ¡ 2) MSE o Reject H 0 if F > F® (1; n ¡ 2). F=
‐ Example 1: STAT6220 11 Inferences about ¯0 There are only infrequent occasions when we wish to make inferences concerning ¯0, the intercept of the regression line. These occur when the scope of the model includes x = 0. ^ ‐ E (¯ 0 ) = ¯0 ^ ‐ Var(¯0) = ¾ 2
³
1 n + x2 ¹ nSxx ´ P i = ¾ 2 nSxxi x2 ‐ 100(1‐®)% confidence interval for ¯0 is sP ^ ¯0 § t®=2(n ¡ 2)s 2 i xi nSxx ‐ Hypothesis test can be done with test statistic ^ ¯0 ¡ 0 t = q P 2 » T (n ¡ 2) under H 0 ix s nSxxi STAT6220 12 5. Correlation Coefficient ‐ The claim is often made that the number of cigarettes smoked and the incidence of lung cancer are “highly correlated”. Also the amount of calcium intake and the incidence of osteoporosis are “correlated”. ‐ The concept of correlation provides us with another measure of the usefulness of the model in regression analysis. The coefficient of correlation is a measure of the strength of the relationship between two variables x and y. ‐ The sample correlation coefficient is denoted by r. The population correlation coefficient is denoted by ½. ‐ Testing H0 : ½ = 0 vs. Ha : ½ 6= 0 is identical to testing H0 : ¯1 = 0 vs. Ha : ¯1 6= 0 ‐ Population correlation coefficient ½ of X and Y is defined by Cov(X; Y ) ½=p Var(X )Var(Y ) where Cov(X; Y ) = E (X ¡ E (X ))(Y ¡ E (Y )). ‐ The sample correlation coefficient r: Sxy ryx = p p Sxx Syy
‐ A correlation measures the strength of the linear relation between x and y. The stronger the correlation, the better x linearly predicts y. STAT6220 13 ‐ ryx > 0 if y tends to increase as x increases (p.591) ‐ ryx < 0 if y tends to decrease as x increases ‐ ryx = 0 if there is no linear relation between y and x ‐ Generally ¡1 6 ryx 6 1. ‐ If ryx = §1, it indicates perfect predictability. ‐ Hypothesis test for H0 : ½ = 0 vs. Ha: o Test statistic: p n¡2 t = rp » T (n ¡ 2) under H 0 1 ¡ r2 ^ ^^ ‐ Relation with simple linear regression: if y = ¯0 + ¯1 x, then r Syy ¯1 = = ryx Sxx 6. Coefficient of Determination ‐ The coefficient of determination, R2, gives the proportion of the total variability in the y¡ values that can be explained by the linear relationship (model) between y and x. SSR R2 = = T SS STAT6220 14 ‐ Thus R2 gives the proportion of the total variability in the y¡value that can be accounted for by the independent variable x through regression. ‐ 0 6 R 2 6 1; ‐ R 2 = 1, the predictor variable x accounts for all variation in the observation y; ‐ R 2 = 0 the predictor variable x is of no help in reducing the variation in the observations y ‐ For simple linear regression, R2 is simply the square of the correlation coefficient r 2. ‐ Example 1: 7. Using the Model for Estimation and Prediction ‐ The most common use of a regression model for making inferences can be divided into two categories: STAT6220 15 (1) For estimating the mean value of y. i.e., E (y ), for a specific value of x (2) For predicting a new individual y‐value for given x. ‐ The error predicting a particular value of y (2) will always be larger than the error of estimating the mean value of y (1) for a given x. This is due to the fact that we are predicting a variable (future value of y) rather than a constant E (y ). ‐ Illustration of Estimation vs. Prediction Estimating ‐ for a given , say The best estimator of the mean of y, E (y ), for a given specific value of x, say xp, is ^ y = ¯0 + ¯1 xp. The standard deviation of the sampling distribution of y is ^^ ^ s 1 ( x p ¡ x )2 ¹ ¾ + n Sxx ‐ The 100(1 α % confidence interval is STAT6220 16 s y § t®=2(n ¡ 2)s ^ 1 (xp ¡ x)2 ¹ + n Sxx Predicting for a given , say ‐ The best predictor for an individual new y for a given specific value of x, say xp, is ^ y = ¯0 + ¯1 xp. The standard error of prediction is ^^ s 1 (xp ¡ x)2 ¹ ¾ 1+ + n Sxx
s 1+ 1 (xp ¡ x)2 ¹ + n Sxx ‐ The 100(1 α % prediction interval is y § t®=2 (n ¡ 2)s ^ ‐ This interval is a prediction interval for the value of y which would be observed for a given value of x. It is typically much wider than the confidence interval for E (y jx) given above. ‐ Example 1: ‐ Illustration of Confidence Bands and Prediction Bands STAT6220 17 Extrapolation ‐ Using the least square prediction equation to estimate the mean value y or to predict a particular value of y for the values of x that fall outside the range of the values of x contained in your sample data may lead to errors of estimation or prediction that are much large. (Extrapolation penalty = (x¡x) ¹ Sxx ) . ‐ Although the least squares model may provide a very good fit to the data over the range of x values contained in the sample, it could give a poor representation of the true model for values of x outside this region. 8. Diagnostics Residual Analysis: So far we have been concerned with how well a linear regression model y = ¯0 + ¯1x + ² fits the data. ‐ Scatterplot of y vs. x can help us to see whether linear is good or not. ‐ We never know for certain that the model assumptions are satisfied. ‐ Because the assumptions all concern the random error component ², the first step is to estimate the random error. ‐ The residual can be used to check the model assumptions. Such checks are generally referred to as residual analysis. STAT6220 18 Residual plots ei vs. y i See if there is any pattern. ^ (a) (b) (c) No apparent pattern Shows nonlinearity. There is need for a higher‐order model. Presence of outliers (i.e., usually higher residuals) (d) Non‐constancy of error variance. Weighted LS can be applied. STAT6220 19 Transformation ‐ If a simple linear regression model is not appropriate for a data set, there are two choices: (1) Abandon regression model and develop and use a more appropriate model; (2) Employ some transformation on the data so that the simple linear regression model is appropriate for the transformed data. ‐ We now consider the use of transformations of one or both of the original variables before carrying out the regression analysis. Simple transformations of either y or x or both are often sufficient to make the simple linear regression model appropriate for the transformed data. ‐ Here are some guidelines. However, the best way of deciding the transformation is to try several and pick the best one. 1. If the plot indicates a relation that is increasing but at a decreasing rate, and if variability around p the curve is roughly constant. transform x to x , log(x) ; or if the plot indicates a relation that is decreasing but at a decreasing rate, try to use x¡1. 2. If the plot indicates a relation that is increasing at an increasing rate, and if variability is roughly constant, or if the plot indicates a relation that increases to a maximum and then decreases, and STAT6220 20 if variability around the curve is roughly constant, try using both x and x2 as predictors. Related to multiple regression. 3. If the plot indicates a relation that is increasing at a decreasing rate, and if variability around the curve increases as the predicted y value increases, try using y 2 as the dependent variable. 4. If the plot indicates a relation that is increasing at an increasing rate, and if variability around the curve increases as the predicted y value increases, try using log(y ) as the dependent variable. Assumptions Linearity Eε 0 Constant variance Var ε σ Normality Violation Patterns Patterns Outliers (outside 3 Skewed distribution Nonlinearity in normal‐ probability plot Positive or negative correlations Durbin‐Watson test Solution Transformation Add more variables Log transformation Weighted least squares Remove outliers Other distributions (Generalized linear models) Time series Independence STAT6220 21 Durbin‐Watson test d If d<1.5, positive serial correlation. If d>2.5, negative serial correlation. If d is approximately 2.0, no serial correlation idea: positive correlation ‐ e e : d is small and e : d is large ∑ e ∑ e e negative correlation ‐ big variation between e no correlation ‐ d is in between STAT6220 22 9. Outliers Outliers : Data points that diverge from the overall pattern and have large residuals are called outliers. In the residual plot, look for points more than 3s above or below the zero line. A safe rule is to discard an outlier only if it represents an error in recording, a miscalculation or a similar circumstance. Standardized residuals The residuals divided by the estimates of their standard errors. They have mean 0 and standard deviation 1. There are two common ways to calculate the standardized residual for the i‐th observation. One uses the residual mean square error from the model fitted to the full dataset (Standardized residuals). The other uses the residual mean square error from the model fitted to the all of the data except the i‐th observation (Studentized residuals) STAT6220 23 Influential Points Influential points are data points with extreme values that greatly affect the slope of the regression line. Influential points should be investigated. Influential points diagnosis Outliers and high leverage points can be influential points, that is, they can greatly influence what the intercept and the slope of the least squares line will be. (Source: http://www.public.iastate.edu/~rdecook/stat101/slides_for_web/Chapter9_mine.ppt) STAT6220 24 What is a leverage point? Xvariable
Outlier with respect to the xvariable. This is a point of high leverage. Xvariable
Not an outlier w.r.t. the xvariable. This is not a point of high leverage.
16 STAT6220 25 Detection of influential points 1. Cook’s distance for the i‐th observation is based on the differences between the predicted responses from the model constructed from all of the data and the predicted responses from the model constructed by setting the i‐th observation aside. Some analysts suggest investigating observations for which Cookʹs distance is greater than 1. Others suggest looking at a dot plot to find extreme values. 2. DFFITS is the scaled difference between the predicted responses from the model constructed from all of the data and the predicted responses from the model constructed by setting the i‐th observation aside. Some analysts suggest investigating observations for which DFITSi is greater than 2 [(p+1)/(n‐p‐1)]. Others suggest looking at a dot plot to find extreme values. 3. DFBETAS are similar to DFITS. Instead of looking at the difference in fitted value when the i‐th observation is included or excluded, DFBETAS looks at the change in each regression coefficient. DFBETAS >2 or 2 / 4. COVRATIO measures the change in the determinant of the covariance matrix of the estimates by deleting in the ith observation.  COVRATIO − 1≥ 3 p / n STAT6220 n ...
View
Full
Document
This note was uploaded on 03/16/2011 for the course STAT 6220 taught by Professor Staff during the Spring '08 term at UGA.
 Spring '08
 Staff
 Linear Regression

Click to edit the document details