Ch10.Spring2010

Ch10.Spring2010 - Chapter 10 Section 10.1 Statistical Model...

Info iconThis preview shows pages 1–12. Sign up to view the full content.

View Full Document Right Arrow Icon
Background image of page 1

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Background image of page 2
Background image of page 3

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Background image of page 4
Background image of page 5

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Background image of page 6
Background image of page 7

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Background image of page 8
Background image of page 9

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Background image of page 10
Background image of page 11

Info iconThis preview has intentionally blurred sections. Sign up to view the full version.

View Full Document Right Arrow Icon
Background image of page 12
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Chapter 10, Section 10.1 Statistical Model For Linear Regression: We now expand our basic linear regression model from Chapter 2. Linear Regression fits a straight line through the data points that minimizes the sum of the squared vertical distances between the data points and the straight line. Because of this, the procedure is often called Least Squares Regression. o Minimizes 2(a)2 , the sum of the squared residuals i=1 0 Equation of the line is: y 2 [20 + blx, with y: the predicted ym o Slope of the line is: b1 : r—S-l, where the slope measures the amount of change S X caused in the response variable Y when the explanatory variable X is increased by one unit. 0 Intercept of the line is: b0 2 y—bpc', where the intercept is the value of the response variable Y when the explanatory variable X: 0. Least Squares Regression y-intercept e aquation of line Ch. 10 BI Population 0 In linear regression the explanatory/independent variable X can have many different values. For each value of x there exists a distribution of values for Y. (For example, the response y to a given X is a random variable that will take different values if we have several observations with the same x-value). The statistical model for simple linear regression assumes that the distribution of values for y is normally distributed about a mean that depends on x; ,uy represents these means. We are interested in how these means, ,Lly, change as x changes. 0 For any fixed value of x, the response y varies according to a normal distribution. Repeated responses y are independent of each other. One can check this assumption of normality by doing a normal quartile plot of the residuals. Lecture 12, Chapter 2 & Section 10.1 Page 1 0 The mean response [13} has a straight—line relationship with X: fly = ,6’0 +3196 We have changed our notation because we are now focusing on the population slope, ,81 and the population intercept, ,30. To check if the relationship is roughly linear, you can do a scatterplot or a residual plot. 0 The standard deviation of y (0' ) is the same for all values of X. The value of 0' is unknown. To check for constant variability you can look at a residual plot. 0 The statistical model for simple linear regression also states that the observed response yl. for an explanatory variable, xi is yi 2'80 +’61xi+8i Here ,60 + filxi is the mean response when x 2 xi. The deviations 8i are assumed to be independent and normally distributed with mean 0 and standard dev1ation0'. "3 Can Sfranf‘ abrang Before you can start formal inference, you need to check that the above assumptions have been met. In addition you need to check for outliers and/or influential variables. 0 The parameters of the model are ,6 , ,6 , and 0' . Estimating the regression parameters We now estimate the parameter values. Our goal is to study or predict the behavior of y for given values of X. We take n observations on an explanatory variable X and a response variable y. (x1,y1),(x2,y2), ...... ..,(xn,yn) Lecture 12, Chapter 2 & Section 10.1 Page 2 The slope b1 and intercept b0 of the least-squares line y=b0+b1x are sample statistics and they are estimates of the population parameters slope ,6’1 and the intercept ,60 respectively. Again, the formula for the slope of the least—squares line is: S 191 = r—1 SX and the intercept is: 190 = y —b1x The residuals are: 61. 2 Observed — predicted = yl. — yi = yl. —b0 —b1xi The estimate for 0' , which measures the variation of y about the population regression line is: 2e? n—2 and S=\/:S‘—2_ s2: Confidence intervals and significance tests: A level C confidence interval for the intercept ,60 is b0 it*SEb0 A level C confidence interval for the slope ,81 is >x< b1 i: 5E,1 Check to see if the value 0 is included in the confidence interval. If it is, the slope could be 0 and that would mean that there is no linear relationship between Y and X. In these expressions t* is the value for the t(df) density curve with area C between — t* and t*, where degrees of freedom, df = n - 2. Lecture 12, Chapter 2 & Section 10.1 Page 3 Significance Test for slope. To test the hypothesis H 0 2,61 = 0 , compute the test statistic t: —b]— SPSS OUTPUT SHOWS THIS The degrees of freedom, df = n - 2. In terms of the random variable T having the t(df) distribution, the P-value for a test of H 0 against Ha 2,61 > O is P(T 2 2‘) one side right Ha 2,61 < 0 is P(T S t) one side left Ha 2,61 73 0 is 2P(T 2| t |) two side SPSS OUTPUT SHows THIS Example, Problem 10.12 (Textbook): Utility companies need to estimate the amount of energy that will be used by their customers. The consumption of natural gas required for heating homes depends on the outdoor temperature. When the weather is cold, more gas will be consumed. A Study of one home recorded the average daily gas consumption y (in hundreds of cubic feet) for each month during one heating season. The explanatory variable x is the average number of heating degree—days per day during the month. One heating degree—day is accumulated for each degree a day’s average temperature falls below 65 degrees F. An average temperature of 50 degrees, for example, corresponds to 15 degree-days. The data for October through June are given in the following table: Degree—days 15.6 26.8 37.8 36.4 35.5 18.6 15.3 7.9 0.0 Gas 5.2 6.1 8.7 8.5 8.8 4.9 4.5 2.5 1.1 consumption We will start by checking the assumptions for the regression model: o The mean response ,uy has a straight—line relationship with x: fly=fl0+fllx The slope ,61 and intercept ,60 are unknown population parameters. Lecture 12, Chapter 2 & Section 10.1 Page 4 To check to see if the relationship is linear, we can make a scatterplot and residual plot of the data. 8.0 57’ 0 Gas Consumption 9 O 2.0 0.0 1040 20.0 30.0 40.0 Deg ree-days Notice that the scatterplot shows a linear relationship. Now let’s look at the residual plot. Lecture 12, Chapter 2 & Section 10.1 Page 5 0.90000 0.60000 0.30000 0.00000 Unstandardized Residual —0.30000 0.60000 0.0 10.0 20.0 30.0 40.0 Degree-days The residual plot shows random scatter, so it too shows that the relationship is linear. o The standard deviation of y (0') is the same for all values of x. The value of 0' is unknown. The residual plot shows no funneling, so we can safely say the assumption of a constant variability has been met. 5 E E e i i l 2' i i E l l x i For any fixed value of x, the response y varies according to a normal distribution. Repeated responses y are independent of each other. To check the assumption of normality we will do a normal quartile/probability plot. Lecture 12, Chapter 2 & Section 10.1 Page 6 Normal P-P Plot of Regression Standardized Residual Dependent Variable: Gas Consumption 1.0 % E E i 9 :0 .0 -(> 03 00 Expected Cum Prob .0 m 0.0 0.0 0.2 0.4 0.6 0.8 1.0 1 Observed Cum Prob The plot is okay except for one point which is well below the line. Nevertheless we will proceed. We can find all the estimates we need on the SPSS printout. Descriptive Statistics - Gas Consumption 5589 2.7438 Degree-days 21.544 13.4194 Model Summary(b) Adjusted R Std. Error of Model R S uare Square the Estimate seem ( a Predictors: Constant), Degree-days b Dependent Variable: Gas Consumption 010‘? CW“ 3 Lecture 12, Chapter 2 & Section 10.1 Page 7 ANOVA(b) Regressio n Residual :3 ,5 Total a Predictors: (Constant), Degree-days b Dependent Variable: Gas Consumption - Std. Error Beta _ (Constant) 1 232 .286 ‘ 3:959 .202 .011 .989 17.663 .000 Unstandardized Coefficients Standardized Coefficients 1 .175 {3.3 m incivde €13 From above we get the following regression line: y =1.232+ 0.20236 where y is gas consumption and x is degree days.) s = 0.4345 The 95% CI for flo is (0556,1909) The 95% CI for ,61 is (0175,0229) 0 is not included in the interval 0 is not included in the interval To test HOZ,81=O versus Hatfllio b t=—L= 0202/0011 = 17.663 From the printout we get the following p-value =0.000. So we would reject the null hypothesis. This agrees with the CI not including 0. Note if we were testing to see if there is significant evidence that the slope is positive, we would be testing the hypotheses: H O I ,81 = 0 versus H a I ,61 > 0 In this case, we would need to divide the p-value from the printout by 2. Also, SPSS can save some calculations for you for each value of your explanatory variable. Look at printout below for the gas consumption — heating degree days Lecture 12, Chapter 2 & Section 10.1 Page 8 example. Note that is shows the predicted value and the residual value. This is in the data View. M d§< xi... /.// .,’ i ,, ,1. 4.38685 .81315 1 6.65162 ' - ‘ 8.87595 8.59285 8.41086 4.99349 4.32619 2.82983 1.23235 Inference for Correlation Recall: The correlation coefficient is a measure of strength and direction for a linear relationship. When the population correlation (p) = 0, there is no linear association in the population between Y and X. In the important case where the two variables x and y are both normally distributed, the condition p = 0 is equivalent to the statement that X and y are independent. Test for a Zero Population Correlation To test the hypothesis H 0 2,0 = 0, compute the 1 statistic: r n—2 \ll—r2 where n is the sample size and r is the sample correlation. I: In terms of a random variable T having t(n-2) distribution, the P—value for a test of H 0 against Ha I ,0 > 0 P(T 2 I) one side right Ha : ,0<O P(T St) one side left Ha 2,0110 2P(T thl) two side Example: For our example, test H 0 I ,0 = 0 versus H a I ,0 > 0 I: 0.9890x/9—2 217.68 Same. as ewiier game. V1—0'9781 a F IZééB exeeptr 1%?” P_Value 410005 reams? 5225:“? erases. m 1“. Te 3?” Lecture 12, Chapter2&Section 10.1 {Law 63 I a Page 9 Reject our null hypothesis Testing whether the population correlation is zero is equivalent to testing whether the population slope is zero. Note that both t values are very close. They are only different due to rounding. Analysis of Variance F test: Analysis of Variance (ANOVA) defines what portion of the variation of the response variable can be defined by changes to the explanatory variable. The ANOVA SPSS printout gives us several values. We will briefly look at how these values where found. 1 ANOVA(b) Sum of Suares Df Mean S uare F Si. Regression 58.907 1 58.907 311.972 .OOO(a) Residual 1.322 7 .189 Total 60.229 8 a Predictors: (Constant), Degree-days b Dependent Variable: Gas Consumption Below is a chart showing what the values above represent: ANOVA df 88 MS F Significance F 26,. if =SSM Regression 1 MSM=SSM/DFM MSM/MSE 4.6E-07 2o.- —§.>2=SSE Residual n-2 MSE=SSE/DFE 20,. —§)2 =SST Total n-t SST/DFT In the simple linear regression model, the hypotheses H 0 : ,61 = 0 H a : ,61 i O is also tested by the F statistic F=MSM &6Qr’\ Severe Meg's? MSE Mean Semis-e 5:?an- Lecture 12, Chapter 2 & Section 10.1 Page 10 The P—value is the probability that F statistic could be equal to or greater than the calculated value of F when the H0 is true. Notice, for our example F = 311.972, with P-value = 0.00000046 which implies that we reject H O. The F -statistic tests the same null hypothesis as one of the t- statistics. We prefer the t-test since it allows us to test one-sided alternatives. Summary We have used the following procedure to examine the relationship between two quantitative variables: 1. Graph the relationship, usually a scatterplot. Describe the form, direction, and strength. Look for outliers. 2. Look at the correlation to get a numerical value for the direction and strength. 3. If the data is reasonably linear, get an equation of the line using least squares regression. 4. Look at the residual plot to see if there are any outliers, and if the residuals are approximately of equal magnitude across the X axis. Outliers may signal the possibility of lurking variables 5. Look at the normal probability plot to determine whether the residuals are normally distributed. (The dots sticking close to the 45 degree line is good.) 6. Look at hypothesis tests for the correlation, slope, and intercept. Look at confidence intervals for the slope and intercept. The slope and the correlation should be significantly different from zero. 7. If you had an outlier, you should re-work the data without the outlier and comment on the differences in your results. - Lecture 12, Chapter 2 & Section 10.1 Page 11 ...
View Full Document

{[ snackBarMessage ]}

Page1 / 12

Ch10.Spring2010 - Chapter 10 Section 10.1 Statistical Model...

This preview shows document pages 1 - 12. Sign up to view the full document.

View Full Document Right Arrow Icon
Ask a homework question - tutors are online