This preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full DocumentThis preview has intentionally blurred sections. Sign up to view the full version.
View Full Document
Unformatted text preview: Chapter 10, Section 10.1 Statistical Model For Linear Regression:
We now expand our basic linear regression model from Chapter 2. Linear Regression fits a straight line through the data points that minimizes the
sum of the squared vertical distances between the data points and the straight line.
Because of this, the procedure is often called Least Squares Regression. o Minimizes 2(a)2 , the sum of the squared residuals
i=1
0 Equation of the line is: y 2 [20 + blx, with y: the predicted ym o Slope of the line is: b1 : r—Sl, where the slope measures the amount of change
S X caused in the response variable Y when the explanatory variable X is
increased by one unit. 0 Intercept of the line is: b0 2 y—bpc', where the intercept is the value of the
response variable Y when the explanatory variable X: 0. Least Squares Regression yintercept
e aquation of line Ch. 10 BI
Population 0 In linear regression the explanatory/independent variable X can have many
different values. For each value of x there exists a distribution of values for
Y. (For example, the response y to a given X is a random variable that will
take different values if we have several observations with the same xvalue).
The statistical model for simple linear regression assumes that the distribution
of values for y is normally distributed about a mean that depends on x; ,uy represents these means. We are interested in how these means, ,Lly, change as x changes. 0 For any fixed value of x, the response y varies according to a normal
distribution. Repeated responses y are independent of each other. One can check this assumption of normality by doing a normal quartile plot
of the residuals. Lecture 12, Chapter 2 & Section 10.1
Page 1 0 The mean response [13} has a straight—line relationship with X: ﬂy = ,6’0 +3196
We have changed our notation because we are now focusing on the population
slope, ,81 and the population intercept, ,30. To check if the relationship is roughly linear, you can do a scatterplot or a
residual plot. 0 The standard deviation of y (0' ) is the same for all values of X. The value of
0' is unknown. To check for constant variability you can look at a residual plot. 0 The statistical model for simple linear regression also states that the observed
response yl. for an explanatory variable, xi is yi 2'80 +’61xi+8i
Here ,60 + ﬁlxi is the mean response when x 2 xi. The deviations 8i are assumed to be independent and normally distributed with mean 0 and
standard dev1ation0'. "3 Can Sfranf‘ abrang Before you can start formal inference, you need to check that the above
assumptions have been met. In addition you need to check for outliers and/or
inﬂuential variables. 0 The parameters of the model are ,6 , ,6 , and 0' . Estimating the regression parameters We now estimate the parameter values. Our goal is to study or predict the behavior
of y for given values of X. We take n observations on an explanatory variable X and a response variable y. (x1,y1),(x2,y2), ...... ..,(xn,yn) Lecture 12, Chapter 2 & Section 10.1
Page 2 The slope b1 and intercept b0 of the leastsquares line y=b0+b1x are sample statistics and they are estimates of the population parameters slope ,6’1 and the intercept ,60 respectively. Again, the formula for the slope of the least—squares line is: S
191 = r—1
SX
and the intercept is:
190 = y —b1x The residuals are:
61. 2 Observed — predicted = yl. — yi = yl. —b0 —b1xi The estimate for 0' , which measures the variation of y about the population
regression line is: 2e?
n—2 and S=\/:S‘—2_ s2: Confidence intervals and significance tests: A level C confidence interval for the intercept ,60 is b0 it*SEb0 A level C confidence interval for the slope ,81 is >x<
b1 i: 5E,1
Check to see if the value 0 is included in the confidence interval. If it is, the slope could be 0 and that would mean that there is no linear relationship between Y and X. In these expressions t* is the value for the t(df) density curve with area C between —
t* and t*, where degrees of freedom, df = n  2. Lecture 12, Chapter 2 & Section 10.1
Page 3 Signiﬁcance Test for slope.
To test the hypothesis H 0 2,61 = 0 , compute the test statistic t: —b]— SPSS OUTPUT SHOWS THIS The degrees of freedom, df = n  2. In terms of the random variable T having the
t(df) distribution, the Pvalue for a test of H 0 against Ha 2,61 > O is P(T 2 2‘) one side right
Ha 2,61 < 0 is P(T S t) one side left
Ha 2,61 73 0 is 2P(T 2 t ) two side SPSS OUTPUT SHows THIS Example, Problem 10.12 (Textbook): Utility companies need to estimate the amount of energy that will be used by their
customers. The consumption of natural gas required for heating homes depends on
the outdoor temperature. When the weather is cold, more gas will be consumed. A Study of one home recorded the average daily gas consumption y (in hundreds of
cubic feet) for each month during one heating season. The explanatory variable x is
the average number of heating degree—days per day during the month. One heating
degree—day is accumulated for each degree a day’s average temperature falls below
65 degrees F. An average temperature of 50 degrees, for example, corresponds to
15 degreedays. The data for October through June are given in the following table: Degree—days 15.6 26.8 37.8 36.4 35.5 18.6 15.3 7.9 0.0
Gas 5.2 6.1 8.7 8.5 8.8 4.9 4.5 2.5 1.1
consumption We will start by checking the assumptions for the regression model: o The mean response ,uy has a straight—line relationship with x: ﬂy=ﬂ0+ﬂlx The slope ,61 and intercept ,60 are unknown population parameters. Lecture 12, Chapter 2 & Section 10.1
Page 4 To check to see if the relationship is linear, we can make a scatterplot and
residual plot of the data. 8.0 57’
0 Gas Consumption
9
O 2.0 0.0 1040 20.0 30.0 40.0
Deg reedays Notice that the scatterplot shows a linear relationship. Now let’s look at the
residual plot. Lecture 12, Chapter 2 & Section 10.1
Page 5 0.90000 0.60000 0.30000 0.00000 Unstandardized Residual —0.30000 0.60000 0.0 10.0 20.0 30.0 40.0
Degreedays The residual plot shows random scatter, so it too shows that the relationship is
linear. o The standard deviation of y (0') is the same for all values of x. The value of
0' is unknown. The residual plot shows no funneling, so we can safely say the assumption of
a constant variability has been met. 5
E
E
e
i
i
l
2'
i
i
E
l
l
x
i For any fixed value of x, the response y varies according to a normal distribution.
Repeated responses y are independent of each other. To check the assumption of normality we will do a normal
quartile/probability plot. Lecture 12, Chapter 2 & Section 10.1
Page 6 Normal PP Plot of Regression Standardized Residual Dependent Variable: Gas Consumption 1.0 %
E
E
i 9 :0 .0
(> 03 00 Expected Cum Prob .0
m 0.0
0.0 0.2 0.4 0.6 0.8 1.0 1 Observed Cum Prob The plot is okay except for one point which is well below the line.
Nevertheless we will proceed. We can find all the estimates we need on the SPSS printout. Descriptive Statistics  Gas Consumption 5589 2.7438
Degreedays 21.544 13.4194
Model Summary(b) Adjusted R Std. Error of
Model R S uare Square the Estimate
seem ( a Predictors: Constant), Degreedays
b Dependent Variable: Gas Consumption 010‘? CW“ 3 Lecture 12, Chapter 2 & Section 10.1
Page 7 ANOVA(b) Regressio n Residual :3 ,5 Total a Predictors: (Constant), Degreedays
b Dependent Variable: Gas Consumption  Std. Error Beta _
(Constant) 1 232 .286 ‘
3:959 .202 .011 .989 17.663 .000 Unstandardized
Coefficients Standardized
Coefficients 1 .175 {3.3 m
incivde €13 From above we get the following regression line: y =1.232+ 0.20236 where y is gas consumption and x is degree days.)
s = 0.4345 The 95% CI for ﬂo is (0556,1909)
The 95% CI for ,61 is (0175,0229) 0 is not included in the interval 0 is not included in the interval To test HOZ,81=O versus Hatﬂlio b
t=—L= 0202/0011 = 17.663 From the printout we get the following pvalue =0.000.
So we would reject the null hypothesis. This agrees with the CI not including 0. Note if we were testing to see if there is significant evidence that the slope is
positive, we would be testing the hypotheses: H O I ,81 = 0 versus H a I ,61 > 0 In this case, we would need to divide the pvalue from the printout by 2. Also, SPSS can save some calculations for you for each value of your explanatory
variable. Look at printout below for the gas consumption — heating degree days Lecture 12, Chapter 2 & Section 10.1
Page 8 example. Note that is shows the predicted value and the residual value. This is in the data View. M
d§< xi... /.// .,’ i ,, ,1. 4.38685 .81315 1
6.65162 '  ‘
8.87595
8.59285
8.41086
4.99349
4.32619
2.82983
1.23235 Inference for Correlation Recall: The correlation coefficient is a measure of strength and direction for a linear
relationship. When the population correlation (p) = 0, there is no linear association
in the population between Y and X. In the important case where the two variables x and y are both normally distributed,
the condition p = 0 is equivalent to the statement that X and y are independent. Test for a Zero Population Correlation To test the hypothesis H 0 2,0 = 0, compute the 1 statistic: r n—2 \ll—r2 where n is the sample size and r is the sample correlation. I: In terms of a random variable T having t(n2) distribution, the P—value for a test of
H 0 against Ha I ,0 > 0 P(T 2 I) one side right
Ha : ,0<O P(T St) one side left
Ha 2,0110 2P(T thl) two side Example:
For our example, test H 0 I ,0 = 0 versus H a I ,0 > 0 I: 0.9890x/9—2 217.68 Same. as ewiier game.
V1—0'9781 a F IZééB exeeptr 1%?”
P_Value 410005 reams? 5225:“? erases. m 1“. Te 3?”
Lecture 12, Chapter2&Section 10.1 {Law 63 I a Page 9 Reject our null hypothesis Testing whether the population correlation is zero is equivalent to testing whether
the population slope is zero. Note that both t values are very close. They are only
different due to rounding. Analysis of Variance F test: Analysis of Variance (ANOVA) defines what portion of the variation of the
response variable can be defined by changes to the explanatory variable. The
ANOVA SPSS printout gives us several values. We will brieﬂy look at how these
values where found. 1 ANOVA(b) Sum of
Suares Df Mean S uare F Si. Regression 58.907 1 58.907 311.972 .OOO(a)
Residual 1.322 7 .189
Total 60.229 8 a Predictors: (Constant), Degreedays
b Dependent Variable: Gas Consumption Below is a chart showing what the values above represent: ANOVA
df 88 MS F Significance F
26,. if =SSM
Regression 1 MSM=SSM/DFM MSM/MSE 4.6E07
2o. —§.>2=SSE
Residual n2 MSE=SSE/DFE
20,. —§)2 =SST
Total nt SST/DFT In the simple linear regression model, the hypotheses
H 0 : ,61 = 0
H a : ,61 i O is also tested by the F statistic F=MSM &6Qr’\ Severe Meg's? MSE Mean Semise 5:?an Lecture 12, Chapter 2 & Section 10.1
Page 10 The P—value is the probability that F statistic could be equal to or greater than the
calculated value of F when the H0 is true. Notice, for our example F = 311.972, with Pvalue = 0.00000046 which implies that
we reject H O. The F statistic tests the same null hypothesis as one of the t statistics. We prefer the ttest since it allows us to test onesided alternatives. Summary We have used the following procedure to examine the relationship between two
quantitative variables: 1. Graph the relationship, usually a scatterplot. Describe the form, direction,
and strength. Look for outliers. 2. Look at the correlation to get a numerical value for the direction and
strength. 3. If the data is reasonably linear, get an equation of the line using least squares
regression. 4. Look at the residual plot to see if there are any outliers, and if the residuals are approximately of equal magnitude across the X axis. Outliers may signal
the possibility of lurking variables 5. Look at the normal probability plot to determine whether the residuals are
normally distributed. (The dots sticking close to the 45 degree line is good.) 6. Look at hypothesis tests for the correlation, slope, and intercept. Look at
confidence intervals for the slope and intercept. The slope and the correlation
should be significantly different from zero. 7. If you had an outlier, you should rework the data without the outlier and
comment on the differences in your results.  Lecture 12, Chapter 2 & Section 10.1
Page 11 ...
View
Full Document
 Spring '08
 Staff
 Normal Distribution, Regression Analysis, Errors and residuals in statistics, Residual Plot, gas consumption

Click to edit the document details