Unformatted text preview: Chapter 12
Simple Linear Regression
Part A • Simple Linear Regression Model
• Least Squares Method
• Coefficient of Determination
• Model Assumptions
• Testing for Significance Simple Linear Regression
Simple Linear Regression Managerial decisions often are based on the relationship between two or more variables. Regression analysis can be used to develop an equation showing how the variables are related. The variable being predicted is called the dependent variable and is denoted by y. The variables being used to predict the value of the dependent variable are called the independent variables and are denoted by x. Simple Linear Regression
Simple Linear Regression Simple linear regression involves one independent variable and one dependent variable. The relationship between the two variables is approximated by a straight line. Regression analysis involving two or more independent variables is called multiple regression. Simple Linear Regression Model The equation that describes how y is related to x and an error term is called the regression model. The simple linear regression model is:
y = β 0 + β 1x +ε where: β 0 and β 1 are called parameters of the model, ε is a random variable called the error term. Simple Linear Regression Equation
Simple Linear Regression Equation
s The simple linear regression equation is:
E(y) = β 0 + β 1x • Graph of the regression equation is a straight line.
• β 0 is the y intercept of the regression line.
• β 1 is the slope of the regression line.
• E(y) is the expected value of y for a given x value. Simple Linear Regression
Equation
s Positive Linear Relationship
E (y )
Regression line
Intercept β0 Slope β 1
is positive
x Simple Linear Regression Equation
Simple Linear Regression Equation
s Negative Linear Relationship
E (y )
Intercept β0 Regression line Slope β 1
is negative
x Simple Linear Regression Equation
Simple Linear Regression Equation
s No Relationship
E (y )
Intercept Regression line β0 Slope β 1
is 0
x Estimated Simple Linear Regression Equation
Estimated Simple Linear Regression Equation
s The estimated simple linear regression equation ˆ
y = b0 + b1 x • The graph is called the estimated regression line. • b0 is the y intercept of the line.
• b1 is the slope of the line.
ˆ •y is the estimated value of y for a given x value. Estimation Process
Regression Model
y = β 0 + β 1x +ε
Regression Equation
E(y) = β 0 + β 1x
Unknown Parameters
β 0, β 1 Sample Data:
x y
x1 y1
. . . . xn yn b0 and b1
provide estimates of
β 0 and β 1 Estimated
Regression Equation
ˆ
y = b0 + b1 x Sample Statistics
b0, b1 Least Squares Method
• Least Squares Criterion m in ∑ ( y i − y i ) 2 where: yi = observed value of the dependent variable for the ith observation
^
yi = estimated value of the dependent variable for the ith observation Least Squares Method
• Slope for the Estimated Regression Equation b1 ∑ (x − x )( y − y )
=
∑(x − x )
i i i 2 Least Squares Method
s yIntercept for the Estimated Regression Equation b0 = y − b1 x
where: xi = value of independent variable for ith observation
yi = value of dependent variable for ith
_ observation
x = mean value for independent variable
_
y = mean value for dependent variable
n = total number of observations Simple LinearRegression
s Example: Reed Auto Sales
Reed Auto periodically has a special weeklong sale. As part of the advertising
campaign Reed runs one or
more television commercials
during the weekend preceding the sale. Data from a sample of 5 previous sales are shown on the next slide. Simple Linear Regression
Simple Linear Regression
s Example: Reed Auto Sales
Number of TV Ads
1
3
2
1
3 Number of
Cars Sold
14
24
18
17
27 Estimated Regression Equation
• Slope for the Estimated Regression Equation
b1 ∑( x − x )( y − y ) = 20 = 5
=
4
∑(x − x )
i i i 2 • yIntercept for the Estimated Regression Equation
b0 = y − b1 x = 20 − 5(2) = 10
• Estimated Regression Equation ˆ
y = 10 + 5x Estimated Regression Equation
s Excel Worksheet (showing data)
1
2
3
4
5
6
7 A
Week
1
2
3
4
5 B
TV Ads
1
3
2
1
3 C
Cars Sold
14
24
18
17
27 D Estimated Regression Equation
s Producing a Scatter Diagram
Step 1 Select cells B1:C6
Step 2 Click the Chart Wizard
Step 3 When the Chart Type dialog box appears: Choose XY (Scatter) in the Chart type list Choose Scatter from the Chart subtype display Select Next >
Step 4 When the Chart Source Data dialog box appears Select Next > Estimated Regression Equation
s Producing a Scatter Diagram
Step 5 When the Chart Options dialog box appears: Select the Titles tab and then
Delete Cars Sold in the Chart title box
Enter TV Ads in the Value (X) axis box
Enter Cars Sold in the Value (Y) axis box Select the Legend tab and then
Remove the check in the Show Legend box
Select Next > Estimated Regression Equation
s Producing a Scatter Diagram
Step 6 When the Chart Location dialog box appears: Specify the location for the new chart Select Finish to display the scatter diagram Estimated Regression Equation
s Adding the Trendline
Step 1 Position the mouse pointer over any data point and right click to display the Chart menu
Step 2 Select the Add Trendline option
Step 3 When the Add Trendline dialog box appears: On the Type tab select Linear On the Options tab select the Display
equation on chart box Click OK Scatter Diagram and Trend Line
30 Cars Sold 25
20 y = 5x + 10 15
10
5
0
0 1 2
TV Ads 3 4 Coefficient of Determination
• Relationship Among SST, SSR, SSE
SST = SSR + SSE
ˆ
ˆ
( y i − y )2 = ∑ ( y i − y )2 + ∑ ( y i − y i )2
∑ where: SST = total sum of squares SSR = sum of squares due to regression SSE = sum of squares due to error Coefficient of Determination
s The coefficient of determination is:
r2 = SSR/SST
where:
SSR = sum of squares due to regression
SST = total sum of squares Coefficient of Determination
Coefficient of Determination
r2 = SSR/SST = 100/114 = .8772 The regression relationship is very strong; 88%
of the variability in the number of cars sold can be
explained by the linear relationship between the
number of TV ads and the number of cars sold. Sample Correlation Coefficient
rxy = (sign of b1 ) Coefficient of Determination
rxy = (sign of b1 ) r 2
where: b1 = the slope of the estimated regression ˆ equation y = b0 + b1 x Sample Correlation Coefficient
rxy = (sign of b1 ) r 2
ˆ
5x
The sign of b1 in the equationy = 10 + is “+”. rxy = + .8772 rxy = +.9366 Assumptions About the Error Term
ε
1. The error ε is a random variable with mean of zero.
2. The variance of ε , denoted by σ 2, is the same for all values of the independent variable.
3. The values of ε are independent. are independent.
4. The error ε is a normally distributed random variable. Testing for Significance To test for a significant regression relationship, we must conduct a hypothesis test to determine whether the value of β 1 is zero. Two tests are commonly used:
t Test and F Test Both the t test and F test require an estimate of σ 2, the variance of ε in the regression model. in the regression model. Testing for Significance
• An Estimate of σ The mean square error (MSE) provides the estimate
of σ 2, and the notation s2 is also used.
s 2 = MSE = SSE/(n − 2)
where: ˆ
SSE = ∑ ( yi − yi ) 2 = ∑ ( yi − b0 − b1 xi ) 2 Testing for Significance
• An Estimate of σ • To estimate σ we take the square root of σ 2.
• The resulting s is called the standard error of the estimate. SSE
s = MSE =
n−2 Testing for Significance: t Test
• Hypotheses • Test Statistic H 0 : β 1 = 0
H a : β 1 ≠ 0 b1
t=
sb1 Testing for Significance: t Test
s Rejection Rule
Reject H0 if pvalue < α or t < tα/2 or t > tα/2
or where: tα/2 is based on a t distribution
with n 2 degrees of freedom Testing for Significance: t Test
1. Determine the hypotheses.
1. Determine the hypotheses. H 0 : β 1 = 0 H a : β 1 ≠ 0 2. Specify the level of significance.
3. Select the test statistic.
4. State the rejection rule. α = .05 b1
t=
sb1
Reject H0 if pvalue < .05
or t > 3.182 (with
3 degrees of freedom) Testing for Significance: t Test
Testing for Significance: 5. Compute the value of the test statistic. b1
5
t=
=
= 4.63
sb1 1.08
6. Determine whether to reject H0.
t = 4.541 provides an area of .01 in the upper
tail. Hence, the pvalue is less than .02. (Also,
t = 4.63 > 3.182.) We can reject H0. Confidence Interval for β 1 We can use a 95% confidence interval for β 1 to test the hypotheses just used in the t test. H0 is rejected if the hypothesized value of β 1 is not included in the confidence interval for β 1. Confidence Interval for β 1
• The form of a confidence interval for β 1 is:
b1 is the
point
estimator b1 ± tα / 2 sb1
wheretα is the t value providing an area
/2
of α/2 in the upper tail of a t distribution
with n 2 degrees of freedom tα / 2 sb1 is the
margin
of error Confidence Interval for β 1
• Rejection Rule
Reject H0 if 0 is not included in
the confidence interval for β 1.
95% Confidence Interval for β
•
b1 ± tα / 2 sb1 = 5 +/ 3.182(1.08) = 5 +/ 3.44
1
or 1.56 to 8.44 • Conclusion
0 is not included in the confidence interval. Reject H0 Testing for Significance: F Test
s Hypotheses s H 0 : β 1 = 0 H a : β 1 ≠ 0 Test Statistic
F = MSR/MSE Testing for Significance: F Test
s Rejection Rule
Reject H0 if pvalue < α
or F > Fα
where: Fα is based on an F distribution with
1 degree of freedom in the numerator and
n 2 degrees of freedom in the denominator Testing for Significance: F Test
1. Determine the hypotheses.
1. Determine the hypotheses. H 0 : β 1 = 0
H a : β 1 ≠ 0 2. Specify the level of significance.
3. Select the test statistic.
4. State the rejection rule. α = .05 F = MSR/MSE
Reject H0 if pvalue < .05
or F > 10.13 (with 1 d.f.
in numerator and 3 d.f. in denominator) Testing for Significance: F Test
Testing for Significance: 5. Compute the value of the test statistic.
F = MSR/MSE = 100/4.667 = 21.43
6. Determine whether to reject H0. F = 17.44 provides an area of .025 in the upper tail. Thus, the pvalue corresponding to F = 21.43 is less than 2(.025) = .05. Hence, we reject H0. The statistical evidence is sufficient to conclude
that we have a significant relationship between the
number of TV ads aired and the number of cars sold. Some Cautions about the
Interpretation of Significance
T Rejecting H : β = 0 and concluding that the ests
0 1 relationship between x and y is significant does not enable us to conclude that a causeandeffect
relationship is present between x and y. Just because we are able to reject H0: β 1 = 0 and demonstrate statistical significance does not enable
us to conclude that there is a linear relationship
between x and y. Chapter 12
Simple Linear Regression
Part B • Excel’s Regression Tool • Using the Estimated Regression Equation for Estimation and Prediction
• Residual Analysis: Validating Model Assumptions
• Outliers and Influential Observations Using Excel’s Regression Tool
Using Excel’s Regression Tool Up to this point, you have seen how Excel can be used for various parts of a regression analysis. Excel also has a comprehensive tool in its Data Analysis package called Regression. The Regression tool can be used to perform a complete regression analysis. Using Excel’s Regression Tool
Using Excel’s Regression Tool
s Excel Worksheet (showing data)
1
2
3
4
5
6
7 A
Week
1
2
3
4
5 B
TV Ads
1
3
2
1
3 C
Cars Sold
14
24
18
17
27 D Using Excel’s Regression Tool
Using Excel’s Regression Tool
s Performing the Regression Analysis
Step 1 Select the Tools menu
Step 2 Choose the Data Analysis option
Step 3 Choose Regression from the list of Analysis Tools Using Excel’s Regression Tool
Using Excel’s Regression Tool
s Excel Regression Dialog Box Using Excel’s Regression Tool
Using Excel’s Regression Tool
s
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26 Excel Value Worksheet
A
Week
1
2
3
4
5 B
TV Ads
1
3
2
1
3 C
Cars Sold
14
24
18
17
27 SUMMARY OUTPUT Intercept
TV Ads SS
1
3
4 F G H I Data ANOVA
ANOVA
Output
Output ANOVA
Regression
Residual
Total E Regression Statistics
Regression
Output
Output Regression Statistics
Multiple R
0.936585812
R Square
0.877192982
Adjusted R Square
0.83625731
Standard Error
2.160246899
Observations
5 df D MS F
Significance F
100
100 21.42857
0.018986231
14 4.666667
114 Coefficients
Standard Error
t Stat
Pvalue
10
2.366431913 4.225771 0.024236
5
1.08012345
4.6291 0.018986 Estimated
Estimated
Regression
Equation Output
Equation Lower 95%
Upper 95% Lower 95.0% Upper 95.0%
2.468950436 17.53104956 2.468950436 17.53104956
1.562561893 8.437438107 1.562561893 8.437438107 Using Excel’s Regression Tool
Using Excel’s Regression Tool
s Excel Value Worksheet (bottomleft portion)
A B C D E 22
23
Coeffic. Std. Err. t Stat Pvalue
24 Intercept
10 2.36643 4.2258 0.02424
25 TV Ads
5 1.08012 4.6291 0.01899
26 Note: Columns FI are not shown. Using Excel’s Regression Tool
Using Excel’s Regression Tool
s Excel Value Worksheet (bottomright portion)
A B F G H I 22
23
Coeffic. Low. 95% Up. 95% Low. 95.0% Up. 95.0%
24 Intercept
10 2.46895 17.53105 2.46895044 17.5310496
25 TV Ads
5 1.562562 8.437438 1.56256189 8.43743811
26 Note: Columns CE are hidden. Using Excel’s Regression Tool
Using Excel’s Regression Tool
s Excel Value Worksheet (middle portion)
A
16
17
18
19
20
21
22 B C D E F ANOVA
df
Regression
Residual
Total SS
MS
F
Significance F
1 100
100 21.4286
0.018986231
3
14 4.66667
4 114 Using Excel’s Regression Tool
Using Excel’s Regression Tool
s Excel Value Worksheet (top portion)
A
9
10
11
12
13
14
15
16 B Regression Statistics
Multiple R
0.936585812
R Square
0.877192982
Adjusted R Square
0.83625731
Standard Error
2.160246899
Observations
5 C Using the Estimated Regression Equation
Using the Estimated Regression Equation
for Estimation and Prediction
s Confidence Interval Estimate of E(y )
y p ± t α / 2 sy p s p Prediction Interval Estimate of y
p y p ± tα / 2 sind
where: confidence coefficient is 1 α and
and
tα/2 is based on a t distribution
with n 2 degrees of freedom Point Estimation
If 3 TV ads are run prior to a sale, we expect
the mean number of cars sold to be:
^
y = 10 + 5(3) = 25 cars y = 10 + 5(3) = 25 cars Confidence Interval for E(y )
p s Formula Worksheet (confidence interval portion)
D
1
2
3
4 E
F
CONFIDENCE INTERVAL
xp
3
x bar
=AVERAGE(B2:B6)
x p x bar
=F2F3 5 (x p x bar)2 =F4^2 2
p (x p x bar)
Variance of y hat
Std. Dev of y hat
t Value
Margin of Error
Point Estimate
Lower Limit
Upper Limit =DEVSQ(B2:B6)
=D20*(1/B15+F5/F6)
=SQRT(F7)
=TINV(0.05,3)
=F9*F8
=B24+B25*F2
=F11F10
=F11+F10 6
7
8
9
10
11
12
13 G Confidence Interval for E(y )
p s Excel’s Confidence Interval Output
D
1
2
3
4 E
CONFIDENCE INTERVAL
xp
3
x bar
2.0
x p x bar
1.0 5 (x p x bar)2 6
7
8
9
10
11
12
13 1.0
2 Ξ (x p x bar)
Variance of y hat
Std. Dev of y hat
t Value
Margin of Error
Point Estimate
Lower Limit
Upper Limit 4.0
2.1000
1.4491
3.1824
4.6118
25.0
20.39
29.61 F G Confidence Interval for E(y )
p The 95% confidence interval estimate of the mean The 95% confidence interval estimate of the mean number of cars sold when 3 TV ads are run is:
25 + 4.61 = 20.39 to 29.61 cars Prediction Interval for y
p s Excel Formula Worksheet
1
2
3
4 H
I
PREDICTION INTERVAL
Variance of y ind
=D20+F7
Std. Dev. of y ind
=SQRT(I2)
Margin of Error
=F9*I3 5 Lower Limit =F11I4 6
7 Upper Limit =F11+I4 Prediction Interval for y
p s Excel Value Worksheet
1
2
3
4 H
PREDICTION INTERVAL
Variance of y ind
6.76667
Std. Dev. of y ind
2.60128
Margin of Error
8.27844 5 Lower Limit 16.72 6
7 Upper Limit 33.28 I Prediction Interval for y
p The 95% prediction interval estimate of the number of The 95% prediction interval estimate of the number of cars sold in one particular week when 3 TV ads are run is:
25 + 8.28 = 16.72 to 33.28 cars Residual Analysis
Residual Analysis If the assumptions about the error term ε appear questionable, the hypothesis tests about the significance of the regression relationship and the interval estimation results may not be valid. The residuals provide the best information about ε . Residual for Observation i ˆ
yi − yi Much of the residual analysis is based on an examination of graphical plots. Residual Plot Against x
Residual Plot Against s If the assumption that the variance of ε is the same for all values of x is valid, and the assumed regression model is an adequate representation of the relationship between the variables, then The residual plot should give an overall impression of a horizontal band of points Residual Plot Against x Residual ˆ
y −y Good Pattern 0 x Residual Plot Against x
Residual Plot Against Residual ˆ
y −y Nonconstant Variance 0 x Residual Plot Against x
Residual Plot Against Residual ˆ
y −y Model Form Not Adequate 0 x Residual Plot Against x
s Residuals
Observation Predicted Cars Sold Residuals 1 15 1 2 25 1 3 20 2 4 15 2 5 25 2 Residual Plot Against x
s Using Excel to Produce a Residual Plot
• The steps outlined earlier to obtain the regression output are performed with one change.
• When the Regression dialog box appears, we must also select the Residual Plot option. • The output will include two new items: • A plot of the residuals against the independent variable, and
• A list of predicted values of y and the corresponding residual values. Residual Plot Against x
s Excel Value Worksheet (bottom portion)
A B 28
29 RESIDUAL OUTPUT
30
31
Observation
Predicted Cars Sold
32
1
15
33
2
25
34
3
20
35
4
15
36
5
25
37 C Residuals
1
1
2
2
2 Residual Plot Against x
Residual Plot Against TV Ads Residual Plot 3 Residuals 2
1
0
1
2
3
0 1 2 TV Ads 3 4 Standardized Residuals
s Standardized Residual for Observation i
ˆ
yi − yi
sy i −yˆ i where: sy i −yˆ i = s 1 − hi ( x i − x )2
1
hi = +
n ∑ ( x i − x )2 Standardized Residual Plot
Standardized Residual Plot
s s The standardized residual plot can provide insight about the assumption that the error term ε has a normal distribution.
If this assumption is satisfied, the distribution of the standardized residuals should appear to come from a standard normal probability distribution. Standardized Residual Plot
s s s Excel’s Regression tool be used to obtain the standardized residuals.
The steps described earlier to conduct a regression analysis are performed with one change:
• When the Regression dialog box appears, we must select the Standardized Residuals option
The Standardized Residuals option does not automatically produce a standardized residual plot. Standardized Residual Plot
s Excel Value Worksheet
A B 28
29 RESIDUAL OUTPUT
30
31 Observation
Predicted Y
32
1
15
33
2
25
34
3
20
35
4
15
36
5
25 C D Residuals Standard Residuals
0.534522484
0.534522484
1.069044968
1.069044968
1.069044968 1
1
2
2
2 Standardized Residual Plot
s s Excel’s Chart Wizard can be used to construct the standardized residual plot.
A scatter diagram is developed in which:
• The values of the independent variable are placed on the horizontal axis
• The values of the standardized residuals are placed on the vertical axis Standardized Residual Plot
s Excel Standardized Residual Plot
Standard Residuals 28
29
30
31
32
33
34
35
36
37 1.5 A B C D 1
RESIDUAL OUTPUT
0.5 Observation
0 0.5 0
1
1.5 1
2
3
4
5 Predicted Y
15
10
25
20
15
25 Residuals
Standard Residuals
1 0.534522
20
30
1 0.534522
2 1.069045
2 1.069045
2 1.069045 Cars Sold Standardized Residual Plot
Standardized Residual Plot
s All of the standardized residuals are between –1.5 and +1.5 indicating that there is no reason to question the assumption that ε has a normal distribution. Outliers and Influential Observations
Outliers and Influential Observations
s Detecting Outliers
• An outlier is an observation that is unusual in comparison with the other data.
• Minitab classifies an observation as an outlier if its standardized residual value is < 2 or > +2.
• This standardized residual rule sometimes fails to identify an unusually large observation as being an outlier.
• This rule’s shortcoming can be circumvented by using studentized deleted residuals.
• The i th studentized deleted residual will be larger than the i th standardized residual. ...
View
Full Document
 Spring '09
 White
 Regression Analysis

Click to edit the document details