#### You've reached the end of your free preview.

Want to read all 26 pages?

**Unformatted text preview: **STA 108
Regression Analysis
Lecture 8
Irina Udaltsova
Department of Statistics
University of California, Davis January 23rd, 2015 Admin for the Day Homework 2 is due today, Friday, January 23rd in class
Homework 3 is assigned (soon) and is due coming Wednesday,
January 28th, in class
Midterm: Friday, January 30th, in class References for Today: Ch.2.7-2.10, 3.1-3.2 (Kutner, 5th Ed.) Topics For Today Recap:
Analysis of variance approach to regression
Measures of Association
Today:
1. ANOVA in R
2. General linear test
3. Diagnostics ANOVA table Recap: ANOVA table :
A table that gives the summary of the variance decomposition
inthe response variable, useful in testing H0 : β1 = 0 against
H1 : β 1 = 0
Source
Regression
Error
Total df
df(SSR) = 1
df(SSE ) = n − 2
df(SSTO) = n − 1 SS
SSR
SSE
SSTO MS
MSR
MSE F∗
F ∗ = MSR/MSE Toluca example: ANOVA table in R
Let’s review all appropriate commands we need in order to produce
the ANOVA table in R for the Toluca Company example.
> workData = read.table("toluca.txt", header=TRUE)
> fit = lm(WorkHrs ~ LotSize, data=workData)
> anova(fit)
Analysis of Variance Table
Response: WorkHrs
Df Sum Sq Mean Sq F value
Pr(>F)
LotSize
1 252378 252378 105.88 4.449e-10 ***
Residuals 23 54825
2384
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
Interpretation: F statistic is 105.88, and P-value is 0.000.
Hence at 0.05 level of signiﬁcance, there is a linear
relationship between work hours and lot size. 1 Toluca example: additional output in R
Let’s review additional important output in R for the Toluca Company example.
> fit = lm(WorkHrs ~ LotSize, data=workData)
> summary(fit)
Call:
lm(formula = WorkHrs ~ LotSize, data = workData)
Residuals:
Min
1Q
-83.876 -34.088 Median
-5.982 3Q
Max
38.826 103.528 Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
62.366
26.177
2.382
0.0259 *
LotSize
3.570
0.347 10.290 4.45e-10 ***
--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1
1
Residual standard error: 48.82 on 23 degrees of freedom
Multiple R-squared: 0.8215, Adjusted R-squared: 0.8138
F-statistic: 105.9 on 1 and 23 DF, p-value: 4.449e-10 Toluca example: additional output in R Residual standard error: 48.82 on 23 degrees of freedom
Multiple R-squared: 0.8215, Adjusted R-squared: 0.8138
F-statistic: 105.9 on 1 and 23 DF, p-value: 4.449e-10
The residual standard error is the square-root of MSE , i.e.,
√
MSE = 48.82.
Degrees of freedom associated with the SSE is n − 2 = 23.
2
The adjusted Radj = 0.8138. Hence, 81.38% of the variation
in the work hours is explained by the regression relationship
on the lot size, after adjusting for the number of parameters
in the model. Coefficient of determination, revisited 2
Important things to remember about R 2 (and Radj ): R 2 does not necessarily imply that
useful predictions could be made, because prediction intervals
can be wide;
estimated regression line is a good ﬁt, because actual
regression relationship could be curvilinear;
X and Y are not related, because actual regression relationship
could be curvilinear. General linear test
Consider that we are interested in testing for the dependence on
the predictor variable from a diﬀerent point of view. The ideas of
ANOVA approach to regression is applicable here.
Deﬁne a Full model:
Yi = β0 + β1 Xi + εi
Deﬁne a Reduced model:
Yi = β0 + εi
Goal: We want to test
H0 : Reduced model holds, against H1 : Full model holds.
Example: suppose we want to test H0 : β1 = 0 against
H1 : β1 = 0. Then, under H0 : β1 = 0, we have the reduced model. General linear test Example: suppose we want to test H0 : β1 = 0 against
H1 : β1 = 0. Then, under H0 : β1 = 0, we have the reduced model.
In this case, under the full model, SSEfull = i (Yi − Yi )2 = SSE .
Under the reduced model SSEred = i (Yi − Y )2 = SSTO.
Observe that: d.f.(SSEfull ) = n − 2, d.f.(SSEred ) = n − 1 and
SSEred − SSEfull = SSR.
Test Statistic: F ∗ = MSR/MSE . General linear test
Goal: Test
H0 : Reduced model holds, against H1 : Full model holds.
General linear test :
Test statistic:
SSEred −SSEfull
∗ F = SSR d.f.(SSEred )−d.f.(SSEfull ) d.f.(SSR) SSEfull
d.f.(SSEfull ) = SSE
d.f.(SSE ) = MSR
MSE Under normal error model, and under H0 : β1 = 0,
F ∗ ∼ Fd.f.(SSEred )−d.f.(SSEfull ), d.f.(SSEfull ) the F distribution with numerator and denominator degrees of
freedom (d.f.(SSEred )− d.f.(SSEfull ), d.f.(SSEfull )).
P-value: p = P( Fd.f.(SSEred )−d.f.(SSEfull ), d.f.(SSEfull ) > F∗ ) Decision rule: reject H0 at level of signiﬁcance α
if F ∗ > F (1−α; d.f.(SSEred )−d.f.(SSEfull ), d.f.(SSEfull )), or, if p < α Toluca example: General linear test In the Toluca example, suppose we were interested to test
H0 : β0 = 20 and β1 = 3, against
H1 : either β0 = 20 or β1 = 3 or both.
Full model:
Yi = β0 + β1 Xi + εi , i = 1, · · · , n
Estimated of β0 and β1 are b0 and b1 , respectively
ˆ
Fitted values: Yi = b0 + b1 Xi
ˆ
Residuals: ei = Yi − Yi
Residual sum of squares: SSEfull = ei2 d.f.(SSEfull ) = n − (# of beta parameters estimated) = n − 2 Toluca example: General linear test
In the Toluca example, suppose we were interested to test
H0 : β0 = 20 and β1 = 3, against
H1 : either β0 = 20 or β1 = 3 or both.
Reduced model:
Yi = 20 + 3Xi + εi , i = 1, · · · , n
ˆ
Fitted values: Yi = 20 + 3Xi
ˆ
Residuals: ei = Yi − Yi = Yi − (20 − 3Xi )
˜
Residual sum of squares: SSEred = ei2
˜ d.f.(SSEred ) = n − (# of beta parameters estimated) =
n−0=n
Numerator d.f. = d.f.(SSEred )− d.f.(SSEfull ) = n − (n − 2) = 2
Denominator d.f. = d.f.(SSEfull ) = n − 2 Toluca example: General linear test anova(fit)
# full model
Analysis of Variance Table
Response: WorkHrs
Df Sum Sq Mean Sq F value
Pr(>F)
LotSize
1 252378 252378 105.88 4.449e-10 ***
Residuals 23 54825
2384
> yhat = 20 + 3*workData$LotSize
# reduced model
> SSEred = sum((workData$WorkHrs - yhat)^2)
> SSEred
[1] 230513 Toluca example: General linear test Test statistic:
SSEred −SSEfull
∗ F = d.f.(SSEred )−d.f.(SSEfull )
SSEfull = d.f.(SSEfull ) 230513−54825
2
54825
23 = 36.85202 P-value:
p = P( F2,23 > F ∗ ) = 6.718263 × 10−8
(In R : 1-pf(36.85202,2,23) )
Since p-value is nearly 0 (smaller than 0.05), we conclude that
with 0.05 level of signiﬁcance, either the intercept is not equal
to 20 or the slope is not equal to 3 or both. Considerations in applying regression analysis
Making inferences for the future:
We assume the regression model will be consistent. However,
it is often that with time, the relationship between X and Y
may change.
For example, the admissions oﬃcer using her model that
predicts ﬁrst year GPA from ACT scores is assuming that the
relationship between GPA and ACT stays the same
(consistent). If the ACT changes in anyway, or the academic
grading policies for ﬁrst year college students changes, then
the model that she estimated may not be valid.
When assumption of normality is inappropriate, then as long as the
sample size is large
Most inference methods we discussed (Ch.1+2) are valid
With one exception, however: prediction intervals no longer
valid. Considerations in applying regression analysis
Predicting observations outside the range of values for X:
We only assume the linear relationship between X and Y in
the range of the values of X.
Extrapolation outside the range of X is unadvised.
A linear association does not imply a cause and eﬀect relationship:
If we conclude β1 = 0, that does not mean a causal
relationship between X and Y.
There could be another variable that inﬂuences both X and Y
(confounding variable).
This is a serious problem in observational studies, yet is still
possible in controlled experiments. Model Assumptions Model : Yi = β0 + β1 Xi + εi , i = 1, · · · , n,
where
ε1 , · · · , εn are independent and normally distributed with mean 0
and variance σ 2 .
Assumptions:
Linearity
Equal Variance
Normality of errors
Independence of errors Departures from the model Possible Departures from the model:
1. The regression function is not linear
2. The error terms (i.e., εi ’s) do not have constant variance
3. The error terms are not independent
4. The model ﬁts all but one or few outlier observations
5. The error terms are not normally distributed
6. One or several important predictor variables have been
omitted from the model Diagnostics
Diagnostic plots using residuals or semistudentized residuals:
[Residuals: ei = Yi − Yi , Semistudentized residuals:
√
ei∗ = ei / MSE ]
1. Plot of residuals against predictor variable (useful for
explaining departures 1,2,4)
2. Plot of absolute or squared residuals against predictor variable
(useful for explaining departure 2)
3. Plot of residuals against ﬁtted values (useful for explaining
departures 1,2,4)
4. Plot of residuals against time or other sequence (useful for
examining departure 3)
5. Plot of residuals against omitted predictor variables (useful for
examining departure 6)
6. Box plot or stem-and-leaf plot or histogram or normal
probability plot of residuals (useful for examining departure 5,
and also 4). Residual plots
Residual plots: When plotting Residuals vs X
– Look for NO PATTERN!
Interpretation: If there is no obvious pattern, it’s ﬁne. Residual plots in R : Normal Probability Plot >
>
>
> fit = lm(WorkHrs ~
fit$fitted.values
fit$residuals
residuals(fit) LotSize, data=workData)
# Y-hat
# e, residuals
# e, residuals > # Plot Residual vs X
> plot(workData$LotSize, fit$resid, xlab="Lot Size", ylab="
> abline(h=0, lty=2) 0
−50 Residuals 50 100 Work Hours example: Residual plot: ei vs. X 20 40 60 80 100 120 Lot Size All points appear without pattern, hence no obvious departures from
assumptions of linearity and constant variance. Also, no obvious outliers.
Note: in case of Simple linear regression, e vs. X plot gives the same
information as e vs. Y plot. Residual plots in R # Plot Normal Probability plot
> qqnorm(fit$resid)
> qqline(fit$resid, col="red") Work Hours example: Normal Probability
Residual plot: 50
0
−50 Sample Quantiles 100 Normal Q−Q Plot −2 −1 0 1 2 Theoretical Quantiles All points except for one small outlier appear to follow the reference line.
Hence, normality assumption of error terms is adequate. Cartoon of the Day ...

View
Full Document

- Fall '09
- Drake
- Normal Distribution, Regression Analysis, Yi, Toluca, general linear test, Toluca example