Unformatted text preview: STAT 350 – Spring 2009 Homework #12 SOLUTION
covers through Lecture S 1. An ecologist is studying the tradeoffs individuals make between investment in survival and
growth and the investment in reproduction. In particular she is studying this relationship in a
highly variable species of prairie grasses. She has sampled 25 individual (randomly selected)
plants across the range of this species. For each plant, she has measured seed mass and root
volume. Let xi denote the root volume (in mL) of the ith plant and yi denote the seed mass (in mg)
25 of the ith plant. She finds ∑ x = 926 , ∑ y
i =1 25 ∑ ( x y ) = 63,677.
i =1 i i 25 i i =1 i 25 = 1,928, ∑x
i =1 2
i 25 = 51,860, ∑y
i =1 2
i = 153,997, and The root volumes ranged from 8.2 to 102.3 mL and the seed masses ranged from 44.7 to 102.5 mg.
Warning: do not roundoff too much in intermediate calculations!
a. Calculate SSxy.
( 926 )(1928) = −7, 736.12
= 63677 −
25
b. Calculate SSxx.
9262
51860 −
= 17,560.96
25
c. Calculate SSyy.
19282
153997 −
= 5309.64
25
d. Calculate the standard deviation of root volume
SS xx
17560.96
sx =
=
= 27.05077
n −1
25 − 1
e. Calculate the standard deviation of seed mass
SS yy
5309.64
sy =
=
= 14.87397
n −1
25 − 1
f. Give a 95% CI for the true mean seed mass of individuals from this species.
⎛s ⎞
95% CI for μ y : y ± tcrit ,df = 25− 2 ⎜ y ⎟
⎝ n⎠
1928
⎛ 14.87397 ⎞
± 2.069 ⎜
⎟
25
25 ⎠
⎝
77.12 → (70.965, 83.275) Homework #12 – Solution Page 1 of 1 g. The biologist finds another individual of this species, but during a time of year when the plant
is not producing seeds, give a 95% prediction interval for the seed mass that would be produced
by this individual.
1
1
77.12 ± 2.069 (14.87397 ) 1 +
y ± tcrit , df = 25− 2 ( s y ) 1 +
→
25
n
→ (45.736, 108.504)
h. Find the Pearson Correlation between the root volume and seed mass of this species.
SS xy
−7736.12
=
= −0.801155
r=
17560.96 5309.64
SS xx SS yy
i. Find the least squares regression equation to predict seed mass from root volume for this
species (give a and b to at least 5 decimal places).
b= SS xy
SS xx = −7736.12
= −0.4405294471
17560.96 ⎛ 1928 ⎞
⎛ 926 ⎞
a = y − bx = ⎜
⎟ − ( −0.4405294471) ⎜
⎟ = 93.43721072
⎝ 25 ⎠
⎝ 25 ⎠
77.12 37.04 ˆ
Answer: y = 93.43721 – 0.44053x
j. Create the ANOVA table for the regression analysis (use Excel to obtain the pvalue for your
table).
Total Sums of Squares (SSTo)= SSyy
= 5309.64
Residual Sums of Squares (SSResid) = SSTo – b(SSxy) = 5309.64 – (0.44053)(7736.12)
= 1901.647056
Regression Sums of Squares (SSReg) = SSTo – SSResid = 5309.64 – 1901.647056
= 3407.992944
ANOVA TABLE
Source
Regression
Residual (Error)
Total df
1
23
24 SS
3407.992944
1901.647056
5309.64 MS
3407.992944
82.6803068 F
41.2189 p
1.5 × 106 Note, the degrees of freedom for the F statistic are: df1 = 1, df2 = 23 Homework #12 – Solution Page 2 of 2 k. What percent of the total variation in seed mass is explained by the linear relationship between
root volume and seed mass?
This is asking for the Coefficient of Determination (R2), but as a % instead of as a proportion.
R2 = SSReg/SSTo = 3407.992944 / 5309.64 = 0.64185
Alternative calculation: Square the Pearson Correlation from (h): (0.801155)2 = 0.64185
Answer: 64.185%
l. Find the standard deviation of the least squares line (se)
se = MSE = 82.6803068 = 9.09287
m. Obtain a 95% CI for β (the true average increase in seed mass with a 1mL increase in root
volume).
se
Where sb =
b ± tcrit ,df = 25− 2 sb
SS xx
⎛ 9.09287 ⎞
−0.44053 ± 2.069 ⎜
⎟
⎝ 17560.96 ⎠
0.068616 → (0.58250, 0.29856) n. Based on your answers to parts (j) through (m), what can you conclude about the relationship
between root volume and seed mass in this species? Be sure to use proper statistical language
and support your answer appropriately!
There is a statistically significant relationship between root volume and seed mass in this
species.
We estimate that the seed mass declines by an average of 0.44 mg for every 1mL increase in
root volume.
Conceptually, this would mean that there is an ecological tradeoff between resources devoted
to growth and survival (roots) and reproduction (seeds). Homework #12 – Solution Page 3 of 3 o. The biologist finds another individual of this species, but during a time of year when the plant
is not producing seeds. However, she can measure the root volume and finds that it is 40 mL.
Give a 95% prediction interval for the seed mass that would be produced by this individual.
ˆ
Predicted seed mass: y = 93.43721 – 0.44053(40) = 75.81601 mg
*
1 (x − x)
ˆ ± ( tcrit , df = n − 2 ) se 1 + +
y
n
SS xx 75.81601 ± 2.069 ( 9.09287 ) 2 1 ( 40 − 37.04 )
1+ +
25
17560.96 2 → (56.626, 95.006) p. The biologist finds another individual of this species, during a time of year when the plant is
not producing seeds. She measures the root volume and finds that it is 100 mL. Give a 95%
prediction interval for the seed mass that would be produced by this individual.
ˆ
Predicted seed mass: y = 93.43721 – 0.44053(100) = 49.38421 mg
49.38421 ± 2.069 ( 9.09287 ) 1 (100 − 37.04 )
1+ +
25
17560.96 2 → (28.219, 70.550) q. Compare the 3 prediction intervals obtained from parts (g), (o), and (p). Which is the
narrowest? Which is the widest? Explain why this is.
Width in (g): (108.504  45.736) = 62.768
Width in (o): (95.006  56.626)
= 38.380
Width in (p): (70.550 – 28.219)
= 42.331
The narrowest interval was in (o), when we knew that x = 40
The widest interval was n (g), when we had no knowledge about x
The reason the interval in (o) is narrower than that in (p) is because the value of x in (o) is
closer to the mean of x than the value of x in (p).
As long as x and y are correlated, a prediction interval for y should be smaller when we have a
value for x. Remember correlation between x and y means that if we know something about x
we should know something about the possible/likely values of y
r. The 8th individual in the sample had root volume = 23.2 mL and seed mass = 74.5 mg. Give
the residual for this observation.
ˆ
Predicted seed mass: y = 93.43721 – 0.44053(23.2) = 83.216914
The residual is the observed value of y minus the predicted value:
74.5 – 83.216914 = 8.716914
Note: the sign (+/) is important. "8.716914" is WRONG! Homework #12 – Solution Page 4 of 4 2. For each of the following, determine whether the correlation is "statistically significant" (that is,
test the null hypothesis ρ = 0 against the alternative that ρ ≠ 0). Be sure to give the value of the
test statistic as well as the pvalue.
a. r2= 0.01, n = 400
Because the sample size is large, we can use z for the test statistic.
Also, because we are doing a 2tailed test, it doesn't matter if r = 0.1 or 0.1, we will get the
same pvalue
r n − 2 0.1 398
z=
=
= 2.00504
1 − .01
1− r2
pvalue = 2(0.0222) = 0.0444
Although the linear relationship between x and y explains only 1% of the variation in y, there is
a statistically significant linear relationship between x and y.
b. r = 0.90, n = 4 r n−2 0.9 2
= 2.92
1 − .89
1− r
with only 2 degree of freedom, the pvalue is 2(0.051) = 0.102.
Thus we cannot conclude there is a statistically significant relationship between x and y,
despite the apparently high correlation. This is due to the very small sample size which gives a
very imprecise estimate of rho.
t= 2 Homework #12 – Solution = Page 5 of 5 3. Jason is a sociology major. For his senior thesis, Jason randomly selected a number of residents
from his hometown to survey. He asked each subject a range of demographic questions. Among
the questions he asked were: "How many years of schooling have you had?" and "What is your
annual income?" Limiting his sample to just those 30 subjects who were no longer in school (that
is, who had completed their schooling), the number of years of schooling ranged from 9 to 22 years
(mean 15.4 years) and the annual incomes ranged from $28,984 to $61,267. Using these 30
subjects, he conducted a regression analysis to explore whether the amount of schooling affects
income. The SAS output from this analysis is given below. Use only the SAS output below and
the appendix tables from your textbook to answer the following questions.
The REG Procedure
Model: MODEL1
Dependent Variable: income
Number of Observations Read
Number of Observations Used 30
30 Analysis of Variance Source DF Sum of
Squares Mean
Square Model
Error
Corrected Total 1
28
29 297006095
1179094138
1476100233 297006095
42110505 Root MSE
Dependent Mean
Coeff Var 6489.26074
44790
14.48804 RSquare
Adj RSq F Value Pr > F 7.05 0.0129 0.2012
0.1727 Parameter Estimates Variable
Intercept
years_school DF Parameter
Estimate Standard
Error t Value Pr > t 1
1 30207
946.97283 5617.60177
356.57432 5.38
2.66 <.0001
0.0129 a. What percent of the variation in incomes is explained by the linear relationship between income
and schooling?
20.12%
b. What is the correlation between income and years of schooling?
This is the positive square root of RSquare: (0.2012)0.5 = 0.44855
I know it is positive, because b is positive (946.97283)
c. Based on the above analysis, what is the income you would expect for an individual from this
town who has had 17 years of schooling?
yhat = 30207 + 946.97283(17) = $46,305.54
d. The 5th subject in this analysis had 17 years of schooling and has an annual income of $41,019.
What is the value of the 5th residual?
41019 – 46305.54 = 5286.54
Homework #12 – Solution Page 6 of 6 e. How much extra money should an individual in this town expect to earn for every additional
year of school he or she has completed?
(i) Give a point estimate.
$946.97
(ii) Give a 95% confidence interval
946.97283±2.048(356.57432)
→ (216.7086, 1677.2370)
f. Jason wants to determine if the relationship between years of schooling and annual income is
"statistically significant"?
(i) Give the value of the appropriate test statistic
t = 2.66 (equally acceptable is F = 7.05)
(ii) Give the degrees of freedom for that test statistic
if answer to (i) was t = 2.66 → answer: 28
if answer to (i) was F = 7.05 → answer: df1 = 1, df2 = 28
(iii) Give the pvalue.
0.0129
(iv) Based on this, is the relationship "statistically significant"? Just answer "yes" or "no".
yes
g. Now assume that Jason wants to test the null hypothesis that years of schooling does not affect
annual income (that is, average annual income does not change with an increase in the number
of years of schooling) versus the alternative hypothesis that average annual income increases as
the number of years of schooling increases.
Here the alternative hypothesis is ONETAILED! Above it was 2tailed
(i) Give the value of the appropriate test statistic
t = 2.66 (here F = 7.05 is NOT acceptable)
(ii) Give the degrees of freedom for that test statistic
28
(iii) Give the pvalue.
0.0129/2 = 0.00645 Homework #12 – Solution Page 7 of 7 h. According to the Bureau of Labor Statistics, nationwide, average income increased $1750 for
each additional year of schooling. Jason wants to compare his town to the national average.
He will test the null hypothesis that the trend in his town is the same as the national average
against the alternative that the trend in his town is different than the national average.
H0: β = 1750, Ha: β ≠ 1750
(i) Give the value of the appropriate test statistic
946.97283 − 1750
t=
= −2.25206
356.57432
(ii) Give the degrees of freedom for that test statistic
28
(iii) Give the pvalue.
From appendix Table VI: (0.015)×2 = 0.030
i. Give a 95% confidence interval for the true mean income of all residents of this town (who
have completed their schooling).
SS yy
SS ( Total )
1476100233
sy =
=
=
= 7134.42156
30 − 1
n −1
n −1
y ± tdf = n −1= 29 sy ⇒ n
→ (42126.26, 47453.74) j. 44790 ± ( 2.045 ) 7134.42156
30 Based solely on the preceding analysis, would it be appropriate for Jason to conclude that
additional schooling causes increased income? Justify your answer.
No! We cannot conclude anything about causation because the study was observational not
experimental! Homework #12 – Solution Page 8 of 8 ...
View
Full
Document
This note was uploaded on 02/16/2010 for the course STAT 350 taught by Professor Staff during the Spring '08 term at Purdue.
 Spring '08
 Staff

Click to edit the document details