Unformatted text preview: Assignment 5 Statistics 231 Due: Tuesday, March 29 You can use the following R code to calculate confidence intervals and p-values for both
blocked and unblocked comparative investigations.
For blocked investigations, the data are stored one row per block with the explanatory
variates in two columns y1 and y2 corresponding to the two values of the explanatory
variate. Use the R code t.test(y1,y2,paired=T).
For unblocked investigations, the data are stored one row per unit. Suppose the response
variate is y and the explanatory variate is x. Use the R code t.test(y~x, var.equal=T).
1. Long exposure to radon, a naturally occurring radioactive gas, is thought to cause
lung cancer. In a case-control study, researchers selected 549 lung cancer patients
from a cancer registry and 621 community controls (individuals without lung cancer).
Each subject had lived in a single family dwelling with a basement for at least 15
years. The researchers then measured the concentration of radon (Bq/m3) in the
homes of the 1150 subjects in the sample. They also measured other variates,
especially smoking history, since smoking is a known cause of lung cancer. You can
access the data with the R command
The variate names and values are:
type: case (1) or control (0)
exposure: radon concentration
smoking: heavy smoker(2), moderate smoker(1), non-smoker(0)
a) Suppose the target population is all people. Define what it means to say that “radon
causes lung cancer”.
Hold all explanatory variates on all people (the target population) fixed.
Set the radon concentration experienced by all people and determine the proportion
who get lung cancer.
Change the radon concentration experienced by all people and again determine the
proportion who get lung cancer.
If the proportion of people who get lung cancer has changed, the change in radon
concentration caused the change in lung cancer rate, or more informally “radon
causes lung cancer.”
b) Ignoring smoking history, construct histograms for the radon concentrations for the
cases and controls. Does a gaussian model seem appropriate?
The histograms below show that the distribution of radon exposure has a long tail to
the right, i.e. there are many large exposures.
Considering the model: Yij = µi + Rij , i = 0,1 , j = 1,..., ni , Rij ~ G ( 0, σ ) indep. where n0 = 621 and n1 = 549. Then, we have Yij ~ G ( µi , σ ) . The long tail to the right
suggests the Gaussian assumption is not ideal. Controls Cases
20 10 Percent Percent 15 5 10 0 0 0 100 200 300 0 exposure 100 200 300 exposure c) Repeat (b) using the log concentration?
Controls Cases 9 15 8
7 Percent Percent 6
3 10 5 2
2.5 3.5 4.5 5.5 2.5 log(exposure) 3.5 4.5 5.5 log(exposure) With the log transformation the Gaussian assumption seems more reasonable.
d) Using the log concentration as the response variate, is there any evidence of a
difference in average exposure in the cases and controls? (Be sure to write down the
model you use to carry out the appropriate test of hypothesis.)
We can use the model: Yij = µi + Rij , i = 0,1 , j = 1,..., ni , Rij ~ G ( 0, σ ) indep. where
n0 = 621 and n1 = 549.
The hypothesis test is µ0 − µ1 = 0
Data summaries (on the natural log scale) give y0 = 3.92, y1 = 3.98 and the within
group standard deviations are s0 = 0.51, s1 = 0.45. The pooled standard deviation
(the estimate σ ) equals 2
621s0 + 545s12
621 + 545 − 2 ˆ
So the discrepancy measure d = 0.0535/0.0285 = 1.88. The corresponding p-value is
given by Pr ( t1168 > 1.88 ) = 0.06.
Thus µ0 − µ1 = -0.0535 and SE ( µ0 − µ1 ) = There is weak evidence of a difference between the case and control groups.
e) What conclusion can you draw? Does ignoring smoking history introduce a limitation
to this conclusion? Explain.
There is a difference in average radon exposure between cases and control. That is,
there is a weak association between radon and lung cancer. If we wish to conclude
that radon causes lung cancer, smoking history is a possible confounder. To check we
can compare the smoking history in the case and control groups. In fact the smoking
rates for people in the case group are higher than for people in the control group.
Thus, sample error arising from confounding of radon exposure and smoking history
imposes a limitation on the conclusion.
f) How might you take smoking history into account? You do not need to carry out this
The data for cases and controls could be subdivided into smaller groups by smoking
history. In this way we could compare cases and control with similar smoking history.
More formally, we could include smoking history as an explanatory variate in the
model: Yijk = µi + θ j + Rijk , i = 0,1 , j = 0,1, 2 , k = 1,..., nij , Rij ~ G ( 0, σ ) indep. where nij equals the number of people in group i with smoking history j.
2. In an agricultural investigation, researchers wanted to compare the average yields of
two new varieties of corn, here called A and B, designed for planting in cool climates.
It is well known that variates such as the weather and soil fertility affect the yield of
corn. Accordingly, they selected 40 two hectare plots from a study population of over
1000 plots in Southern Manitoba. Each plot was divided into two subplots. The
researchers assigned A and B to the two subplots at each plot. At the end of the
growing season the yield (in kg /hectare) was measured on each plot. You can access
the data with the R code
The variate names are plot, yieldA and yieldB.
a) What are the target and study populations?
target population: all agricultural plots in the world now and in the future
study population: over 1000 plots in Southern Manitoba over the next year
b) How should the researchers select the 40 plots? Why?
We want to make it likely that sample error will be as small as necessary to answer
the question adequately. The plots could be selected randomly or the plots could be
selected by judgment to cover a range of conditions representative of the study population.
c) This plan uses blocking. How does blocking prevent confounding from variates such
as the weather?
The two subplots in each block should experience essentially the same weather since
they are physically close together. In this way the weather is the same for the two
samples of plots planted with corn varieties A and B. As a result, we have prevented
the possibility of confounding between corn variety and weather.
d) How can the researchers reduce the risk of confounding due to local changes in the
soil fertility? Explain.
By randomly assigning the corn varieties to the two subplots in each block we reduce
the risk of confounding between corn variety and other (unknown and unmeasured
explanatory) variates that differ between the two subplots such as soil fertility. Using
the random assigning we are likely to have similar average soil fertility across the two
groups of plots planted with corn varieties A and B.
e) Using a suitable model, analyze the data to produce a 95% confidence interval for the
difference in average yields for the two varieties.
Since the investigation involves blocking we use the model Yij = µi + γ j + Rij ,
i = A, B , j = 1,..., 40 , Rij ~ G ( 0,σ * ) indep.
We do the analysis based on the differences within each block, i.e. d j = y Aj − y B j
The model for the difference is D j = µ + R j , where µ = µ A − µ B , j = 1,..., 40 ,
R j ~ G ( 0, σ ) indep. ˆ
From the data (differences yieldA-yieldB) we get µ = 14.84, σ = 19.37.
A 95% confidence interval for the difference in average yield is of the form
µ ± cSD ( µ ) , where Pr ( t39 < c ) = 0.975 and SD ( µ ) = σ 40 = 3.06. From t-tables
we get c = 2.02, and the 95% confidence interval is 14.84 ± 2.02(3.06) or (8.65,
21.03) kg/hectare f) Write a Conclusion about the difference in the two varieties.
Since the 95% confidence interval does not include zero, there is a statistically
significant difference between the average yield for varieties A and B. The average
yield for variety A is on average 14.8 kg/hectare higher than for yield B.
Due to the good Plan the risk of important confounding is small. Also, sample and
measurement error are not likely to be appreciable. However, there is a risk of
substantial study error. We are not sure that variety A will have a higher yield in other
areas of the world (that have different weather conditions than southern Manitoba).
3. Suppose we have a sample of n units with data ( x1 , y1 ),...,( xn , yn ) . Consider the
model Yi = β xi + Ri , Ri ~ G (0, σ ), i = 1,..., n independent . ˆ
a) Find the least squares estimate β of β Minimize W ( β ) = ∑ (y
n i =1 i − β xi ) 2 Taking the derivative we get dW dx = −2∑ i =1 ( yi − β xi ) xi
n ∑ ( y x − βˆ x ) = 0
= ∑ yx ∑ x
n Setting equal to zero gives
Solving gives β n i =1 n ii 2
i ii i =1 2
i i =1 Using linear algebra is also an acceptable solution.
b) What is the estimate σ of σ ? ∑ ˆ
σ= n ˆ
r2 i =1 i ( n − 1) = ∑ (y
n i =1 i ˆ
− β xi ) 2 ( n − 1) c) Find the distribution of the corresponding estimators β , σ .
The distribution of σ is σ Kn −1 β= ∑ n ∑ Yx i =1 i i
i =1 i ai = x ∑ n
i =1 xi2 = ∑ (β x + R ) x ∑
n i i =1 i i n x We know (from the course notes) that sd () x = β + ∑ i =1 ai Ri for constants n
i =1 i So, sd β = σ n ∑ ai2 = σ
i =1 ∑(
n i =1 xi ( ) n ∑a ∑i =1 ai Ri = σ ∑ j=1 x 2j
n n ) 2 =σ ∑ 2
i i =1 n
j =1 when sd ( Ri ) = σ x2
j d) Derive a 95% confidence interval for β . () () ˆ
A 95% CI for β is of the form β ± cSD β , where SD β = σ
a constant from the t-tables that satisfies Pr ( tn −1 < c ) = 0.975 ∑ n
j =1 x 2 and c is
View Full Document