A5sol - Assignment 5 Statistics 231 Due Tuesday March 29...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: Assignment 5 Statistics 231 Due: Tuesday, March 29 You can use the following R code to calculate confidence intervals and p-values for both blocked and unblocked comparative investigations. For blocked investigations, the data are stored one row per block with the explanatory variates in two columns y1 and y2 corresponding to the two values of the explanatory variate. Use the R code t.test(y1,y2,paired=T). For unblocked investigations, the data are stored one row per unit. Suppose the response variate is y and the explanatory variate is x. Use the R code t.test(y~x, var.equal=T). 1. Long exposure to radon, a naturally occurring radioactive gas, is thought to cause lung cancer. In a case-control study, researchers selected 549 lung cancer patients from a cancer registry and 621 community controls (individuals without lung cancer). Each subject had lived in a single family dwelling with a basement for at least 15 years. The researchers then measured the concentration of radon (Bq/m3) in the homes of the 1150 subjects in the sample. They also measured other variates, especially smoking history, since smoking is a known cause of lung cancer. You can access the data with the R command source(‘http://uwangel.uwaterloo.ca/AngelUploadsuwangel/Content/MRG-041122144725-_admin/_assoc/04C00F4D113D4ABE958960DCA3358257/sourceAss5q1.txt’) The variate names and values are: subject type: case (1) or control (0) exposure: radon concentration smoking: heavy smoker(2), moderate smoker(1), non-smoker(0) a) Suppose the target population is all people. Define what it means to say that “radon causes lung cancer”. Hold all explanatory variates on all people (the target population) fixed. Set the radon concentration experienced by all people and determine the proportion who get lung cancer. Change the radon concentration experienced by all people and again determine the proportion who get lung cancer. If the proportion of people who get lung cancer has changed, the change in radon concentration caused the change in lung cancer rate, or more informally “radon causes lung cancer.” b) Ignoring smoking history, construct histograms for the radon concentrations for the cases and controls. Does a gaussian model seem appropriate? The histograms below show that the distribution of radon exposure has a long tail to the right, i.e. there are many large exposures. Considering the model: Yij = µi + Rij , i = 0,1 , j = 1,..., ni , Rij ~ G ( 0, σ ) indep. where n0 = 621 and n1 = 549. Then, we have Yij ~ G ( µi , σ ) . The long tail to the right suggests the Gaussian assumption is not ideal. Controls Cases 20 10 Percent Percent 15 5 10 0 0 0 100 200 300 0 exposure 100 200 300 exposure c) Repeat (b) using the log concentration? Controls Cases 9 15 8 7 Percent Percent 6 5 4 3 10 5 2 1 0 0 2.5 3.5 4.5 5.5 2.5 log(exposure) 3.5 4.5 5.5 log(exposure) With the log transformation the Gaussian assumption seems more reasonable. d) Using the log concentration as the response variate, is there any evidence of a difference in average exposure in the cases and controls? (Be sure to write down the model you use to carry out the appropriate test of hypothesis.) We can use the model: Yij = µi + Rij , i = 0,1 , j = 1,..., ni , Rij ~ G ( 0, σ ) indep. where n0 = 621 and n1 = 549. The hypothesis test is µ0 − µ1 = 0 Data summaries (on the natural log scale) give y0 = 3.92, y1 = 3.98 and the within group standard deviations are s0 = 0.51, s1 = 0.45. The pooled standard deviation ˆ (the estimate σ ) equals 2 621s0 + 545s12 = 0.485. 621 + 545 − 2 ˆ ˆ σ2 σ2 + = 0.0285 621 545 So the discrepancy measure d = 0.0535/0.0285 = 1.88. The corresponding p-value is given by Pr ( t1168 > 1.88 ) = 0.06. ˆ ˆ ˆ ˆ Thus µ0 − µ1 = -0.0535 and SE ( µ0 − µ1 ) = There is weak evidence of a difference between the case and control groups. e) What conclusion can you draw? Does ignoring smoking history introduce a limitation to this conclusion? Explain. There is a difference in average radon exposure between cases and control. That is, there is a weak association between radon and lung cancer. If we wish to conclude that radon causes lung cancer, smoking history is a possible confounder. To check we can compare the smoking history in the case and control groups. In fact the smoking rates for people in the case group are higher than for people in the control group. Thus, sample error arising from confounding of radon exposure and smoking history imposes a limitation on the conclusion. f) How might you take smoking history into account? You do not need to carry out this analysis. The data for cases and controls could be subdivided into smaller groups by smoking history. In this way we could compare cases and control with similar smoking history. More formally, we could include smoking history as an explanatory variate in the model: Yijk = µi + θ j + Rijk , i = 0,1 , j = 0,1, 2 , k = 1,..., nij , Rij ~ G ( 0, σ ) indep. where nij equals the number of people in group i with smoking history j. 2. In an agricultural investigation, researchers wanted to compare the average yields of two new varieties of corn, here called A and B, designed for planting in cool climates. It is well known that variates such as the weather and soil fertility affect the yield of corn. Accordingly, they selected 40 two hectare plots from a study population of over 1000 plots in Southern Manitoba. Each plot was divided into two subplots. The researchers assigned A and B to the two subplots at each plot. At the end of the growing season the yield (in kg /hectare) was measured on each plot. You can access the data with the R code source(‘http://uwangel.uwaterloo.ca/AngelUploadsuwangel/Content/MRG-041122144725-_admin/_assoc/04C00F4D113D4ABE958960DCA3358257/sourceAss5q2.txt’) The variate names are plot, yieldA and yieldB. a) What are the target and study populations? target population: all agricultural plots in the world now and in the future study population: over 1000 plots in Southern Manitoba over the next year b) How should the researchers select the 40 plots? Why? We want to make it likely that sample error will be as small as necessary to answer the question adequately. The plots could be selected randomly or the plots could be selected by judgment to cover a range of conditions representative of the study population. c) This plan uses blocking. How does blocking prevent confounding from variates such as the weather? The two subplots in each block should experience essentially the same weather since they are physically close together. In this way the weather is the same for the two samples of plots planted with corn varieties A and B. As a result, we have prevented the possibility of confounding between corn variety and weather. d) How can the researchers reduce the risk of confounding due to local changes in the soil fertility? Explain. By randomly assigning the corn varieties to the two subplots in each block we reduce the risk of confounding between corn variety and other (unknown and unmeasured explanatory) variates that differ between the two subplots such as soil fertility. Using the random assigning we are likely to have similar average soil fertility across the two groups of plots planted with corn varieties A and B. e) Using a suitable model, analyze the data to produce a 95% confidence interval for the difference in average yields for the two varieties. Since the investigation involves blocking we use the model Yij = µi + γ j + Rij , i = A, B , j = 1,..., 40 , Rij ~ G ( 0,σ * ) indep. We do the analysis based on the differences within each block, i.e. d j = y Aj − y B j The model for the difference is D j = µ + R j , where µ = µ A − µ B , j = 1,..., 40 , R j ~ G ( 0, σ ) indep. ˆ ˆ From the data (differences yieldA-yieldB) we get µ = 14.84, σ = 19.37. A 95% confidence interval for the difference in average yield is of the form ˆ ˆ ˆ ˆ µ ± cSD ( µ ) , where Pr ( t39 < c ) = 0.975 and SD ( µ ) = σ 40 = 3.06. From t-tables we get c = 2.02, and the 95% confidence interval is 14.84 ± 2.02(3.06) or (8.65, 21.03) kg/hectare f) Write a Conclusion about the difference in the two varieties. Since the 95% confidence interval does not include zero, there is a statistically significant difference between the average yield for varieties A and B. The average yield for variety A is on average 14.8 kg/hectare higher than for yield B. Due to the good Plan the risk of important confounding is small. Also, sample and measurement error are not likely to be appreciable. However, there is a risk of substantial study error. We are not sure that variety A will have a higher yield in other areas of the world (that have different weather conditions than southern Manitoba). 3. Suppose we have a sample of n units with data ( x1 , y1 ),...,( xn , yn ) . Consider the model Yi = β xi + Ri , Ri ~ G (0, σ ), i = 1,..., n independent . ˆ a) Find the least squares estimate β of β Minimize W ( β ) = ∑ (y n i =1 i − β xi ) 2 Taking the derivative we get dW dx = −2∑ i =1 ( yi − β xi ) xi n ∑ ( y x − βˆ x ) = 0 = ∑ yx ∑ x n Setting equal to zero gives ˆ Solving gives β n i =1 n ii 2 i ii i =1 2 i i =1 Using linear algebra is also an acceptable solution. ˆ b) What is the estimate σ of σ ? ∑ ˆ σ= n ˆ r2 i =1 i ( n − 1) = ∑ (y n i =1 i ˆ − β xi ) 2 ( n − 1) c) Find the distribution of the corresponding estimators β , σ . The distribution of σ is σ Kn −1 β= ∑ n ∑ Yx i =1 i i n 2 i i =1 i ai = x ∑ n i =1 xi2 = ∑ (β x + R ) x ∑ n i i =1 i i n x We know (from the course notes) that sd () x = β + ∑ i =1 ai Ri for constants n 2 i =1 i So, sd β = σ n ∑ ai2 = σ i =1 ∑( n i =1 xi ( ) n ∑a ∑i =1 ai Ri = σ ∑ j=1 x 2j n n ) 2 =σ ∑ 2 i i =1 n j =1 when sd ( Ri ) = σ x2 j d) Derive a 95% confidence interval for β . () () ˆ ˆ ˆ ˆ A 95% CI for β is of the form β ± cSD β , where SD β = σ a constant from the t-tables that satisfies Pr ( tn −1 < c ) = 0.975 ∑ n j =1 x 2 and c is j ...
View Full Document

This note was uploaded on 11/21/2011 for the course MATH STAT 231 taught by Professor Marsh during the Spring '10 term at Waterloo.

Ask a homework question - tutors are online