This preview shows page 1. Sign up to view the full content.
Unformatted text preview: Statistics 500 Fall 2009 Solutions to Homework 7 1. Regression Design a) The s.e. of the slope = / sqrt( (xixbar)2 ) . When you are using n copies of the same set of X's, (xixbar)2 for the entire data set can be written as n(xixbar)2 where the sum is only over 1 of each of the unique densities. I gave you that sum, (xixbar)2 = 0.4557. So, for n copies of the 9 densities, s.e. of the slope = / sqrt( n 0.4557) I gave you values for =0.25 and the desired s.e. = 0.075. The rest is solving for n: n = 2 / (se2 0.4557) = 24.4. Since we want se < 0.075, we need n = 25. b) Same ideas, but with a different configuration of x's. Here, (xixbar)2 for one copy of the two densities (0.001 and 0.686) is (0.001 0.3435)^2 + (0.686 0.3435)^2 = 0.2346. Hence, n = 2 / (se2 0.2346) = 47.3, i.e. use n = 48. . * You need more replicates per density in the twodensity design, but notice that you need many fewer blocks total. In the first design (9 densities), you use a total of 25*9 = 225 blocks. In the second design (2 densities), you use a total of 48*2 = 96 blocks. c) Not graded, since this was a `think about the issues' question, for which there are many reasonable responses. I would be concerned that the two density design assumes that the relationship is a straight line. It provides no data to assess that, either graphically or formally. You can always draw a line through 2 points, even if the true response is very nonlinear. I might recommend the 2 density design if I knew that the response was a straight line (perhaps from previous studies or the physics of the relationship between density and log gain. It needs fewer blocks to achieve the same precision, so it is the `cheaper' experiment. I might recommend the 9 density design when I needed to figure out an appropriate model for the mean response. d) The lack of fit SS = 0.416 0.095 = 0.321 with 7 d.f. I get F = (0.321 / 7) / (0.095 / 81) = 39.1. This is much larger than the 0.999 quantile of the F 7,60 distribution (or the F 7,120) distribution, so p < 0.001. There is very strong evidence of lack of fit. e) The MSE from 4d is 0.0.00117, which is a lot smaller than the MSE from the original data, 0.062). The two values are measuring different variabilities. The smaller value from part 4d is the variability between measurements on the same block. The larger value from the original data is the variability between measurements on different blocks. I would never recommend the design in 4d unless I knew there was absolutely no variability between two blocks of the same density. Another way to think about these issues is in terms of observational and experimental units. This is not a randomized experiment, but you could imagine that density is `assigned' to a block. So, the block is an experimental unit. In the original data, each of the 90 blocks is measured once, so a block is the observational unit. The assumption of independence is ok, because o.u. = e.u.. In the part 4d study, the o.u. is a measurement, not the block. There are 90 measurements, but only 9 blocks. There is a problem: the o.u. is not the same as the e.u. Part 4d illustrates `cluster effects'. Each block is a cluster. Either (or any other reasonable) explanation is acceptable. Statistics 500 Fall 2009 Solutions to Homework 7 2. Inference on correlations fuel efficiency (a) We have two continuous random variables where neither variable could be thought of as a response. We want a measure of association and are not trying to predict either variable. Pearson's correlation coefficient seems to be an adequate measure of linear association. (b) r = 0.942. The t statistic is t = r n - 2 / 1 - r 2 = 9.31 with 11 d.f.. The pvalue is < 0.001. * You can also get the pvalue directly from the SAS output. (c) A 99% CI for 1 + 1 log is 2 1- 1 1+ r 1 log z 0.995 = 1.753 2.576(0.316 ) = (0.938,2.568) 2 1- r n -3 e 2*0.938 - 1 e 2*2.568 - 1 = (0.734,0.988) , A 99% CI for is 2*0.938 e + 1 e 2*2.568 + 1 (d) H 0 : = 0.90 vs H a : 0.90 1 1 + 0.9 1 1 + 0 z = n - 3 Z r - log 1 - = 10 1.753 - 2 log 1 - 0.9 = 0.888 pvalue = 0.38 2 0 There is no evidence that the population correlation differs from 0.90. (e) The test in (d) is a onesample test, comparing a sample correlation against a hypothesized value for the population correlation. To compare two population correlations based on two samples, we need to account for the variability in the estimate of the second sample correlation. The two samples (luxury and compact cars) are independent so we should use a two sample test. Let 1 and 2 stand for the correlations of the two populations respectively, then 1 1 1 + 1 1 + 2 1 1 , and Z r 2 ~ N log , , Z r1 ~ N log 2 2 1 - 1 n1 - 3 1 - 2 n2 - 3 1 1 Under H 0 : 1 = 2 , Z r1 - Z r 2 ~ N 0, n - 3 + n - 3 . Therefore an appropriate statistic would 1 2 be z= Z r1 - Z r 2 2 13 - 3 = 0.623 pvalue = 0.53 There is no evidence of a difference in correlations. Statistics 500 Fall 2009 Solutions to Homework 7 3. Review of matrix operations For both parts, the starting point is the definition of H = X (XTX)1 XT (a) H is symmetric if H = HT. Let's see: HT = [X (XTX)1 XT ] T = (XT)T [ (XTX)1 ]T (X)T = X (XTX)1 XT = H. yes, H is symmetric. Note: [ (XTX)1 ]T= (XTX)1 because (XTX) is symmetric (given in the problem). (b) H is idempotent if H H = H. Let's see: X (XTX)1 XT X (XTX)1 XT = X (XTX)1 I XT = X (XTX)1 XT = H. 4. Weighing bears. (a) As the headlength increases, the size and the weight of the bear increases. The relationship between weight vs headlen doesn't appear linear; the increment is larger for higher values of headlen . It might be quadratic. weight 100 200 300 400 500 10 12 headlen 14 16 You may notice other possible difficulties with the models we've been considering (e.g. the apparent increase in variability at large head lengths). If you log transformed weight, then replotted the data, you would probably log transform all the X variables. We'll talk a lot more about these issues later. ^ (b) H = 47.3896 pounds per inch. The estimated average weight of a bear increases by 47.4 pounds for a 1inch increment in the length of the head. People with biological background might make an assessment about 47 pounds being a "reasonable" value or not, but even without that background we can say that having a positive number makes sense, the bigger the head, the bigger the bear, and therefore heavier. ^ (c) H = -4.7914 pounds per inch. Something about the relationship between the explanatory variables is making the coefficient for headlen to be negative, which does not make sense. The smaller the head, the heavier the bear? ^ (d) No. If headlen was uncorrelated with the other predictor variables then H would be the same, but if some correlation with the other predictors exists then the coefficient would change. In particular, the correlation between headlen and chest is 0.86, and the correlation between headlen and neck is 0.88. (Correlation between chest and neck is 0.93). Statistics 500 Fall 2009 Solutions to Homework 7 (e) tstatistic = 1.0913 with pvalue = 0.2804. You can write many possible conclusions. Some focus on the parameter estimate, others on predictions from the model. 1) There is no evidence that head length of a bear is associated with its average weight, after adjusting for neck and chest size. 2) Adding head length to a model with chest size and neck size does not significantly improve the prediction of bear weight. 3) No evidence that head length needs to be included in a model that predicts weight from chest and neck measurements. (f) Compare the reduced model weight = 0 + C chest + N neck + with the full model weight = 0 + H headlen + C chest + N neck + . F= (SSE reduced - SSE full ) r MSE full = (49815.41 - 48656.42 ) 1 = 1.19 , pvalue = 0.28 973.1 I didn't ask for a conclusion here. This is the same test as in 2e, so the same conclusions are appropriate. (g) Compare the reduced model weight = 0 + C chest + with the full model. F= (SSE reduced - SSE full ) r MSE full = (56895.06 - 48656.42 ) 2 = 4.23 , pvalue = 0.02 973.1 Again, there is more than one conclusion; either is appropriate 1) Predictions of the average weight of a bear are improved by including either (or both) of the neck and headlength variables. 2) One or more of the regression slopes for neck and headlength are nonzero. (h) Fstatistic: 252.7 on 3 and 50 degrees of freedom, the pvalue is < 0.0001. At least one of these three variables is useful to predict the weight of a bear. * This is one of the default tests provided by SAS. You could also get it by comparing models. (i) The estimated coefficients are: 0 = 267, 1 = 9.29, 0 = 5.77. Predicted weights (and s.e.'s for next part) are in the table: Bear Chest Neck Predicted weight A B C 35 55 50 20 30 15 173 lbs 417 lbs 284 lbs 4.3 lbs 10.3 lbs 30.3 lbs s.e. (j) The s.e. for the predicted weight is given by MSE(x T ( X T X ) -1 x n where xnT = [1 35 20], n Statistics 500 Fall 2009 Solutions to Homework 7 [1 55 30], or [1 50 15]. (k) The explanation is obvious when you plot the chest size and neck length and superimpose the prediction locations (plot on next page). Bear C is much farther from the center of the data than bear B. Even though each dimension is individually closer to the mean, bear C has an unusually small neck for the chest size (or an unusually large chest for the neck size). That puts it a long way from the mean and increases the s.e. * This is one of the reasons why it is hard to identify X outliers in multiple regression. A multivariate outlier (here, point C) may not be a univariate outlier. 30 B Neck length 20 25 A C 10 20 25 30 15 35 40 45 Chest size 50 55 ...
View Full Document
- Fall '08