hw2_final1

hw2_final1 - STATS-305, FALL 2010 SOLUTIONS FOR HW 2...

Info iconThis preview shows page 1. Sign up to view the full content.

View Full Document Right Arrow Icon
This is the end of the preview. Sign up to access the rest of the document.

Unformatted text preview: STATS-305, FALL 2010 SOLUTIONS FOR HW 2 Instructor: Professor Trevor Hastie TA’s: Rahul Mazumder and Luo Lu For questions about grading, please see Rahul Mazumder or Luo Lu (1) (a) Weisberg Exercise 3.2. (i) We consider regression on X = x1 · · · xp and also on X− , the first p − 1 columns of X . For a matrix A we let [A] ˆ ˆ denote the column span. Let β and β− denote the coefficients of regression of y on X and X− respectively: ˆˆˆ y = X β + e, e ⊥ [X ], ˆ = X− β− + y·[p−1] , y·[p−1] ⊥ [X− ]. Let α denote the coefficient of regression of xp on X− , ˆ xp = X− α + x·[p−1] , ˆ x·[p−1] ⊥ [X− ], and finally let γ denote the coefficient of regression of y·[p−1] on x·[p−1] , ˆ y·[p−1] = x·[p−1] γ + e, ˆ˜ e ⊥ [x·[p−1] ]. ˜ But e = y·[p−1] − γ x·[p−1] ⊥ [X− ] so e ⊥ [X− , x·[p−1] ] = [X ]. Thus ˜ ˆ ˜ ˆ y = (X− β− + γ x·[p−1] ) + e ˆ ˜ where the first term is in [X ] while the second term is in [X ]⊥ , so we must have e = e and ˆ˜ ˆ ˆ X β = X− (β− − γ α) + γ xp , ˆˆ ˆ ˆ which shows βp = γ . ˆ In particular, the result of the exercise follows by taking X = 1 n x2 x1 . 1 2 STATS-305, FALL 2010 SOLUTIONS FOR HW 2 (ii) Let X be n × p of full rank. Under the normal model y |X ∼ N (Xβ, σ 2 In ) we have ˆ β1 − β1 ∼ tn−p , se(β1 ) ˆ where se(β1 ) is computed using ˆ RSS σ2 = ˆ . n−p This is the t statistic in the usual regression of y on X . On the other hand, when we compute the regression of y·[p−1] on x·[p−1] , the t statistic is computed under the (incorrect) model y·[p−1] ∼ N (x·[p−1] γ, σ 2 In ) and so se(β1 ) will be computed using ˆ RSS σ2 = ˆ , n−1 i.e. the residual degrees of freedom used is incorrect. (b) Weisberg Exercise 3.3. We consider the mean function E[Y |X1 , X2 ] = β0 + β1 X1 + β2 X2 . ˆ Let β be the coefficient of regression of y on x1 (with intercept) and let α be the ˆ coefficient of regression of x2 on x1 (with intercept). (i) The added-variable plot for x2 after x1 is the plot of ˆ ˆ y·1 = y − β0 − β1 x1 on x2·1 = x2 − α0 − α1 x1 , ˆ ˆ the residual from the regression of x2 on x1 (with intercept). Since x1 = 2.2x2 with no error, x2·1 = 0n , so the added-variable plot will be a series of n points arranged on the vertical line x2·1 = 0. (ii) If instead y = 3x1 with no error, y·1 = 0n so the added-variable plot will be a series of n points arranged on the horizontal line y·1 = 0. ˆ (iii) If β1 = α1 = 0, the plot of y·1 on x2·1 will be the same as the plot of y on ˆ ˆ x2 except for an overall shift of (−α0 , −β0 ). ˆ Another way to answer this problem: Since the added-variable plot shows the effect of X2 adjusted for the effect of X1 , it will only have exactly the same shape as the scatterplot of Y on X2 (without X1 ) when both X2 and Y are uncorrelated with X1 . In this case, the residuals of regressing X2 on X1 and Y on X1 will just be demeaned versions of X2 and Y , which does not affect the shape of the plot. (iv) Suppose we measure the variability of a vector v ∈ Rn by n (vi − v )2 . ¯ i=1 STATS-305, FALL 2010 SOLUTIONS FOR HW 2 3 The vertical variability in the original plot of y on x2 is simply the sum of squared residuals from the regression of y on 1n , while the vertical variability in the added-variable plot is the sum of squared residuals from the regression of y on 1n and x1 . Thus the vertical variability in the added-variable plot cannot be greater. (2) (a) For any function g ∈ L2 , we expand (Y − g (X ))2 = (Y − f (X ))2 + (f (X ) − g (X ))2 + 2(Y − f (X ))(f (X ) − g (X )). By the law of iterated expectations, E[2(Y − f (X ))(f (X ) − g (X ))] = 2E [(f (X ) − g (X ))E[(Y − f (X ))|X ]] = 0, and so E[(Y − g (X ))2 ] = E[(Y − f (X ))2 ] + E[(f (X ) − g (X ))2 ] ≥ E[(Y − f (X ))2 ]. (b) Here, f (X ) = E (Y |X ) EX,Y [(Y − X T β )2 ] = EX,Y [Y 2 − 2Y X T β + β T XX T β ] = EX,Y [Y 2 ] − 2EX,Y [Y X T β ] + β T EX,Y [XX T ]β ∂ ∂ EX,Y [(Y − X T β )2 ] = [EX,Y [Y 2 ] − 2(EX,Y [Y X T ]β ) + β T EX,Y [XX T ]β ] ∂β ∂β 0 = −2EX,Y [Y X T ] + 2EX,Y [XX T ]β β ∗ = (EX,Y [XX T ])−1 EX,Y [Y X T ] = (EX,Y [XX T ])−1 EX,Y [f (X )X T ] It can be shown, similar to part a) EX,Y [(Y − X T β )2 ] = EX,Y [(Y − f (X ) + f (X ) − X T β )2 ] = constant + EX,Y [(f (X ) − X T β )2 ] So β ∗ is the best linear approximation to f (X ) in mean squared error. (c) ˆ β = (X T X )−1 X T Y ˆ EX,Y (β ) = EX [EY ((X T X )−1 X T Y |X )] = EX [(X T X )−1 X T EY (Y |X )] = EX [(X T X )−1 X T f (X )] ˆ In general, β is not an unbiased estimator for β ∗ . 4 STATS-305, FALL 2010 SOLUTIONS FOR HW 2 (3) Problem 3 Here we provide a simple method for solving the problem. The linear regression model in this problem is: 10 E (Y |X) = β0 + i=1 xi β i where β1 = . . . = β5 and β6 = . . . = β10 and β0 is the intercept. Let βi = β (1) , ∀i = 1, . . . , 5 and βi = β (2) , ∀i = 6, . . . , 10 Using the above constraints, in the linear model, the model becomes: 5 10 (0.1) E (Y |X) = β0 + ( i=1 xi )β (1) +( i=6 xi )β (2) The above model can be interpreted as a ‘collapsed version’ of the original problem with the two new predictors given by (a) the sum of the first 5 columns of gene measurements and (b) the sum of the next 5 columns of the gene measurement (columns). We will like to mention that the same problem can be solved via the general principle of obtaining least squares estimates under linear side-constraints — a topic described in class. However, that method is unreasonably complicated for the problem, given the above simple solution. (4) Problem 4 The code for the problem is provided in the Appendix The scatterplot matrix is shown in Fig. 1. Using lpsa as a response we fit a multiple regression model using the other variables as predictors. > summary.lm(lmout) Call: lm(formula = lpsa ~ lcavol + lweight + age + lbph + svi + lcp + gleason + pgg45) Residuals: Min 1Q Median -1.73312 -0.37132 -0.01699 3Q 0.41417 Max 1.63812 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.669217 1.296385 0.516 0.60700 lcavol 0.587018 0.087920 6.677 2.11e-09 *** lweight 0.454465 0.170012 2.673 0.00896 ** age -0.019636 0.011173 -1.758 0.08231 . lbph 0.107053 0.058449 1.832 0.07040 . svi 0.766170 0.244308 3.136 0.00233 ** lcp -0.105479 0.091013 -1.159 0.24961 gleason 0.045151 0.157464 0.287 0.77499 pgg45 0.004525 0.004421 1.024 0.30887 STATS-305, FALL 2010 SOLUTIONS FOR HW 2 5 3 5 q q q qq q qq qq q q qq q q q q qq q qqq q qq q qq q qq q q qqqqq q q qq q q qqq qqqqq q qq q qqqqq qq q q qq q qq q q q qq q q q q qq q q qq q −1 1 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q −1 2 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0 60 q q q qqq q q q qq qq q qq q qq qq q q q q qq qq q qq q q qqq q q q qq q qqq q q q q q qq qq qq q qq q q qq qq qqq q qq q q qq q q q q qq qq q q lcavol q q q q q q q q 5 q q q q qq qq q q q q qq q q q q q qq q qq q q qq q q qq qq q q q qq q q qqq q qq q qq q q qqq qq q q qq q q q q q q qq q q q q qq q qq q q qq qq q q q q lweight qq q qq q q qqqqq q qq qq qq q qqq q qqqq qq q qq qq q qqqqqqq q q q q q qqqqq q q q q q qq q q q q qq qqq qq q qq q q q q q q q qq q qq q qq q qq q q q qq qq q q q q qqqq q qq q q qqq qq q qq q qq q q q qq q q q q q q qq q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q qq q q q qq qq qq qq q qq q q q q q qq qq qq q q q q qq q q qq qq q q q qq q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 3 q q q q q q q q q qq qqq q q qqq qq q q qq q q q q q q q qq q qq qq qq q q qqqq qq q q qq q qq qq qq q q q q qq q qq qq qq q qq qq qq q qq q q q qq q q q qq q q q q qq q qq qqq q qqq qqq q q q qq qq q q q qqq q qq qq q q q q q qq q qqq q q qq qqqq qq q q q q −1 q q qq q q q qq q q q qq q q q qq q q q q q qq q q q q qq q q qq qq q q qq q q q q q q q q q qq q qqqqqqqq q q q q qq q qq qqqq qq q qq q qq q q qq q q q qq q qq q qqq q qq qq qq q qq qq q q qq q qq qq q q q qq qq q q q qq q q q qqqqq q q q q qqqq q q qqq q qq q q 1 q q qq q q qq q qq q qq qqqqqqq q q q qq q qq qq q qq q q qq q q q qqq q q q qq q lbph q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q qq q qq q q q qq q q q qq q q qq q qq q q q q q q q qq q q q q q q q q q qqq qq qqq q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q qq qq q q qqq q q qq q q q qq q qq q qq q q qq q q q qq q q q q q q q q q qq q q qq q q qq q qq q qq q q qq q qqq q q q q q qq q q q q q qq qq q q q q q q q q qq q qqq qqqqqqq qq q qqqqqq qq q q q q q q q q qq q q q qq qqqqq qqqq q q qqqqqq q q qqqq q q q qq svi qqqq q qqqqq qq q q q q qqqq qqqqq q q q qq q q q q q qqq q q qq q q qqqqqqq q qqqqq qq q qqqqqqq qqqq qq qq q qq qq qqqqqqqqqqq q q qqqqqqqqq q q qq qq q qqqqq qq qqqqqq q qqq q qq q q q qq qqq qqq q q qq qq q q q q q q qq q q q q q q q q qq q q q q qq qqqqqqq q q q qqqqq q qqqqqq q qqqqq q qq −1 q q q qq qqq q q q qq q q qq qq q q qq q q q q qqq qq q q qq q q q qq q q qq q q qqqq qqqq qqq q q q q qqq q q q q q qq q q qq q q qq q q qq qq qq q q qq q qq q q qq q qq q qq q q q qq q q q q q qq q qq q q q q q qqqqqq q qqqq q qq qq qq qq qq q q q 2 q qq q q q qq qq q qq q q q qq q q q qq q q q q qq q q q qq qq q q qq qqqq q qq qq q q q q q q q q q q q q q q q qq q q q qq q q qqq q qq q q q q q q qq qq q q qq q qq qq qqq q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q lcp q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q qq q qqq q q q q qqq qq q q q q q qqq q q q q q q qq qq qq qq q q q qq qq q qq q q q qq q q qq q q qq q qq q q qq q qq q q qq q qq q q qq q qq q q qq q q q qq qqq qq q q q q qqq qq q q qqqq q q qq q qq qq q qqqqqqq q q q qqqqqqq q q qq qq q q q q qqq qq qqq q q q q q q qq qq qq q q q q q q q q q qq q q qqqq q qq q qq q q q q q q q q qqqqqq qqqqq q qqqqq q qq q qq q q qq q qq q q q qqqqqqq q qqqqqqq q qqqqq q qq qq q q q q q qqqqqq qqqq q qqqqqq q q q q qq q qqq q q q qqq q q q qq q q q q q q qq qqq qq qqq q q qqq qq qqq q qq gleason q q qq q q q q q q q qqqq q q q qqqqqqqq q qq qq q qqq qq q qq q qqq qq q qq q qqqq q qq qqqq q q q qqq qq q q qq q q q q qqqqq q q qqq q q q qqqq qq q qq qq q q qqqqq q q q qq qq qq qq qq q q qq q qq q q q qq q q q q qq qqqqq qq q q qq qq q qq q qq 0 q q q qq q q q qq q qq q q q q qq qq q qq q q q qq q q q q q q q q qq q q qq q q q qqq q qqq q q qqqq q qqqq qq q q q q qqq qqqq q qq qq q q q q qq q qq qq q q q qq qq q qq q q qq qq q q qqq qq q q q qq qq q q qq q qq q q qqq qqqq q q qq qqqqq q q qqqq q qq q q q q q qq q q qq qq q q qq q q qq qq qq qq q qq qqq q q qqq q q qq q qq q q qq q q qq q q q q qq qq q q qqqqq q qq q qq qq q q q q q q q q q qq q q q q q q q q qq q q q q q qq q q q qq qq q q qq q q q q q q q qq q q qq q q qq q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q qq q qq q q q qq q qq q q qqq q q qq qq q q q q q q q q qq qq q qq q q q q q q q q q q q q q q q q q q q q 60 q q q q q pgg45 q q q q qq q q q qq qq q q q qq qq q qq q qq q q q q q q q q q q q qqq qq q q qq q q q qq q q q qq qq q qq qqqqq qq q q qq qq q qq q qq q qq q q q q qq q qq qq q q q q q qq q q q q q qq qq q q q q qq q q qqq q qqq q q qqq q q qq q q qqq q q q qq q q qq qqqq q q qq qq q q q q q qq q q qq q q qqq q q qq q qqq q qqq q q q qq q qq qq q q qqqqq qq qq q qq qq qqqq q qq q qq q q qq q q qq qq q qq q qq q qqq q qq q qq q qq q q q q qq q q q qq q q qqqqq q q q q q qqqqq q q q qqq q qqq qqqq q q qqq q qq qqq qq q q qq qq q q qq qqq q q qq q q q q q q q qq q q q q q q qq q q q q q q q q qq q q q q q q q q q qqqq q q q qqqqq q qqq q q q qq q q q q q qq qq q q qq q q qq qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q qq qqq q q q q qq q q qq q qq q q q q q q q qq q qq q q q q q qq qq q qq q qq q q qq qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq qq q qq qqqqq qqq q qq q q q qq q q qqq qqqq q qq q qq q qq q q q q q q q qq qqq q qq q qq q q q q q q qq q lpsa −1 2 40 70 0.0 0.8 6.0 8.0 0 3 Figure 1. Exercise 4: scatterplot matrix --Signif. codes: 0 ^˘¨***^˘´ 0.001 ^˘¨**^˘´ 0.01 ^˘¨*^˘´ 0.05 ^˘¨.^˘´ 0.1 ^˘¨ a˘´ 1 aAY aAZ aAY aAZ aAY aAZ aAY aAZ aAY ^AZ Residual standard error: 0.7084 on 88 degrees of freedom Multiple R-squared: 0.6548, Adjusted R-squared: 0.6234 F-statistic: 20.86 on 8 and 88 DF, p-value: < 2.2e-16 It is tempting to ’throw away’ multiple variables after looking at the above table. But this is not right, due to the correlation among the predictors, which is apparent from the “pairs()” plot. Moreover, the Pr(>|t|) values above measure the p-values looking at the variables one at a time (adjusting for the remaining variables). The strategy is to examine the variables one at a time, interactively, by throwing them away from the model (or bringing them back). This will be done, looking at the model-summary and also accounting for the correlations among the predictors. We will like to note that this strategy is a heuristic, and may not lead to the best few 0 3 6.0 8.0 0.0 0.8 qqq qqqqq q q qq qq q q q qqq qqqq qq qq q q q qqqqq qqq q qq q q q q q q qq q qq q q qq q qqqqqq q q qqq qq q q qqqqqqqqqq qq q qq qqqq q qq qqqqq q qq q 40 q qq q q qqq qqq qq qqq q q qq qq q qq q qq qq q q qq q q q q qq qq qq q q qq q q q qq qq q q q qq q q q qq q qq qq q q qq qq q qq qq q q q q qq q q q q q q q q q qq q q q q q q q qq q q qq qq q qqq q qq qq qqqq q q qqq q qqqq q qq q q q qqq q qq q q qq q qq qqqq qqq q qqq qq qq q q qq q qq qq q q age q q qq q q q q q q qq q q qq q q q q qq q qq qq q qq q q q qq qq q q q qq q q q qq q q q qq qq q q q q qq q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 70 q q q q q q qq q q q q q q qq qq qqq q q qqq q q qq q q q q qq q q q qq q qq q q q qq q q qq qq q q qq q qq q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q qq q q q qq qq qqqqqqqqq q qq q qq q q q qqq qq q q q q q q qqqqq qq q q q q q qq q qq q q q q q q q q q q q q q q qq q qq q qq q q q q qq q q qq q qqq q qq qqqq qq q qq qq q qq qq q q qq q qq qqqq q q qqqq q q q q q qq q q qqq q q qq q q q q q qq q q q q q q −1 q qq qq qq q qq q q qq qq qq q q qq q qq qqqq q qq q qqqqq q qq q q qq q qq q q qqq q q qqq q qqq qq q qq q q q q q qq q q qq q qq q q qq qq q qq q q q q q q q q q q q q q q qq q q q q q q qq q q qq q q q q qq q q qq q q q q q qq q q q q q qq q q q q q qq q q q q q qq q q q q q qq q qq q q q q q qq q q q qq q q q qq q qqq qq q q qq q q q qq qq q qq q q q q qq q q q q qq q q q q q qq q q qq qq q qq q q q q q qq qq qq q qqqq q q qq qq q qqq q q q q q qq q qq q q q qqqqq q qq q qqq q q qq qq q q q qq qq qq q q q qq q qq q q q q 2 6 STATS-305, FALL 2010 SOLUTIONS FOR HW 2 predictors in the model. This is because searching for the best few predictors for the model is a combinatorially hard problem. We notice that gleason has the smallest t-value. We examine the effect of removing it. The R2 value is almost unchanged, which supports the validity of removing gleason from the regression. Additionally an investigation of the correlation matrix of the predictors show that gleason and pgg45 are fairly correlated with correlation 0.7519. This suggests that pgg45 is a surrogate for gleason. This is suggested in the following display, in the change of the tail-probability for pgg45. > summary.lm(lmout) Call: lm(formula = lpsa ~ lcavol + lweight + age + lbph + svi + lcp + pgg45) Residuals: Min 1Q Median 3Q Max -1.73114 -0.38137 -0.01731 0.43360 1.63514 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.953865 0.829437 1.150 0.25322 lcavol 0.591612 0.086001 6.879 8.07e-10 *** lweight 0.448288 0.167770 2.672 0.00897 ** age -0.019335 0.011066 -1.747 0.08404 . lbph 0.107671 0.058107 1.853 0.06720 . svi 0.757744 0.241281 3.141 0.00229 ** lcp -0.104487 0.090477 -1.155 0.25124 pgg45 0.005318 0.003433 1.549 0.12488 --Signif. codes: 0 ^˘¨***^˘´ 0.001 ^˘¨**^˘´ 0.01 ^˘¨*^˘´ 0.05 ^˘¨.^˘´ 0.1 ^˘¨ ^˘´ 1 aAY aAZ aAY aAZ aAY aAZ aAY aAZ aAY aAZ Residual standard error: 0.7048 on 89 degrees of freedom Multiple R-squared: 0.6544, Adjusted R-squared: 0.6273 F-statistic: 24.08 on 7 and 89 DF, p-value: < 2.2e-16 From the above display, it appears that lcp has the largest tail probability. We try removing it. The final R-squared for the model, almost shows no change with the previous one. This suggests that lcp, was perhaps not important. > summary.lm(lmout) Call: lm(formula = lpsa ~ lcavol + lweight + age + lbph + svi + pgg45) Residuals: Min 1Q Median 3Q Max -1.777e+00 -4.171e-01 1.453e-05 4.067e-01 1.597e+00 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.980024 0.830663 1.180 0.24119 lcavol 0.545765 0.076431 7.141 2.31e-10 *** lweight 0.449446 0.168078 2.674 0.00890 ** age -0.017469 0.010967 -1.593 0.11472 lbph 0.105754 0.058191 1.817 0.07249 . svi 0.641671 0.219756 2.920 0.00442 ** pgg45 0.003528 0.003068 1.150 0.25332 STATS-305, FALL 2010 SOLUTIONS FOR HW 2 7 --Signif. codes: 0 ^AY***^˘´ 0.001 ^˘¨**^˘´ 0.01 ^˘¨*^˘´ 0.05 ^˘¨.^˘´ 0.1 ^˘¨ a˘´ 1 a˘¨ aAZ aAY aAZ aAY aAZ aAY aAZ aAY ^AZ Residual standard error: 0.7061 on 90 degrees of freedom Multiple R-squared: 0.6493, Adjusted R-squared: 0.6259 F-statistic: 27.77 on 6 and 90 DF, p-value: < 2.2e-16 We now remove pgg45 from the above model, observing the above tail-probabilities. We see that lweight (though significant) undergoes a large change in the tail-probability. > summary.lm(lmout) Call: lm(formula = lpsa ~ lcavol + lweight + age + lbph + svi) Residuals: Min 1Q -1.835021 -0.393973 Median 0.004167 3Q 0.463354 Max 1.578880 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.95094 0.83174 1.143 0.255911 lcavol 0.56560 0.07459 7.583 2.77e-11 *** lweight 0.42369 0.16687 2.539 0.012815 * age -0.01489 0.01075 -1.385 0.169564 lbph 0.11184 0.05805 1.927 0.057160 . svi 0.72096 0.20902 3.449 0.000854 *** --Signif. codes: 0 ^˘¨***^˘´ 0.001 ^˘¨**^˘´ 0.01 ^˘¨*^˘´ 0.05 ^˘¨.^˘´ 0.1 ^˘¨ a˘´ 1 aAY aAZ aAY aAZ aAY aAZ aAY aAZ aAY ^AZ Residual standard error: 0.7073 on 91 degrees of freedom Multiple R-squared: 0.6441, Adjusted R-squared: 0.6245 F-statistic: 32.94 on 5 and 91 DF, p-value: < 2.2e-16 Removing age from the above model, we get: Call: lm(formula = lpsa ~ lcavol + lweight + lbph + svi) Residuals: Min 1Q -1.82650 -0.42273 Median 0.04368 3Q 0.47040 Max 1.48531 Coefficients: Estimate Std. Error t (Intercept) 0.14555 0.59747 lcavol 0.54960 0.07405 lweight 0.39087 0.16600 lbph 0.09009 0.05617 svi 0.71174 0.20996 --Signif. codes: 0 ^˘¨***^˘´ 0.001 aAY aAZ value Pr(>|t|) 0.244 0.80807 7.422 5.64e-11 *** 2.355 0.02067 * 1.604 0.11212 3.390 0.00103 ** a˘¨ aAZ ^AY**^˘´ 0.01 ^˘¨*^˘´ 0.05 ^˘¨.^˘´ 0.1 ^˘¨ ^˘´ 1 aAY aAZ aAY aAZ aAY aAZ Residual standard error: 0.7108 on 92 degrees of freedom 8 STATS-305, FALL 2010 SOLUTIONS FOR HW 2 Multiple R-squared: 0.6366, Adjusted R-squared: 0.6208 F-statistic: 40.29 on 4 and 92 DF, p-value: < 2.2e-16 We can move ahead, and remove other variables in order , one by one, as done above. But we choose to stop here or in the model with age, based on the (large) differences in the Adjusted Rsquared values obtained across successive models. We will also like to note, that it may be tempting to remove the intercept from the model based on the the associated F-test, however, the intercept, is unlike the other predictors — in a sense, the model with an intercept alone is a ’null-model’. We keep it in the model. We note that this is a rather ad-hoc stopping criterion. While passing note that, one can also explore methods like cross-validation to choose the number of variables in the model ie the stopping criterion. COEFFICIENTS −5 0.0 0 5 0.2 0.4 RHO VALUES 0.6 0.8 Figure 2. Exercise 5: coefficient profiles (5) Problem 5 Part (a) 2 2 Cov(Xj , Xj ) equals σZ if j = j and 1 + σZ if j = j . Thus for j = j Corr(Xj , Xj ) = Part (b) 2 σZ 2. 1 + σZ STATS-305, FALL 2010 SOLUTIONS FOR HW 2 9 AVERAGED CORRELATION −0.5 0.0 −0.4 −0.3 −0.2 −0.1 0.2 0.4 RHO VALUES 0.6 0.8 Figure 3. Exercise 5: averaged correlations vs ρ The (unconditional) variance of f (X ) = X t β = β t X is 2 2 2 1 + σZ σZ σZ 1 2 2 2 2 1 + σZ σZ 1 = 3(1 + 3σZ ). β t Σ β = 1 1 1 σZ 2 2 2 σZ σZ 1 + σZ 1 Part c) We create X using the procedure described in (a), with 2 σZ 2 = ρ, 1 + σZ 2 i.e. σZ = ρ/(1 − ρ). We then generate noise i i.i.d. normal with variance specified by the signal-tonoise ratio. The code is presented in the Appendix 2.2. 10 STATS-305, FALL 2010 SOLUTIONS FOR HW 2 Parts d-f) Please see code in Appendix 2.2 Figures for Part (f) are in Fig. 2 and Fig. 3. Fig 2 shows that with increasing ρ, the coefficient profiles become more and more erratic. This is because the variance of the estimates become larger, since the model matrix becomes close to singular. 1. Acknowledgments Portions of this solution set are based on that of Nike Sun and Robert Chang — thanks! 2. Appendix 2.1. Code for Problem 4. prostate<-read.table("http://www-stat.stanford.edu/~hastie/stats305/Data/lprostate.dat", header=TRUE,row.name=1); pairs(prostate) attach(prostate) cor(prostate) lmout <- lm(lpsa ~ lcavol + lweight + age + lbph + svi + lcp + gleason + pgg45, data=prostate) summary.lm(lmout) lmout <- lm(lpsa ~ lcavol + lweight + age + lbph + svi + lcp + pgg45) ###alternative lmout <- update(lmout,~.-gleason) summary.lm(lmout) lmout <- lm(lpsa ~ lcavol + lweight + age + lbph + svi + lcp + gleason ) summary.lm(lmout) lmout <- lm(lpsa ~ lcavol + lweight + age + lbph + svi + pgg45) summary.lm(lmout) lmout <- lm(lpsa ~ lcavol + lweight + age + lbph + svi) summary.lm(lmout) lmout <- lm(lpsa ~ lcavol + lweight + lbph + svi) summary.lm(lmout) 2.2. Code for problem 5. Here are the code-templates for the different parts: Part c) ## N : sample size; rho: correlation coefficient among the predictors; ##SNR= signal to noise ratio; RNseed : random number seed gendata <- function(N, rho, SNR, RNseed=2010) { set.seed(RNseed) # build model with 3 predictors, using parts a,b X0 <- matrix(rnorm(N*3),N,3) # these are tildeX of part (a) Z0 <- rnorm(N) STATS-305, FALL 2010 SOLUTIONS FOR HW 2 11 ep0 <- rnorm(N) varz <- rho/(1 - rho) X <- X0 + sqrt(varz) * Z0 # Y <- X %*% rep(1, 3) + ep0 * # The factor sqrt(3 * (1 + 3 # set the SNR at the desired return(list(X=X, y=Y)) } Part d) R recycles this vector to fill a matrix sqrt(3 * (1 + 3 * varz)/SNR) * varz)/SNR) is required to level coefcor <- function(N, rhovec, SNR, RNseed=2010) { coef.matrix<-array(0,dim=c(3,length(rhovec))); i=0; for (rho in rhovec) { i=i+1 data<- gendata(N, rho, SNR, RNseed) lsout <- lm(y~X-1,data=data) # -1 removes the intercept coef.matrix[,i]=coef(lsout) } return(coef.matrix) } Parts e-f) coefcor(N=10,rhovec=c(0.1,0.2),SNR=1) rhovec=seq(from=0, to=0.95, length=20) a <- coefcor(30, rhovec, 1, RNseed=20101) matplot(rhovec, t(a), lty = 1,ylim=c(-8,8),ylab="COEFFICIENTS",xlab="RHO VALUES") ## part f) rhovec <- seq(0,0.95,length.out=20) a <- coefcor(30, rhovec, 1, 1) matplot(rhovec, t(a), type = 'l',ylim=c(-8,8),ylab="COEFFICIENTS",xlab="RHO VALUES") ## Define matrix for each coefficient (for 100 runs, and number of rho-values) coefmat<-array(0,dim=c(3,length(rhovec),100)) # 3 dimensional array for (i in 1:100) { a <- coefcor(30, rhovec, 1, i) coefmat[,,i] <- a matlines(rhovec, t(a)) } ave_cor<-rep(0,length(rhovec)); 12 STATS-305, FALL 2010 SOLUTIONS FOR HW 2 # vector computing the averaged correlation for every value of rho coefmat=aperm(coefmat,c(3,1,2)) # rearrange the indices for convenience ### The next line shows how you can create your own functions for use in apply, sapply etc ave_cor=apply(coefmat,3, function(x){a=cor(x);mean(a[row(a)>col(a)])}) plot(rhovec,ave_cor, xlab="RHO VALUES", ylab="AVERAGED CORRELATION") ...
View Full Document

This document was uploaded on 12/07/2010.

Ask a homework question - tutors are online