View the step-by-step solution to:

# STAT444/844/CM464/764 Assignment # 2 Winter 2013 Instructor: S. Chenouri Due: March. 6, 2013 Undergraduate students are only required to work on 4

Problem 2) (Overfitting and underfitting) Variable selection is important for regression since there are problems in either using too many (irrelevant) or too few (omitted) variables in a regression model.Consider the linear regression model yi = Î²T xi + Îµi, where the vector of covariates (input vector) is xi = (xi 1, . . . , xi p)T âˆˆ Rp and the errors are independent and identically distributed (i.i.d.) satisfying E(Îµi) = 0 and Var(Îµi) = Ïƒ2. Let y = (y1, ..., yn)T be the response vector and X = (xij;i = 1...n, j = 1,...p) be the design matrix.
Assumethatonlythefirstp0 variablesareimportant.LetA={1,...,p}betheindexsetforthefullmodel
and A0 = {1, . . . , p0 } be the index set for the true model. The true regression coefficients can be denoted as
Î²âˆ— = (Î²âˆ— T , 0T )T . Now consider three different modelling strategies: A0
Strategy I: Fit the full model. Denote the full design matrix as XA and the corresponding OLS ols
estimator by Î²ô°‚A .
Strategy II: Fit the true model using the first p0 covariates. Denote the corresponding design matrix ols
by XA0 and the OLS estimator by Î²ô°‚A0 .
Strategy III: Fit a subset model using only the first q covariates for some q < p0 . Denote the corre- ols
sponding design matrix by XA1 and the OLS estimator by Î²ô°‚A1 .
1. One possible consequence of including irrelevant variables in a regression model is that the predictions
ï¿¼are not efficient (i.e., have larger variances) though they are unbiased. For any x âˆˆ Rp, show that Eô°€Î²olsTx ô°=Î²âˆ—Tx , Varô°€Î²olsTx ô°â‰¥Var(Î²olsTx ),
ô°‚AAA0A0 ô°‚AA ô°‚A0A0 where xA0 consists of the first p0 elements of x.
2. One consequence of excluding important variables in a linear model is that the predictions are biased, though they have smaller variances. For any x âˆˆ Rp, show that
Eô°€Î²olsTx ô°â‰  Î²âˆ—Tx , Varô°€Î²olsTx ô°â‰¤Var(Î²olsTx ), ô°‚A1 A1 A0 A0 ô°‚A1 A1 ô°‚A0 A0
where xA1 consists of the first q elements of x.

STAT444/844/CM464/764 Assignment # 2 Winter 2013 Instructor: S. Chenouri Due : March. 6, 2013 Undergraduate students are only required to work on 4 out 5 questions. Problem 1) Compare the performance of forward selection, backward elimination, and all subset selection on the Boston housing data, which are available in the R package MASS . i) What are the ﬁnal selected covariates for each of the three methods? ii) What are the ﬁve-fold CV errors for each of the three methods? Problem 2) (Overﬁtting and underﬁtting) Variable selection is important for regression since there are problems in either using too many (irrelevant) or too few (omitted) variables in a regression model. Consider the linear regression model y i = β T x i + ± i , where the vector of covariates (input vector) is x i = ( x i 1 ,..., x i p ) T R p and the errors are independent and identically distributed (i.i.d.) satisfying E ( ± i ) = 0 and Var( ± i ) = σ 2 . Let y = ( y 1 , ..., y n ) T be the response vector and X = ( x i j ; i = 1 ...n, j = 1 ,...p ) be the design matrix. Assume that only the ﬁrst p 0 variables are important. Let A = { 1 ,..., p } be the index set for the full model and A 0 = { 1 ,...,p 0 } be the index set for the true model. The true regression coeﬃcients can be denoted as β * = ( β * T A 0 , 0 T ) T . Now consider three diﬀerent modelling strategies: Strategy I: Fit the full model. Denote the full design matrix as X A and the corresponding OLS estimator by b β ols A . Strategy II: Fit the true model using the ﬁrst p 0 covariates. Denote the corresponding design matrix by X A 0 and the OLS estimator by b β ols A 0 . Strategy III: Fit a subset model using only the ﬁrst q covariates for some q < p 0 . Denote the corre- sponding design matrix by X A 1 and the OLS estimator by b β ols A 1 . 1. One possible consequence of including irrelevant variables in a regression model is that the predictions are not eﬃcient (i.e., have larger variances) though they are unbiased. For any x R p , show that E ± b β ols T A x A ² = β * T A 0 x A 0 , Var ± b β ols T A x A ² Var( b β ols T A 0 x A 0 ) , where x A 0 consists of the ﬁrst p 0 elements of x . 2. One consequence of excluding important variables in a linear model is that the predictions are biased, though they have smaller variances. For any x R p , show that E ± b β ols T A 1 x A 1 ² 6 = β * T A 0 x A 0 , Var ± b β ols T A 1 x A 1 ² Var( b β ols T A 0 x A 0 ) , where x A 1 consists of the ﬁrst q elements of x . Problem 3) In an enzyme kinetics study the velocity of a reaction ( Y ) is expected to be related to the concentration ( X ) as follows Y i = β 0 x i β 1 + x i + ± i . The dataset “ Enzyme.txt ” posted on D2L contains eighteen data points related to this study. 1
i) To obtain starting values for β 0 and β 1 , observe that when the error term is ignored we have Y 0 i = α 0 + α 1 x 0 i , where Y 0 i = 1 /Y i , α 0 = 1 0 , α 1 = β 1 0 , and x 0 i = 1 /x i . Therefore ﬁt a linear regression function to the transformed data to obtain initial estimates for β 0 and β 1 used in nls . ii) Using the starting values obtained in part (i), ﬁnd the least squares estimates of the parameters β 0 and β 1 iii) Plot the estimated nonlinear regression and the data. Does the ﬁt appear to be adequate? iv) Obtain the residuals and plot them against the ﬁtted values and against X on separate graphs. Also obtain a normal probability plot. What do your plots show? v) Can you conduct an approximate formal lack of ﬁt test here? Explain? vi) Given that only 18 trials can be made, what are some advantages and disadvantages of considering fewer concentration levels but with some replications, as compared to considering 18 diﬀerent concentration levels as was done here? vii) Assume that the ﬁtted model is appropriate and that large-sample inferences can be employed here. (1) Obtain an approximate 95 percent conﬁdence interval for β 0 . (2) Test whether or not β 1 = 20; use α = 0 . 05. State the alternatives, decision rule, and conclusion. The following questions are from the Book (SJS) “A Modern Approach to Regression with R” by S. J. Sheather. This book is available from the library. I do recommend reading it for parts of the course. That dataset used in these questions is “ cars04.csv" posted in D2L . Problem 4) Do the Exercise 5 in Chapter 3 of SJS. Problem 5) Do the Exercise 3 in Chapter 6 of SJS. 2

Here's the explanation you needed for... View the full answer

2) (Over­fitting and Under­fitting)
SOLUTION: TRUE MODEL: =y =X, where each are i.i.d with E ( 1) In the model as specified, we know that is given by where is the full design matrix and y is the...

### Why Join Course Hero?

Course Hero has all the homework and study help you need to succeed! We’ve got course-specific notes, study guides, and practice tests along with expert tutors.

### -

Educational Resources
• ### -

Study Documents

Find the best study resources around, tagged to your specific courses. Share your own to gain free Course Hero access.

Browse Documents