STAT444/844/CM464/764 Assignment # 2 Winter 2013 Instructor: S. Chenouri Due: March. 6, 2013 Undergraduate students are only required to work on 4
Problem 2) (Overfitting and underfitting) Variable selection is important for regression since there are problems in either using too many (irrelevant) or too few (omitted) variables in a regression model.Consider the linear regression model yi = Î²T xi + Îµi, where the vector of covariates (input vector) is xi = (xi 1, . . . , xi p)T âˆˆ Rp and the errors are independent and identically distributed (i.i.d.) satisfying E(Îµi) = 0 and Var(Îµi) = Ïƒ2. Let y = (y1, ..., yn)T be the response vector and X = (xij;i = 1...n, j = 1,...p) be the design matrix.
Assumethatonlythefirstp0 variablesareimportant.LetA={1,...,p}betheindexsetforthefullmodel
and A0 = {1, . . . , p0 } be the index set for the true model. The true regression coefficients can be denoted as
Î²âˆ— = (Î²âˆ— T , 0T )T . Now consider three different modelling strategies: A0
Strategy I: Fit the full model. Denote the full design matrix as XA and the corresponding OLS ols
estimator by Î²ô°‚A .
Strategy II: Fit the true model using the first p0 covariates. Denote the corresponding design matrix ols
by XA0 and the OLS estimator by Î²ô°‚A0 .
Strategy III: Fit a subset model using only the first q covariates for some q < p0 . Denote the corre ols
sponding design matrix by XA1 and the OLS estimator by Î²ô°‚A1 .
1. One possible consequence of including irrelevant variables in a regression model is that the predictions
ï¿¼are not efficient (i.e., have larger variances) though they are unbiased. For any x âˆˆ Rp, show that Eô°€Î²olsTx ô°=Î²âˆ—Tx , Varô°€Î²olsTx ô°â‰¥Var(Î²olsTx ),
ô°‚AAA0A0 ô°‚AA ô°‚A0A0 where xA0 consists of the first p0 elements of x.
2. One consequence of excluding important variables in a linear model is that the predictions are biased, though they have smaller variances. For any x âˆˆ Rp, show that
Eô°€Î²olsTx ô°â‰ Î²âˆ—Tx , Varô°€Î²olsTx ô°â‰¤Var(Î²olsTx ), ô°‚A1 A1 A0 A0 ô°‚A1 A1 ô°‚A0 A0
where xA1 consists of the first q elements of x.
STAT444/844/CM464/764Assignment # 2 Winter 2013 Instructor: S. ChenouriDue: March. 6, 2013Undergraduate students are only required to work on 4 out 5 questions.Problem 1)Compare the performance of forward selection, backward elimination, and all subset selectionon the Boston housing data, which are available in theRpackageMASS.i) What are the ﬁnal selected covariates for each of the three methods?ii) What are the ﬁvefold CV errors for each of the three methods?Problem 2) (Overﬁtting and underﬁtting)Variable selection is important for regression since thereare problems in either using too many (irrelevant) or too few (omitted) variables in a regression model.Consider the linear regression modelyi=βTxi+±i, where the vector of covariates (input vector) isxi=(xi1,..., xi p)T∈Rpand the errors are independent and identically distributed (i.i.d.) satisfyingE(±i) = 0and Var(±i) =σ2. Lety= (y1, ..., yn)Tbe the response vector andX= (xi j;i= 1...n, j= 1,...p) bethe design matrix.Assume that only the ﬁrstp0variables are important. LetA={1,..., p}be the index set for the full modelandA0={1,...,p0}be the index set for the true model. The true regression coeﬃcients can be denoted asβ*= (β*TA0,0T)T. Now consider three diﬀerent modelling strategies:Strategy I:Fit the full model. Denote the full design matrix asXAand the corresponding OLSestimator bybβolsA.Strategy II:Fit the true model using the ﬁrstp0covariates. Denote the corresponding design matrixbyXA0and the OLS estimator bybβolsA0.Strategy III:Fit a subset model using only the ﬁrstqcovariates for someq < p0. Denote the corresponding design matrix byXA1and the OLS estimator bybβolsA1.1. One possible consequence of including irrelevant variables in a regression model is that the predictionsare not eﬃcient (i.e., have larger variances) though they are unbiased. For anyx∈Rp, show thatE±bβols TAxA²=β*TA0xA0,Var±bβols TAxA²≥Var(bβols TA0xA0),wherexA0consists of the ﬁrstp0elements ofx.2. One consequence of excluding important variables in a linear model is that the predictions are biased,though they have smaller variances. For anyx∈Rp, show thatE±bβols TA1xA1²6=β*TA0xA0,Var±bβols TA1xA1²≤Var(bβols TA0xA0),wherexA1consists of the ﬁrstqelements ofx.Problem 3)In an enzyme kinetics study the velocity of a reaction (Y) is expected to be related to theconcentration (X) as followsYi=β0xiβ1+xi+±i.The dataset “Enzyme.txt” posted on D2L contains eighteen data points related to this study.1
i) To obtain starting values forβ0andβ1, observe that when the error term is ignored we haveY0i=α0+α1x0i, whereY0i= 1/Yi,α0= 1/β0,α1=β1/β0, andx0i= 1/xi. Therefore ﬁt a linear regressionfunction to the transformed data to obtain initial estimates forβ0andβ1used innls.ii) Using the starting values obtained in part (i), ﬁnd the least squares estimates of the parametersβ0andβ1iii) Plot the estimated nonlinear regression and the data. Does the ﬁt appear to be adequate?iv) Obtain the residuals and plot them against the ﬁtted values and againstXon separate graphs. Alsoobtain a normal probability plot. What do your plots show?v) Can you conduct an approximate formal lack of ﬁt test here? Explain?vi) Given that only 18 trials can be made, what are some advantages and disadvantages of considering fewerconcentration levels but with some replications, as compared to considering 18 diﬀerent concentrationlevels as was done here?vii) Assume that the ﬁtted model is appropriate and that largesample inferences can be employed here.(1) Obtain an approximate 95 percent conﬁdence interval forβ0.(2) Test whether or notβ1= 20; useα= 0.05. State the alternatives, decision rule, and conclusion.The following questions are from the Book (SJS) “A Modern Approach to Regression withR” by S. J. Sheather. This book is available from the library. I do recommend reading it forparts of the course. That dataset used in these questions is “cars04.csv"posted in D2L.Problem 4)Do the Exercise 5 in Chapter 3 of SJS.Problem 5)Do the Exercise 3 in Chapter 6 of SJS.2
2) (Overfitting and Underfitting)
SOLUTION: TRUE MODEL: =y =X, where each are i.i.d with E ( 1) In the model as specified, we know that is given by where is the full design matrix and y is the...
Sign up to view the full answer
Why Join Course Hero?
Course Hero has all the homework and study help you need to succeed! We’ve got coursespecific notes, study guides, and practice tests along with expert tutors.

Educational Resources

Study Documents
Find the best study resources around, tagged to your specific courses. Share your own to gain free Course Hero access.
Get oneonone homework help from our expert tutors—available online 24/7. Ask your own questions or browse existing Q&A threads. Satisfaction guaranteed!