View the step-by-step solution to:

STAT444/844/CM464/764 Assignment # 2 Winter 2013 Instructor: S. Chenouri Due: March. 6, 2013 Undergraduate students are only required to work on 4

Problem 2) (Overfitting and underfitting) Variable selection is important for regression since there are problems in either using too many (irrelevant) or too few (omitted) variables in a regression model.Consider the linear regression model yi = βT xi + εi, where the vector of covariates (input vector) is xi = (xi 1, . . . , xi p)T ∈ Rp and the errors are independent and identically distributed (i.i.d.) satisfying E(εi) = 0 and Var(εi) = σ2. Let y = (y1, ..., yn)T be the response vector and X = (xij;i = 1...n, j = 1,...p) be the design matrix.
Assumethatonlythefirstp0 variablesareimportant.LetA={1,...,p}betheindexsetforthefullmodel
and A0 = {1, . . . , p0 } be the index set for the true model. The true regression coefficients can be denoted as
β∗ = (β∗ T , 0T )T . Now consider three different modelling strategies: A0
Strategy I: Fit the full model. Denote the full design matrix as XA and the corresponding OLS ols
estimator by β􏰂A .
Strategy II: Fit the true model using the first p0 covariates. Denote the corresponding design matrix ols
by XA0 and the OLS estimator by β􏰂A0 .
Strategy III: Fit a subset model using only the first q covariates for some q < p0 . Denote the corre- ols
sponding design matrix by XA1 and the OLS estimator by β􏰂A1 .
1. One possible consequence of including irrelevant variables in a regression model is that the predictions
are not efficient (i.e., have larger variances) though they are unbiased. For any x ∈ Rp, show that E􏰀βolsTx 􏰁=β∗Tx , Var􏰀βolsTx 􏰁≥Var(βolsTx ),
􏰂AAA0A0 􏰂AA 􏰂A0A0 where xA0 consists of the first p0 elements of x.
2. One consequence of excluding important variables in a linear model is that the predictions are biased, though they have smaller variances. For any x ∈ Rp, show that
E􏰀βolsTx 􏰁≠ β∗Tx , Var􏰀βolsTx 􏰁≤Var(βolsTx ), 􏰂A1 A1 A0 A0 􏰂A1 A1 􏰂A0 A0
where xA1 consists of the first q elements of x.
STAT444/844/CM464/764 Assignment # 2 Winter 2013 Instructor: S. Chenouri Due : March. 6, 2013 Undergraduate students are only required to work on 4 out 5 questions. Problem 1) Compare the performance of forward selection, backward elimination, and all subset selection on the Boston housing data, which are available in the R package MASS . i) What are the final selected covariates for each of the three methods? ii) What are the five-fold CV errors for each of the three methods? Problem 2) (Overfitting and underfitting) Variable selection is important for regression since there are problems in either using too many (irrelevant) or too few (omitted) variables in a regression model. Consider the linear regression model y i = β T x i + ± i , where the vector of covariates (input vector) is x i = ( x i 1 ,..., x i p ) T R p and the errors are independent and identically distributed (i.i.d.) satisfying E ( ± i ) = 0 and Var( ± i ) = σ 2 . Let y = ( y 1 , ..., y n ) T be the response vector and X = ( x i j ; i = 1 ...n, j = 1 ,...p ) be the design matrix. Assume that only the first p 0 variables are important. Let A = { 1 ,..., p } be the index set for the full model and A 0 = { 1 ,...,p 0 } be the index set for the true model. The true regression coefficients can be denoted as β * = ( β * T A 0 , 0 T ) T . Now consider three different modelling strategies: Strategy I: Fit the full model. Denote the full design matrix as X A and the corresponding OLS estimator by b β ols A . Strategy II: Fit the true model using the first p 0 covariates. Denote the corresponding design matrix by X A 0 and the OLS estimator by b β ols A 0 . Strategy III: Fit a subset model using only the first q covariates for some q < p 0 . Denote the corre- sponding design matrix by X A 1 and the OLS estimator by b β ols A 1 . 1. One possible consequence of including irrelevant variables in a regression model is that the predictions are not efficient (i.e., have larger variances) though they are unbiased. For any x R p , show that E ± b β ols T A x A ² = β * T A 0 x A 0 , Var ± b β ols T A x A ² Var( b β ols T A 0 x A 0 ) , where x A 0 consists of the first p 0 elements of x . 2. One consequence of excluding important variables in a linear model is that the predictions are biased, though they have smaller variances. For any x R p , show that E ± b β ols T A 1 x A 1 ² 6 = β * T A 0 x A 0 , Var ± b β ols T A 1 x A 1 ² Var( b β ols T A 0 x A 0 ) , where x A 1 consists of the first q elements of x . Problem 3) In an enzyme kinetics study the velocity of a reaction ( Y ) is expected to be related to the concentration ( X ) as follows Y i = β 0 x i β 1 + x i + ± i . The dataset “ Enzyme.txt ” posted on D2L contains eighteen data points related to this study. 1
Background image of page 1
i) To obtain starting values for β 0 and β 1 , observe that when the error term is ignored we have Y 0 i = α 0 + α 1 x 0 i , where Y 0 i = 1 /Y i , α 0 = 1 0 , α 1 = β 1 0 , and x 0 i = 1 /x i . Therefore fit a linear regression function to the transformed data to obtain initial estimates for β 0 and β 1 used in nls . ii) Using the starting values obtained in part (i), find the least squares estimates of the parameters β 0 and β 1 iii) Plot the estimated nonlinear regression and the data. Does the fit appear to be adequate? iv) Obtain the residuals and plot them against the fitted values and against X on separate graphs. Also obtain a normal probability plot. What do your plots show? v) Can you conduct an approximate formal lack of fit test here? Explain? vi) Given that only 18 trials can be made, what are some advantages and disadvantages of considering fewer concentration levels but with some replications, as compared to considering 18 different concentration levels as was done here? vii) Assume that the fitted model is appropriate and that large-sample inferences can be employed here. (1) Obtain an approximate 95 percent confidence interval for β 0 . (2) Test whether or not β 1 = 20; use α = 0 . 05. State the alternatives, decision rule, and conclusion. The following questions are from the Book (SJS) “A Modern Approach to Regression with R” by S. J. Sheather. This book is available from the library. I do recommend reading it for parts of the course. That dataset used in these questions is “ cars04.csv" posted in D2L . Problem 4) Do the Exercise 5 in Chapter 3 of SJS. Problem 5) Do the Exercise 3 in Chapter 6 of SJS. 2
Background image of page 2
Sign up to view the entire interaction

Top Answer

Here's the explanation you needed for... View the full answer

sol_beta 16.docx

2) (Over­fitting and Under­fitting)
SOLUTION: TRUE MODEL: =y =X, where each are i.i.d with E ( 1) In the model as specified, we know that is given by where is the full design matrix and y is the...

Sign up to view the full answer

Why Join Course Hero?

Course Hero has all the homework and study help you need to succeed! We’ve got course-specific notes, study guides, and practice tests along with expert tutors.

-

Educational Resources
  • -

    Study Documents

    Find the best study resources around, tagged to your specific courses. Share your own to gain free Course Hero access.

    Browse Documents
  • -

    Question & Answers

    Get one-on-one homework help from our expert tutors—available online 24/7. Ask your own questions or browse existing Q&A threads. Satisfaction guaranteed!

    Ask a Question
Ask a homework question - tutors are online