Problem 2) (Overfitting and underfitting) Variable selection is important for regression since there are problems in either using too many (irrelevant) or too few (omitted) variables in a regression model.Consider the linear regression model yi = Î²T xi + Îµi, where the vector of covariates (input vector) is xi = (xi 1, . . . , xi p)T âˆˆ Rp and the errors are independent and identically distributed (i.i.d.) satisfying E(Îµi) = 0 and Var(Îµi) = Ïƒ2. Let y = (y1, ..., yn)T be the response vector and X = (xij;i = 1...n, j = 1,...p) be the design matrix.

Assumethatonlythefirstp0 variablesareimportant.LetA={1,...,p}betheindexsetforthefullmodel

and A0 = {1, . . . , p0 } be the index set for the true model. The true regression coefficients can be denoted as

Î²âˆ— = (Î²âˆ— T , 0T )T . Now consider three different modelling strategies: A0

Strategy I: Fit the full model. Denote the full design matrix as XA and the corresponding OLS ols

estimator by Î²ô°‚A .

Strategy II: Fit the true model using the first p0 covariates. Denote the corresponding design matrix ols

by XA0 and the OLS estimator by Î²ô°‚A0 .

Strategy III: Fit a subset model using only the first q covariates for some q < p0 . Denote the corre- ols

sponding design matrix by XA1 and the OLS estimator by Î²ô°‚A1 .

1. One possible consequence of including irrelevant variables in a regression model is that the predictions

ï¿¼are not efficient (i.e., have larger variances) though they are unbiased. For any x âˆˆ Rp, show that Eô°€Î²olsTx ô°=Î²âˆ—Tx , Varô°€Î²olsTx ô°â‰¥Var(Î²olsTx ),

ô°‚AAA0A0 ô°‚AA ô°‚A0A0 where xA0 consists of the first p0 elements of x.

2. One consequence of excluding important variables in a linear model is that the predictions are biased, though they have smaller variances. For any x âˆˆ Rp, show that

Eô°€Î²olsTx ô°â‰ Î²âˆ—Tx , Varô°€Î²olsTx ô°â‰¤Var(Î²olsTx ), ô°‚A1 A1 A0 A0 ô°‚A1 A1 ô°‚A0 A0

where xA1 consists of the first q elements of x.

Assumethatonlythefirstp0 variablesareimportant.LetA={1,...,p}betheindexsetforthefullmodel

and A0 = {1, . . . , p0 } be the index set for the true model. The true regression coefficients can be denoted as

Î²âˆ— = (Î²âˆ— T , 0T )T . Now consider three different modelling strategies: A0

Strategy I: Fit the full model. Denote the full design matrix as XA and the corresponding OLS ols

estimator by Î²ô°‚A .

Strategy II: Fit the true model using the first p0 covariates. Denote the corresponding design matrix ols

by XA0 and the OLS estimator by Î²ô°‚A0 .

Strategy III: Fit a subset model using only the first q covariates for some q < p0 . Denote the corre- ols

sponding design matrix by XA1 and the OLS estimator by Î²ô°‚A1 .

1. One possible consequence of including irrelevant variables in a regression model is that the predictions

ï¿¼are not efficient (i.e., have larger variances) though they are unbiased. For any x âˆˆ Rp, show that Eô°€Î²olsTx ô°=Î²âˆ—Tx , Varô°€Î²olsTx ô°â‰¥Var(Î²olsTx ),

ô°‚AAA0A0 ô°‚AA ô°‚A0A0 where xA0 consists of the first p0 elements of x.

2. One consequence of excluding important variables in a linear model is that the predictions are biased, though they have smaller variances. For any x âˆˆ Rp, show that

Eô°€Î²olsTx ô°â‰ Î²âˆ—Tx , Varô°€Î²olsTx ô°â‰¤Var(Î²olsTx ), ô°‚A1 A1 A0 A0 ô°‚A1 A1 ô°‚A0 A0

where xA1 consists of the first q elements of x.

#### Top Answer

Here's the explanation you needed for... View the full answer