1
BIOST/EPI 513
Spring Quarter 2011
Dr. McKnight
HOMEWORK 7 KEY
See Appendix I for STATA commands and Appendix II for output.
1.
Using the data in
prosHW.dta
on the Assigments page of the class website, use the training
data set (for which training == 1) and the command
swaic
(type
findit swaic
) to improve
upon the predictive model derived in class. The improvement will be made possible by allowing
dummy variables for the
dre
variable (it was treated as grouped linear in class) and by also
allowing a
log(psa)
term.
(a)
Generate a new variable equal to
log(psa)
, fit a logistic model using only the training
data set to predict the binary “penetration” outcome using all the following variables:
psa,
log(psa), age, race, i.dre, i.dcaps, vol
(main effects only). Present the value
of the area under the ROC curve (AUC) for this model for both the training and validation
data sets.
AUC for the training data set: 0.787
AUC for the validation data set: 0.738
(b)
Use
swaic
and the training data set to determine the model with a lowest AIC from
among those considered. List the subset of the original variables that are included in this
model.
The model with the lowest AIC (209.9) contains
log(psa)
,
dre
(as a factor variable),
vol
,
and
race
.
(c)
Fit the minimum AIC model derived using
swaic
in (b) to the training data set and
present AIC and BIC for it.
AIC: 209.9
BIC: 232.2
(d)
Compute and present the AUC of the model fit in (c) for both training and validation data
sets.
AUC for the training data set: 0.783
AUC for the validation data set: 0.743
(e)
Compare the values of AUC applied to the
validation
data set from both the full
model fit in (a) and the reduced model fit in (c). What do you conclude?
The AUC from the reduced model applied to the validation data set is higher than the AUC
from the full model applied to the validation data set, so the reduced model is preferred for its
accuracy and parsimony. This illustrates that a model containing more variables is not
necessarily better in terms of prediction accuracy.
2.
The data file
leuk.dta
 Spring '11
 BarbaraMc.Knight

