• Values of X are known precisely – Special methods that account for measurements error in X • No interaction between variables (beyond interactions included in the model) – Check for interactions and include interaction terms as necessary 48
Page 25 Issues in Fitting Multiple Regression Models Which and how many variables to include? • In the current study, the researches collected data on many additional covariates: smoking, blood pressure, serum ferritin levels, residence in rural vs urban areas, and urinary cadmium levels. • Adding these variables to the model did not improve the fit. Therefore, these variables were excluded from the final (reported) model – i.e., the researchers computed a P-value for every covariate in the model, removed those with P>0.05, and reran the model without these variables 49 Automatic Variable Selection • Many statistical programs can perform automatic variable selection: – All subsets selection : fit every possible model (i.e., combination of variables) to the data, and pick the one that fits the best – Forward selection : start with a single predictors and add other predictors one at a time. Keep only those that improve model fit (that is, are significantly associated with outcome). – Backward selection : start with a full set of predictors and eliminate predictors that contribute the least, one at a time. Stop when no more predictors can be removed. 50
Page 26 Problems with Automatic Variable Selection • Testing too many combinations can lead to spurious associations • Final model fits too well. R 2 is too high. • The best-fit parameters are too far from zero. • The CI’s are too narrow • The p-values are too low • When reading (presenting) the results of a study, it is important to know (report) how many variables the investigator started with • Good practice is to validate the model in an independent population 51 Issues in Fitting Multiple Regression Models How many variables to include? • The rule of thumb is that you need at least 10-20 (or even 40) participants for each predictor • Too few variables – model may be too simple, not good at prediction • Too many variables – can result in overfitting ; model fits the current data set well; but the predictions will not generalize to other data sets. • Goal – find the simplest model that still adequately predicts the outcome. 52
Page 27 Issues in Fitting Multiple Regression Models • Which and how many variables to include? • This depends on study primary goal: – Single hypothesis testing : estimate the effect of a given variable adjusted for all potential available confounders – Exploratory study : identify a set of variables independently associated with the outcome – Predictive modeling : predict the outcome with the least variables possible (parsimony vs. adequacy) 53 Issues in Fitting Multiple Regression Models • Multicollinearity – occurs when two predictor (X) variables are highly correlated with each other – Contain redundant information – Including the second one does not add much information after the first on is known (i.e., does not improve model fit) – Cannot estimate coefficients reliably • Check correlation among the predictors!
You've reached the end of your free preview.
Want to read all 28 pages?
- Fall '19
- Regression Analysis, Harvey Motulsky