Five classification models were built for predicting whether a neighborhood will soon see a large rise in home prices, based on public elementary school ratings and other factors. The training data set was missing the school rating variable for every new school (3% of the data points).
Because ratings are unavailable for newly-opened schools, it is believed that locations that have recently experienced high population growth are more likely to have missing school rating data.
- Model 1 used imputation, filling in the missing data with the average school rating from the rest of the data.
- Model 2 used imputation, building a regression model to fill in the missing school rating data based on other variables.
- Model 3 used imputation, first building a classification model to estimate (based on other variables) whether a new school is likely to have been built as a result of recent population growth (or whether it has been built for another purpose, e.g. to replace a very old school), and then using that classification to select one of two regression models to fill in an estimate of the school rating; there are two different regression models (based on other variables), one for neighborhoods with new schools built due to population growth, and one for neighborhoods with new schools built for other reasons.
- Model 4 used a binary variable to identify locations with missing information.
- Model 5 used a categorical variable: first, a classification model was used to estimate whether a new school is likely to have been built as a result of recent population growth; and then each neighborhood was categorized as "data available", "missing, population growth", or "missing, other reason".
a. If school ratings can be reasonably well-predicted from the other factors, and new schools built due to recent population growth cannot be reasonably well-classified using the other factors, which model would you recommend?
Recently Asked Questions
- Define and describe Data, Information & Knowledge and discuss how they are similar and/or different to one another?
- What environmental factors limit the geographic distribution of individuals?
- How do variations in their food supply or interactions with other species, such as pathogens, affect the size of a population?