Multiple regression methods using the model
generally depend on the following four assumptions:
- the residuals of the model are nearly normal,
- the variability of the residuals is nearly constant,
- the residuals are independent, and
- each variable is linearly related to the outcome.
can be used to check each of these assumptions. We will consider the model from the Mario Kart auction data, and check whether there are any notable concerns:
Normal probability plot.
A normal probability plot of the residuals is shown in Figure 1. While the plot exhibits some minor irregularities, there are no outliers that might be cause for concern. In a normal probability plot for residuals, we tend to be most worried about residuals that appear to be outliers, since these indicate long tails in the distribution of residuals.
Absolute values of residuals against fitted values.
Figure 1. A normal probability plot of the residuals is helpful in identifying observations that might be outliers.
A plot of the absolute value of the residuals against their corresponding fitted values
is shown in Figure 2.
This plot is helpful to check the condition that the variance of the residuals is approximately constant. We don't see any obvious deviations from constant variance in this example.
Residuals in order of their data collection.
Figure 2. Comparing the absolute value of the residuals against the fitted values
is helpful in identifying deviations from the constant variance assumption.
A plot of the residuals in the order their corresponding auctions were observed is shown in Figure 3. Such a plot is helpful in identifying any connection between cases that are close to one another, e.g. we could look for declining prices over time or if there was a time of the day when auctions tended to fetch a higher price. Here we see no structure that indicates a problem.
Residuals against each predictor variable.
Figure 3. Plotting residuals in the order that their corresponding observations were collected helps identify connections between successive observations. If it seems that consecutive observations tend to be close to each other, this indicates the independence assumption of the observations would fail.
We consider a plot of the residuals against the cond_new variable, the residuals against the stock photo variable, and the residuals against the wheels variable. These plots are shown in Figure 4. For the two-level condition variable, we are guaranteed not to see any remaining trend, and instead we are checking that the variability doesn't ﬂuctuate across groups, which it does not. However, looking at the stock photo variable, we find that there is some difference in the variability of the residuals in the two groups. Additionally, when we consider the residuals against the wheels variable, we see some possible structure. There appears to be curvature in the residuals, indicating the relationship is probably not linear.
Figure 4. For the condition and stock photo variables, we check for differences in the distribution shape or variability of the residuals. In the case of the stock photos variable, we see a little less variability in the unique photo group than the stock photo group. For numerical predictors, we also check for trends or other structure. We see some slight bowing in the residuals against the wheels variable in the bottom plot.
It is necessary to summarize diagnostics for any model fit. If the diagnostics support the model assumptions, this would improve credibility in the findings. If the diagnostic assessment shows remaining underlying structure in the residuals, we should try to adjust the model to account for that structure. If we are unable to do so, we may still report the model but must also note its shortcomings. In the case of the auction data, we report that there appears to be non-constant variance in the stock photo variable and that there may be a nonlinear relationship between the total price and the number of wheels included for an auction. This information would be important to buyers and sellers who may review the analysis, and omitting this information could be a setback to the very people who the model might assist.
"All models are wrong, but some are useful"
—George E.P. Box
The truth is that no model is perfect. However, even imperfect models can be useful. Reporting a ﬂawed model can be reasonable so long as we are clear and report the model's shortcomings.
Caution: Don't report results when assumptions are grossly violated
While there is a little leeway in model assumptions, don't go too far. If model assumptions are very clearly violated, consider a new model, even if it means learning more statistical methods or hiring someone who can help.
TIP: Confidence intervals in multiple regression
Confidence intervals for coefficients in multiple regression can be computed using the same formula as in the single predictor model:
is the appropriate t
-value corresponding to the confidence level and model degrees of freedom, df
Licenses and Attributions