Model Comparison 3. Principle of Model validation - continued
A good regression model predicts values of the response variable very close to the observed response values. The difference between predicted value and observed value of a response variable is called as prediction error. In Ordinary Least Squares (OLS) regression model, four statistics are used to evaluate the fitness of the model: 1. R squared and Adjusted R squared 2. F test 3. Root Mean Square Error (RMSE) 4. Mean Absolute Percentage Error (MAPE)
The above statistics are based on Total Sum of Squares (SST) and Error Sum of Squares (SSE). SST measures how far the data are from the mean while SSE measures how far the data are from the model's predicted values. i. R squared is obtained by dividing the difference (between SST and SSE) by SST. Adjusted R square incorporates the model's degrees of freedom. Adjusted R square is interpreted as the proportion of total variance that is explained by the model. ii. F test evaluates the null hypothesis that all regression coefficients are equal to zero versus the alternative that at least one is not. F test determines whether the proposed relationship between the dependent variable and the set of independent variables is statistically reliable.
iii. RMSE is the square root of the variance of the residuals. It indicates the absolute fit of the model to the data. RMSE is an absolute measure of fit. RMSE can be interpreted as the standard deviation of the unexplained variance and also has the same unit of measure as the response variable. iv. RMSE is a good measure of how accurately the model predicts the response. v. Mean Absolute Percentage Error (MAPE) is a measure of prediction accuracy of a forecasting method for example in trend estimation. Refer to iii. Mean Absolute Error (MAE) is the mean of absolute errors and it is difficult to distinguish between big and small errors. MAPE calculates the mean absolute error in percentage terms, thus allowing us to compare forecasts of data with different units of measure using different models.
Classification problems In classification problems, we measure how the model the sample data is correctly classified to their category. Here, the objective is to find a classifier such as Logistic Regression, CART, Random Forest etc. that performs well in predicting classes for new data for which the response is not known. In classification models, error or residual is the count of misclassified observations.

