So the lovable kiwi above is a GOOD fruit to help keep you FIT. Can you guess what we are discussing this week? Yup, model goodness-of-fit!
Just because we are able to fit a regression model to a data set does not mean that it is the right model to use. It is imperative to assess the goodness-of-fit of a regression model with determined metrics and graphical displays. What actions then constitute an analysis of goodness-of-fit for a regression model? What can go wrong in the interpretation of the results and the use of a regression model that would be deemed to “fit poorly”? Today, we tackle what makes a regression model well fitting. One quick caveat, this post deals only with models fitted using OLS regression, not maximum likelihood.
A regression model can be declared as well fitting if the predicted values match closely to the observed, or expected, values. The most basic regression model can be considered the “mean model”, or the model where every predicted values is simply the mean of the expected data. Obviously, a regression model should fit better than this. The actions constituting this analysis can consist of many tests and the calculation of various statistics. However, the three statistics most commonly used to evaluate model fit are R-squared, the overall F-test, and the Root Mean Square Error (RMSE).
All three are based on two sums of squares, Sum of Squares Total (SST) and Sum of Squares Error (SSE). SST measures how far the data are from the mean and SSE measures how far the data are from the model’s predicted values. Different combinations of these two values provide different information about how the regression model compares to the mean model.
The difference between SST and SSE is the improvement in prediction from the regression model, compared to the mean model. Dividing that difference by SST gives R-squared, or the coefficient of determination. R-squared is the proportional improvement in prediction from the regression model, compared to the mean model. Additionally, the F-test evaluates the null hypothesis that all regression coefficients are equal to zero versus the alternative that at least one does not, or that the R-squared value equals zero.
A significant F-test indicates that the observed R-squared is reliable, and is not a spurious result of oddities in the data set. Thus, the F-test determines whether the proposed relationship between the response variable and the set of predictors is statistically reliable. Finally, the RMSE is the square root of the variance of the residuals. It indicates the absolute fit of the model to the data, or how close the observed data points are to the model’s predicted values. Lower values of RMSE indicate better fit. If you have any background in statistics, all of these values comprise the ANOVA table.
To conceptually understand these statistics, it is first prudent to understand the residuals, or difference between the predicted and observed value. The residuals thus measure the error, and the error has to be stochastic. Stochastic is basically a fancy word for unpredictable and random. Consequently, the residuals need to also depict an air of randomness. This process is easy to understand with a die-rolling analogy. When you roll a die, you should not be able to predict which number will show on any given toss. However, you can assess a series of tosses to determine whether the displayed numbers follow a random pattern.
For example, if the number one shows up more frequently than randomness dictates, one-sixth of the time, you know something is wrong with your understanding of how the die actually behaves. The dice below always has one show up on every single roll. Consequently then, we can assume this dice is weighted, or actually, that each roll is the same one just looped. Thanks to James Neilson for this awesome animation! The full GIF can be found here on giphy.
Like the die, the errors should be unpredictable for any given observation. Since they are a calculation of the difference in values, the residuals should not be either systematically high or low. So, the residuals should be centered on zero throughout the range of fitted values. In other words, the model is correct on average for all fitted values. This plot is a Residual vs Fitted Value graph. Additionally, residuals for OLS regressions should be normally distributed. Naturally then, a QQ-plot would be used to assess the normality of the residuals.
In a poorly fit regression model, you can predict non-zero values for the residuals based on the fitted value. The non-random pattern in the residuals indicates that the deterministic portion of the model is not capturing some relevant explanatory information. Possible explanations include a missing variable, a missing higher-order term of a variable in the model to explain the curvature, or a missing interaction between terms already in the model. Everything that is possible with your predictors must be explained so that only random error is leftover. If there are non-random patterns in your residuals, it means that the predictors are missing something that is negatively affecting the regression model.
In the real world, the best measure of a model’s fit depends on your objectives, and more than model one are oftentimes deemed useful. Remember, the metrics and strategies discussed above are applicable only to regression models that use OLS estimation. Many types of regression models, such as mixed models, generalized linear models, and event history models to name a few, use maximum likelihood estimation. Different metrics and criterion should be used in those cases. Look out for an explanation of those in a future post!
Have you ever created a model only to discover it is less than useful? Let us know in the comments below! We learn and grow as analysts by sharing our experiences. We can’t wait to join the conversation!
The SaberSmart Team