In multiple regression, machine learning, and predictive analytics in general, a common goal is to determine which independent variables contribute significantly to explaining the variability in the dependent variable. Unfortunately, many people fall into the same trap, classifying variable importance by comparing the smallest relative P-values. In this post, we caution modelers of biased and misleading statistics and provide alternatives to discover what predictors could be the best fit for your model.
The answer to which variable is most important is more complicated than it first appears. For one thing, how one defines “most important” often depends on the subject area, or specific context, and goals for the model. Furthermore, how one collects and measures the data used in prediction can influence the apparent relative importance of each variable. We would say that the analyst should consider specific and individual changes in each separate predictor and the consequent effect that they would have on the response variable.
Two methods that should not be used to evaluate the importance of a predictor variable are comparing the size of the regression coefficients and comparing the relative P-values in the model output. While regular regression coefficients describe the relationship between each predictor variable and the response, they almost always have different units, making it impossible for a direct comparison. For example, the coefficient for a variable called age, initially recorded in years, would be divided by twelve if age were instead recorded in months. However, this change in units does not change the variable's importance. Consequently, adjusting all of the units is nigh impossible, and thus comparisons across values are essentially meaningless.
Additionally, even though we look for low P-values to help determine whether the variable should be included in the model in the first place, they are inopportune in deciding importance. A very low P-value can reflect properties other than importance, such as a very precise estimate and a large sample size. For example, even if the predictors are measured on the same scale, a small coefficient that can be estimated precisely will have a small P-value, while a large coefficient that is not estimated precisely will have a large P-value. A small P-value indicates that the predictor is statistically significant. However, statistical significance does not translate to relative importance. The model composed of only statistically significant predictors would be the most accurate, however, it would naturally be biased. It is vital to remember that it is would NOT be accurate to compare the importance of predictors based solely off of this value.
The two methods that should be used in evaluating importance are comparing the standardized regression coefficients and evaluating the change in adjusted R-squared that each variable produces when it is added to a model that already contains all of the other variables. The standardized coefficients are calculated in the usual way. Before fitting the multiple regression equation, all variables including both the response and predictor variables, are standardized by subtracting the mean and dividing by the standard deviation. The standardized regression coefficients consequently represent the change in response for a change of one standard deviation in a predictor.
The second method is the evaluation of change in adjusted R-squared. Because this analysis treats each variable as the last one entered into the model, the change actually represents the percentage of the variance a variable explains that the other variables in the model cannot explain. Essentially this change encapsulates the amount of unique variance that each variable explains above and beyond the other variables in the model. If an automated variable selection algorithm selects a variable then, it is not necessarily important. For instance, it could only define a subset of variables, and not include all of the important predictors. Also, automated variable selection brings into the question the issue of practical vs statistical importance.
In reality, some predictors will not be able to be changed, regardless of their relative importance or accuracy in predicting the response variable. This is not an issue if the question only wants to know what variables best determines the response, but it is critical if the point of the model is to develop a plan or feature to instigate a change in the response. The cost will also enter into the discussion. For example, suppose a change in the response variable can be obtained by either a large change in one predictor or a small change in another. In some situations, it may prove to be more cost-effective to attempt the large change than the small change.
How do you determine variable importance in your models? Let us know in the comments below!
The SaberSmart Team