With so much terminology thrown around in regards to big data, and especially with machine learning, we thought it would be helpful to explore some of the more common verbiage. In this post, we delve into the idiosyncrasies behind two common concepts thrown around with predictive analytics, statistical inferences and predictive modeling. While sometimes used in similar situations, they really are independent concepts.
When analyzing big data, we can build statistical models for inference purposes or predictive purposes. For instance, imagine that we fit a simple linear regression model Y = b0 + b1X. If we fit this model for the purpose of statistical inference, our primary motivations and conclusions gathered from statistical tests are about the data itself. If we built this model for predictive purposes, statistical inference and tests are still important, however our motivations are steered primarily by the success of the predicted values. Consequently, the metrics used to evaluate these models have to be unique to the underlying motivations.
With statistical models developed for inference modeling, the primary focus is on the data that one has and the journey to discover the underlying relationships between the data after statistical fluctuations have been removed. Essentially, there is a large focus on theory and domain knowledge and consequently, the data is used to test these assumptions. On the other hand, predictive models extrapolate into the unknown using the known relationships between the data. The known relationship may emerge from a causal or descriptive analysis, or even some other technique such as machine learning.
When using models to explain existing behavior, or basically the purpose of statistical inference, the typical procedure is to first form a hypothesis about which fields will be useful and what form of a model is the true form. This holds true for almost any analysis project. If the coefficients in the model are determined to be wrong, or if the model errors are too egregious, we have to rebuild the model to get it right, which may mean transforming the inputs or respondent variables so that the model conforms to our assumptions. However, when predicting the target variable accurately is paramount, the actual distribution and construction of the model takes a back seat. One does not need to explain precisely why individuals behave as they do, as long as they can explain how they will behave.
Take the simple case from the introduction, a fitted simple linear regression model Y = b0 + b1X. If we fit this model for the purpose of statistical inference, then we are typically interested in learning more about the relationship between the independent variable, X, and the dependent variable Y. This could simply be determining how to to adjust the input X to result in a predetermined result for Y. In more complex models, this could determine which particular independent variables need to be adjusted to determine a particular change in Y.
If we use this model to predict Y, we must understand how we can utilize the input variables to make better decisions but we do not necessarily need an explanation on how the model actually works. Predictive analytics models may be essentially explicable,however a real-world explanation of why a model has a particular coefficient is definitely not required. With our simple linear regression model, we want to determine Y-hat, or an estimate for the actual value of Y. In this case, we input various X-values and determine how close the resulting Y-hat values are to our expected values of Y. By using R^2 values and F-tests, one can determine how “good” these models actually are in their predictive abilities.
Since a predictive model’s objective is quite clear, it basically has a specific prediction goal, the performance and value of the model can be measured without explaining causality. While statistical inferences are mostly explainers, it is important to remember that even strong correlations do not necessarily imply causation.
In one sentence, predictive modelling is about explaining what is likely to happen, while statistical inference determines how we can change the expected result.
When do you use statistical inference? With predictive modeling dominating big data analytics headlines in recent years, are data explainers even necessary anymore? Let us know your opinions below in the comments!
The SaberSmart Team