Last week, we talked about the differences between building models for statistical inference versus building models for predictive modeling. Predictive modeling has not always been part of the statistics community. One person who is quite responsible for bridging the gap between the computer science community and the statistics community is Leo Breiman, known colloquially as the Father of CART and Random Forests. In his 2001 article, "Statistical Modeling: The Two Cultures", he articulated his views on the difference between the modern, at least for his time, statistics community and the machine learning community. Here are our thoughts on this ground breaking paper, and the topic in general.
While it may seem that machine learning and statistical modeling are two identical branches of predictive modeling, there are actually nuanced differences between the two. In the "Statistical Modeling: The Two Cultures" paper, Breiman argued that statisticians rely too heavily on data modeling, and that machine learning techniques are making progress by instead relying on the predictive accuracy of models. In fact, Breiman states that:
“Statisticians assume that the data are generated by a given stochastic data model. The other [machine learning] uses algorithmic models and treats the data mechanism as unknown. The statistical community has been committed to the almost exclusive use of data models. This commitment has led to irrelevant theory, questionable conclusions, and has kept statisticians from working on a large range of interesting current problems.”
Statisticians try to understand how the data was created by developing statistical properties of the estimator, e.g. p-values and unbiased estimators. They also study the underlying distribution of the population they are studying and the kinds of conclusions one would expect if they reproduced the experiment a great amount of times. This is akin to our simulation of the Super Bowl we performed back in February.
Statistical modeling techniques are usually applied to low dimensional data sets. Conversely, machine learning requires no prior assumptions about the underlying relationships between the variables. You pretty much toss in all the data you have, and the algorithm processes it to discover patterns, from which you can make predictions on the new data set. Machine learning treats an algorithm like a black box, as long it works then it’s fine! It is generally applied to high dimensional data sets as the more data one has the more accurate your prediction consequently becomes.
Perhaps the biggest difference between these two fields is their emphasis. Although statistical modeling and machine learning both try to learn from data, they tend to emphasize different aspects concerning how and when their methods should be applied. For instance, statistics emphasizes formal statistical inference, with confidence intervals, hypothesis tests, and optimal estimators in smaller data sets. In contrast, machine learning is more focused on making accurate predictions. Because of this purported emphasis on prediction above all, some people characterize machine learning as being less “strict” about testing assumptions. Basically, in machine learning making good predictions trumps more formal considerations like p-values.
In a nutshell, for statistics the model comes first; for machine learning the data are paramount. Because the emphasis in machine learning is on the data, not the model, validation techniques that separate data into training and test sets are very important. The quality of a solution lies not in a p-value, but in proving how well the solution performs on previously unseen data. Fitting a statistical model to a set of data and training a decision tree or random forest to a set of data involves estimation of unknown quantities. The best split points of the tree are determined from the data. Additionally, so are the estimates of the parameters for the conditional distribution of the dependent variable.
Increasingly however, these communities are interacting and sharing ideas. For example, research in machine learning is rapidly growing due to extensive applied use of statistics. A quick survey of any machine learning or data mining software package will reveal clustering techniques such as k-means, which are also found in statistics, as well as dimension reduction techniques such as principal components analysis and factor analysis. Both also rely on computing and software. Interestingly, the main software for machine learning is Python, while statistics makes heavy use of R. However, data analysis is also widely used in Python, as you can see with a quick perusal of our GitHub.
In order to stay successful in the dynamic world of analytics, we think that data scientists have to put equal weight in learning both practices to help their customers or organizations stay relevant in the fast-paced world of data analytics.
What are your opinions? Can you think of examples where a statistically good model is preferred over one that performs the best? Let us know in the comments!
The SaberSmart Team