Have you heard of machine learning but haven’t found a way to implement any algorithms? Do you use R for all of your machine learning models and are wondering how to scale and deploy your models to production quickly and efficiently? Do you solely use R, or caret, for your machine learning models and want to diversify your skillset?
No judgement if you do, but let me introduce you to a, in my opinion, superior way to craft and deploy machine learning models using Python and scikit-learn.
Scikit-learn provides a range of supervised and unsupervised learning algorithms via a consistent interface in Python. While there are many packages that have the ability to implement machine learning models, scikit-learn is a robust library which you can use to bring machine learning into a production system through Python.
Python, as an open-source platform, is a straightforward and convenient way to work with machine learning algorithms due to the sheer size of its community. Because Python is a general purpose programming language, with access to various tools from scientific algorithms to plotting, web scraping, and general data wrangling, it is becoming one of the industry’s top preferred tools for data exploration and complex artificial intelligence models.
One vision for scikit-learn is to provide the level of robustness and support required for use in production systems. This means a deep focus on concerns such as ease of use, code quality, collaboration, documentation and performance. The library is focused on modeling data, not necessarily on loading, manipulating and summarizing data. However, those feature are available through two other popular Python libraries, NumPy and Pandas, which work in harmony with scikit-learn.
Scikit-learn, therefore, provides a clean and consistent interface to a wide variety of different models. It provides an analyst with many options, or parameters, for each model, but also chooses sensible defaults. Its documentation is exceptional as it helps one to understand the models as well as how to use them properly. It is also being actively developed, which while does mean that at some point our models may be out of date, also means that new features will be available to us almost instantly, allowing us to keep up with the latest technologies and developments.
There does exist a similar library in R, a popular statistical programming language. While caret is an excellent R package that attempts to provide a consistent interface for machine learning models in R, it is nowhere near as elegant a solution as scikit-learn. Furthermore, in R, switching between different models usually means learning a new package written by a different author. The interface may be completely different, the documentation may or may not be helpful in learning the package, and the package may or may not be in a beta type deployment state.
With R and caret, you might be able to more quickly build and launch your first model, however mastering scikit and similar libraries will provide you with a deeper and more complete toolset that you can feel safe using in your machine learning production systems.
Some popular groups of models provided by scikit-learn include:
Perhaps the most useful feature is the ability of scikit-learn to pare down the number of features used in the model by importance. It is well known that a variety of different numbers and statistics are available in the era of big data that can be used as model features. By narrowing your model’s scope to only those features that have predictive power, you reduce the likelihood of encountering potential problems, such as over-fitting, in our final model for our production environment.
Finally, scikit-learn provides more than just models, for example the library also provides the ability to create professional looking graphics from the results of most machine learning algorithms. Almost all of the graphical parameters are customizable, however, the defaults again do a great job. The reliability of the default parameters, in both modelling and visualization, means scikit-learn can provide actionable results faster than most other competitors. Below are a couple of visualization examples:
Testimonials and Experience
While machine learning is still gaining traction in the mainstream, the scikit-learn testimonials page lists Spotify, Inria, Mendeley, wise.io, Evernote, Telecom ParisTech and AWeber as public users of the library. If this is a small indication of companies that have presented on their use, then there are very likely tens to hundreds of larger organizations using the library. Consequently, scikit-learn has good test coverage and managed releases and, consequently, is suitable for prototype and production projects alike.
For my thesis project to complete my Master’s Degree in Data Science, I implemented scikit-learn and Python to develop a machine learning model to predict the potential career worth in the Major Leagues of current minor league hitters, based on their public playing statistics at the various lower development levels. Scikit-learn gave me the flexibility to quickly pivot and test the accuracy of various model types, including random forests, support vector machines, and neural nets before finally settling on a gradient boosting model.
I also used scikit-learn to create and submit my predictions for the now famous, at least within data science circles, Titanic problem, known on Kaggle as Machine Learning From Disaster because who doesn’t love puns. If you’re wondering, I cracked the top 45% at the time of my submission.
If you want to quickly see the power of scikit-learn, one example on their site provides the entire source code and datasource to build your own facial image recognition software!
For further learning, I recommend starting out with the quick-start tutorial and moving to the user guide and example gallery for explanations on the specific algorithms that are available.
Ultimately, scikit-learn is a library and the API reference will be the best documentation for getting things done and solving any issues that arise in the analytics pipeline.
What do you think? Will you take scikit-learn out for a test drive in machine learning, or will you make the switch? Do you use a more exotic software for your machine learning needs, like Alteryx? Let me know in the comments below!
The SaberSmart Team
P.S. If you enjoyed this article, and need something off Amazon anyway, why not support this site by clicking through the banner at the bottom of the page? As a member of the Amazon Affiliates program, we may receive a commission on any purchases. All revenue goes towards the continued hosting of this site.