SaberSmart
  • Home
  • Blog
    • Throwback
  • Playoff Odds
    • MLB >
      • 2019 >
        • Total Playoff
        • Win Division
        • Win WildCard
        • Expected Wins
      • 2018 >
        • Total Playoff
        • Win Division
        • Win WildCard
        • Expected Wins
      • 2017 >
        • Total Playoff
        • Win Division
        • Win WildCard
        • Expected Wins
    • NBA >
      • 2018 >
        • Total Playoff
        • Expected Wins
      • 2017 >
        • Total Playoff
        • Expected Wins
  • About
    • Contact
    • Comment Policy

Why use Scikit-Learn for Machine Learning?

8/16/2018

Comments

 
Picture

​Have you heard of machine learning but haven’t found a way to implement any algorithms? Do you use R for all of your machine learning models and are wondering how to scale and deploy your models to production quickly and efficiently? Do you solely use R, or caret, for your machine learning models and want to diversify your skillset?

​No judgement if you do, but let me introduce you to a, in my opinion, superior way to craft and deploy machine learning models using Python and scikit-learn.
Scikit-learn provides a range of supervised and unsupervised learning algorithms via a consistent interface in Python. While there are many packages that have the ability to implement machine learning models, scikit-learn is a robust library which you can use to bring machine learning into a production system through Python.

​
Python, as an open-source platform, is a straightforward and convenient way to work with machine learning algorithms due to the sheer size of its community. Because Python is a general purpose programming language, with access to various tools from scientific algorithms to plotting, web scraping, and general data wrangling, it is becoming one of the industry’s top preferred tools for data exploration and complex artificial intelligence models.

One vision for scikit-learn is to provide the level of robustness and support required for use in production systems. This means a deep focus on concerns such as ease of use, code quality, collaboration, documentation and performance. The library is focused on modeling data, not necessarily on loading, manipulating and summarizing data. However, those feature are available through two other popular Python libraries, NumPy and Pandas, which work in harmony with scikit-learn.

Scikit-learn, therefore, provides a clean and consistent interface to a wide variety of different models. It provides an analyst with many options, or parameters, for each model, but also chooses sensible defaults. Its documentation is exceptional as it helps one to understand the models as well as how to use them properly. It is also being actively developed, which while does mean that at some point our models may be out of date, also means that new features will be available to us almost instantly, allowing us to keep up with the latest technologies and developments.

There does exist a similar library in R, a popular statistical programming language. While caret is an excellent R package that attempts to provide a consistent interface for machine learning models in R, it is nowhere near as elegant a solution as scikit-learn. Furthermore, in R, switching between different models usually means learning a new package written by a different author. The interface may be completely different, the documentation may or may not be helpful in learning the package, and the package may or may not be in a beta type deployment state.

With R and caret, you might be able to more quickly build and launch your first model, however mastering scikit and similar libraries will provide you with a deeper and more complete toolset that you can feel safe using in your machine learning production systems.

Ability
Picture
Source: http://scikit-learn.org

Some popular groups of models provided by scikit-learn include:

  • Clustering: for grouping unlabeled data such as K-Means.
 
  • Cross Validation: for estimating the performance of supervised models on unseen data.
 
  • Datasets: for test datasets and for generating datasets with specific properties for investigating model behavior.
 
  • Dimensionality Reduction: for reducing the number of attributes in data for summarization, visualization and feature selection such as Principal Component Analysis.
 
  • Ensemble methods: for combining the predictions of multiple supervised models.
 
  • Feature extraction: for defining attributes in image and text data.
 
  • Feature selection: for identifying meaningful attributes from which to create supervised models.
 
  • Parameter Tuning: for getting the most out of supervised models.
 
  • Manifold Learning: For summarizing and depicting complex multi-dimensional data.
 
  • Supervised Models: a vast array not limited to generalized linear models, discriminant analysis, naive bayes, lazy methods, neural networks, support vector machines and decision trees just to name a few.

This is just a small sampling, more information can be found on their site.

​Perhaps the most useful feature is the ability of scikit-learn to pare down the number of features used in the model by importance. It is well known that a variety of different numbers and statistics are available in the era of big data that can be used as model features. By narrowing your model’s scope to only those features that have predictive power, you reduce the likelihood of encountering potential problems, such as over-fitting, in our final model for our production environment. 

Finally, scikit-learn provides more than just models, for example the library also provides the ability to create professional looking graphics from the results of most machine learning algorithms. Almost all of the graphical parameters are customizable, however, the defaults again do a great job. The reliability of the default parameters, in both modelling and visualization, means scikit-learn can provide actionable results faster than most other competitors. Below are a couple of visualization examples:
Picture
Picture

​Testimonials and Experience

While machine learning is still gaining traction in the mainstream, the scikit-learn testimonials page lists Spotify, Inria, Mendeley, wise.io, Evernote, Telecom ParisTech and AWeber as public users of the library. If this is a small indication of companies that have presented on their use, then there are very likely tens to hundreds of larger organizations using the library. Consequently, scikit-learn has good test coverage and managed releases and, consequently, is suitable for prototype and production projects alike.

For my thesis project to complete my Master’s Degree in Data Science, I implemented scikit-learn and Python to develop a machine learning model to predict the potential career worth in the Major Leagues of current minor league hitters, based on their public playing statistics at the various lower development levels. Scikit-learn gave me the flexibility to quickly pivot and test the accuracy of various model types, including random forests, support vector machines, and neural nets before finally settling on a gradient boosting model.

I also used scikit-learn to create and submit my predictions for the now famous, at least within data science circles, Titanic problem, known on Kaggle as Machine Learning From Disaster because who doesn’t love puns. If you’re wondering, I cracked the top 45% at the time of my submission.

If you want to quickly see the power of scikit-learn, one example on their site provides the entire source code and datasource to build your own facial image recognition software!
Picture

​Additional Resources

For further learning, I recommend starting out with the quick-start tutorial and moving to the user guide and example gallery for explanations on the specific algorithms that are available.
Ultimately, scikit-learn is a library and the API reference will be the best documentation for getting things done and solving any issues that arise in the analytics pipeline.

  • Quick Start Tutorial: http://scikit-learn.org/stable/tutorial/basic/tutorial.html
  • User Guide: http://scikit-learn.org/stable/user_guide.html
  • API Reference: http://scikit-learn.org/stable/modules/classes.html
  • Example Gallery: http://scikit-learn.org/stable/auto_examples/index.html

​What do you think? Will you take scikit-learn out for a test drive in machine learning, or will you make the switch? Do you use a more exotic software for your machine learning needs, like Alteryx? Let me know in the comments below!

The SaberSmart Team
​
P.S. If you enjoyed this article, and need something off Amazon anyway, why not support this site by clicking through the banner at the bottom of the page? As a member of the Amazon Affiliates program, we may receive a commission on any purchases. All revenue goes towards the continued hosting of this site.
Comments
comments powered by Disqus

    Archives

    August 2019
    July 2019
    January 2019
    October 2018
    September 2018
    August 2018
    July 2018
    June 2018
    April 2018
    February 2018
    December 2017
    November 2017
    October 2017
    September 2017
    August 2017
    July 2017
    June 2017
    May 2017
    April 2017
    March 2017
    February 2017
    January 2017
    December 2016

    Categories

    All
    Analytics
    Big Data
    Computer Science
    Economics
    Essay
    Football
    Gambling
    History
    Mathematics
    MLB Teams
    NBA Teams
    NFL Teams
    Philosophy
    Super Bowl
    Triple Crown
    World Series

    RSS Feed

    Follow @sabersmartblog
    Tweets by sabersmartblog
 Support this site by clicking through the banner below:
  • Home
  • Blog
    • Throwback
  • Playoff Odds
    • MLB >
      • 2019 >
        • Total Playoff
        • Win Division
        • Win WildCard
        • Expected Wins
      • 2018 >
        • Total Playoff
        • Win Division
        • Win WildCard
        • Expected Wins
      • 2017 >
        • Total Playoff
        • Win Division
        • Win WildCard
        • Expected Wins
    • NBA >
      • 2018 >
        • Total Playoff
        • Expected Wins
      • 2017 >
        • Total Playoff
        • Expected Wins
  • About
    • Contact
    • Comment Policy