Integrating Pandas with Scikit-Learn, an Exciting New Workflow

Abstract: For Python data scientists, a typical workflow consists of using Pandas for exploratory data analysis before turning to Scikit-Learn for machine learning. Pandas and Scikit-Learn arose independently, each focusing on their specific tasks, and were never specifically designed to be integrated together. There was never a clearly defined and standardized process for transitioning between the two libraries. This lack of a concrete handoff lead to practitioners creating a variety of markedly different workflows to make this transition.

One of the main hurdles facing the Pandas to Scikit-Learn transition was the handling of string columns. Inputs to Scikit-Learn's machine learning models only allow for numeric arrays. The common scenario of taking a Pandas DataFrame with string columns and converting it to an array of only numeric values was quite painful. Yet another hurdle, was processing separate groupings of columns with separate functions.

With the recent release of Scikit-Learn version 0.20, many workflows will start looking similar. The brand new ColumnTransformer allows for direct Pandas integration to Scikit-Learn. It applies separate transformations to specific subsets of columns. The upgraded OneHotEncoder standardizes the encoding of string columns. Before, it only encoded columns containing numeric categorical data.

In this hands-on tutorial, we will use these new additions to Scikit-Learn to build a modern, robust, and efficient workflow for those starting from a Pandas DataFrame. There will be ample practice problems and detailed notes available so that you can use it immediately upon completion.

Bio: Ted Petrou is the author of Pandas Cookbook and founder of both Dunder Data and the Houston Data Science Meetup group. He worked as a data scientist at Schlumberger where he spent the vast majority of his time exploring data. Ted received his Master’s degree in statistics from Rice University and used his analytical skills to play poker professionally and teach math before becoming a data scientist.