Introduction to scikit-learn: Machine Learning in Python


Scikit-learn is a Python machine-learning library used by data science practitioners from many disciplines. We will start this training by learning about scikit-learn's API for supervised machine learning. scikit-learn's API mainly consists of three methods: fit to build models, predict to make predictions from models, and transform to modify data. This consistent and straightforward interface abstracts away the underlying algorithm, thus enabling us to focus on our particular problems. We will learn the importance of splitting your data into train and test sets for model evaluation. Next, we will learn about combining preprocessing techniques with machine learning models using scikit-learn's Pipeline. The Pipeline allows us to connect transformers with a classifier or regressor to build a data flow where the output of one layer is the input of another. Finally, we will look at the Pandas output API recently introduced in version 1.2. After this training, you will have the foundations to apply scikit-learn to your machine-learning problems.

Session Outline:

Module 1: Introduction to Supervised Learning with Scikit-learn
We introduce the workflow for supervised machine learning and how this workflow fits into scikit-learn's API. We will learn the data representation used by scikit-learn and experiment with different machine learning models.

Module 2: Preprocessing
Scikit-learn's transformer API enables us to preprocess or transfer the representation of our data. We will learn about the importance of preprocessing and how machine learning models behave with and without preprocessing.

Module 3: Pipelines
Scikit-learn's Pipeline greatly simplifies how we can express machine learning pipelines. In this section, we will learn about Pipelines to enable multiple preprocessing steps to be chained together and pass the transformed data into a machine-learning model.

Module 4: Pandas Output
With version 1.2, scikit-learn transformers are configurable to output pandas DataFrames! This section will explore the Pandas output API and how this feature is helpful in a machine-learning pipeline.

Background Knowledge:

We recommend a basic understanding of Python for this workshop.


Thomas J. Fan is a Staff Software Engineer at Quansight Labs and is a maintainer for scikit-learn, an open-source machine learning library for Python. Previously, Thomas worked at Columbia University to improve interoperability between scikit-learn and AutoML systems. He is a maintainer for skorch, a neural network library that wraps PyTorch. Thomas has a Masters in Mathematics from NYU and a Masters in Physics from Stony Brook University.

Open Data Science




Open Data Science
One Broadway
Cambridge, MA 02142

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Consent to display content from - Youtube
Consent to display content from - Vimeo
Google Maps
Consent to display content from - Google