Intermediate Machine Learning with scikit-learn: Pandas Interoperability, Categorical Data, Parameter Tuning, and Model Evaluation

Abstract: 

Scikit-learn is a Python machine-learning library used by data science practitioners from many disciplines. We will learn about Pandas interoperability, categorical data, parameter tuning, and model evaluation. For Pandas interoperability, the ColumnTransformer applies data transformations on different columns from a Pandas DataFrame. In version 1.2, all of scikit-learn's transformers are configurable to output Pandas DataFrames. Next, we will learn about categorical data and how to use scikit-learn's encoders to convert these categorical features into numerical features for a machine learning algorithm to consume. We will explore tuning algorithms in scikit-learn with grid search and random search. Model evaluation is an essential part of the machine learning workflow. We will cover the metrics provided by scikit-learn and how to use the scoring API. Furthermore, we will use the plotting API to visualize a model's performance. Finally, we use all the ML techniques we learned to train and evaluate a model on a house pricing dataset with Histogram-based Gradient Boosted Trees.

Session Outline:

Module 1: Pandas Interoperability
Pandas DataFrames are frequently used together with scikit-learn to build machine learning models. In this section, we learn how to use the ColumnTransformer to apply transformers on different columns of a Pandas Dataframe. We will learn how to configure scikit-learn's transformers to output Pandas DataFrames.

Module 2: Categorical data
Many datasets are heterogeneous, consisting of both numerical features and categories features. In this section, we learn how to use scikit-learn's encoders to convert categorical features into numerical features for a machine-learning model to consume.

Module 3: Parameter Tuning
Scikit-learn's machine learning models contain a wide selection of hyper-parameters. This section explores the available tuning algorithms to select the hyper-parameters for your data and machine learning pipelines.

Module 4: Model Evaluation
Model evaluation is an essential part of the machine learning workflow. In this section, we will cover a selection of metrics provided by scikit-learn and how to use the scoring and plotting API for model evaluation.

Background Knowledge:

We recommend a basic understanding of Python and scikit-learn for this workshop.

Bio: 

Thomas J. Fan is a Staff Software Engineer at Quansight Labs and is a maintainer for scikit-learn, an open-source machine learning library for Python. Previously, Thomas worked at Columbia University to improve interoperability between scikit-learn and AutoML systems. He is a maintainer for skorch, a neural network library that wraps PyTorch. Thomas has a Masters in Mathematics from NYU and a Masters in Physics from Stony Brook University.

Open Data Science

 

 

 

Open Data Science
One Broadway
Cambridge, MA 02142
info@odsc.com

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Youtube
Consent to display content from - Youtube
Vimeo
Consent to display content from - Vimeo
Google Maps
Consent to display content from - Google