
Abstract: Scikit-learn is a Python machine learning library used by data science practitioners from many disciplines. We will learn about cross-validation, tuning machine learning algorithms, and pandas interoperability during this training. Cross-validation enables us to evaluate our machine learning models by splitting our data into multiple training and testing datasets. We will learn to handle missing values with imputation using univariate and multivariate techniques. Next, we will explore tuning algorithms in scikit-learn with grid search and random search. We will learn about categorical features and how to use scikit-learn's encoders to convert these categorical features into numerical features for a machine-learning algorithm to consume. Finally, we will apply the machine learning techniques on a house pricing dataset with scikit-learn's Histogram-based Gradient Boosted Trees. scikit-learn's boosted tree implementation is based on LightGBM and has similar performance characteristics.
Session Outline
Module 0: Quick Review of scikit-learn
We start by reviewing scikit-learn's API for fitting and evaluating models. We revisit splitting data into training and test sets, preprocessing data, and pipelines.
Module 1: Cross-Validation in scikit-learn
Cross-validation enables us to evaluate our machine learning models by splitting our data into multiple training and testing datasets. We cover cross-validation schemes such as K-Fold cross-validation and the importance of stratifying your data.
Module 2: Parameter tuning
Next, we learn about tuning algorithms in scikit-learn with grid search, random search, and successive halving. These hyper-parameter searching techniques help find configurations that are suited for your data. We explore how to specify hyper-parameters spaces when working with scikit-learn's Pipelines.
Module 3: Missing values in scikit-learn
We learn to handle missing values with imputation using univariate techniques and multivariate techniques such as k-Nearest Neighbors. We cover missing value indicators and use parameter tuning to find the best imputer for your data.
Module 4: Pandas Interoperability
The ColumnTransformer enables us to specify which columns in the panda's DataFrame to apply to a given transformer. Specifically, we use numerical transformations on numerical columns and encoders on categorical columns.
Background Knowledge
Python
Bio: Thomas J. Fan is a Senior Software Engineer at Quansight Labs, working to sustain and evolve the PyData open-source ecosystem. He is a maintainer for scikit-learn, an open-source machine learning library written for Python. Previously, he worked at Columbia University, improving the interoperability between scikit-learn and AutoML systems. Thomas holds a Masters in Physics from Stony Brook University and a Masters in Mathematics from New York University.

Thomas Fan
Title
Senior Software Engineer | Quansight Labs
