
Abstract: Scikit-learn is a Python machine learning library used by data science practitioners from many disciplines. During this training, we will learn about processing text data, working with imbalanced data, and Poisson regression. We will start by processing text data with scikit-learn's vectorizers. Since the output of these vectorizers is sparse, we will also review scikit-learn estimators that can handle sparse data. We will look at estimators with class weights, resampling techniques provided by imbalanced-learn, and using a bagging classifier with balancing. Next, we will explore how to work with imbalanced data where one of the classes appears more frequently than the others. Finally, we will learn about generalized linear models focusing on Poisson regression. Poisson regression models target distributions that are counts or relative frequencies. We will use tree-based models such as Histogram-based Gradient Boosted Trees with a Poisson loss to model relative frequencies.
Session Outline
Module 1: Text Data
The CountVectorizer converts a collection of text documents into a matrix of token counts. The TfidfVectorizer weights the count features into floating-point values using the term-frequency and inverse document-frequency. Since the output of these vectorizers is sparse matrices, we explore scikit-learn estimators that can handle sparse input data.
Module 2: Imbalanced data
Next, we learn about how to work with imbalanced data. Imbalanced data appear in datasets where one of the classes appears more frequently than the others. Specifically, we look at estimators with class weights, resampling techniques provided by imbalanced-learn, and using a bagging classifier with balancing. Imbalanced-learn defines a simple API for over-sampling, under-sampling, and sample generation with SMOTE.
Module 3: Poisson regression
We examine generalized linear models focusing on Poisson regression. Poisson regression can model target distributions that are counts or frequencies. We apply Poisson regression to a case study using an insurance claims dataset. Furthermore, we show that Poisson regression can better model the target distributions compared to other estimators. Lastly, we explore tree-based models such as Histogram-based Gradient Boosted Trees with a Poisson loss to model frequencies.
Background Knowledge
Python and intermediate understanding of scikit-learn's API
Bio: Thomas J. Fan is a Senior Software Engineer at Quansight Labs, working to sustain and evolve the PyData open-source ecosystem. He is a maintainer for scikit-learn, an open-source machine learning library written for Python. Previously, he worked at Columbia University, improving the interoperability between scikit-learn and AutoML systems. Thomas holds a Masters in Physics from Stony Brook University and a Masters in Mathematics from New York University.

Thomas Fan
Title
Senior Software Engineer | Quansight Labs
