Advanced Machine Learning with Scikit-learn: Text Data, Imbalanced Data, and Poisson Regression


Scikit-learn is a machine learning library in Python that is used by many data science practitioners. In this training, we will learn about processing text data, working with imbalanced data, and Poisson regression. We will start by learning about processing text data with scikit-learn's CountVectorizer and TfidfVectorizer. The CountVectorizer converts a collection of text documents into a matrix of token counts. We will explore the hyper-parameters that the CountVectorizer provides for creating these token counts. The TfidfVectorizer weights the count features into floating-point values using the term frequency and inverse document frequency. Since the output of these vectorizers is sparse matrices, we will also review the scikit-learn estimators that can handle sparse input data. Next, we will learn about how to work with imbalanced data. Imbalanced data appear in datasets where one of the classes appears more frequently than the others. Specifically, we will look at estimators with class weights, resampling techniques provided by imbalanced-learn, and using a bagging classifier with balancing. Imbalanced-learn defines a simple API for over-sampling, under-sampling, and sample generation with SMOTE. Next, we will learn about generalized linear models with a focus on Poisson regression. Poisson regression is used to model target distributions that are counts or relative frequencies. We will perform Poisson regression on a case study using an insurance claims dataset. Furthermore, we will show that Poisson regression can better model the target distributions when compared to other estimators. Lastly, we will learn how to use tree-based models such as Histogram-based Gradient Boosted Trees with a poisson loss to model relative frequencies.


Thomas J. Fan is a Staff Associate at the Data Science Institute at Columbia University. He is one of the core developers of scikit-learn, an open-source machine learning library is written in Python. Thomas holds a Masters in Mathematics from NYU and Masters in Physics from Stony Brook University. He also maintains skorch, a scikit-learn compatible neural network library that wraps PyTorch. He believes that developing open-source software is one of the best ways to maximize one's impact.