Advanced Machine Learning with Scikit-learn: Text Data, Imbalanced Data, and Poisson Regression

Abstract: 

Scikit-learn is a Python machine learning library used by data science practitioners from many disciplines. During this training, we will learn about processing text data, working with imbalanced data, and Poisson regression. We will start by processing text data with scikit-learn's vectorizers. Since the output of these vectorizers is sparse, we will also review scikit-learn estimators that can handle sparse data. We will look at estimators with class weights, resampling techniques provided by imbalanced-learn, and using a bagging classifier with balancing. Next, we will explore how to work with imbalanced data where one of the classes appears more frequently than the others. Finally, we will learn about generalized linear models focusing on Poisson regression. Poisson regression models target distributions that are counts or relative frequencies. We will use tree-based models such as Histogram-based Gradient Boosted Trees with a Poisson loss to model relative frequencies.

Session Outline
Module 1: Text Data
The CountVectorizer converts a collection of text documents into a matrix of token counts. The TfidfVectorizer weights the count features into floating-point values using the term-frequency and inverse document-frequency. Since the output of these vectorizers is sparse matrices, we explore scikit-learn estimators that can handle sparse input data.

Module 2: Imbalanced data
Next, we learn about how to work with imbalanced data. Imbalanced data appear in datasets where one of the classes appears more frequently than the others. Specifically, we look at estimators with class weights, resampling techniques provided by imbalanced-learn, and using a bagging classifier with balancing. Imbalanced-learn defines a simple API for over-sampling, under-sampling, and sample generation with SMOTE.

Module 3: Poisson regression
We examine generalized linear models focusing on Poisson regression. Poisson regression can model target distributions that are counts or frequencies. We apply Poisson regression to a case study using an insurance claims dataset. Furthermore, we show that Poisson regression can better model the target distributions compared to other estimators. Lastly, we explore tree-based models such as Histogram-based Gradient Boosted Trees with a Poisson loss to model frequencies.

Background Knowledge
Python and intermediate understanding of scikit-learn's API

Bio: 

Thomas J. Fan is a Senior Software Engineer at Quansight Labs, working to sustain and evolve the PyData open-source ecosystem. He is a maintainer for scikit-learn, an open-source machine learning library written for Python. Previously, he worked at Columbia University, improving the interoperability between scikit-learn and AutoML systems. Thomas holds a Masters in Physics from Stony Brook University and a Masters in Mathematics from New York University.

Open Data Science

 

 

 

Open Data Science
One Broadway
Cambridge, MA 02142
info@odsc.com

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Youtube
Consent to display content from Youtube
Vimeo
Consent to display content from Vimeo
Google Maps
Consent to display content from Google