Abstract: Scikit-learn is a Python machine learning library used by data science practitioners from many disciplines. We start this training by learning about scikit-learn's API for supervised machine learning. scikit-learn's API mainly consists of three methods: fit to build models, predict to make predictions from models, and transform to modify data. This consistent and straightforward interface helps to abstract away the algorithm, thus allowing us to focus on our domain-specific problems. First, we learn the importance of splitting your data into train and test sets for model evaluation. Then, we explore the preprocessing techniques on numerical, categorical, and missing data. We see how different machine learning models are impacted by preprocessing. For example, linear and distance-based models require standardization, but tree-based models do not. We explore how to use the Pandas output API, which allows scikit-learn's transformers to output Pandas DataFrames! The Pandas output API enables us to connect the feature names with the state of a machine learning model. Next, we learn about the Pipeline, which connects transformers with a classifier or regressor to build a data flow where the output of one model is the input of another. Lastly, we look at scikit-learn's Histogram-based Gradient Boosting model, which can natively handle numerical and categorical data with missing values. After this training, you will have the foundations to apply scikit-learn to your machine learning problems.
Module 1: Supervised learning with scikit-learn
We explore scikit-learn's API for supervised machine learning: fit models, predict to make predictions and transform to modify data. We also see how to split data into a training and test set with scikit-learn.
Module 2: Preprocessing
We learn about preprocessing numerical data and its significance for particular machine learning models. For example, preprocessing is essential for approximating distances such as nearest neighbor models. Scikit-learn also provides imputers to handle data with missing values.
Module 3: Pipelines
We explore combining preprocessing techniques with a machine learning model using scikit-learn's Pipeline. The Pipeline enables us to connect transformers with a classifier or regressor to build a data flow where the output of one layer is the input of another.
Module 4: Categorical data
Many datasets are heterogeneous, consisting of both numerical features and categories features. In this section, we learn how to use scikit-learn's encoders to convert categorical features into numerical features for a machine learning model to consume. Lastly, we look at scikit-learn's Histogram-based Gradient Boosting model, which can natively handle heterogeneous data.
Attendees will learn about the foundations of scikit-learn. We will use the following open source tools: scikit-learn, pandas, numpy, seaborn, matplotlib, jupyter
Basic Python skills are required
Bio: Thomas J. Fan is a Senior Machine Learning Engineer at Union.ai and a maintainer for scikit-learn, an open-source machine learning library for Python. He led the development of scikit-learn's set_output API, which allows transformers to return pandas DataFrames. Previously, Thomas worked at Columbia University to improve interoperability between scikit-learn and AutoML systems. He also maintains skorch, a neural network library that wraps PyTorch. Thomas has a Master's in Mathematics from NYU and a Master's in Physics from Stony Brook University.