Abstract: Scikit-learn is a Python machine learning library used by data science practitioners from many disciplines. We will start this training by learning about scikit-learn's API for supervised machine learning. scikit-learn's API mainly consists of three methods: fit to build models, predict to make predictions from models, and transform to modify data. This consistent and straightforward interface abstracts away the underlying algorithm, thus enabling us to focus on our particular problems. We will learn about the importance of splitting your data into train and test sets for model evaluation. Next, we will learn about combining preprocessing techniques with machine learning models using scikit-learn's Pipeline. The Pipeline allows us to connect transformers with a classifier or regressor to build a data flow, where the output of one layer is the input or another. After this training, you will have the foundations to apply scikit-learn to your machine learning problems.
Module 1: Introduction to machine learning and Loading Data
We review machine learning use cases focusing on the supervised learning problem. We learn about loading data in Python and the semantics of input data for scikit-learn.
Module 2: Supervised learning with scikit-learn
Next, we explore scikit-learn's API for training models, making predictions, and transforming data. We also see how to split data into a training and test set with scikit-learn.
Module 3: Preprocessing
We learn about preprocessing numerical data and its significance for particular machine learning models. For example, preprocessing is essential for ML models that approximate distances like nearest neighbor models.
Module 4: Pipelines
Lastly, we combine preprocessing and ML models from the previous modules and create pipelines using scikit-learn's Pipeline object. The Pipeline enables us to connect transformers with a classifier or regressor to build a data flow, where the output of one layer is the input or another.
Bio: Thomas J. Fan is a Senior Software Engineer at Quansight Labs, working to sustain and evolve the PyData open-source ecosystem. He is a maintainer for scikit-learn, an open-source machine learning library written for Python. Previously, he worked at Columbia University, improving the interoperability between scikit-learn and AutoML systems. Thomas holds a Masters in Physics from Stony Brook University and a Masters in Mathematics from New York University.