Introduction to Scikit-learn: Machine Learning in Python


Scikit-learn is a Python machine learning library used by data science practitioners from many disciplines. We will start this training by learning about scikit-learn's API for supervised machine learning. scikit-learn's API mainly consists of three methods: fit to build models, predict to make predictions from models, and transform to modify data. This consistent and straightforward interface abstracts away the underlying algorithm, thus enabling us to focus on our particular problems. We will learn about the importance of splitting your data into train and test sets for model evaluation. Next, we will learn about combining preprocessing techniques with machine learning models using scikit-learn's Pipeline. The Pipeline allows us to connect transformers with a classifier or regressor to build a data flow, where the output of one layer is the input or another. After this training, you will have the foundations to apply scikit-learn to your machine learning problems.

Session Outline
Module 1: Introduction to machine learning and Loading Data
We review machine learning use cases focusing on the supervised learning problem. We learn about loading data in Python and the semantics of input data for scikit-learn.

Module 2: Supervised learning with scikit-learn
Next, we explore scikit-learn's API for training models, making predictions, and transforming data. We also see how to split data into a training and test set with scikit-learn.

Module 3: Preprocessing
We learn about preprocessing numerical data and its significance for particular machine learning models. For example, preprocessing is essential for ML models that approximate distances like nearest neighbor models.

Module 4: Pipelines
Lastly, we combine preprocessing and ML models from the previous modules and create pipelines using scikit-learn's Pipeline object. The Pipeline enables us to connect transformers with a classifier or regressor to build a data flow, where the output of one layer is the input or another.

Background Knowledge


Thomas J. Fan is a Senior Software Engineer at Quansight Labs, working to sustain and evolve the PyData open-source ecosystem. He is a maintainer for scikit-learn, an open-source machine learning library written for Python. Previously, he worked at Columbia University, improving the interoperability between scikit-learn and AutoML systems. Thomas holds a Masters in Physics from Stony Brook University and a Masters in Mathematics from New York University.

Open Data Science




Open Data Science
One Broadway
Cambridge, MA 02142

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Consent to display content from Youtube
Consent to display content from Vimeo
Google Maps
Consent to display content from Google