Introduction to Scikit-learn: Machine Learning in Python
Introduction to Scikit-learn: Machine Learning in Python


Scikit-learn is a machine learning library in Python that is used by many data science practitioners. Machine learning is a valuable tool used across many domains such as medicine, physics, and finance. We will start this training by learning about scikit-learn’s API for supervised machine learning. scikit-learn's API mainly consists of three methods: fit, to build models, predict, to make predictions from models, and transform, to change the representation of the input data. Supervised machine learning models in scikit-learn that make predictions are called classifiers or regressors. Models that are used for transforming data are called transformers. This simple and consistent interface helps to abstract away the algorithm, thus allowing us to focus on our particular problems. We will use this interface to apply traditional machine learning algorithms such as linear models and tree based models. We will learn about the importance of splitting your data into train and test sets for model evaluation. Next, we will learn about preprocessing numerical data and its importance when working with linear models. Linear models with regularization such as logistic regression can converge faster to a solution when the training data is scaled. Finally, we will learn how to combine these preprocessing techniques with a machine learning model by using a Pipeline. The Pipeline enables us to combine transformers with a classifier or regressor to build a data flow, where the output of one layer is the input or another. After this training, you will be able use and apply scikit-learn to your machine learning problems.


Thomas J. Fan is a Staff Associate at the Data Science Institute at Columbia University. He is one of the core developers of scikit-learn, an open-source machine learning library is written in Python. Thomas holds a Masters in Mathematics from NYU and Masters in Physics from Stony Brook University. He also maintains skorch, a scikit-learn compatible neural network library that wraps PyTorch. He believes that developing open-source software is one of the best ways to maximize one's impact.

Open Data Science

Open Data Science
Innovation Center
101 Main St
Cambridge, MA 02142

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Consent to display content from Youtube
Consent to display content from Vimeo
Google Maps
Consent to display content from Google