Intermediate Machine Learning with Scikit-learn: Cross-validation, Parameter Tuning, Pandas Interoperability, and Missing Values


Scikit-learn is a machine learning library in Python that is used by many data science practitioners. In this training, we will learn about cross-validation, tuning machine learning algorithms, and pandas interoperability. We will start by learning about cross-validation for machine learning. Cross-validation enables us to evaluate our machine learning models by splitting our data into training and testing datasets. We will cover cross-validation schemes such as K-Fold cross-validation and the importance of stratifying your data. Next, we will learn about tuning algorithms in scikit-learn with grid search and random search. These hyper-parameter searching techniques help find hyper-parameter combinations that are suited for your dataset. We will learn how to specific hyper-parameters spaces when working with scikit-learn's Pipelines. Next, we will learn about categorical features and how to use scikit-learn's encoders to convert these categorical features into numerical features for a machine-learning algorithm to consume. We will learn how to handle heterogeneous data with scikit-learn and panda's DataFrames. scikit-learn’s ColumnTransformer enables us to specify which columns in the DataFrame to apply a given transformer. Specifically, we will learn how to apply numerical transformations to numerical columns and encoders to the categorical columns. Then we will learn how to handle missing values with imputation using univariate techniques and a k-Nearest Neighbors approach. Finally, we will apply the machine learning techniques we have learned on a house pricing dataset with scikit-learn's Histogram-based Gradient Boosted Trees. scikit-learn’s implementation of boosted trees is based on LightGBM and has similar performance characteristics.


Thomas J. Fan is a Staff Associate at the Data Science Institute at Columbia University. He is one of the core developers of scikit-learn, an open-source machine learning library is written in Python. Thomas holds a Masters in Mathematics from NYU and Masters in Physics from Stony Brook University. He also maintains skorch, a scikit-learn compatible neural network library that wraps PyTorch. He believes that developing open-source software is one of the best ways to maximize one's impact.

Open Data Science

Open Data Science
Innovation Center
101 Main St
Cambridge, MA 02142

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Consent to display content from Youtube
Consent to display content from Vimeo
Google Maps
Consent to display content from Google