Training Gradient Boosting Models on Large Datasets with CatBoost

Abstract: Gradient boosting is a machine-learning technique that achieves state-of-the-art results in a variety of practical tasks. For a number of years, it has remained the primary method for learning problems with heterogeneous features, noisy data, and complex dependencies: web search, recommendation systems, weather forecasting, and many others.

CatBoost ( is one of the three most popular gradient boosting libraries. It has a set of advantages that differentiate it from other libs:

1. CatBoost provides great quality without parameter
2. CatBoost is able to incorporate categorical features in your data (like music genre, device id, URL, etc.) in predictive models with no additional preprocessing.
3. CatBoost prediction is 20-60 times faster than in other open-source gradient boosting libraries, which makes it possible to use CatBoost for latency-critical tasks.
4. CatBoost has the fastest GPU and multi GPU training implementations of all the openly available gradient boosting libraries.

This workshop will feature a comprehensive tutorial on using CatBoost library. We will walk you through all the steps of building a good predictive model. We will cover such topics as:

- Choosing suitable loss functions and metrics to optimize
- Training CatBoost model
- Visualizing the process of training
- CatBoost built-in overfitting detector and means of reducing overfitting of gradient boosting models
- Feature selection and explaining model predictions
- Testing trained CatBoost model on an unseen data

Bio: Nikita Dmitriev is a Member of CatBoost team. He graduated from Lomonosov Moscow State University and Yandex School of Data Analysis. He works as machine learning developer at Yandex.