Advanced Gradient Boosting (I): Fundamentals, Interpretability, and Categorical Structure


Gradient Boosting remains the most effective method for classification and regression problems on tabular data. This session is Part One of two. We will start with the fundamentals of how boosting works and best practices for model building and hyper-parameter tuning. Next, we will discuss how to interpret the model, understanding what features are important generally and for a specific prediction. Finally, we will discuss how to exploit categorical structure, when the different values of a categorical variable have a known relationship to one another.

Session Outline:

Section 1: Fundamentals of Gradient Boosting: We will quickly how a forest of decision trees is built iteratively using the gradient of the loss function. This will highlight the meanings of the various hyper-parameters involved in model creation.

Section 2: Model building : How should one approach a new modeling problem? How do we balance the twin goals of making good predictions versus understanding the relationship between our features and the target. Best practices for setting the various hyper-parameters involved in a gradient boosting model.

Section 3: Model interpretation: We focus on both """"global"""" interpretability (which features are generally important) and """"local"""" interpretability (explaining an individual prediction). ICE plots and SHAP values will be the main tools we discuss.

Section 4: Categorical Structure: Standard approaches to categorical variables either treat them as having no structure (as in one-hot encoding) or an ordinal (linear) structure. However, categorical variables may exhibit many different kinds of structure including circular (like the months of the year), hierarchical (as in the categories for image classification), or other graphical structures (such as the U.S. states). We will show how to use the StructureBoost package to exploit known structure in both features and target variables.

Background Knowledge:

All examples will be in Python using Jupyter notebooks. Students should have experience with using Gradient Boosting models in practice, but all are welcome to follow along.


Brian Lucena is Principal at Numeristical, where he advises companies of all sizes on how to apply modern machine learning techniques to solve real-world problems with data. He is the creator of three Python packages: StructureBoost, ML-Insights, and SplineCalib. In previous roles he has served as Principal Data Scientist at Clover Health, Senior VP of Analytics at PCCI, and Chief Mathematician at Guardian Analytics. He has taught at numerous institutions including UC-Berkeley, Brown, USF, and the Metis Data Science Bootcamp.

Open Data Science




Open Data Science
One Broadway
Cambridge, MA 02142

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Consent to display content from Youtube
Consent to display content from Vimeo
Google Maps
Consent to display content from Google