StructureBoost: Gradient Boosting with Categorical Structure
StructureBoost: Gradient Boosting with Categorical Structure


The values of a categorical variable frequently have a structure that is not ordinal or linear in nature. For example, the months of the year have a circular structure, and the US States have a geographical structure. Standard approaches such as one-hot or numerical encoding are unable to effectively exploit the structural information of such variables. In this tutorial, we will introduce the StructureBoost gradient boosting package, wherein the structure of categorical variables can be represented by a graph, and exploited to improve predictive performance. Morevoer, StructureBoost can make informed predictions on categorical values for which there is little or no data, by leveraging the knowledge of the structure. We will walk through examples of how to configure and train models using StructureBoost and demonstrate other features of the package.

Session Outline
Section 1: Structured Categorical Decision Trees. We will review how to extend the standard decision tree to accept structured categorical variables. This extension is the theoretical underpinning of StructureBoost.

Section 2: Configuring and using StructureBoost. Working through a Jupyter notebook, we will demonstrate how to fit and predict using StructureBoost on real datasets involving Structured Categorical Variables.

Section 3: Advanced features and capabilities of StructureBoost. We will dive deeper into some of the advantages of StructureBoost over other boosting packages, and demonstrate some of the more advanced features.

Background Knowledge
Attendees should be familiar with the Python toolkit: numpy, pandas, scikit-learn, etc.
Attendees should be familiar with the fit -> predict -> evaluate workflow of model creation using train/test splits. Experience with gradient boosting or random forests in particular will be helpful.


Brian Lucena is a Principal at Lucena Consulting and a consulting Data Scientist at Agentero. An applied mathematician in every sense, he is passionate about applying modern machine learning techniques to understand the world and act upon it. In previous roles, he has served as SVP of Analytics at PCCI, Principal Data Scientist at Clover Health, and Chief Mathematician at Guardian Analytics. He has taught at numerous institutions including UC-Berkeley, Brown, USF, and the Metis Data Science Bootcamp.

Open Data Science

Open Data Science
Innovation Center
101 Main St
Cambridge, MA 02142

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Consent to display content from Youtube
Consent to display content from Vimeo
Google Maps
Consent to display content from Google