
Abstract: Often, categorical variables possess a natural structure that is not linear or ordinal in nature. The months of the year have a circular structure while the US states have a structure that can be represented by a graph. StructureBoost uses novel techniques that allow this known structure to be exploited to yield better predictions. Recently, StructureBoost has been enhance to utilize the structure in the target variable (i.e. in multi-classification) as well as in the predictor variables. This hands-on workshop will demonstrate how to use StructureBoost in different problems involving categorical variables with known structure.
Session Outline:
1. Overview: What is categorical structure and why is it important?
2. Categorical Predictors in StructureBoost: We will show how to use StructureBoost when you have a categorical predictor with known structure and demonstrate the improvement yielded by exploiting this structure. We will also discuss a bit of theory about how StructureBoost works 'under the hood'.
3. Multi-classification in StructureBoost: Next, we demonstrate how to configure StructureBoost when the target (rather than the predictor) possesses a known categorical structure. Again, we will talk a bit about the theory behind the exploitation of structure.
Background Knowledge:
The ideal participant will have used Gradient Boosting before (packages such as XGBoost, CatBoost, LightGBM) and be experienced in the model development cycle of training a model, making predictions, evaluating results and iterating on this. However, participants of all levels should be able to appreciate the workshop and learn from it.
Bio: Brian Lucena is Principal at Numeristical, where he advises companies of all sizes on how to apply modern machine learning techniques to solve real-world problems with data. He is the creator of three Python packages: StructureBoost, ML-Insights, and SplineCalib. In previous roles he has served as Principal Data Scientist at Clover Health, Senior VP of Analytics at PCCI, and Chief Mathematician at Guardian Analytics. He has taught at numerous institutions including UC-Berkeley, Brown, USF, and the Metis Data Science Bootcamp.