
Abstract: Gradient Boosting remains the most effective method for classification and regression problems on tabular data. This hands-on training is for those with some experience using Gradient Boosting who want to learn more advanced, cutting edge techniques. We begin with a review of best practices for building Gradient Boosting models, and discuss parameter tuning and model interpretability. Next, we will cover probabilistic regression, wherein the goal is to predict a full conditional density of the target variable, rather than just a point estimate. Finally, we will cover categorical structure - that is, how to encode knowledge about the values of a categorical variable so that it can be exploited effectively.
Session Outline:
Section 1: Gradient Boosting best practices: Which hyper-parameters are most important? How should I tune them? How should I approach feature selection? What tools can I use to explain my model?
Section 2: Probabilistic Regression: We will demonstrate two packages that perform Probabilistic Regression: NGBoost and StructureBoost. NGBoost takes a parametric approach, making it ideal for problems where you expect the answer to have a normal distribution (or other known parametric distribution). StructureBoost takes a nonparametric approach, and is a great choice for larger data sets and/or situations where the target distribution may be multimodal, asymmetric, or otherwise unusual.
Section 3: Categorical Structure: We will dive further into StructureBoost and demonstrate how you can exploit known structure in categorical variables, both as a predictor and as the target variable (i.e. in multi-classification).
Background Knowledge:
All examples will be in Python using Jupyter notebooks. Students should have experience with using Gradient Boosting models in practice, but all are welcome to follow along.
Bio: Brian Lucena is Principal at Numeristical, where he advises companies of all sizes on how to apply modern machine learning techniques to solve real-world problems with data. He is the creator of three Python packages: StructureBoost, ML-Insights, and SplineCalib. In previous roles he has served as Principal Data Scientist at Clover Health, Senior VP of Analytics at PCCI, and Chief Mathematician at Guardian Analytics. He has taught at numerous institutions including UC-Berkeley, Brown, USF, and the Metis Data Science Bootcamp.