Abstract: The values of a categorical variable frequently have a structure that is not ordinal or linear in nature. For example, the months of the year have a circular structure, and the US States have a geographical structure. Standard approaches such as one-hot or numerical encoding are unable to effectively exploit the structural information of such variables. In this tutorial, we will introduce the StructureBoost gradient boosting package, wherein the structure of categorical variables can be represented by a graph, and exploited to improve predictive performance. Moreover, StructureBoost can make informed predictions on categorical values for which there is little or no data, by leveraging the knowledge of the structure. We will walk through examples of how to configure and train models using StructureBoost and demonstrate other features of the package.
Bio: Brian Lucena is a Principal at Lucena Consulting and a consulting Data Scientist at Agentero. An applied mathematician in every sense, he is passionate about applying modern machine learning techniques to understand the world and act upon it. In previous roles, he has served as SVP of Analytics at PCCI, Principal Data Scientist at Clover Health, and Chief Mathematician at Guardian Analytics. He has taught at numerous institutions including UC-Berkeley, Brown, USF, and the Metis Data Science Bootcamp.