Missing Data in Supervised Machine Learning

Abstract: Most implementations of supervised machine learning algorithms are designed to work with complete datasets, but datasets are rarely complete. This dichotomy is usually addressed by either deleting points with missing elements and losing potentially valuable information or imputing (trying to guess the values of the missing elements), which can lead to increased bias and false conclusions. I will quickly review the three types of missing data (missing completely at random, missing at random, missing not at random) and a couple of simple but often misleading ways to impute. I will spend most of the time describing three advanced methods for handling missing data: multiple imputation, the reduced-feature (aka pattern submodel) approach, and XGBoost, which is one of the few machine learning algorithms that works with incomplete datasets. We will discuss the advantages and limitations of each of these methods. By the end of the workshop, you will have a deeper understanding of the intricate nature of missing data and you will have multiple state of the art techniques in your arsenal to deal with them. I will use python to implement and visualize the techniques, relying on packages like pandas, scikit-learn, xgboost, and matplotlib or plotly for visualizations. The workshop is at the intermediate level. I assume that this is not the first time the participants learn about missing data and that they have worked through at least one regression or classification problem before.

Bio: Andras Zsom is a Lead Data Scientist at the Center for Computation and Visualization at Brown University. He is managing a small but dedicated team of data scientists with the mission to help high level university administrators to make better data-driven decisions with data analysis and predictive modeling, they collaborate with faculty members on various data-intensive academic projects, and they also train data science interns.
Andras is passionate about using machine learning and predictive modeling for good. He is an astrophysicist by training and he has been fascinated with all fields of the natural and life sciences since childhood. He was a postdoctoral researcher at MIT for 3.5 years before coming to Brown. He obtained his PhD from the Max Planck Institute of Astronomy at Heidelberg, Germany; and he was born and raised in Hungary.