Missing Data in Supervised Machine Learning
Missing Data in Supervised Machine Learning

Abstract: 

Most implementations of supervised machine learning algorithms are designed to work with complete datasets but datasets are rarely complete. This dichotomy is usually addressed by either deleting points with missing elements and losing potentially valuable information or imputing (trying to guess the values of the missing elements), which can lead to increased bias and false conclusions. I will quickly review the three types of missing data (missing completely at random, missing at random, missing not at random) and a couple of simple but often misleading ways to impute. I will spend most of the time describing three advanced methods for handling missing data: multiple imputation, the reduced-feature (aka pattern submodel) approach, and XGBoost, which is one of the few machine learning algorithms that works with incomplete datasets. We will discuss the advantages and limitations of each of these methods. By the end of the workshop, you will have a deeper understanding of the intricate nature of missing data and you will have multiple state of the art techniques in your arsenal to deal with them. I will use python to implement and visualize the techniques, relying on packages like pandas, scikit-learn, xgboost, and matplotlib or plotly for visualizations. The workshop is at the intermediate level. I assume that this is not the first time the participants learn about missing data and that they have worked through at least one regression or classification problem.

Bio: 

Andras Zsom is a Lead Data Scientist in the Center for Computation and Visualization group at Brown University, Providence, RI. He works with high-level academic administrators to tackle predictive modeling problems, he collaborates with faculty members on data-intensive research projects, and he was the instructor of a data science course offered to the data science master students at Brown.