
Abstract: The problem of missing data is exacerbated by the dimensionality of the data and the multiple data sources, making any analysis challenging. Deleting observations with missing entries is not an alternative as in addition to potential biais it may lead to the deletion of almost all data. An abundant literature addresses missing data as well as many implementations (more than 150 R packages.)
In this tutorial, we will review main approaches and implementation (in R and python) to tackle this issue. We will start by the inferential framework, where the aim is to estimate at best the parameters and their variance in the presence of missing data. We will discuss the dangers of single imputation methods and detail multiple imputation. Then we will cover recent results in a supervised-learning setting. A striking one is that the widely-used method of imputing with the mean prior to learning can be consistent. That such a simple approach can be relevant may have important consequences in practice. We will also explain how to perform random forest with missing values and give first results on neural nets with incomplete data.
Session Outline
Explore a database with missing values, what is the pattern of missing values, what are the type of missing values (informative, non informative).
Learn how to perform logistic regression with missing values.
Learn what confidence should be given to an analysis performed on an incomplete data.
Learn that even with high percentage of missing values some analyses can be conducted.
Implement main matrix completions, multiple imputations techniques.
Learn how to do supervised learning with missing values when you have missing values in both train and test sets.
Background Knowledge
Basic knowledge in stat/ML methods (logistic (regression) analysis, PCA, random forest, etc.) Knowledge in R and python.
Bio: Julie Josse is a senior researcher at Inria and a scientific collaborator and teacher at at Ecole Polytechnique (Institut Polytechnique de Paris) in France. Julie Josse's is an expert in handling missing values (inference, multiple imputation, matrix completion, missing non at random data, supervised learning with missing values) and has created a platform to collect works and give resources to users https://rmisstastic.netlify.app/, has organized workshops on the topic. Her vocation is to push methodological innovation to bring useful application of her research to the user in particular in bio-sciences and health. Her curent research focuses on causal inferences techniques for personalized medicine. She leads an important project with the Traumabase group dedicated to the management of polytraumatized patients to help emergency doctors making decisions. Julie Josse is dedicated to reproducible research with the R statistical software: she has developed packages including missMDA and FactoMineR to transfer her work, she is a member of the R foundation and of Rforwards to increase the participation of minorities in the community.