Handling missing data in python/pandas and R

Abstract: Missing data is a widespread challenge for data analysts. It is so common in practice that it is always recommended to design studies with this problem in mind from the beginning. It is not a question of if, but of how much, data will be missing at the end of the study. Even in the most robust settings, malfunctioning instruments, malfunctioning researchers, and study subjects that drop out are all attributes of data collection in the real world. Knowing how to handle missing data, therefore, is such a crucial skill for data analysts that Wainer (2010) considers it one of the “six necessary tools” that researchers need to master in order to successfully tackle problems in their fields in this century.

The missing data research field emerged through several articles published in the 1970s (Dempster et al 1977, Heckman, 1979, Rubin, 1976). And it really took off with Rubin, and Little and Rubin’s seminal texts, both published in 1987. Since then, Little and Rubin’s additional contributions, along with those of Schafer, Allison, Graham and van Buuren, have completed the theory underpinning this subfield of statistics. In parallel, software advances from SPSS to Stata to R and more recently Python, now allow researchers to implement robust methods for handling missing data in their studies.

While several robust statistical methods of handling missing data have been developed, and are now widely accepted in the statistical community, researchers from other fields have lagged in adopting them. In their survey of RCTs published in top medical journals, Bell et al (2013) concluded that “A large gap is apparent between statistical methods research related to missing data and use of these methods in application settings, including RCTs in top medical journals.” Concerning the educational field, Pampaka et al (2016) noted that “even though missing data is an important issue, it is rarely dealt with or even acknowledged in educational research.” Indeed, most data analyses address missing data through complete case analysis (deleting all observations with missing data), and the most common advice in the data community is to perform single imputation i.e. replacing all missing values with the mean or mode for that variable. These methods are suboptimal, and lead to both increased bias and a reduction in power for the study results.

There are still heuristic aspects to missing data methods. However, as Graham (2009) states, multiple imputation and maximum likelihood procedures take us “at least 90% of the way to the hypothetical ideal from where we were 25 years ago. Newer procedures will continually fine-tune the existing MI and ML procedures, but the main missing data solutions are already available and should be used now.”

In the ODSC tutorial I will aim to:

1. describe missing data and the challenges it poses,

2. clarify a confusing terminology that further adds to the field’s complexity,

3. review methods for handling missing data, and

4. apply robust multiple imputation methods to a varied dataset in Python/Pandas

The tutorial will be aimed at all data scientists and researchers trying to understand missing data methodology for handling missing data in their own studies/datasets. I’ve been amazed myself through my journey into the methods for handling missing data – from the statistical theory behind it to how suboptimal the basic methods are, to the lack of any off-the-shelf robust methods in the python data landscape at this point (I am considering with my colleagues adding them to either sklearn or pandas in the coming months).

Bio: Alexandru Agachi is a co-founder of Empiric Capital, an algorithmic, data driven asset management firm headquartered in London. He is also a guest lecturer in big data and machine learning at Pierre et Marie Curie University in Paris, and is involved in neuro oncogenetic research, in particular applications of machine learning. After initial studies at LSE, he completed 4 graduate and postgraduate degrees and diplomas in technology and science, focusing on the thorium nuclear fuel cycle, surgical robotics, neuroanatomy and imagery, and biomedical innovation. He previously worked at UBP in hedge funds research, Deutsche Bank, the Kyoto University Research Reactor Institute, and conducted an investment consulting project for the CIO office at Investec. He was nominated as one of Forbes’ 30 Under 30 in Finance in 2018.

Open Data Science Conference