Generating Realistic Data While Preserving Privacy
Generating Realistic Data While Preserving Privacy

Abstract: 

Individual-level data is very valuable to researchers, as it allows for arbitrary analyses. However, the value of individual-level data must be balanced against the privacy concerns of individuals who appear in data sets.

Traditional approaches to protecting privacy, such as reporting summary statistics or adding noise directly to the data, can throw away more information than necessary and may not even protect privacy. In this talk, I discuss an alternative approach: generating synthetic data sets that preserve relevant statistical properties of input data while also providing privacy guarantees for individuals who appear in the data.

I begin with a high-level introduction to differential privacy, a mathematical framework for analyzing and proving privacy guarantees. I introduce the notion of generative adversarial networks (GANs), which are used to create the synthetic data, and I illustrate how GANs are trained with a toy example using handwritten digits. I explain a small modification to the procedure for training GANs that ensures differential privacy.

I offer an implementation of a differentially private generative adversarial network (DPWGAN) using PyTorch. To illustrate how this method works, I apply the DPWGAN to ACS PUMS data, a collection of individual-level from the U.S. Census Bureau. I show that the DPWGAN models correlations in the data, and that cross tabs on the synthetic data are close to those in the original data. The PyTorch code is available and open source.

Participants will come away with an understanding of the basics of differential privacy and generative adversarial networks, as well as the ability to apply the DPWGAN code to their own data sets.

Bio: 

Joshua Falk is a data scientist at Civis Analytics, where he focuses on improving survey methodology and infrastructure, making pipelines more robust, and causal inference. Prior to Civis, he studied linguistics and statistics at the University of Chicago. Joshua also contributes to data analysis and visualizations at the South Side Weekly.