How I Learned to Stop Worrying and Create Messy Data
How I Learned to Stop Worrying and Create Messy Data


The biggest hurdle in building a great model is often times not the model choice, but finding a realistic, high-quality training dataset. Multiple industries are booming on the back of this pain point: data vendors are collecting and curating extensive datasets, and lately more and more companies are investing into the development of synthetic datasets -- data that is generated programmatically and (ideally) resembles real data.

Whereas we have extensive libraries that produce random noise, generating noisy realistic data is trickier. For example, a dataset of words with random letters swapped out may not be the best replacement for a dataset of typos. And obtaining a dataset that includes all misspellings of a word may be difficult, expensive, and unrealistic. In an ideal future we would be able to add add realistic noise to a dataset as easily as we can draw a number from a normal distribution.

In the field of Master Data Management, golden records refer to single sources of truth in one’s datasets; they are data points that contain all the information associated with one entity. For example, credit rating companies rely on golden records to keep track of the different names, addresses, and credit lines associated with one individual. Golden records also contain organic noise, such as different spellings and abbreviations of names and street addresses.

In this session, we discuss how golden record data can be used to train models for generating synthetic datasets. By learning to reproduce organic noise from golden record data, these algorithms can generate realistic noisy training datasets. As an example, we train a Recurrent Neural Network (RNN) on golden records with name and address fields. We anticipate useful applications of this technique, as models trained on noisy datasets can perform better when encountering real world data.


Julia is the Director of Analytics at Tamr, where she is expanding the company's analytics and data science solutions. Before joining Tamr, she led end-to-end modeling and development of data science products at Aon's Intellectual Property Solutions group. Her previous experience includes technology-focused litigation consulting, quantitative finance, and private equity. Julia has a PhD in Physics from Harvard.