Accelerating Machine Learning with Training Data Management

Abstract: One of the key bottlenecks in building machine learning systems is creating and managing the massive training datasets that today‚Äôs models learn from. In this talk, I will describe my work on data management systems that let users specify training datasets in higher-level, faster, and more flexible ways, leading to applications that can be built in hours or days, rather than months or years.

I will start by describing Snorkel, an open-source system for programmatically labeling training data that has been deployed by major technology companies, academic labs, and government agencies. In Snorkel, rather than hand-labeling training data, users write labeling functions which label data using heuristic strategies such as pattern matching, distant supervision, and other models. These labeling functions can have noisy, conflicting, and correlated outputs, which Snorkel models and combines into clean training labels. We solve this novel data cleaning problem without any ground truth labels using a matrix-completion style approach, which we show has strong consistency guarantees, and demonstrate that Snorkel leads to impactful gains in applications ranging from knowledge base construction to medical imaging.

Next, I will give an overview of two other systems that accelerate training data creation and management: TANDA, a system for optimizing and managing data augmentation strategies, wherein a labeled dataset is artificially expanded by transforming data points; and MeTaL, a system for integrating training labels across multiple related tasks. I will conclude by outlining future research directions for further accelerating and democratizing machine learning workflows, such as higher-level interfaces and massively multi-task frameworks.

Bio: Alex Ratner is a Ph.D. candidate in computer science at Stanford, advised by Chris Re, where his research focuses on weak supervision: the idea of using higher-level, noisier input from domain experts to train complex state-of-the-art models where limited or no hand-labeled training data is available. He leads the development of the Snorkel framework (snorkel.stanford.edu) for weakly supervised ML, which has been applied to machine learning problems in domains like genomics, radiology, and political science. He is supported by a Stanford Bio-X SIGF fellowship.