Abstract: One of the key bottlenecks in building machine learning systems today is creating and managing large labeled training datasets. In this talk, I’ll describe our work on systems to support and accelerate ways of creating training data in higher-level, programmatic, but noisier ways---often referred to as weak supervision. This work is motivated by the observation that ML developers spend an increasing amount of their time doing training data engineering---i.e. labeling, augmenting, reshaping, cleaning, and maintaining training datasets---and that we can better support these emerging workflows with both data management and statistical learning tools and principles. In this talk, I’ll describe Snorkel, our open-source system for training data labeling (snorkel.stanford.edu), that can reduce training data creation time from months to days; other recent work around data augmentation and multi-task supervision; and applications of this work in domains ranging from medical imaging to unstructured data extraction.
Bio: Alex Ratner is a Ph.D. candidate in computer science at Stanford, advised by Chris Re, where his research focuses on weak supervision: the idea of using higher-level, noisier input from domain experts to train complex state-of-the-art models where limited or no hand-labeled training data is available. He leads the development of the Snorkel framework (snorkel.stanford.edu) for weakly supervised ML, which has been applied to machine learning problems in domains like genomics, radiology, and political science. He is supported by a Stanford Bio-X SIGF fellowship.