Software 2.0 and Snorkel: Beyond hand-labeled data

Abstract: In the last several years, deep learning models have simultaneously become more performant and more readily available as easy-to-use, commodity tools--however, their deployment in practice is bottlenecked by the need for large, hand-labeled training sets. This talk describes Snorkel, a system that focuses on this emerging training data bottleneck in the software 2.0 stack. In Snorkel, instead of tediously hand-labeling individual data items, a user implicitly defines large training sets by writing simple programs, called labeling functions, that label subsets of data points. This allows users to build high-quality models despite the fact that these labeling functions will have varying quality, coverage, and specificity--and be correlated in unknown ways. A key technical challenge in Snorkel is to estimate the quality and correlations among these labeling functions without hand-labeled data. This talk will explain a theory of learning without labeled data, and a host of recent applications in natural language processing, structured data problems, and computer vision. This talk will also briefly discuss recent extensions of these core ideas to automatically generating data augmentations, synthesizing training data, and learning from multi-task supervision.

Snorkel is open source on github. Technical blog posts and tutorials are available at Snorkel.Stanford.edu.

Bio: Alex Ratner is a Ph.D. candidate in computer science at Stanford, advised by Chris Re, where his research focuses on weak supervision: the idea of using higher-level, noisier input from domain experts to train complex state-of-the-art models where limited or no hand-labeled training data is available. He leads the development of the Snorkel framework (snorkel.stanford.edu) for weakly supervised ML, which has been applied to machine learning problems in domains like genomics, radiology, and political science. He is supported by a Stanford Bio-X SIGF fellowship.

Open Data Science Conference