Abstract: While the general idea of self-supervised learning is identical across modalities, the actual algorithms and objectives differ widely because they were developed with a single modality in mind. In this talk, I will present data2vec, a framework for general self-supervised learning that uses the same learning method for either speech, NLP or computer vision. The core idea is to predict latent representations of the full input data based on a masked view of the input in a self-distillation setup using a standard Transformer architecture. Instead of predicting modality-specific targets such as words, visual tokens or units of human speech which are local in nature, data2vec predicts contextualized latent representations that contain information from the entire input. Experiments on the major benchmarks of speech recognition, image classification, and natural language understanding demonstrate a new state of the art or competitive performance to predominant approaches.
Learning objectives: General Self-supervised learning in multiple modalities.
Bio: Michael Auli is a principal research scientist/director at FAIR in Menlo Park, California. His work focuses on speech and NLP and he helped create projects such as wav2vec/data2vec, the widely used fairseq toolkit, the first modern feed-forward seq2seq models outperforming RNNs for NLP, and several top ranked submissions at the WMT news translation task in 2018 and 2019. Before that Michael was at Microsoft Research, where he did early work on neural machine translation and using neural language models for conversational applications. During his PhD at the University of Edinburgh he worked on natural language processing and parsing. http://michaelauli.github.io