Look, Listen, Read: Unified AI with TorchMultimodal


Multimodal AI is a fast-growing field where deep neural networks are trained using multiple types of input data simultaneously (e.g. text, image, video, audio). Multimodal models perform better in content understanding applications, and are setting new standards for content generation in models such as DALL-E and StableDiffusion. Building multimodal models is hard; In this session we share more about multimodal AI, why you should care about it, what are some challenges you might face and how TorchMultimodal, our new PyTorch domain library eases the developer experience of building multimodal models.


Evan Smothers is a software engineer on the PyTorch Multimodal team at Meta. His work focuses on supporting researchers building state-of-the-art vision and language models, and helping to scale these models to billions of parameters. Previously Evan was a data scientist at Uber using ML to improve their matching algorithms. His academic background is in mathematics, and he completed his PhD from UC Davis with a research focus in partial differential equations.

Open Data Science




Open Data Science
One Broadway
Cambridge, MA 02142

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Consent to display content from - Youtube
Consent to display content from - Vimeo
Google Maps
Consent to display content from - Google