Distributed Training Platform at Facebook
Distributed Training Platform at Facebook

Abstract: 

Large scale distributed training has become an essential element to scaling the productivity for ML engineers. Today, ML models are getting larger and more complex in terms of compute and memory requirements. The amount of data we train on at Facebook is huge. In this talk, we will learn about the Distributed Training Platform to support large scale data and model parallelism. We will touch base on Distributed Training support for PyTorch and how we are offering a flexible training platform for ML engineers to increase their productivity at facebook scale.

Bio: 

Senior Software Engineering Manager - Facebook AI Infrastructure Mohamed Fawzy leads the Distributed AI Group at Facebook. The Distributed AI group is focused on scaling and product-ionizing machine learning training at facebook scale by building platform and infrastructure to support large scale distributed training on heterogeneous hardware.

Open Data Science

Open Data Science
Innovation Center
101 Main St
Cambridge, MA 02142
info@odsc.com

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Youtube
Consent to display content from Youtube
Vimeo
Consent to display content from Vimeo
Google Maps
Consent to display content from Google