
Abstract: It’s estimated to cost around $4.6 million US dollars and 355 years to train GPT-3 on a single GPU in 2020. Training large models in the cloud successfully requires optimization optimizations that we get out of the box with the open source library, lightning, by Lightning AI. In this workshop will walk through how to use lightning to speed up machine learning workflows in the Cloud. We will begin with an introduction to the different methods of speeding up training along with their cost implications and technical complexities. Then we’ll learn how to leverage key features of the lightning library like LightningDataset and multi-GPU training to speed up our training workflows in the cloud. By the end of this workshop, you should be able to train a basic model with fast data loading on multiple GPU’s.
Session Outline:
Lesson 1: Overview & Environment setup
Learn why training fast is important and the impact it has on costs. We’ll review the current challenges with efficient training and how Lightning was built to to solve those challenges. Bring your computer so you can setup a basic model that we’ll learn how to train efficiently together.
Lesson 2: Large datasets
Understand the cost considerations when working with large datasets on the cloud. We’ll also review the most common libraries for training with large datasets and learn how to create a custom LightningDataset that efficiently works with the Imagenet dataset on S3.
Lesson 3: Mutli-GPU
Here we’ll go over the cost and operational complexities of mutli-GPU training and learn how to use Lightning’s out-of-the-box multi-GPU support.
Lesson 4 (optional): Further challenges
What happens when training doesn’t fit on 1 GPU? If time allows, we’ll talk about some of the ongoing challenges with large-scale training and how Lightning is constantly evolving to solve the hardest and most common challenges.
https://github.com/Lightning-AI/lightning
Background Knowledge:
python, elementary knowledge of ML
Bio: Noha Alon joined as a founding team member at Lightning AI and currently holds an engineering leadership position. She leads parts of the effort to build the Lightning AI platform which aims to revolutionize the AI development workflow. Previously she worked on ML projects at Glossier and the LLM team at Microsoft. She holds a bachelor's degree in Software Engineering from Cal Poly, San Luis Obispo.

Noha Alon
Title
Director of Engineering | Lightning AI
