The following post introduces PyTorch Lightning, outlines its core design philosophy, and provides inline examples of how this philosophy enables more reproducible and production-capable deep learning code.

What is PyTorch Lightning?

PyTorch Lighting is a lightweight PyTorch wrapper for high-performance AI research.

Simply put, PyTorch Lightning is just organized PyTorch code.

Organizing PyTorch code with Lightning enables seamless training on multiple-GPUs, TPUs, CPUs and the use of difficult to implement best practices such as model sharding and mixed precision.

Losing the Boiler Plate PyTorch Lightning Design Philosophy Explained

1. Self Contained Models and Data

One of the traditional bottlenecks to reproducibility in deep learning is that models are often thought of as just a graph of computations and weights.

Example PyTorch Computation Graph from the PyTorch AutoGrad Docs

In reality, reproducing Deep Learning requires mechanisms to keep track of components such as initializations, optimizers, loss functions, data transforms, and augmentations.

A core design philosophy of PyTorch Lightning is that all the components and code related to reproducibility should be self-contained. A good test to see how self-contained your model is to ask yourself this question: “Can someone drop this file into a Trainer without knowing anything about the internals?”

The lightning module contains all the default initialization parameters needed for reproducibility. 

2. Modular Code

PyTorch Lightning provides a modular framework to decouple research and data code resulting in faster iteration and more reproducible code.

PyTorch Lightning module exampleVisualized Modularization of Deep Learning Code with Lighting 

The modular nature of this code increases readability. For example, if I want to understand how a module trains or inferences instead of guessing where in the code this is implemented, I can look at the modules training_step and forward functions. 

Similarly, if I want to know how data was preprocessed, transformed, or split, I can look at the Data Modules init, prepare_data, and setup functions.

3. Reduce the Boiler Plate

After decoupling your research and data code, the remaining boilerplate is managed by Lightning, providing implementations of proven deep learning best practices to reduce error and training time.

PyTorch Lightning module exampleThe Lightning Trainer standardizes the boilerplate with best practices to reduce ~80% of the most common Deep Learning errors.

Otherwise, state-of-the-art methods are often abandoned because boilerplate code such as ensuring that the model is not accidentally configured to evaluation mode on fine-tuning was not properly configured.

Lightning manages boilerplate code, such as device optimization, logging, process rank management, and more so that researchers can focus on building the best models possible. If users have special use cases requiring additional abstraction, they can create and share their own callbacks. 

4. Maximum Flexibility

For research to flourish, tools must be flexible. Lightning’s standardized best practices are accessible to the end-user as overridable hooks enabling maximum flexibility for those who want to experiment with crazy ideas that stray off the standard path.

PyTorch Lightning example override codeAn example Auto Encoder training_step override that demonstrates full access to the underlying boilerplate when needed.

Since hooks for processes such as loss configuration are standardized, Lightning makes it much easier to experiment with custom losses for domain-specific applications and combine different models for more complex Multi-Input Modeling scenarios. 

From Reproducible Research to Production Deep Learning

Now that we have a better understanding of the core design philosophy of underlying PyTorch Lightning, let’s look at some of the cool features this enables out the box from Multi GPU and TPU training to one line Onnx and Torch Script export.

PyTorch Lightning code example and gpusIf you want to get started quickly, Lightning also provides an example implementation of common Deep Learning Tasks from Text Summarization to Object Detection as part of the PyTorch Lightning Flash repo.


Read our launch blogpost Pip / conda pip install lightning-flash -U Pip from source pip install…

Production Scale Training with Grid

While Lightning helps keep your PyTorch code organized, reproducible and scalable there is one remaining barrier towards production that we have yet to discuss, managing infrastructure.  Often orchestrating compute and data pipelines to train and serve models at scale requires extensive configuration orcode modification to accomplish. That is where Grid comes in, Grid manages this overhead for you enabling PyTorch Lightning code to scale from a laptop to the cloud without changing a single line of code.

With Grid trains you can take a PyTorch Lightning script and scale and track hundreds of experiments as follows.

PyTorch Lightning code example

If you don’t have a powerful enough laptop interactive nodes provide optimized development environments that make it easier to hit the ground running developing your code enabling true production scale training of applied state of the art deep learning models. 

Conclusions on PyTorch Lightning

This post shows how Lightning’s core design principles enable more reproducible and production-ready deep learning code.

  1. Lightning code is clearer to read because engineering code is abstracted away, and common functions such as training_steps, process_data are standardized. Lightning handles the tricky engineering preventing common mistakes while enabling access to all the flexibility of PyTorch when needed.
  2. Lightning modules are hardware agnostic; if your code runs on a CPU, it will run on GPUs, TPUs, and clusters without requiring gradient accumulation or process rank management. You can even implement your own custom accelerators
  3. Each release is tested rigorously with every new PR on every supported version of PyTorch and Python, OS, multi GPUs, and even TPUs.Lightning has dozens of integrations with popular machine learning tools such as TensorBoard, CometML, and Neptune.
  4. Grid enables you to scale production training of your PyTorch Lightning code from your laptop to the cloud without having to modify a single line of code. 

Article written by Ari Bornstein & Sean Narenthiran.

Editor’s note: Learn more about PyTorch Lightning in William Falcon’s ODSC East 2021 talk, “From Research to Production, Minus the Boilerplate,” there!

About the ODSC Speaker:

William Falcon is the creator of the popular open-source project PyTorch Lightning, and the recently announced Grid AI. William created Lightning while doing his PhD at NYU and as a PhD researcher at Facebook AI; Lightning allows users to scale models without the boilerplate and Grid enables large-scale training on the cloud. Previously he co-founded the now acquired NextGenVest and spent time at Goldman Sachs. His PhD (currently on leave to focus on Lightning), is funded by Google Deepmind and NSF Foundation. His research interest is in unsupervised learning and the intersection of AI and neuroscience. William is a native of Venezuela and holds a BA from Columbia University in Computer Science and Statistics, with a minor in Math.