Editor’s note: The authors are speaking at ODSC West 2021. Be sure to check out their talk, “Deep Dive into Flyte,” there!

Machine learning (ML) has been deployed in the industry for over a few decades, but the tooling to support researchers and engineers in this space is still maturing. What once fell into the blanket title of “data scientist” has now specialized into data engineers, ML engineers, and ML researchers, of course, with data scientists still playing a crucial role in an organization’s data team.

As models get complex and data sources become diverse, infrastructure becomes a bottleneck. Often, data-scientists have to get entangled in the low-level infrastructure concerns — Kubernetes, networking, GPU drivers, resource management, etc. Moreover, machine learning experimentation requires individuals to work rapidly on independent and isolated experiments, which demands a way to enable collaboration.

If infrastructure is tied to a data scientist’s job, the scope is too huge to handle. Building pipelines that train models in production and running increasingly complex models with team collaboration in place need a lot of infrastructure wrangling. This is a vast area that needs a dedicated team or platform.

Dev needs DevOps; likewise, ML needs MLOps. Connecting various components in the MLOps stack is a need for most data scientists, which led to the development of tools like Flyte. 


About Flyte

Flyte is an open-source, container-native, structured programming and distributed processing platform that supports building highly concurrent, scalable, and maintainable workflows for machine learning and data processing. It enables the user to focus on the business logic while outsourcing infrastructure management to a centralized team. It also allows platform teams to provide a self-serve platform for their users.

Here is a simple Flyte code snippet using the Python Flytekit API that defines a Flyte task to compute the total pay, which returns the output as a pandas DataFrame:

import pandas as pd
from flytekit import Resources, task
@task(limits=Resources(cpu="2", mem="150Mi"))
def total_pay(hourly_pay: float, hours_worked: int, df: pd.DataFrame) ->
     return df.assign(total_pay=hourly_pay * hours_worked)

Key Benefits of Flyte

  • Kubernetes-Native Workflow Automation Platform
  • Ergonomic SDKs in Python, Java & Scala
  • Versioned & Auditable
  • Reproducible Pipelines
  • Strongly Typed System

Flyte is a feature-rich platform. In this post, we will understand the three essential features of Flyte.

Type Checking

* Flytekit SDK is available in Python, Java, and Scala. In this article, our primary focus will be on Python.

We all believe that Python’s ease of use is partly attributed to its dynamic nature. However, recently, this notion has been shifting gears. The Python community encourages introducing types in Python code because it helps write resilient and error-free code and improves code readability. 

The Flyte team believes that typing is a crucial aspect of writing code; hence we introduced a native-type system within Flyte.

Flytekit Python SDK automatically maps Python type-hints to Flyte-understood, cross-language types. You can refer to the documentation to see all of the types that map from Python to Flyte’s type system.

To understand why we value typing in Flyte, let’s analyze a function in Python that has no types.

def concat(a, b):
     return a + b

Looking at this function, we don’t understand what a and b are. Precisely, we cannot accurately predict the types—they can be integers, floats, or strings because concatenation can happen in all three of them. When we try to develop code on top of this, we may need to reexamine this code and emulate a type system in our heads whenever we utilize the output returned by this function, which can get tedious. It is also difficult to debug when an error occurs because we may have to analyze all the data types to understand the error source.

All in all, types can help eliminate this tediousness to some extent. Consider the following function with types included:

def concat(a: int, b: int) -> int:
     return a + b

Ah! We now understand that a and b are integers. We can decipher what the function output is at a glance! It’s neat, easy to understand, and easy to debug.

For that reason, Flyte requires the usage of types. It supports strong data typing and has a robust type system to support multiple types.

Look at the following example that showcases the use of dataclass_json as a type in Flyte:

import typing
from dataclasses import dataclass
from dataclasses_json import dataclass_json
from flytekit import task, workflow
class Datum(object):
     Example of a simple custom class that is modeled as a dataclass.
     x: int
     y: str
     z: typing.Dict[int, str]
def stringify(x: int) -> Datum:
     A dataclass return will be regarded as a complex single JSON return.
     return Datum(x=x, y=str(x), z={x: str(x)})
def add(x: Datum, y: Datum) -> Datum:
     Flytekit will automatically convert the passed in JSON to a DataClass. If the structures do not match, it will raise a runtime failure.
     return Datum(x=x.x + y.x, y=x.y + y.y, z=x.z)
def wf(x: int, y: int) -> Datum:
     Dataclasses (JSON) can be returned from a workflow.
     return add(x=stringify(x=x), y=stringify(x=y))
if __name__ == "__main__":
     This workflow can be run locally. During local execution, the dataclasses are marshalled to and from json.
     wf(x=10, y=20)

Extending the type system is easy, and Flyte has many custom types that are compact and hence, easy to use.

Here is an example of FlyteFile—a custom file type defined in Flyte.

from flytekit import task, workflow
from flytekit.types.file import FlyteFile
def t1(f: FlyteFile) -> str:
     with open(f) as fp:
          data = fp.readlines()
     return data[0]
def wf() -> str:
     return t1(
if __name__ == "__main__":
     print(f"Running {__file__} main...")
     print(f"Running wf(), first line of the file is '{wf()}'")

FlyteFile tries to download the remote CSV file; however, the URL in the code snippet doesn’t exist (the actual URL is https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv). On running the code, we see the following error:

flytekit.common.exceptions.user.FlyteAssertion: Failed to get data from
https://raw.githubusercontent.com/mwaskom/seaborndata/master/iris.csv to
/var/folders/6r/9pdkgpkd5nx1t34ndh1f_3q80000gn/T/flyteabhpuq8t/20211011_163211/local_flytekit/d50f36f4119018dda42d601f76ea0999/iris.csv (recursive=False).
Original exception: Value error! Received: 404. Request for data
@ https://raw.githubusercontent.com/mwaskom/seaborndata/master/iris.csv failed.
Expected status code 200

Thus, with the Flyte type system, we can eliminate disruptions in data pipelines due to the simplification of the debugging process and code readability.

We understand that not all software can be migrated to using types overnight. Hence, we have been working on providing enough flexibility to the typing system in Flyte by adding support to pickle arbitrary Python data types, which will be available in Flytekit 0.24.0. However, we recommend that you plan to include types for maintainability. 

Reproducibility and Fault Tolerance


From the perspective of machine learning or data pipelines, going back in time to access multiple versions of our work is of utmost importance because sometimes a previous version performs better than newer iterations. ML is a vast area where we can churn tons of data across multiple pipelines and algorithms. Likewise, data pipelines deal with lots of data processing. In such cases, we ought to meticulously manage our work (at the least, code and data), and due to the inherent complexity of ML, this task gets even more challenging.

Flyte inherently supports reproducibility. Reproducibility is all about the ability to “reproduce” or “reuse” our work. The idea of including reproducibility in Flyte stemmed from the problems that the founder members of the Flyte team faced at Lyft. One notable example we observed was when one of our colleagues had left Lyft. This colleague developed a system that quantized the pickup and drop points using Geohash and estimated the time required to commute between them. Due to the ad-hoc nature of ML back then, to reproduce the result, the team spent months of effort only to resurrect the original algorithm.

This is why Flyte is built for deterministic compute, which plays a crucial role in ML. We want our algorithm to produce the same set of outputs, given the same set of inputs. To achieve this, we need to plug the relevant reproducibility mechanisms into our system.

Here’s how we support reproducibility in Flyte:

  • Every task runs in an isolated environment; this allows tasks not to override each other
  • The task code is captured as a point in time snapshot using Docker containers, Protobufs, and a robust versioning system

Fault Tolerance

When an abrupt failure interrupts our workflows, we don’t want a loss of service. Fault tolerance is all about tolerating failures. It enables a service to run continuously without any hiccups.

In machine learning and data pipelines, the code and compute complexity demands a fault tolerance service to be in place. Flyte fully understands that; it doesn’t have a single point of failure and has some in-built fault-tolerance mechanisms. Hence, Flyte ensures that the platform would be resilient by design.

Here’s how we support fault tolerance in Flyte:

  • User and System Retries
  • Timeouts
  • Caching/Memoization
  • Guaranteed recovery of any past workflow execution up to the failure point

The following blogs talk about reproducibility and fault tolerance in detail: 

Incremental Development

Incremental development is the process of developing a system piece by piece. When building complex models through ML or data pipelines, we’d want to do it step by step by working on various modules, building one on top of the other. 

Flyte firmly abides by incremental development. Here are a few examples:

  • Start locally, run the code in Python, and move on to deploying the model to production
  • Run one single task in remote Flyte, and then run a workflow
  • Flyte supports the use of domains: development, staging, and production
  • Run entire workflow, cache successful steps, and iterate on the failures


Caching helps in faster retrieval of work, faster executions, and minimal waste of computing resources. Caching is useful when the same task needs to be repeated with the same exact input.

Flyte has a neat way of enabling cache. Here’s an example:

@task(cache=True, cache_version="1.0")
def fetch_dataset(...) -> ...:

cache_version field indicates that the task functionality has changed. Bumping the cache_version is similar to invalidating the cache.


This article just skims the surface of Flyte’s capabilities. It’s a Kubernetes-native platform that supports end-to-end workflow orchestration, from transforming raw data into a usable form, to deploying a model in production to serve your users. Once deployed on your system, it acts as a perfect companion to the ML and Data processing pipelines. Lyft, Spotify, Freenome, Blackshark.ai, and many others have already deployed Flyte on a production scale where millions of workflows are running successfully!

Learn more about Flyte at our upcoming talk at ODSC West 2021. We will conduct a Deep Dive into Flyte hands-on workshop by deep diving into Flyte’s features, concepts, and extensibility mechanisms.

To get started, refer to the documentation. Join Flyte’s Slack or Twitter to stay tuned for the updates!

About the authors/ODSC West 2021 speakers on Flyte:

Ketan Umare is the TSC Chair for Flyte (incubating under LF AI & Data). He is also the Co-founder and CEO of Union.ai. Previously he had multiple Senior Lead roles at Lyft, Oracle and Amazon ranging from Cloud, Distributed storage, Mapping (map making) and machine learning systems. He is passionate about building software that makes developer and other engineers’ lives easier and provides simplified access to large scale systems. With Flyte he is trying to bridge the gap from ideation to productionization for data and ML pipelines and bring a battle tested approach and structure to the data and ML world.

Besides software, he is a proud father, husband and enjoys traveling and outdoor activities.

Samhita is a Machine Learning Engineer and Tech Evangelist at Union.ai. Previously, she worked as a tools developer at Oracle. She likes to architect applications, write blogs, and speak about tech. She has self-published a book that digs into the intricacies of the tech industry. Her research revolves around Problem Solving, Web development, and Machine Learning.