Abstract: Using Spark, Dask, or Ray is not an all-or-nothing thing. It may seem daunting for new practitioners expecting to translate existing Pandas pipelines to these big data frameworks. In reality, distributed computing can be incrementally adopted. There are many use cases where only one or two steps of a pipeline require expensive computation. This talk covers the strategies and best practices around moving portions of workloads to distributed computing through the open-source Fugue project. The Fugue API has a suite of standalone functions compatible with Pandas, Spark, Dask, and Ray. Collectively, these functions allow users to scale any part of their pipeline when ready for full-scale production workloads on big data.
Data practitioners often find Pandas becoming a bottleneck in their data workloads. Either data becomes too big for Pandas to handle effectively, or computationally expensive workloads can take hours to run. These scenarios call for distributed computing frameworks such as Spark, Dask, or Ray to accelerate. The problem then becomes how to migrate already existing code to these frameworks that have different syntax. In this talk, we approach the migration in a strategic way, only moving portions that truly benefit from additional resources. These are cases like:
Training several machine learning models in parallel on small data
Expensive feature engineering on each group of data, but small machine learning models
Downsampling big data to smaller data with stratified sampling
In all the above scenarios, only a single step requires cluster and distributed computing. The Fugue open-source project focused on this over the last two years, allowing data practitioners to port code to Spark with one function call. This function was called transform(). However, users still faced friction in still having to write Spark, Dask, or Ray code around the transform() step in several cases. The learning curve to utilize the these engines was still present.
As a result, Fugue released a new API with 60 standalone functions that are compatible across Pandas, Spark, Dask, and Ray. They collectively provide a minimal interface for users to distribute some steps of their pipelines. By being both compatible with Pandas and big data frameworks, users can still retain the quick iteration speed of Pandas. When ready for production, the code can be executed on a cluster with just one line of code change.
The newly released Fugue API is intuitive and incrementally adoptable, removing the need to fully learn distributed computing framework to move workloads to the distributed setting. In this talk, we'll show real workloads that can be written in a few lines of code.
Bio: Han Wang is the tech lead of Lyft Machine Learning Platform, focusing on distributed computing solutions. Before joining Lyft, he worked at Microsoft, Hudson River Trading, Amazon and Quantlab. Han is the creator of the Fugue project, aiming at democratizing distributed computing and machine learning.