Next-Generation Big Data Pipelines with Prefect and Dask

Abstract: 

Data pipelines are crucial to an organization’s data science efforts. They ensure data is collected and organized in a timely and accurate manner, and is made available for analysis and modeling. In many cases, these pipelines require parallel computing. That might be because they involve “big compute” (many tasks to execute in parallel) or “big data” (large datasets which have to be processed in chunks). In this talk we’ll introduce the next-generation stack for big data pipelines built upon Prefect and Dask, and compare it to popular tools like Spark, Airflow, and the Hadoop ecosystem. We’ll discuss pros and cons of each, then take a deep dive into Prefect and Dask.

Dask is a Python-native parallel computing framework that can distribute computation of arbitrary Python functions up to high-level DataFrame and Array objects. It also has machine learning modules that are optimized to take advantage of these distributed data structures. Prefect is a workflow management system created by engineers who contributed to Airflow, and was specifically designed to address some of Airflow's shortcomings. It is built around the “negative engineering” paradigm - it takes care of all the little things that might go wrong in a data pipeline. Then when computations need to be distributed, Prefect integrates seamlessly with Dask clusters through its executor interface.

Bio: 

Aaron Richter is a software developer turned data engineer and data scientist. He has pioneered the development and implementation of large-scale data science infrastructure in both business and research environments. Inevitably, he spent a lot of time finding efficient ways to clean data, run pipelines, and tune models. Aaron is currently a Senior Data Scientist at Saturn Cloud, where he works to make data scientists faster and happier. He holds a PhD in machine learning from Florida Atlantic University.