Next-Generation Big Data Pipelines with Prefect and Dask

Abstract: 

Data pipelines are crucial to an organization’s data science efforts. They ensure data is collected and organized in a timely and accurate manner, and is made available for analysis and modeling. In many cases, these pipelines require parallel computing. That might be because they involve “big compute” (many tasks to execute in parallel) or “big data” (large datasets which have to be processed in chunks). In this talk we’ll introduce the next-generation stack for big data pipelines built upon Prefect and Dask, and compare it to popular tools like Spark, Airflow, and the Hadoop ecosystem. We’ll discuss pros and cons of each, then take a deep dive into Prefect and Dask.

Dask is a Python-native parallel computing framework that can distribute computation of arbitrary Python functions up to high-level DataFrame and Array objects. It also has machine learning modules that are optimized to take advantage of these distributed data structures. Prefect is a workflow management system created by engineers who contributed to Airflow, and was specifically designed to address some of Airflow's shortcomings. It is built around the “negative engineering” paradigm - it takes care of all the little things that might go wrong in a data pipeline. Then when computations need to be distributed, Prefect integrates seamlessly with Dask clusters through its executor interface.

Bio: 

Aaron Richter is a software developer turned data engineer and data scientist. He has pioneered the development and implementation of large-scale data science infrastructure in both business and research environments. Inevitably, he spent a lot of time finding efficient ways to clean data, run pipelines, and tune models. Aaron is currently a Senior Data Scientist at Saturn Cloud, where he works to make data scientists faster and happier. He holds a PhD in machine learning from Florida Atlantic University.

Open Data Science

 

 

 

Open Data Science
One Broadway
Cambridge, MA 02142
info@odsc.com

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Youtube
Consent to display content from Youtube
Vimeo
Consent to display content from Vimeo
Google Maps
Consent to display content from Google