Designing ETL Pipelines with Delta Lake and Structured Streaming — How to Architect Things Right


Structured Streaming has proven to be the best framework for building distributed stream processing applications. Its unified SQL/Dataset/DataFrame APIs and Spark's built-in functions make it easy for developers to express complex computations. Delta Lake, on the other hand, is the best way to store structured data because it is an open-source storage layer that brings ACID transactions to Apache Spark and big data workloads. Together, these can make it very easy to build pipelines in many common scenarios. However, expressing the business logic is only part of the larger problem of building end-to-end streaming pipelines that interact with a complex ecosystem of storage systems and workloads. It is important for the developer to truly understand the business problem that needs to be solved. Apache Spark, being a unified analytics engine doing both batch and stream processing, often provides multiple ways to solve the same problem. So understanding the requirements carefully helps you to architect your pipeline that solves your business needs in the most resource efficient manner. In this talk, I am going to examine a number of common streaming design patterns in the context of the following questions.

Session Outline:

- WHAT are you trying to consume? What are you trying to produce? What is the final output that the business wants? What are your throughput and latency requirements?
- WHY do you really have those requirements? Would solving the requirements of the individual pipeline actually solve your end-to-end business requirements?
- HOW are going to architect the solution? And how much are you willing to pay for it?

Background Knowledge:

Clarity in understanding the 'what and why' of any problem can automatically bring much clarity on the 'how' to architect it using Structured Streaming and, in many cases, Delta Lake.


Tathagata Das is a Staff Software Engineer at Databricks, an Apache Spark™ committer and a member of the Apache Spark Project Management Committee (PMC). He is one of the original developers of Apache Spark, the lead developer of Spark Streaming (DStreams), and is currently one of the core developers of Structured Streaming and Delta Lake.

Previously, he was a grad student at UC Berkeley at AMPLab, where he conducted research about data-center frameworks and networks with Scott Shenker and Ion Stoica. In addition, he is also an author of the book ""Learning Spark 2nd Edition"", published by O'Reilly -

Open Data Science




Open Data Science
One Broadway
Cambridge, MA 02142

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Consent to display content from - Youtube
Consent to display content from - Vimeo
Google Maps
Consent to display content from - Google