Best Practices for CDC Pipelines to Apache Iceberg Tables


Change data capture (CDC) is a tool to efficiently stream changes from transactional database tables as they happen. Because distributed analytic tools like Trino or Spark can easily overwhelm transactional databases, CDC streams play a critical role: supplying data to mirror tables in analytic formats like Apache Iceberg.

This talk is about best practices for building CDC pipelines that maintain analytic tables, while preserving the ability to work with the change records as a stream. This will include how to avoid pitfalls like weak/eventual consistency as well as how to ensure the Apache Iceberg table is structured to deliver fast query results.

You'll learn about:
* CDC patterns and how to choose one
* Common mistakes to avoid
* Iceberg tables and best practices


Ryan is the co-creator of Apache Iceberg and spent the last decade working on big data infrastructure at Netflix, Cloudera, and now Tabular. He is an ASF member and a committer in the Apache Parquet, Avro, and Spark communities.

Open Data Science




Open Data Science
One Broadway
Cambridge, MA 02142

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Consent to display content from - Youtube
Consent to display content from - Vimeo
Google Maps
Consent to display content from - Google