Best Practices for CDC Pipelines to Apache Iceberg Tables

Abstract: 

Change data capture (CDC) is a tool to efficiently stream changes from transactional database tables as they happen. Because distributed analytic tools like Trino or Spark can easily overwhelm transactional databases, CDC streams play a critical role: supplying data to mirror tables in analytic formats like Apache Iceberg.

This talk is about best practices for building CDC pipelines that maintain analytic tables, while preserving the ability to work with the change records as a stream. This will include how to avoid pitfalls like weak/eventual consistency as well as how to ensure the Apache Iceberg table is structured to deliver fast query results.

You'll learn about:
* CDC patterns and how to choose one
* Common mistakes to avoid
* Iceberg tables and best practices

Bio: 

Ryan is the co-creator of Apache Iceberg and spent the last decade working on big data infrastructure at Netflix, Cloudera, and now Tabular. He is an ASF member and a committer in the Apache Parquet, Avro, and Spark communities.

Open Data Science

 

 

 

Open Data Science
One Broadway
Cambridge, MA 02142
info@odsc.com

Privacy Settings
We use cookies to enhance your experience while using our website. If you are using our Services via a browser you can restrict, block or remove cookies through your web browser settings. We also use content and scripts from third parties that may use tracking technologies. You can selectively provide your consent below to allow such third party embeds. For complete information about the cookies we use, data we collect and how we process them, please check our Privacy Policy
Youtube
Consent to display content from - Youtube
Vimeo
Consent to display content from - Vimeo
Google Maps
Consent to display content from - Google