What is Change Data Capture (CDC)?

Change Data Capture (CDC) is a data integration technique that monitors a source database's transaction log (WAL — Write-Ahead Log) and captures every row-level change (INSERT, UPDATE, DELETE) as a structured event. These events are then streamed to downstream systems (Kafka, Iceberg tables) to replicate the source database's changes in near-real-time.

How does CDC work with Apache Iceberg?

CDC events (typically captured by Debezium and published to Kafka) are processed by Apache Flink using the Flink CDC connectors. Flink writes INSERTs as new rows, UPDATEs as updates (using Iceberg V2 equality delete files in MoR mode), and DELETEs as deletions to target Iceberg Silver tables. The result is a Silver Iceberg table that continuously mirrors the current state of the source operational database.

What is Debezium and why is it used for CDC?

Debezium is the leading open-source CDC platform. It connects to source databases (PostgreSQL, MySQL, Oracle, MongoDB, SQL Server) as a replication client, reads the WAL, and publishes change events to Apache Kafka in a structured JSON or Avro format. Debezium's Kafka Connect integration makes it the standard CDC source for Kafka-based streaming pipelines feeding Apache Iceberg lakehouses.

Change Data Capture (CDC): The Definitive Guide for Data Lakehouse

What Is Change Data Capture?

Change Data Capture (CDC) is a data integration technique that captures row-level changes from operational databases in near-real-time by reading the database's transaction log (called the Write-Ahead Log, or WAL, in PostgreSQL; the binlog in MySQL). Instead of periodically querying the source database for changed records (slow, expensive, misses deletes), CDC monitors the WAL — the same log the database itself uses for crash recovery — and publishes each committed INSERT, UPDATE, or DELETE as a structured change event.

CDC is the key technology enabling the transition from batch ETL (where lakehouse data is hours or days old) to near-real-time lakehouse analytics (where data is seconds or minutes old). For business use cases requiring current operational data — inventory levels, real-time order status, live customer activity — CDC is the pipeline pattern that makes the lakehouse competitive with direct database queries for freshness.

CDC Architecture: Debezium → Kafka → Flink → Iceberg

The standard lakehouse CDC architecture has four components:

Debezium: Connects to source databases (PostgreSQL, MySQL, etc.) as a logical replication client. Reads WAL entries and publishes change events (before/after row images plus change type: INSERT/UPDATE/DELETE) to Kafka topics in Debezium JSON or Avro format.
Apache Kafka: The durable, scalable message queue that buffers change events between Debezium and downstream consumers. Provides fault tolerance and allows multiple consumers (Flink, other processors) to read the same events independently.
Apache Flink: Reads Debezium events from Kafka using the flink-cdc-connectors library. Applies transformations (optional). Writes to Iceberg using the Flink Iceberg sink in UPSERT mode — INSERTs become new rows, UPDATE/DELETE become equality delete files.
Apache Iceberg: The destination Silver table maintains current-state records. V2 equality delete files record deleted/updated rows; compaction periodically merges these into CoW data files for read performance.

CDC Architecture Debezium Flink Iceberg diagram — Figure 1: CDC pipeline — Debezium captures WAL changes, Kafka buffers, Flink processes, Iceberg stores.

CDC and Iceberg V2 DML

CDC pipelines require Apache Iceberg V2 table format support for UPDATE and DELETE operations:

Inserts: New records from CDC are appended as new data files using standard Iceberg append operations
Updates: CDC updates generate an equality delete file (marking the old row as deleted by its primary key) plus a new data file (the updated row) — this is Merge-on-Read semantics
Deletes: CDC deletes generate equality delete files marking the deleted row's primary key

The accumulation of equality delete files over time requires periodic compaction to merge them into CoW data files for read performance. Without compaction, queries on heavily updated Silver tables must read many delete files to determine current row state.

CDC Iceberg V2 MoR Delete Files diagram — Figure 2: CDC writes to Iceberg — equality delete files for updates/deletes, compaction for read performance.

Summary

Change Data Capture is the data integration technology that makes the data lakehouse competitive with operational databases for data freshness. The Debezium → Kafka → Flink → Iceberg pipeline delivers near-real-time data from operational source databases into governed, analytically optimized Silver Iceberg tables — without the latency of batch ETL windows or the performance impact of direct database queries. Combined with compaction for read performance and Dremio's query engine for BI analytics, CDC-fed Iceberg tables close the gap between operational and analytical data to seconds.

What Is Change Data Capture?

CDC Architecture: Debezium → Kafka → Flink → Iceberg

CDC and Iceberg V2 DML

Summary

Related Concepts

Go Deeper — Recommended Resources