What Is Change Data Capture?
Change Data Capture (CDC) is a data integration technique that captures row-level changes from operational databases in near-real-time by reading the database's transaction log (called the Write-Ahead Log, or WAL, in PostgreSQL; the binlog in MySQL). Instead of periodically querying the source database for changed records (slow, expensive, misses deletes), CDC monitors the WAL — the same log the database itself uses for crash recovery — and publishes each committed INSERT, UPDATE, or DELETE as a structured change event.
CDC is the key technology enabling the transition from batch ETL (where lakehouse data is hours or days old) to near-real-time lakehouse analytics (where data is seconds or minutes old). For business use cases requiring current operational data — inventory levels, real-time order status, live customer activity — CDC is the pipeline pattern that makes the lakehouse competitive with direct database queries for freshness.
CDC Architecture: Debezium → Kafka → Flink → Iceberg
The standard lakehouse CDC architecture has four components:
- Debezium: Connects to source databases (PostgreSQL, MySQL, etc.) as a logical replication client. Reads WAL entries and publishes change events (before/after row images plus change type: INSERT/UPDATE/DELETE) to Kafka topics in Debezium JSON or Avro format.
- Apache Kafka: The durable, scalable message queue that buffers change events between Debezium and downstream consumers. Provides fault tolerance and allows multiple consumers (Flink, other processors) to read the same events independently.
- Apache Flink: Reads Debezium events from Kafka using the flink-cdc-connectors library. Applies transformations (optional). Writes to Iceberg using the Flink Iceberg sink in UPSERT mode — INSERTs become new rows, UPDATE/DELETE become equality delete files.
- Apache Iceberg: The destination Silver table maintains current-state records. V2 equality delete files record deleted/updated rows; compaction periodically merges these into CoW data files for read performance.

CDC and Iceberg V2 DML
CDC pipelines require Apache Iceberg V2 table format support for UPDATE and DELETE operations:
- Inserts: New records from CDC are appended as new data files using standard Iceberg append operations
- Updates: CDC updates generate an equality delete file (marking the old row as deleted by its primary key) plus a new data file (the updated row) — this is Merge-on-Read semantics
- Deletes: CDC deletes generate equality delete files marking the deleted row's primary key
The accumulation of equality delete files over time requires periodic compaction to merge them into CoW data files for read performance. Without compaction, queries on heavily updated Silver tables must read many delete files to determine current row state.

Summary
Change Data Capture is the data integration technology that makes the data lakehouse competitive with operational databases for data freshness. The Debezium → Kafka → Flink → Iceberg pipeline delivers near-real-time data from operational source databases into governed, analytically optimized Silver Iceberg tables — without the latency of batch ETL windows or the performance impact of direct database queries. Combined with compaction for read performance and Dremio's query engine for BI analytics, CDC-fed Iceberg tables close the gap between operational and analytical data to seconds.