What Is Apache Kafka?
Apache Kafka is an open-source distributed event streaming platform originally developed at LinkedIn and donated to the Apache Software Foundation. Kafka stores events (messages) in ordered, immutable, partitioned logs called topics — organized streams of events that producers write to and consumers read from independently, at any time, and at any pace.
Kafka's design provides four properties critical for lakehouse ingestion pipelines: durability (events are replicated across the Kafka cluster and persisted to disk), scalability (topics are partitioned across brokers to handle millions of events/second), replayability (consumers can re-read events from any point in a topic's history), and fan-out (multiple independent consumers can read the same topic simultaneously — Flink for Iceberg writes, a monitoring consumer for alerting, a ML consumer for real-time feature computation).
Kafka in the Lakehouse Ingestion Architecture
Kafka is the ingestion backbone of the real-time lakehouse — the central nervous system connecting data producers (operational systems, applications, devices) to data consumers (Flink, Spark, Iceberg sinks):
Producers → Kafka Topics: Application services emit events (user actions, transactions, IoT readings) to Kafka. Debezium connectors read PostgreSQL/MySQL WALs and publish CDC events. SaaS webhook receivers forward events from external systems.
Kafka Topics → Iceberg Sinks: Apache Flink reads topics, applies transformations and stateful aggregations, and writes to Bronze Iceberg tables. Spark Structured Streaming reads topics and writes batch micro-batches to Iceberg. Kafka Connect Iceberg sink connectors write directly without Flink or Spark for simple passthrough ingestion.

Kafka Schema Registry and Iceberg
The Confluent Schema Registry (and its open-source equivalent Apicurio) is a critical companion to Kafka in the lakehouse ingestion pipeline. It manages Avro schemas for Kafka messages — ensuring that producers and consumers use compatible schemas and that schema changes are backward-compatible.
Schema Registry integration with Iceberg is valuable for schema evolution: when a producer adds a new field to their Avro schema (registered in Schema Registry), the Flink consumer detects the new field and Iceberg's schema evolution capability allows the Bronze table schema to be updated to include the new column — without breaking existing queries on the table. This automatic schema propagation from Kafka Schema Registry → Flink → Iceberg schema evolution is one of the most powerful features of the Kafka-to-Iceberg pipeline.

Summary
Apache Kafka is the real-time ingestion platform that transforms the data lakehouse from a batch-oriented historical archive into a live, continuously updated operational intelligence platform. By acting as the durable, scalable message bus between operational systems and Apache Iceberg tables, Kafka enables CDC pipelines, application event streaming, and IoT data ingestion with seconds-level data freshness. Combined with Apache Flink for stream processing and Iceberg's V2 DML for CDC writes, Kafka completes the real-time lakehouse ingestion stack.