How does Apache Flink enable stream processing into Iceberg?

Apache Flink reads data from Kafka topics using its Kafka source connector, processes events using Flink's stateful stream processing APIs (filtering, aggregating, joining across streams), and writes results to Apache Iceberg tables using the Flink Iceberg sink connector. The sink supports both append mode (Bronze tables) and upsert mode (CDC-driven Silver tables) with exactly-once delivery guarantees.

When should I use stream processing vs batch processing for lakehouse ingestion?

Use stream processing when data freshness matters — real-time dashboards, fraud detection, operational alerts. Use batch processing when throughput and transformation complexity matter more than freshness — large historical loads, complex multi-table joins, ML feature engineering. Most mature lakehouse architectures use both: streaming for Bronze CDC ingestion and operational analytics, batch for Silver and Gold transformations.

Stream Processing: The Definitive Guide for Data Lakehouse

Q: What is stream processing?

Stream processing is the continuous computation on data as it arrives — processing events in real time or near-real time rather than accumulating data in batches and processing periodically. Stream processing systems maintain state across events, apply time-based windowing, and write results continuously to outputs like Apache Iceberg tables, databases, or downstream Kafka topics.

What Is Stream Processing?

Stream processing is the computational model where data is processed continuously as it arrives — each event or small micro-batch is processed immediately upon ingestion, rather than accumulated and processed periodically in bulk. Stream processing systems maintain persistent state (aggregations, join buffers, session windows) across events, apply time-based operations (tumbling windows, sliding windows, session windows), and emit results continuously to downstream systems.

In the data lakehouse, stream processing is the ingestion layer that feeds real-time data into Bronze Apache Iceberg tables. Apache Kafka delivers events from operational systems; stream processing engines (Apache Flink, Spark Structured Streaming) consume those events and write them to Iceberg with low latency.

Flink Stream Processing to Iceberg

Apache Flink is the leading stream processing engine for lakehouse ingestion. Key capabilities:

Exactly-once semantics: Flink's checkpointing mechanism ensures that even after failures, each event is processed exactly once in the output Iceberg table — no duplicates, no losses
Iceberg sink connector: Native Flink Iceberg sink supports both append-only (Bronze CDC ingestion) and UPSERT mode (Silver current-state CDC tables)
Watermark-based event time: Flink processes events by event time (when they occurred) rather than processing time (when they arrive) — correctly handling late-arriving events for accurate time-series aggregations
Stateful stream joins: Flink can join streams with lookups against Iceberg tables (dimension lookups) or join two streams together within time windows

Flink Stream Processing to Iceberg diagram — Figure 1: Flink stream processing pipeline — Kafka source, stateful processing, Iceberg sink.

Micro-Batch vs True Streaming

Two stream processing models exist for lakehouse ingestion:

True streaming (event-by-event): Apache Flink's default model. Each event is processed immediately as it arrives. Minimum latency. Used for time-critical workloads.
Micro-batch streaming: Spark Structured Streaming's model. Events are accumulated over a short interval (seconds to minutes) and processed as a small batch. Higher throughput per resource unit, slightly higher latency. Simpler to reason about — each micro-batch is a mini-ETL job.

For Iceberg ingestion, micro-batch Spark Structured Streaming is often sufficient (providing minute-level freshness) and simpler to operate than Flink. True streaming Flink is used when second-level freshness is required or when complex stateful operations (session detection, fraud pattern matching) are needed.

Stream vs Micro-Batch Iceberg Ingestion diagram — Figure 2: True streaming vs micro-batch — latency and throughput tradeoffs for Iceberg ingestion.

Summary

Stream processing is the real-time ingestion capability that keeps data lakehouse data fresh for operational analytics. By combining Apache Kafka as the event transport layer with Apache Flink or Spark Structured Streaming for continuous processing and Apache Iceberg as the ACID-safe destination, organizations achieve second-to-minute data freshness in their Bronze tables without sacrificing the governance, time travel, and schema evolution capabilities that make Iceberg the superior lakehouse table format.

What Is Stream Processing?

Flink Stream Processing to Iceberg

Micro-Batch vs True Streaming

Summary

Related Concepts

Go Deeper — Recommended Resources