What Is Stream Processing?
Stream processing is the computational model where data is processed continuously as it arrives — each event or small micro-batch is processed immediately upon ingestion, rather than accumulated and processed periodically in bulk. Stream processing systems maintain persistent state (aggregations, join buffers, session windows) across events, apply time-based operations (tumbling windows, sliding windows, session windows), and emit results continuously to downstream systems.
In the data lakehouse, stream processing is the ingestion layer that feeds real-time data into Bronze Apache Iceberg tables. Apache Kafka delivers events from operational systems; stream processing engines (Apache Flink, Spark Structured Streaming) consume those events and write them to Iceberg with low latency.
Flink Stream Processing to Iceberg
Apache Flink is the leading stream processing engine for lakehouse ingestion. Key capabilities:
- Exactly-once semantics: Flink's checkpointing mechanism ensures that even after failures, each event is processed exactly once in the output Iceberg table — no duplicates, no losses
- Iceberg sink connector: Native Flink Iceberg sink supports both append-only (Bronze CDC ingestion) and UPSERT mode (Silver current-state CDC tables)
- Watermark-based event time: Flink processes events by event time (when they occurred) rather than processing time (when they arrive) — correctly handling late-arriving events for accurate time-series aggregations
- Stateful stream joins: Flink can join streams with lookups against Iceberg tables (dimension lookups) or join two streams together within time windows

Micro-Batch vs True Streaming
Two stream processing models exist for lakehouse ingestion:
- True streaming (event-by-event): Apache Flink's default model. Each event is processed immediately as it arrives. Minimum latency. Used for time-critical workloads.
- Micro-batch streaming: Spark Structured Streaming's model. Events are accumulated over a short interval (seconds to minutes) and processed as a small batch. Higher throughput per resource unit, slightly higher latency. Simpler to reason about — each micro-batch is a mini-ETL job.
For Iceberg ingestion, micro-batch Spark Structured Streaming is often sufficient (providing minute-level freshness) and simpler to operate than Flink. True streaming Flink is used when second-level freshness is required or when complex stateful operations (session detection, fraud pattern matching) are needed.

Summary
Stream processing is the real-time ingestion capability that keeps data lakehouse data fresh for operational analytics. By combining Apache Kafka as the event transport layer with Apache Flink or Spark Structured Streaming for continuous processing and Apache Iceberg as the ACID-safe destination, organizations achieve second-to-minute data freshness in their Bronze tables without sacrificing the governance, time travel, and schema evolution capabilities that make Iceberg the superior lakehouse table format.