What is batch processing in the data lakehouse?

Batch processing is the data transformation model where data is accumulated over a time period (an hour, a day, a week) and processed in a single scheduled run — as opposed to continuous stream processing. In the lakehouse context, batch processing typically means Apache Spark jobs that read Bronze Iceberg tables, apply transformations, and write Silver or Gold Iceberg tables on a schedule (hourly, daily, etc.).

When should I use batch vs stream processing for lakehouse pipelines?

Use batch when: transformation complexity is high (multi-table joins, complex business logic), historical data reprocessing is needed (backfills), data freshness requirements are hourly or daily, or when leveraging the Python ML ecosystem (Pandas, scikit-learn, PyTorch) for ML feature engineering. Use streaming when: data freshness must be seconds to minutes, or real-time fraud/anomaly detection is required.

What is Spark's role in batch lakehouse processing?

Apache Spark is the standard batch processing engine for lakehouse ETL. PySpark and Spark SQL provide the data transformation APIs; Spark's Iceberg connector enables reading and writing Iceberg tables; Spark's distributed execution scales from single-machine to thousand-node clusters. Most Silver and Gold layer transformations in the Medallion Architecture are implemented as scheduled Spark batch jobs.

Batch Processing: The Definitive Guide for Data Lakehouse

What Is Batch Processing?

Batch processing is the data processing model where data is accumulated over a time period and processed together in a single scheduled run — as opposed to stream processing, which processes data continuously as it arrives. Batch processing has been the dominant model for data transformation since the earliest days of computing, and it remains the most common pattern for Silver and Gold layer transformations in the Medallion Architecture.

In the data lakehouse, batch processing is typically implemented with Apache Spark — reading Bronze Apache Iceberg tables that have been populated by streaming ingestion, applying complex transformations (joins, aggregations, business rule application), and writing Silver or Gold Iceberg tables. Batch jobs are scheduled by Apache Airflow or similar orchestrators.

Spark Batch Processing on Iceberg

Apache Spark is the dominant batch processing engine for lakehouse ETL because it combines:

Python ecosystem: PySpark integrates with NumPy, Pandas, scikit-learn, and PyTorch — enabling ML feature engineering and data science workflows as part of the batch ETL pipeline
Iceberg native integration: The Spark Iceberg connector supports all Iceberg DML (INSERT INTO, MERGE INTO, DELETE, UPDATE) and DDL (ALTER TABLE, schema evolution) operations natively
Distributed execution: Spark scales from a single local machine to thousand-node EMR or Databricks clusters — the same PySpark code runs at any scale
SQL and DataFrame API: Both Spark SQL and the DataFrame API can query and write Iceberg tables, allowing teams to choose the interface that fits their skills

Spark Batch Processing Iceberg Pipeline diagram — Figure 1: Spark batch pipeline — Bronze Iceberg scan, transformation, Silver/Gold Iceberg write, Airflow schedule.

Batch Patterns for Medallion Layers

Specific batch patterns for each Medallion layer:

Bronze → Silver (Cleansing batch): Daily Spark job reading previous day's Bronze partitions, deduplicating, validating, and writing to Silver via MERGE INTO (applying CDC events to current state)
Silver → Gold (Aggregation batch): Hourly or daily Spark SQL job computing business metrics (daily revenue by region, weekly cohort retention rates) and writing pre-aggregated Gold tables
Gold optimization batch: Weekly Spark OPTIMIZE + Z-ORDER job maintaining Gold table performance by compacting small files and re-clustering data layout

Batch Job Schedule Medallion Architecture diagram — Figure 2: Batch job schedule across Medallion layers — cleansing, aggregation, and optimization cadences.

Summary

Batch processing remains the workhorse of data lakehouse ETL — producing the cleansed Silver tables and business-ready Gold tables that analysts and BI tools depend on. Apache Spark's combination of Python ecosystem richness, native Iceberg integration, and horizontal scalability makes it the ideal batch processing engine for complex, high-volume transformations that require more than streaming can provide. The mature lakehouse architecture uses streaming (Flink) and batch (Spark) complementarily — each in the role it performs best.

What Is Batch Processing?

Spark Batch Processing on Iceberg

Batch Patterns for Medallion Layers

Summary

Related Concepts

Go Deeper — Recommended Resources