What Is Batch Processing?
Batch processing is the data processing model where data is accumulated over a time period and processed together in a single scheduled run — as opposed to stream processing, which processes data continuously as it arrives. Batch processing has been the dominant model for data transformation since the earliest days of computing, and it remains the most common pattern for Silver and Gold layer transformations in the Medallion Architecture.
In the data lakehouse, batch processing is typically implemented with Apache Spark — reading Bronze Apache Iceberg tables that have been populated by streaming ingestion, applying complex transformations (joins, aggregations, business rule application), and writing Silver or Gold Iceberg tables. Batch jobs are scheduled by Apache Airflow or similar orchestrators.
Spark Batch Processing on Iceberg
Apache Spark is the dominant batch processing engine for lakehouse ETL because it combines:
- Python ecosystem: PySpark integrates with NumPy, Pandas, scikit-learn, and PyTorch — enabling ML feature engineering and data science workflows as part of the batch ETL pipeline
- Iceberg native integration: The Spark Iceberg connector supports all Iceberg DML (INSERT INTO, MERGE INTO, DELETE, UPDATE) and DDL (ALTER TABLE, schema evolution) operations natively
- Distributed execution: Spark scales from a single local machine to thousand-node EMR or Databricks clusters — the same PySpark code runs at any scale
- SQL and DataFrame API: Both Spark SQL and the DataFrame API can query and write Iceberg tables, allowing teams to choose the interface that fits their skills

Batch Patterns for Medallion Layers
Specific batch patterns for each Medallion layer:
- Bronze → Silver (Cleansing batch): Daily Spark job reading previous day's Bronze partitions, deduplicating, validating, and writing to Silver via MERGE INTO (applying CDC events to current state)
- Silver → Gold (Aggregation batch): Hourly or daily Spark SQL job computing business metrics (daily revenue by region, weekly cohort retention rates) and writing pre-aggregated Gold tables
- Gold optimization batch: Weekly Spark OPTIMIZE + Z-ORDER job maintaining Gold table performance by compacting small files and re-clustering data layout

Summary
Batch processing remains the workhorse of data lakehouse ETL — producing the cleansed Silver tables and business-ready Gold tables that analysts and BI tools depend on. Apache Spark's combination of Python ecosystem richness, native Iceberg integration, and horizontal scalability makes it the ideal batch processing engine for complex, high-volume transformations that require more than streaming can provide. The mature lakehouse architecture uses streaming (Flink) and batch (Spark) complementarily — each in the role it performs best.