What Is ETL?
ETL (Extract, Transform, Load) is the classic data integration pattern: Extract data from source systems (operational databases, SaaS APIs, event streams), Transform it (clean, validate, join, aggregate) in an intermediate processing layer, and Load the transformed result into the destination analytical system (historically a data warehouse, now an Apache Iceberg lakehouse).
ETL was the dominant data integration pattern for decades because traditional data warehouses had expensive, limited storage — it was impractical to store raw, uncleaned data in the warehouse. Transformation had to happen before loading to minimize the volume of data stored in expensive warehouse storage.
The data lakehouse changed this economics: object storage is so inexpensive that storing raw data is trivial. This enabled ELT (Extract, Load, Transform) — load raw data first, then transform using the lakehouse's own compute — which is now the dominant pattern in modern lakehouse deployments.
ETL vs ELT in the Lakehouse
| Aspect | ETL | ELT |
|---|---|---|
| Transform timing | Before loading | After loading (in-place) |
| Raw data preservation | No — only transformed data stored | Yes — raw Bronze layer preserved |
| Transform compute | Separate ETL engine | Lakehouse compute (Spark, Dremio) |
| Re-processing | Must re-extract from source | Re-process from Bronze layer |
| Debugging | Harder — raw data not available | Easier — raw data always accessible |
| Storage cost | Lower (only transformed data) | Higher (raw + transformed stored) |

Modern ETL/ELT Tools for Iceberg
The modern lakehouse ETL/ELT toolchain:
- Apache Spark: The workhorse for batch ELT — reading Bronze Iceberg tables, applying Python/Scala transformations, writing Silver and Gold Iceberg tables
- Apache Flink: Streaming ETL — continuously reading Kafka topics (or CDC streams) and writing to Bronze/Silver Iceberg tables with exactly-once semantics
- dbt (data build tool): SQL-first ELT — defining Silver and Gold table transformations as SQL models that compile to Iceberg-compatible DDL and DML, run on Dremio, Spark, or Trino
- Apache Airflow: Pipeline orchestration — scheduling and monitoring multi-step ETL/ELT DAGs across Spark jobs, Flink deployments, and dbt runs

Summary
ETL and ELT are both data pipeline patterns with specific trade-offs, and both remain relevant in the modern lakehouse. The dominant pattern is ELT — raw data lands in Bronze Iceberg tables first, and Spark, dbt, or SQL transforms produce Silver and Gold layers in-place. This approach preserves raw data for reprocessing, simplifies debugging, and leverages the lakehouse's own scalable compute for transformation. Understanding ETL and ELT fundamentals, combined with the Medallion Architecture framework, gives data engineers the conceptual foundation to design robust, maintainable lakehouse pipelines at any scale.