What Is ETL?

ETL (Extract, Transform, Load) is the classic data integration pattern: Extract data from source systems (operational databases, SaaS APIs, event streams), Transform it (clean, validate, join, aggregate) in an intermediate processing layer, and Load the transformed result into the destination analytical system (historically a data warehouse, now an Apache Iceberg lakehouse).

ETL was the dominant data integration pattern for decades because traditional data warehouses had expensive, limited storage — it was impractical to store raw, uncleaned data in the warehouse. Transformation had to happen before loading to minimize the volume of data stored in expensive warehouse storage.

The data lakehouse changed this economics: object storage is so inexpensive that storing raw data is trivial. This enabled ELT (Extract, Load, Transform) — load raw data first, then transform using the lakehouse's own compute — which is now the dominant pattern in modern lakehouse deployments.

ETL vs ELT in the Lakehouse

AspectETLELT
Transform timingBefore loadingAfter loading (in-place)
Raw data preservationNo — only transformed data storedYes — raw Bronze layer preserved
Transform computeSeparate ETL engineLakehouse compute (Spark, Dremio)
Re-processingMust re-extract from sourceRe-process from Bronze layer
DebuggingHarder — raw data not availableEasier — raw data always accessible
Storage costLower (only transformed data)Higher (raw + transformed stored)
ETL vs ELT Lakehouse Patterns diagram
Figure 1: ETL vs ELT — transformation timing and data preservation trade-offs.

Modern ETL/ELT Tools for Iceberg

The modern lakehouse ETL/ELT toolchain:

  • Apache Spark: The workhorse for batch ELT — reading Bronze Iceberg tables, applying Python/Scala transformations, writing Silver and Gold Iceberg tables
  • Apache Flink: Streaming ETL — continuously reading Kafka topics (or CDC streams) and writing to Bronze/Silver Iceberg tables with exactly-once semantics
  • dbt (data build tool): SQL-first ELT — defining Silver and Gold table transformations as SQL models that compile to Iceberg-compatible DDL and DML, run on Dremio, Spark, or Trino
  • Apache Airflow: Pipeline orchestration — scheduling and monitoring multi-step ETL/ELT DAGs across Spark jobs, Flink deployments, and dbt runs
Modern ETL ELT Toolchain diagram
Figure 2: Modern ELT toolchain — Flink streaming, Spark batch, dbt SQL models, Airflow orchestration.

Summary

ETL and ELT are both data pipeline patterns with specific trade-offs, and both remain relevant in the modern lakehouse. The dominant pattern is ELT — raw data lands in Bronze Iceberg tables first, and Spark, dbt, or SQL transforms produce Silver and Gold layers in-place. This approach preserves raw data for reprocessing, simplifies debugging, and leverages the lakehouse's own scalable compute for transformation. Understanding ETL and ELT fundamentals, combined with the Medallion Architecture framework, gives data engineers the conceptual foundation to design robust, maintainable lakehouse pipelines at any scale.