ETL stands for Extract, Transform, Load — the data integration pattern where data is extracted from source systems, transformed (cleaned, joined, aggregated) in an intermediate processing layer, and then loaded into the destination system (data warehouse, lakehouse). Transformation happens before the data reaches the destination.

What is the difference between ETL and ELT?

In ETL, transformation happens before loading — data is cleaned and processed in the pipeline before arriving at the destination. In ELT (Extract, Load, Transform), raw data is loaded first into the destination (Bronze Iceberg tables) and transformed in-place using the destination system's compute (Spark on Iceberg, Dremio SQL). ELT is the dominant pattern in modern lakehouses because object storage is cheap and lakehouse compute is powerful.

What tools implement ETL/ELT for Apache Iceberg?

Apache Spark is the primary ELT engine for Iceberg batch transformations. Apache Flink handles streaming ETL (Kafka → Iceberg). dbt-dremio and dbt-spark allow data teams to define ELT transformations as SQL models that generate Iceberg tables. Apache Airflow and similar tools orchestrate multi-step ETL/ELT pipeline DAGs.

ETL: The Definitive Guide for Data Lakehouse

What Is ETL?

ETL (Extract, Transform, Load) is the classic data integration pattern: Extract data from source systems (operational databases, SaaS APIs, event streams), Transform it (clean, validate, join, aggregate) in an intermediate processing layer, and Load the transformed result into the destination analytical system (historically a data warehouse, now an Apache Iceberg lakehouse).

ETL was the dominant data integration pattern for decades because traditional data warehouses had expensive, limited storage — it was impractical to store raw, uncleaned data in the warehouse. Transformation had to happen before loading to minimize the volume of data stored in expensive warehouse storage.

The data lakehouse changed this economics: object storage is so inexpensive that storing raw data is trivial. This enabled ELT (Extract, Load, Transform) — load raw data first, then transform using the lakehouse's own compute — which is now the dominant pattern in modern lakehouse deployments.

ETL vs ELT in the Lakehouse

Aspect	ETL	ELT
Transform timing	Before loading	After loading (in-place)
Raw data preservation	No — only transformed data stored	Yes — raw Bronze layer preserved
Transform compute	Separate ETL engine	Lakehouse compute (Spark, Dremio)
Re-processing	Must re-extract from source	Re-process from Bronze layer
Debugging	Harder — raw data not available	Easier — raw data always accessible
Storage cost	Lower (only transformed data)	Higher (raw + transformed stored)

ETL vs ELT Lakehouse Patterns diagram — Figure 1: ETL vs ELT — transformation timing and data preservation trade-offs.

Modern ETL/ELT Tools for Iceberg

The modern lakehouse ETL/ELT toolchain:

Apache Spark: The workhorse for batch ELT — reading Bronze Iceberg tables, applying Python/Scala transformations, writing Silver and Gold Iceberg tables
Apache Flink: Streaming ETL — continuously reading Kafka topics (or CDC streams) and writing to Bronze/Silver Iceberg tables with exactly-once semantics
dbt (data build tool): SQL-first ELT — defining Silver and Gold table transformations as SQL models that compile to Iceberg-compatible DDL and DML, run on Dremio, Spark, or Trino
Apache Airflow: Pipeline orchestration — scheduling and monitoring multi-step ETL/ELT DAGs across Spark jobs, Flink deployments, and dbt runs

Modern ETL ELT Toolchain diagram — Figure 2: Modern ELT toolchain — Flink streaming, Spark batch, dbt SQL models, Airflow orchestration.

Summary

ETL and ELT are both data pipeline patterns with specific trade-offs, and both remain relevant in the modern lakehouse. The dominant pattern is ELT — raw data lands in Bronze Iceberg tables first, and Spark, dbt, or SQL transforms produce Silver and Gold layers in-place. This approach preserves raw data for reprocessing, simplifies debugging, and leverages the lakehouse's own scalable compute for transformation. Understanding ETL and ELT fundamentals, combined with the Medallion Architecture framework, gives data engineers the conceptual foundation to design robust, maintainable lakehouse pipelines at any scale.

What Is ETL?

ETL vs ELT in the Lakehouse

Modern ETL/ELT Tools for Iceberg

Summary

Related Concepts

Go Deeper — Recommended Resources