What is the data lakehouse architecture?

The data lakehouse architecture is a modern data platform design that stores all data in open formats (Apache Iceberg, Parquet) on cheap, scalable object storage (S3, ADLS, GCS) and provides database-like capabilities (ACID transactions, schema enforcement, time travel, query optimization) through an open table format layer — while supporting multiple decoupled query engines for different workloads.

What are the key layers of the lakehouse architecture?

The lakehouse architecture has five key layers: (1) Storage — object storage (S3, ADLS, GCS); (2) Table Format — Apache Iceberg managing ACID, schema, and metadata; (3) Catalog — Iceberg REST Catalog (Polaris, Nessie, Glue) tracking table locations; (4) Compute — decoupled engines (Dremio for BI, Spark for ETL, Flink for streaming); (5) Semantic & Governance — semantic layer (Dremio VDSs) and governance (RBAC, lineage, quality).

What makes the lakehouse better than a data warehouse?

Key lakehouse advantages over traditional warehouses: open data formats (no vendor lock-in), dramatically lower storage cost (object storage vs proprietary), multi-engine access (Dremio + Spark + Flink on the same data), infinite scale without cluster resizing, native ML/AI integration (Spark ML, PyIceberg), and time travel + schema evolution without downtime.

Lakehouse Architecture: The Definitive Guide

The Five-Layer Lakehouse Architecture

The lakehouse architecture is organized into five functional layers, each providing specific capabilities:

Layer 1: Storage

All data lives in cloud object storage (S3, ADLS, GCS) as Parquet data files and Avro metadata files. Storage is infinitely scalable, always-on, and billed separately from compute.

Layer 2: Table Format

Apache Iceberg organizes data files into tables with ACID transactions, schema enforcement, partition management, snapshot isolation, and time travel — providing database semantics on top of raw object storage.

Layer 3: Catalog

An Iceberg REST Catalog (Apache Polaris, Project Nessie, AWS Glue) tracks table metadata locations and enforces RBAC access control — the shared metadata service all engines connect through.

Layer 4: Compute

Decoupled query engines each handle specific workloads: Dremio for BI analytics and semantic layer, Spark for batch ETL, Flink for streaming ingestion, Trino for federated SQL.

Layer 5: Semantic & Governance

Semantic layer (Dremio VDSs), governance (RBAC, lineage, quality), and data catalog (discovery and documentation) make the lakehouse usable and trustworthy for the entire organization.

Five Layer Lakehouse Architecture diagram — Figure 1: The five-layer lakehouse architecture — storage, table format, catalog, compute, and governance.

Lakehouse vs Data Warehouse vs Data Lake

Dimension	Data Warehouse	Data Lake	Data Lakehouse
Storage	Proprietary	Object storage	Object storage
File format	Proprietary	Any (Parquet, CSV)	Open (Iceberg + Parquet)
ACID transactions	Yes	No	Yes (Iceberg)
Schema enforcement	Yes	No	Yes (Iceberg)
Multi-engine support	No	Partial	Yes (open format)
Storage cost	High	Low	Low
ML/AI support	Limited	Good	Excellent

Lakehouse vs Warehouse vs Lake diagram — Figure 2: Architecture comparison — data warehouse, data lake, and data lakehouse on key dimensions.

Summary

The data lakehouse architecture represents the convergence of the best properties of data warehouses (ACID, governance, performance) and data lakes (open formats, low cost, scalability) into a unified, open, multi-engine platform. Built on Apache Iceberg tables in cloud object storage, governed by open Iceberg REST Catalogs, and accessed by decoupled compute engines including Dremio, the lakehouse is the dominant enterprise data architecture of 2025 and beyond — providing the analytical capabilities of the warehouse at the economics and flexibility of the open data lake.