What is a Data Lakehouse?

The definitive guide to the architecture that unifies data lakes and data warehouses.

Introduction: The Convergence of Analytics

For over thirty years, enterprise data teams were forced to choose between two fundamentally flawed paradigms. You could put your data in a Data Warehouse, which offered fast, reliable SQL analytics but trapped your data in expensive, proprietary formats that were useless for machine learning. Or, you could put your data in a Data Lake, which offered cheap, infinitely scalable storage for all data types but lacked the transactional guarantees, governance, and performance required for business intelligence.

The Data Lakehouse represents the convergence of these two systems. A data lakehouse is an open data architecture that implements data warehouse-like data structures and data management features directly on top of low-cost cloud data lakes.

Definition: A Data Lakehouse is a modern data management architecture that uses open file formats (like Parquet) on cloud object storage (like Amazon S3), managed by an open table format (like Apache Iceberg), to deliver ACID transactions and high-performance SQL analytics without copying data into a proprietary warehouse.

Why the Lakehouse Architecture Emerged

The shift to the lakehouse was not driven by a single vendor, but by an academic and engineering consensus that the two-tier architecture (moving data from a lake into a warehouse) was no longer sustainable. As formalized in the seminal 2021 UC Berkeley RISE Lab paper on the topic, the two-tier architecture suffered from four major flaws:

The lakehouse emerged because technological breakthroughs in open file formats (Apache Parquet) and the invention of open table formats (Apache Iceberg, Delta Lake, Apache Hudi) made it possible to bring transactional consistency and indexing directly to the raw object storage layer.

How a Data Lakehouse Works: The 3-Layer Architecture

A true data lakehouse is characterized by the decoupling of storage and compute, bridged by an open metadata layer. All major implementations agree on this foundational three-layer architecture.

            graph TD
                subgraph "Layer 3: Compute & Semantic"
                    BI[BI Tools / Dashboards]
                    AI[AI Agents / ML Models]
                    Spark[Apache Spark / Batch]
                    Dremio[Dremio / Fast SQL]
                    Flink[Apache Flink / Streaming]
                end

                subgraph "Layer 2: Open Table Format (The Brain)"
                    Cat[Iceberg REST Catalog / Apache Polaris]
                    Meta[Table Metadata, Manifests, Snapshots]
                end

                subgraph "Layer 1: Object Storage (The Brawn)"
                    S3[(Amazon S3, Azure ADLS, Google Cloud Storage)]
                    Parquet[(Open Files: Parquet, ORC, Avro)]
                end

                BI -.-> Dremio
                AI -.-> Dremio
                Spark --> Cat
                Dremio --> Cat
                Flink --> Cat
                
                Cat --> Meta
                Meta --> S3
                S3 --> Parquet
                
                style Cat fill:#dbeafe,stroke:#2563eb
                style Meta fill:#e0f2fe,stroke:#0284c7
                style S3 fill:#f0fdf4,stroke:#16a34a
                style Parquet fill:#dcfce7,stroke:#22c55e
                style Dremio fill:#fef08a,stroke:#ca8a04
            

1. The Storage Layer (Object Storage)

At the bottom is the storage layer. Instead of local disks attached to compute nodes, the lakehouse uses cloud object storage—Amazon S3, Azure Data Lake Storage (ADLS), or Google Cloud Storage. Object storage provides infinite scalability, 99.999999999% durability, and the lowest possible storage cost. The data itself is stored in open, standard file formats. The overwhelming industry standard is Apache Parquet, a columnar format that compresses heavily and allows engines to skip reading columns they don't need.

2. The Open Table Format Layer (Metadata)

This is the defining layer of the lakehouse. Without this layer, you just have a data lake. The Open Table Format—most prominently Apache Iceberg—sits on top of the Parquet files and provides a transaction log, schema enforcement, and an index. It tracks which files belong to a table, which version (snapshot) of the table is currently active, and exactly where each piece of data is located.

This layer provides the ACID guarantees (Atomicity, Consistency, Isolation, Durability) that allow multiple query engines to read and write to the same data at the exact same time without corrupting the files or seeing partial data.

3. The Compute and Semantic Layer

Because the data and the metadata are completely open and standard, the lakehouse supports Multi-Engine Interoperability. You are no longer locked into a single vendor's compute engine. In a lakehouse, Apache Spark might run the nightly batch jobs, Apache Flink might stream in real-time events, and Dremio might serve sub-second SQL queries to Tableau and PowerBI—all pointing at the exact same data files.

The top of this compute tier often includes a Semantic Layer, which translates raw technical tables into business-friendly views, ensuring that metrics like "Revenue" are defined consistently regardless of which BI tool or AI agent is asking the question.

Lake vs. Warehouse vs. Lakehouse

The easiest way to grasp the lakehouse is to directly compare it with the legacy architectures it replaces.

Feature Data Warehouse Data Lake Data Lakehouse
Storage Format Proprietary / Closed Open (CSV, Parquet) Open (Parquet) + Open Metadata
ACID Transactions Yes No Yes
Cost Very High (premium storage) Very Low (object storage) Very Low (object storage)
Vendor Lock-in Extreme Low Low (Open Standards)
Compute Model Single-engine (Tightly coupled) Multi-engine (Decoupled) Multi-engine (Decoupled)
Machine Learning Support Poor Excellent Excellent

The Role of Open Table Formats

If you are building a lakehouse today, you must choose an Open Table Format. There are three main contenders that emerged from the Hadoop/Spark ecosystem, but the industry has overwhelmingly converged on one standard.

When is a Data Lakehouse the Right Fit?

For organizations starting greenfield data projects in 2026, the lakehouse is almost universally the recommended architecture. However, there are specific scenarios where transitioning to a lakehouse is most urgent:

  1. Exploding Data Volume and Costs: If your Snowflake or BigQuery compute/storage costs are skyrocketing because you are storing massive amounts of historical event data inside the warehouse, moving that data to an Iceberg lakehouse on S3 will cut costs by up to 80%.
  2. The Need for AI and Machine Learning: If your data scientists are constantly exporting data out of the warehouse via CSV to train models, a lakehouse allows them to connect their Python notebooks directly to the source of truth.
  3. Complex Data Ecosystems: If you have multiple engineering teams who want to use different tools (Spark, Flink, Dremio, Trino), the lakehouse allows them all to collaborate on the same data without stepping on each other's toes.

Common Misconceptions

Misconception 1: "The Lakehouse is just a marketing term for a Data Lake."
False. A data lake is a chaotic collection of files. A lakehouse uses a transactional metadata layer (Iceberg) to enforce schemas, guarantee snapshot isolation, and provide the exact same SQL semantics as a traditional relational database.

Misconception 2: "Lakehouses are too slow for BI dashboards."
Historically, object storage was too slow for sub-second queries. However, modern query engines like Dremio use Apache Arrow (vectorized memory processing), column pruning, and Data Reflections (materialized views) to deliver dashboard-speed performance directly on the lake, bypassing the need for a warehouse completely.

Conclusion

The data lakehouse is no longer a theoretical architecture—it is the established baseline for modern data engineering. By unifying the low cost and infinite scalability of cloud object storage with the governance, reliability, and performance of a data warehouse, it provides a single, open data foundation that can power everything from nightly BI reports to the next generation of Agentic AI.