Apache Iceberg vs Delta Lake vs Apache Hudi

Choosing the right Open Table Format for your Data Lakehouse.

The Battle for the Lakehouse

If you are building a modern data lakehouse, you must choose an Open Table Format. This layer sits on top of your raw Parquet files in cloud object storage and provides the ACID transactions, schema enforcement, and metadata tracking necessary to turn a swamp of files into a high-performance database.

There are three major contenders, all born from large-scale tech companies trying to solve the limitations of the Apache Hive metastore: Apache Iceberg (born at Netflix), Delta Lake (born at Databricks), and Apache Hudi (born at Uber). While they all solve the same fundamental problem, their underlying architectural philosophies—and their ideal use cases—are completely different.

TL;DR:

1. How They Manage Table State (Architecture)

The most critical difference between the formats is how they track the files that belong to a table.

Apache Iceberg: The Metadata Tree

Iceberg uses an explicit, hierarchical metadata tree. A single root JSON file points to a Manifest List (representing a snapshot), which points to Manifest Files, which individually list every single Parquet data file. Iceberg does not care about directories. It relies entirely on explicit file-level tracking. This makes reading incredibly fast because engines can prune files using min/max statistics embedded directly in the Manifest Files before they ever touch object storage.

Delta Lake: The Transaction Log (_delta_log)

Delta Lake relies on a directory-based transaction log called the _delta_log. Every transaction adds a new JSON file to this log (e.g., `000001.json`). Periodically, Delta computes a "checkpoint" (a Parquet file summarizing the log). To find the current state of a table, an engine reads the latest checkpoint and plays the recent JSON log files forward. Delta is highly optimized for Spark, but its reliance on sequential log reading can occasionally create bottlenecks if checkpoints aren't managed properly.

Apache Hudi: The Timeline

Hudi manages state using a .hoodie directory that acts as a timeline of all actions (commits, rollbacks, compactions) performed on the table. Hudi is deeply opinionated about how data is laid out physically on disk, organizing data into file groups and leveraging primary keys heavily. Its architecture is explicitly designed to handle continuous streaming and upserts efficiently.

            graph TD
                subgraph "Iceberg: The Tree"
                    I_JSON[Metadata.json] --> I_Snap[Snapshot]
                    I_Snap --> I_Man[Manifest Files]
                    I_Man --> I_Data[(Data Files)]
                end

                subgraph "Delta Lake: The Log"
                    D_Dir[_delta_log/] --> D_JSON[001.json, 002.json]
                    D_Dir --> D_Check[Checkpoint.parquet]
                    D_JSON --> D_Data[(Data Files)]
                end

                subgraph "Hudi: The Timeline"
                    H_Dir[.hoodie/] --> H_Time[Timeline / Commits]
                    H_Time --> H_Keys[Index / Bloom Filters]
                    H_Keys --> H_Data[(Data Files)]
                end
                
                style I_JSON fill:#e0f2fe,stroke:#0284c7
                style D_Dir fill:#fef08a,stroke:#ca8a04
                style H_Dir fill:#fce7f3,stroke:#db2777
            

2. Schema and Partition Evolution

Business requirements change. Columns get added, dropped, or renamed. Partitioning strategies change as datasets grow.

Schema Evolution

Partition Evolution

3. Upserts, Deletes, and Updates (ACID)

In a lakehouse, changing existing data is hard because object storage files (Parquet) are immutable. You have to rewrite the file.

All three formats support Copy-on-Write (CoW): when you update a row, the engine rewrites the entire Parquet file containing that row. This is slow for writes, but makes reads very fast.

All three formats now also support Merge-on-Read (MoR): when you update a row, the engine just writes a tiny "delta" or "delete" file. This makes writes blazingly fast, but readers have to reconcile the data on the fly.

4. Engine Interoperability and Ecosystem

The entire promise of the Data Lakehouse is that you aren't locked into one vendor's compute engine.

5. Governance and Catalogs

To use these formats, you need a Catalog to track them.

Decision Framework: Which should you choose?

Scenario / Workload Winner Why?
Vendor Neutrality & Multi-Engine Iceberg Designed to be agnostic. Supported flawlessly by Dremio, Trino, Snowflake, AWS, and GCP.
Heavy Streaming & CDC Upserts Hudi Built specifically for this. Primary key indexing makes massive continuous updates highly efficient.
100% Databricks & Spark Shop Delta Lake If you pay for Databricks, use Delta. The out-of-the-box integration and proprietary optimizations are excellent.
Evolving Data Models Iceberg Hidden partitioning and ID-based schema evolution prevent expensive data rewrites when business logic changes.
Data Versioning & Branching Iceberg When paired with Project Nessie, you get literal Git-like branching and tagging for data.

Conclusion

In 2026, the "format wars" have largely stabilized. Apache Iceberg has emerged as the industry standard for the broader ecosystem because of its elegant architecture, vendor neutrality, and unmatched support for schema/partition evolution. Delta Lake remains the undisputed choice for dedicated Databricks customers. Apache Hudi remains a powerful niche tool for engineering teams dealing with extreme streaming CDC workloads.

For most organizations building a modern, flexible data lakehouse intended to outlast their current choice of query engine, Apache Iceberg is the safest and most powerful architectural choice.