The Battle for the Lakehouse
If you are building a modern data lakehouse, you must choose an Open Table Format. This layer sits on top of your raw Parquet files in cloud object storage and provides the ACID transactions, schema enforcement, and metadata tracking necessary to turn a swamp of files into a high-performance database.
There are three major contenders, all born from large-scale tech companies trying to solve the limitations of the Apache Hive metastore: Apache Iceberg (born at Netflix), Delta Lake (born at Databricks), and Apache Hudi (born at Uber). While they all solve the same fundamental problem, their underlying architectural philosophies—and their ideal use cases—are completely different.
- Choose Apache Iceberg for pure vendor neutrality, the broadest ecosystem interoperability, and the most elegant metadata architecture.
- Choose Delta Lake if you are entirely committed to the Databricks ecosystem and Spark.
- Choose Apache Hudi if your primary workloads are heavy streaming ingestion and massive, continuous upserts.
1. How They Manage Table State (Architecture)
The most critical difference between the formats is how they track the files that belong to a table.
Apache Iceberg: The Metadata Tree
Iceberg uses an explicit, hierarchical metadata tree. A single root JSON file points to a Manifest List (representing a snapshot), which points to Manifest Files, which individually list every single Parquet data file. Iceberg does not care about directories. It relies entirely on explicit file-level tracking. This makes reading incredibly fast because engines can prune files using min/max statistics embedded directly in the Manifest Files before they ever touch object storage.
Delta Lake: The Transaction Log (_delta_log)
Delta Lake relies on a directory-based transaction log called the _delta_log. Every transaction adds a new JSON file to this log (e.g., `000001.json`). Periodically, Delta computes a "checkpoint" (a Parquet file summarizing the log). To find the current state of a table, an engine reads the latest checkpoint and plays the recent JSON log files forward. Delta is highly optimized for Spark, but its reliance on sequential log reading can occasionally create bottlenecks if checkpoints aren't managed properly.
Apache Hudi: The Timeline
Hudi manages state using a .hoodie directory that acts as a timeline of all actions (commits, rollbacks, compactions) performed on the table. Hudi is deeply opinionated about how data is laid out physically on disk, organizing data into file groups and leveraging primary keys heavily. Its architecture is explicitly designed to handle continuous streaming and upserts efficiently.
graph TD
subgraph "Iceberg: The Tree"
I_JSON[Metadata.json] --> I_Snap[Snapshot]
I_Snap --> I_Man[Manifest Files]
I_Man --> I_Data[(Data Files)]
end
subgraph "Delta Lake: The Log"
D_Dir[_delta_log/] --> D_JSON[001.json, 002.json]
D_Dir --> D_Check[Checkpoint.parquet]
D_JSON --> D_Data[(Data Files)]
end
subgraph "Hudi: The Timeline"
H_Dir[.hoodie/] --> H_Time[Timeline / Commits]
H_Time --> H_Keys[Index / Bloom Filters]
H_Keys --> H_Data[(Data Files)]
end
style I_JSON fill:#e0f2fe,stroke:#0284c7
style D_Dir fill:#fef08a,stroke:#ca8a04
style H_Dir fill:#fce7f3,stroke:#db2777
2. Schema and Partition Evolution
Business requirements change. Columns get added, dropped, or renamed. Partitioning strategies change as datasets grow.
Schema Evolution
- Iceberg: Best in class. Iceberg uses unique IDs to track columns, rather than column names. You can drop a column and reuse its name later without data corruption. It supports full, safe schema evolution in-place with no data rewrites.
- Delta Lake: Strong support. Delta supports adding, reordering, and dropping columns. However, under the hood, it historically tracked by name, though modern Delta versions have introduced column mapping to achieve similar capabilities to Iceberg.
- Hudi: Supports schema evolution, but traditionally relied heavily on Avro schema resolution. Comprehensive schema evolution support has improved significantly in recent versions.
Partition Evolution
- Iceberg: Iceberg is the only format that natively supports Hidden Partitioning and Partition Evolution. You can start partitioning a table by `Month`. A year later, you can change the partition spec to `Day`. Iceberg simply uses the new spec for new data and the old spec for old data. Queries don't break, and you don't rewrite historical data.
- Delta & Hudi: If you want to change the partition strategy of a Delta or Hudi table, you generally must rewrite the entire table. Furthermore, users must explicitly query the partition column.
3. Upserts, Deletes, and Updates (ACID)
In a lakehouse, changing existing data is hard because object storage files (Parquet) are immutable. You have to rewrite the file.
All three formats support Copy-on-Write (CoW): when you update a row, the engine rewrites the entire Parquet file containing that row. This is slow for writes, but makes reads very fast.
All three formats now also support Merge-on-Read (MoR): when you update a row, the engine just writes a tiny "delta" or "delete" file. This makes writes blazingly fast, but readers have to reconcile the data on the fly.
- Hudi: The undisputed king of upserts. Hudi was built by Uber to handle millions of streaming database CDC (Change Data Capture) updates. Its primary key indexing and bloom filters make finding and updating specific rows incredibly fast.
- Iceberg: Supports MoR using Position and Equality delete files. It handles deletes and updates elegantly at the metadata level, and relies on background compaction to keep read performance high.
- Delta Lake: Introduced Deletion Vectors (similar to MoR) to speed up updates. While powerful, its roots were originally in batch CoW processing.
4. Engine Interoperability and Ecosystem
The entire promise of the Data Lakehouse is that you aren't locked into one vendor's compute engine.
- Apache Iceberg: The most open and interoperable ecosystem. Because Iceberg was built independently of any single compute engine, it has robust, first-class native support across Spark, Flink, Trino, Dremio, Snowflake, AWS Athena, and BigQuery. The Iceberg REST Catalog specification ensures total vendor neutrality.
- Delta Lake: Deeply intertwined with Databricks and Apache Spark. While it is fully open-source (Delta Lake 3.0+), other engines often treat Delta as a second-class citizen compared to Databricks' own proprietary Photon engine, which has exclusive optimizations for Delta. If you live entirely in Databricks, Delta is perfect. If you want a diverse, multi-vendor ecosystem, it can feel restrictive.
- Apache Hudi: Strong ecosystem, but heavily reliant on Spark or Flink for its advanced table services (like compaction and indexing). It requires more complex configuration and tuning to integrate with other engines compared to Iceberg.
5. Governance and Catalogs
To use these formats, you need a Catalog to track them.
- Iceberg relies on the open Iceberg REST Catalog specification. Implementations like Apache Polaris and Project Nessie provide incredible capabilities. Nessie, for example, allows "Git for Data"—you can branch an entire petabyte-scale Iceberg catalog, make experimental changes, and merge them back to main.
- Delta Lake relies heavily on the Hive Metastore or Databricks Unity Catalog. Unity Catalog is a phenomenal governance tool, but it pulls you deeper into the Databricks ecosystem.
- Hudi typically relies on the Hive Metastore or AWS Glue.
Decision Framework: Which should you choose?
| Scenario / Workload | Winner | Why? |
|---|---|---|
| Vendor Neutrality & Multi-Engine | Iceberg | Designed to be agnostic. Supported flawlessly by Dremio, Trino, Snowflake, AWS, and GCP. |
| Heavy Streaming & CDC Upserts | Hudi | Built specifically for this. Primary key indexing makes massive continuous updates highly efficient. |
| 100% Databricks & Spark Shop | Delta Lake | If you pay for Databricks, use Delta. The out-of-the-box integration and proprietary optimizations are excellent. |
| Evolving Data Models | Iceberg | Hidden partitioning and ID-based schema evolution prevent expensive data rewrites when business logic changes. |
| Data Versioning & Branching | Iceberg | When paired with Project Nessie, you get literal Git-like branching and tagging for data. |
Conclusion
In 2026, the "format wars" have largely stabilized. Apache Iceberg has emerged as the industry standard for the broader ecosystem because of its elegant architecture, vendor neutrality, and unmatched support for schema/partition evolution. Delta Lake remains the undisputed choice for dedicated Databricks customers. Apache Hudi remains a powerful niche tool for engineering teams dealing with extreme streaming CDC workloads.
For most organizations building a modern, flexible data lakehouse intended to outlast their current choice of query engine, Apache Iceberg is the safest and most powerful architectural choice.