What Is an Open Table Format?
Imagine a massive library with millions of books scattered across thousands of rooms, with no card catalog, no numbering system, and no organization. Finding a specific book requires walking through every room and scanning every shelf. That is essentially what querying a data lake looks like without an open table format.
An Open Table Format is a metadata specification that sits between raw data files (stored as Apache Parquet on cloud object storage) and the query engines that process them. It acts as the library's card catalog — an explicit, structured index that tells a query engine exactly which files contain the data it needs, without having to look at a single file it doesn't.
Why Open Table Formats Exist: The Hive Problem
The story of open table formats begins with Apache Hive and its metastore, which was the dominant way to organize big data for over a decade. The Hive model defined a table as a directory on a distributed file system. Partitions were subdirectories. When you queried a table, the engine listed the directory contents to find files.
This "directory-first" approach had five catastrophic failure modes that became increasingly obvious as data lakes scaled to petabytes on cloud object storage:
- Slow Directory Listing: Listing millions of files on Amazon S3 is extremely slow — S3 is not a file system. Queries began timing out just during the "planning" phase, before reading a single byte of data.
- No ACID Guarantees: If two Spark jobs wrote to the same directory simultaneously, files would be overwritten and corrupted. There was no "safe concurrent write" primitive.
- No Safe Deletes or Updates: Deleting a row from a Hive table required rewriting the entire partition — a massive, expensive operation for any table with significant data.
- Schema Drift: If a data producer changed a column name, every query against that table broke. There was no mechanism for backward-compatible schema evolution.
- No Time Travel: Once data was overwritten, it was gone. You couldn't query the table as it looked yesterday.
Netflix, Databricks, and Uber engineers independently arrived at the same conclusion: the solution requires moving from directory-level tracking to file-level tracking with explicit transactional metadata.
How Open Table Formats Work: The Core Mechanism
All open table formats — regardless of which one you choose — share the same foundational mechanism: they maintain an explicit, atomic log of every data file that belongs to a table.
When a writer wants to add data, it writes new files to object storage, then updates the metadata to atomically "register" those files as part of the table. Readers always see a consistent, committed snapshot — they never see partially-written data. When data is deleted, the files aren't immediately erased; instead, they are "de-registered" from the metadata, making them invisible to readers. Actual file deletion happens later during a garbage-collection sweep.
sequenceDiagram
participant Writer
participant Catalog
participant Metadata
participant S3
Writer->>S3: 1. Write new Parquet files
Note over S3: Files exist but are invisible to readers
Writer->>Metadata: 2. Create new Manifest / Log entry
Writer->>Catalog: 3. Atomic swap to new metadata pointer
Note over Catalog: Atomic! Either succeeds or fails completely
Catalog-->>Writer: Commit confirmed
Note over S3,Catalog: All readers now see the new data instantly
The Three Major Open Table Formats
Apache Iceberg
Born at Netflix and donated to the Apache Software Foundation, Iceberg is built around a metadata tree. A root JSON file (managed by the Catalog) points to a snapshot, which points to a Manifest List, which points to Manifest Files, which explicitly list every data file along with column-level statistics (min/max values, null counts).
Iceberg's defining architectural choices are:
- ID-based column tracking: Columns are tracked by unique integer IDs, not by name. This enables true, safe schema evolution — you can rename, drop, and re-add columns without rewriting data.
- Hidden Partitioning: The partition strategy is defined in the metadata, not embedded in the file paths. Users query by data values; Iceberg handles the partition translation automatically.
- Partition Evolution: You can change the partitioning scheme (e.g., from monthly to daily) without rewriting historical data.
- Vendor Neutrality: Iceberg's spec was designed to be implemented by any engine without proprietary extensions. It has first-class support in Spark, Flink, Trino, Dremio, Snowflake, AWS, GCP, and Azure.
Delta Lake
Born at Databricks, Delta Lake uses a transaction log stored in a _delta_log/ directory alongside the data. Every transaction appends a new JSON file to this log. Periodically, Delta computes a "checkpoint" (a Parquet file summarizing the entire history) to prevent the log from becoming too long.
Delta's architectural choices:
- Log-replay model: To find the current table state, an engine reads the latest checkpoint and "replays" recent JSON log entries on top of it.
- Deep Spark integration: Delta was designed from day one to work perfectly with Apache Spark on Databricks, with proprietary optimizations like the Photon engine.
- Deletion Vectors: A newer Delta feature (similar to Iceberg's MoR delete files) that tracks row-level deletions efficiently without rewriting entire files.
Apache Hudi
Born at Uber, Hudi uses a timeline stored in a .hoodie/ directory to track all actions on the table. Hudi's architecture is explicitly designed around primary keys and upsert workloads. It maintains a built-in index (using Bloom filters or HBase) so it can efficiently locate which file contains a specific record ID during an update.
Hudi's architectural choices:
- Primary key-first design: Every Hudi table has a declared primary key. This enables ultra-efficient upserts at scale.
- Built-in table services: Hudi has a sophisticated scheduler for compaction, clustering, and archiving as first-class integrated features.
- Incremental processing: Hudi makes it trivial to read "only the new or changed records since a given timestamp," a pattern critical for streaming CDC pipelines.
Feature Comparison Matrix
| Feature | Apache Iceberg | Delta Lake | Apache Hudi |
|---|---|---|---|
| Origin | Netflix / Apache | Databricks / Linux Foundation | Uber / Apache |
| Metadata Model | Hierarchical Tree (JSON + Avro) | Sequential Log (JSON + Parquet) | Timeline (.hoodie dir) |
| Schema Evolution | Best-in-class (ID-based) | Strong (name-based + column mapping) | Good (Avro-based) |
| Partition Evolution | Yes (Hidden Partitioning) | No (requires full rewrite) | No (requires full rewrite) |
| Time Travel | Yes (Snapshots) | Yes (Log version history) | Yes (Timeline) |
| Row-Level Deletes | Yes (Position + Equality Deletes) | Yes (Deletion Vectors) | Yes (MoR) |
| Best Engine | Any (Dremio, Spark, Flink, Trino) | Apache Spark / Databricks | Apache Spark / Flink |
| Upsert Performance | Good | Good | Excellent (primary key index) |
| Vendor Neutrality | Excellent | Good (open-source core) | Good |
Open Table Formats vs. File Formats
A common source of confusion is conflating "table formats" with "file formats." They are completely different layers of the stack:
- File Formats (Apache Parquet, ORC, Avro) define how individual records are physically encoded in a binary file on disk. Parquet is columnar and compressed; Avro is row-based and schema-rich.
- Table Formats (Iceberg, Delta, Hudi) define how a collection of files constitutes a logical, versioned, transactional table. They live one layer above the file format.
An Apache Iceberg table typically stores its actual data as Parquet files. The Iceberg metadata layer (JSON + Avro manifests) is layered on top, tracking which Parquet files belong to the table.
The Role of Catalogs
Open table formats describe how metadata is structured, but they don't tell you how to find that metadata for a given table name. That's the Catalog's job. A catalog maps table names (like my_catalog.my_database.sales) to the root metadata file location on object storage.
The Iceberg REST Catalog specification provides a standard HTTP API for catalog operations, enabling true multi-engine interoperability. Implementations like Apache Polaris and Project Nessie fulfill this spec.
Choosing the Right Format for Your Workload
- Choose Apache Iceberg if you value vendor neutrality, want the best schema/partition evolution, and plan to use multiple query engines (Dremio for BI, Spark for ETL, Flink for streaming). This is the safe, future-proof default for 2026.
- Choose Delta Lake if your entire organization is committed to the Databricks platform and Apache Spark, and you don't anticipate needing other query engines.
- Choose Apache Hudi if your primary workload is continuous streaming ingestion with millions of upserts (e.g., database CDC replication), where Hudi's primary-key indexing provides measurable performance advantages.
Conclusion
Open Table Formats are the foundational innovation that made the Data Lakehouse possible. By replacing Hive's chaotic directory-listing model with a precise, hierarchical, transactional metadata tree, they deliver ACID guarantees, schema evolution, time travel, and blazing query performance — all directly on cheap cloud object storage.
For most modern data teams starting fresh in 2026, Apache Iceberg is the pragmatic default. Its open specification, multi-engine support, and architectural elegance have earned it dominant adoption across the cloud ecosystem.