Iceberg REST Catalog, Apache Polaris & Interoperability

How catalogs enable true multi-engine interoperability in the data lakehouse.

What Is a Catalog in the Iceberg Ecosystem?

Apache Iceberg defines a brilliant metadata architecture for organizing data files into versioned, ACID-compliant tables. But Iceberg itself doesn't specify how to find a table's metadata given just a table name. That's the catalog's job.

An Iceberg catalog is a service that maps human-readable table names (like analytics.prod.sales) to the physical URI of that table's current metadata file on object storage. It also enforces the atomic commit protocol — ensuring that when two writers try to update the same table simultaneously, exactly one succeeds and the other retries.

Without a catalog, each query engine would need to be individually configured with hardcoded metadata file paths — an unmanageable approach at any scale. With a centralized catalog, any engine that knows the catalog's address can discover, read, and write any table in the lakehouse.

The Catalog's Two Jobs:
  1. Table Discovery: "Given the name prod.orders, what is the current metadata file location?"
  2. Atomic Commit: "Only allow a new metadata pointer to be committed if the previous version hasn't changed since the writer started."

The Iceberg REST Catalog Specification

For the first few years of Iceberg's existence, catalogs were tightly coupled to specific backends — Hive Metastore, AWS Glue, or JDBC databases. Each engine needed a custom plugin for each catalog type, creating an M×N compatibility matrix that was difficult to maintain.

The Iceberg REST Catalog Specification (sometimes called the "Iceberg Open API" or "REST Catalog Spec") solved this elegantly. It defines a standard HTTP/REST API that any catalog can implement and any engine can connect to. Once an engine implements the REST catalog client, it automatically works with every catalog that implements the server side of the spec.

Key REST catalog API operations include:

      graph TD
        Dremio[Dremio] -->|REST API| Cat[Iceberg REST Catalog Server]
        Spark[Apache Spark] -->|REST API| Cat
        Flink[Apache Flink] -->|REST API| Cat
        Trino[Trino] -->|REST API| Cat
        PyIceberg[PyIceberg Python] -->|REST API| Cat

        Cat -->|Persists metadata pointers| S3[(S3 / Object Storage)]
        Cat -->|Returns credentials| Engines["Query Engines (Dremio, Spark…)"]
        Engines -->|Direct read of data files| S3

        style Cat fill:#dbeafe,stroke:#2563eb,stroke-width:2px
        style S3 fill:#dcfce7,stroke:#22c55e
      

Credential Vending: A Critical Security Feature

One of the most underappreciated features of the REST catalog spec is credential vending. When an engine calls the REST catalog to load a table, the catalog response can include short-lived, scoped credentials for accessing only the specific S3 (or GCS, ADLS) prefixes where that table's data lives.

This means query engines never need permanent cloud storage credentials. An engine is granted access to exactly the paths it needs for exactly the duration of the query. This creates a powerful security boundary — even if an engine is compromised, its temporary credentials can only access the data it was explicitly authorized to query.

Major Catalog Implementations

Apache Polaris (The Open Standard)

Originally developed by Snowflake and donated to the Apache Software Foundation in 2024, Apache Polaris is now the de facto open-source, vendor-neutral implementation of the Iceberg REST Catalog spec.

Key Polaris capabilities:

Dremio's Open Catalog offering is powered by Apache Polaris, providing a managed, enterprise-grade implementation available through Dremio Cloud.

Project Nessie (Git for Data)

Project Nessie is a catalog with a unique superpower: it provides Git-like branching and tagging semantics for your entire data lakehouse. Just like you branch code in Git to make experimental changes without affecting the main branch, Nessie lets you branch your entire catalog — creating an isolated environment where you can test transformations, validate data pipelines, or run experiments on production-scale data.

Nessie capabilities:

Dremio uses Project Nessie as its internal catalog engine, exposing Nessie's branching capabilities directly in the Dremio UI for data engineers and data platform teams.

AWS Glue Data Catalog

AWS Glue Data Catalog is Amazon's managed metadata catalog, deeply integrated with the AWS ecosystem. Since 2023, it supports the Iceberg REST Catalog spec natively, meaning AWS Athena, EMR, and other services can use it as an Iceberg catalog. For organizations running primarily on AWS, Glue provides low operational overhead and seamless integration with IAM for access control.

Hive Metastore (Legacy)

The Hive Metastore (HMS) remains widely used for Iceberg catalog management, particularly in on-premises Hadoop environments and older cloud deployments. However, HMS was not designed for the REST catalog spec, lacks credential vending, and has significant operational overhead. Most organizations are migrating from HMS to REST-catalog-compatible alternatives.

Catalog Comparison

Catalog REST Spec Cred Vending Branching Best For
Apache Polaris✅ Full✅ Yes❌ NoOpen standard, multi-engine
Project Nessie✅ Full⚠️ Partial✅ Yes (Git-like)Data versioning, Dremio
AWS Glue✅ Full✅ Yes (IAM)❌ NoAWS-native deployments
Databricks Unity⚠️ Partial✅ Yes❌ NoDatabricks ecosystem
Hive Metastore❌ No❌ No❌ NoLegacy Hadoop environments

Multi-Engine Interoperability in Practice

The REST catalog's greatest value is enabling true multi-engine interoperability. Here's what this looks like in a production lakehouse:

  1. Apache Flink streams events from Kafka into Bronze-layer Iceberg tables, committing every 60 seconds. It connects to the REST catalog to register each new snapshot.
  2. Apache Spark runs nightly batch jobs transforming Bronze to Silver. It connects to the same REST catalog, reads the latest snapshot committed by Flink, and writes new Silver snapshots.
  3. Dremio serves sub-second interactive queries for BI dashboards against both Silver and Gold tables. It uses the REST catalog to discover table locations and uses metadata pruning to skip irrelevant files.
  4. PyIceberg (Python library) allows data scientists to read Iceberg tables directly into Pandas or PyArrow DataFrames for model training, again via the same REST catalog endpoint.

Every engine reads and writes to the same physical tables, governed by the same catalog, with the same ACID guarantees — zero data copying required.

Conclusion

The Iceberg REST Catalog specification is the connective tissue of the modern data lakehouse. By standardizing how engines discover, read, and commit to Iceberg tables through a simple HTTP API, it eliminates the catalog fragmentation that previously forced organizations to maintain separate data copies for different tools.

Apache Polaris provides the open-source foundation. Project Nessie adds Git-like data versioning. And managed offerings like Dremio's Open Catalog provide enterprise-grade reliability on top of these foundations. Choose based on your priorities: pure openness (Polaris), data versioning (Nessie), or managed simplicity (Dremio or AWS Glue).