What Is a Catalog in the Iceberg Ecosystem?
Apache Iceberg defines a brilliant metadata architecture for organizing data files into versioned, ACID-compliant tables. But Iceberg itself doesn't specify how to find a table's metadata given just a table name. That's the catalog's job.
An Iceberg catalog is a service that maps human-readable table names (like analytics.prod.sales) to the physical URI of that table's current metadata file on object storage. It also enforces the atomic commit protocol — ensuring that when two writers try to update the same table simultaneously, exactly one succeeds and the other retries.
Without a catalog, each query engine would need to be individually configured with hardcoded metadata file paths — an unmanageable approach at any scale. With a centralized catalog, any engine that knows the catalog's address can discover, read, and write any table in the lakehouse.
- Table Discovery: "Given the name
prod.orders, what is the current metadata file location?" - Atomic Commit: "Only allow a new metadata pointer to be committed if the previous version hasn't changed since the writer started."
The Iceberg REST Catalog Specification
For the first few years of Iceberg's existence, catalogs were tightly coupled to specific backends — Hive Metastore, AWS Glue, or JDBC databases. Each engine needed a custom plugin for each catalog type, creating an M×N compatibility matrix that was difficult to maintain.
The Iceberg REST Catalog Specification (sometimes called the "Iceberg Open API" or "REST Catalog Spec") solved this elegantly. It defines a standard HTTP/REST API that any catalog can implement and any engine can connect to. Once an engine implements the REST catalog client, it automatically works with every catalog that implements the server side of the spec.
Key REST catalog API operations include:
GET /v1/{prefix}/namespaces— list all namespacesGET /v1/{prefix}/namespaces/{namespace}/tables— list all tables in a namespaceGET /v1/{prefix}/namespaces/{namespace}/tables/{table}— load a table (returns current metadata + credentials)POST /v1/{prefix}/namespaces/{namespace}/tables— create a tablePOST /v1/{prefix}/namespaces/{namespace}/tables/{table}/metrics— report scan metricsPOST /v1/{prefix}/namespaces/{namespace}/tables/{table}/snapshots— atomic commit (the heart of ACID)
graph TD
Dremio[Dremio] -->|REST API| Cat[Iceberg REST Catalog Server]
Spark[Apache Spark] -->|REST API| Cat
Flink[Apache Flink] -->|REST API| Cat
Trino[Trino] -->|REST API| Cat
PyIceberg[PyIceberg Python] -->|REST API| Cat
Cat -->|Persists metadata pointers| S3[(S3 / Object Storage)]
Cat -->|Returns credentials| Engines["Query Engines (Dremio, Spark…)"]
Engines -->|Direct read of data files| S3
style Cat fill:#dbeafe,stroke:#2563eb,stroke-width:2px
style S3 fill:#dcfce7,stroke:#22c55e
Credential Vending: A Critical Security Feature
One of the most underappreciated features of the REST catalog spec is credential vending. When an engine calls the REST catalog to load a table, the catalog response can include short-lived, scoped credentials for accessing only the specific S3 (or GCS, ADLS) prefixes where that table's data lives.
This means query engines never need permanent cloud storage credentials. An engine is granted access to exactly the paths it needs for exactly the duration of the query. This creates a powerful security boundary — even if an engine is compromised, its temporary credentials can only access the data it was explicitly authorized to query.
Major Catalog Implementations
Apache Polaris (The Open Standard)
Originally developed by Snowflake and donated to the Apache Software Foundation in 2024, Apache Polaris is now the de facto open-source, vendor-neutral implementation of the Iceberg REST Catalog spec.
Key Polaris capabilities:
- Full REST Catalog compliance: Every Iceberg-compatible engine works with Polaris out of the box.
- Credential Vending: Fine-grained, temporary credential issuance per table access.
- Multi-principal governance: RBAC at the namespace, table, and view level.
- Multi-catalog federation: One Polaris instance can manage multiple "external" catalog connections, acting as a federated metadata hub.
Dremio's Open Catalog offering is powered by Apache Polaris, providing a managed, enterprise-grade implementation available through Dremio Cloud.
Project Nessie (Git for Data)
Project Nessie is a catalog with a unique superpower: it provides Git-like branching and tagging semantics for your entire data lakehouse. Just like you branch code in Git to make experimental changes without affecting the main branch, Nessie lets you branch your entire catalog — creating an isolated environment where you can test transformations, validate data pipelines, or run experiments on production-scale data.
Nessie capabilities:
- Catalog-level branching: Create a
devbranch, run 10 ETL jobs against it, then merge tomainonly if the results are correct. - Atomic multi-table transactions: Because branches can span multiple tables, you can commit changes to 5 tables simultaneously as a single atomic operation.
- Audit history: Every commit to every branch is permanently logged with its author, timestamp, and description.
Dremio uses Project Nessie as its internal catalog engine, exposing Nessie's branching capabilities directly in the Dremio UI for data engineers and data platform teams.
AWS Glue Data Catalog
AWS Glue Data Catalog is Amazon's managed metadata catalog, deeply integrated with the AWS ecosystem. Since 2023, it supports the Iceberg REST Catalog spec natively, meaning AWS Athena, EMR, and other services can use it as an Iceberg catalog. For organizations running primarily on AWS, Glue provides low operational overhead and seamless integration with IAM for access control.
Hive Metastore (Legacy)
The Hive Metastore (HMS) remains widely used for Iceberg catalog management, particularly in on-premises Hadoop environments and older cloud deployments. However, HMS was not designed for the REST catalog spec, lacks credential vending, and has significant operational overhead. Most organizations are migrating from HMS to REST-catalog-compatible alternatives.
Catalog Comparison
| Catalog | REST Spec | Cred Vending | Branching | Best For |
|---|---|---|---|---|
| Apache Polaris | ✅ Full | ✅ Yes | ❌ No | Open standard, multi-engine |
| Project Nessie | ✅ Full | ⚠️ Partial | ✅ Yes (Git-like) | Data versioning, Dremio |
| AWS Glue | ✅ Full | ✅ Yes (IAM) | ❌ No | AWS-native deployments |
| Databricks Unity | ⚠️ Partial | ✅ Yes | ❌ No | Databricks ecosystem |
| Hive Metastore | ❌ No | ❌ No | ❌ No | Legacy Hadoop environments |
Multi-Engine Interoperability in Practice
The REST catalog's greatest value is enabling true multi-engine interoperability. Here's what this looks like in a production lakehouse:
- Apache Flink streams events from Kafka into Bronze-layer Iceberg tables, committing every 60 seconds. It connects to the REST catalog to register each new snapshot.
- Apache Spark runs nightly batch jobs transforming Bronze to Silver. It connects to the same REST catalog, reads the latest snapshot committed by Flink, and writes new Silver snapshots.
- Dremio serves sub-second interactive queries for BI dashboards against both Silver and Gold tables. It uses the REST catalog to discover table locations and uses metadata pruning to skip irrelevant files.
- PyIceberg (Python library) allows data scientists to read Iceberg tables directly into Pandas or PyArrow DataFrames for model training, again via the same REST catalog endpoint.
Every engine reads and writes to the same physical tables, governed by the same catalog, with the same ACID guarantees — zero data copying required.
Conclusion
The Iceberg REST Catalog specification is the connective tissue of the modern data lakehouse. By standardizing how engines discover, read, and commit to Iceberg tables through a simple HTTP API, it eliminates the catalog fragmentation that previously forced organizations to maintain separate data copies for different tools.
Apache Polaris provides the open-source foundation. Project Nessie adds Git-like data versioning. And managed offerings like Dremio's Open Catalog provide enterprise-grade reliability on top of these foundations. Choose based on your priorities: pure openness (Polaris), data versioning (Nessie), or managed simplicity (Dremio or AWS Glue).