What Is AWS Glue Data Catalog?

AWS Glue Data Catalog is Amazon Web Services' serverless, managed metadata catalog — the central repository for table schemas, partition metadata, and data locations for data stored in AWS S3. It serves as the catalog for the entire AWS analytics ecosystem: Amazon Athena queries Glue for table metadata, Amazon EMR (Spark/Hive) uses Glue as its Hive Metastore replacement, Amazon Redshift Spectrum queries Glue-registered tables in S3, and AWS Lake Formation uses Glue as its metadata backbone for fine-grained access control.

Glue Data Catalog began as an HMS-compatible catalog — supporting the same Thrift API that Apache Hive uses, making it a drop-in replacement for self-hosted HMS with serverless scaling and AWS-managed availability. Over time, AWS has added native Apache Iceberg support and an Iceberg REST Catalog API, positioning Glue as the managed catalog for the AWS-based open lakehouse.

Glue and Apache Iceberg

AWS Glue supports Apache Iceberg in two modes:

Iceberg Tables in Glue (Backend Mode)

Spark jobs running on Amazon EMR can write Iceberg tables using Glue as the catalog backend. Glue stores the current metadata file location for each Iceberg table; the full Iceberg metadata tree (manifest list, manifests, data files) lives in S3. This is the most common EMR + Iceberg pattern on AWS.

Iceberg REST Catalog API

Glue now exposes an Iceberg REST Catalog API endpoint, allowing any Iceberg REST catalog client to connect to Glue directly — without HMS Thrift client configuration. This enables Spark, Trino, Flink, and other engines to use Glue as a REST catalog in their standard Iceberg catalog configuration.

AWS Glue Data Catalog Architecture diagram
Figure 1: Glue Data Catalog in the AWS lakehouse — HMS-compatible and Iceberg REST API for all engines.

Glue and AWS Lake Formation

AWS Lake Formation is AWS's data lake governance service, built on top of Glue Data Catalog. Lake Formation adds fine-grained access control to Glue-registered tables: column-level permissions, row-level filters, tag-based access policies, and cross-account data sharing.

For organizations building a governed lakehouse on AWS, the Glue + Lake Formation combination provides the access control layer without requiring a separate catalog deployment. Lake Formation permissions are enforced at the Glue catalog API level — engines querying Glue-registered Iceberg tables via Athena or EMR have their access controlled by Lake Formation policies, regardless of which engine is making the request.

Glue vs. Polaris and Nessie

DimensionAWS GlueApache PolarisProject Nessie
DeploymentManaged AWS serviceSelf-hosted or managedSelf-hosted (open source)
Open sourceNo (proprietary)Yes (ASF)Yes (Apache)
Cloud portabilityAWS onlyAny cloudAny cloud
Git-like branchingNoNoYes
REST Catalog APIYesYesYes
Lake Formation integrationNativeNoNo
Glue vs Open Catalogs Comparison diagram
Figure 2: Glue vs open catalog alternatives — AWS-native governance vs cloud-portable open standards.

Summary

AWS Glue Data Catalog is the practical default catalog for AWS-based data lakehouses. Its native integration with Athena, EMR, Redshift Spectrum, and Lake Formation makes it the path of least resistance for AWS-centric organizations. For organizations prioritizing cloud portability, open governance, or Git-like data versioning, Apache Polaris or Project Nessie are the open alternatives. Most AWS lakehouse deployments start with Glue and evaluate migration to open catalogs as multi-cloud or portability requirements emerge.