How is a data catalog different from a technical catalog like Iceberg REST catalog?

A technical catalog (like Iceberg REST or Hive Metastore) stores metadata for query engines — schemas, file locations, partition specs. A business data catalog stores metadata for humans — descriptions, glossary terms, ownership, lineage, quality metrics, and classifications. Both are needed: the technical catalog for engine access, the business catalog for human discovery and governance.

What are popular data catalog tools?

Popular data catalog tools include Apache Atlas (open source), Collibra, Alation, DataHub (LinkedIn, open source), OpenMetadata (open source), Microsoft Purview, and Google Dataplex. Most modern data catalogs integrate with technical catalogs (Hive, Iceberg REST, Glue) to automatically ingest schema metadata and augment it with business context.

Data Catalog: The Definitive Guide

Q: What is a data catalog?

A data catalog is a metadata management tool that indexes data assets (tables, reports, ML models, dashboards) with business context — descriptions, business owners, data quality ratings, usage statistics, lineage, and classifications. It enables data discovery, governance, and trust for business users and data analysts without requiring them to know physical storage details.

What Is a Data Catalog?

A data catalog is a curated, searchable inventory of an organization's data assets — tables, reports, dashboards, ML models, streaming topics, and APIs — enriched with business context that makes data discoverable, understandable, and trustworthy for everyone who needs to use it, from data analysts to executives to AI systems.

The distinction from a technical catalog (Iceberg REST Catalog, Hive Metastore) is purpose: technical catalogs are designed for query engines — they store schema metadata, file locations, and partition specs that engines need to execute queries. Business data catalogs are designed for humans — they store descriptions, ownership, quality ratings, business glossary linkages, lineage graphs, and usage statistics that analysts need to find and trust data.

In the modern data lakehouse, both layers are necessary. The technical catalog (Iceberg REST) enables engines like Dremio and Spark to access tables. The business data catalog enables analysts to discover which tables exist, what they mean, how fresh they are, and who owns them — before they write a single query.

Core Data Catalog Capabilities

A full-featured data catalog provides:

Asset discovery: Full-text search across all registered assets with filtering by type, domain, owner, quality, and classification
Business descriptions: Human-readable descriptions of tables, columns, and their business meaning — written by data owners and data stewards
Business glossary: Centralized definitions of business terms (LTV, churn, conversion) linked to specific tables and columns that implement those concepts
Data lineage: Visual graph showing how data flows from source systems through transformations to reports — enabling impact analysis and root cause investigation
Data quality: Quality metrics and profiling results (completeness, uniqueness, freshness, anomalies) displayed alongside asset metadata
Ownership and stewardship: Clear data ownership assignment with contact information for questions and access requests
Classifications and tags: PII classification, sensitivity labels, regulatory scope tags

Data Catalog Business Discovery Layer diagram — Figure 1: A data catalog sits above the technical catalog — adding business context for human discovery and governance.

Open Source Data Catalogs

The open-source data catalog ecosystem has matured significantly:

DataHub (LinkedIn)

Originally developed at LinkedIn and open-sourced, DataHub is a metadata platform with push-based ingestion from hundreds of data sources, lineage tracking, business glossary, and ML feature store integration.

OpenMetadata

A modern, API-first open-source data catalog with a strong focus on data quality integration, automated profiling, and collaboration features. Widely adopted in the lakehouse community.

Apache Atlas

The original open-source metadata governance framework from the Hadoop ecosystem, deeply integrated with Apache Ranger for access control. Widely deployed in on-premises Hadoop environments.

Data Catalog Tools Ecosystem diagram — Figure 2: Open-source and commercial data catalog options for the modern lakehouse.

Summary

A data catalog is the business intelligence layer of the data lakehouse — transforming raw technical metadata into a curated, searchable, trusted inventory that enables self-service analytics at scale. Without a catalog, the lakehouse is a sophisticated storage and query system that only specialists can navigate. With a catalog, every analyst can discover trusted data, understand its business meaning, verify its quality, and begin analysis with confidence. For organizations building towards self-service analytics, the data catalog is as essential as the query engine itself.