What Is a Data Catalog?

A data catalog is a curated, searchable inventory of an organization's data assets — tables, reports, dashboards, ML models, streaming topics, and APIs — enriched with business context that makes data discoverable, understandable, and trustworthy for everyone who needs to use it, from data analysts to executives to AI systems.

The distinction from a technical catalog (Iceberg REST Catalog, Hive Metastore) is purpose: technical catalogs are designed for query engines — they store schema metadata, file locations, and partition specs that engines need to execute queries. Business data catalogs are designed for humans — they store descriptions, ownership, quality ratings, business glossary linkages, lineage graphs, and usage statistics that analysts need to find and trust data.

In the modern data lakehouse, both layers are necessary. The technical catalog (Iceberg REST) enables engines like Dremio and Spark to access tables. The business data catalog enables analysts to discover which tables exist, what they mean, how fresh they are, and who owns them — before they write a single query.

Core Data Catalog Capabilities

A full-featured data catalog provides:

  • Asset discovery: Full-text search across all registered assets with filtering by type, domain, owner, quality, and classification
  • Business descriptions: Human-readable descriptions of tables, columns, and their business meaning — written by data owners and data stewards
  • Business glossary: Centralized definitions of business terms (LTV, churn, conversion) linked to specific tables and columns that implement those concepts
  • Data lineage: Visual graph showing how data flows from source systems through transformations to reports — enabling impact analysis and root cause investigation
  • Data quality: Quality metrics and profiling results (completeness, uniqueness, freshness, anomalies) displayed alongside asset metadata
  • Ownership and stewardship: Clear data ownership assignment with contact information for questions and access requests
  • Classifications and tags: PII classification, sensitivity labels, regulatory scope tags
Data Catalog Business Discovery Layer diagram
Figure 1: A data catalog sits above the technical catalog — adding business context for human discovery and governance.

Open Source Data Catalogs

The open-source data catalog ecosystem has matured significantly:

DataHub (LinkedIn)

Originally developed at LinkedIn and open-sourced, DataHub is a metadata platform with push-based ingestion from hundreds of data sources, lineage tracking, business glossary, and ML feature store integration.

OpenMetadata

A modern, API-first open-source data catalog with a strong focus on data quality integration, automated profiling, and collaboration features. Widely adopted in the lakehouse community.

Apache Atlas

The original open-source metadata governance framework from the Hadoop ecosystem, deeply integrated with Apache Ranger for access control. Widely deployed in on-premises Hadoop environments.

Data Catalog Tools Ecosystem diagram
Figure 2: Open-source and commercial data catalog options for the modern lakehouse.

Summary

A data catalog is the business intelligence layer of the data lakehouse — transforming raw technical metadata into a curated, searchable, trusted inventory that enables self-service analytics at scale. Without a catalog, the lakehouse is a sophisticated storage and query system that only specialists can navigate. With a catalog, every analyst can discover trusted data, understand its business meaning, verify its quality, and begin analysis with confidence. For organizations building towards self-service analytics, the data catalog is as essential as the query engine itself.