What is data governance in the data lakehouse?

Data governance in the data lakehouse is the set of policies, controls, and processes that ensure: only authorized users can access specific data (access control), data quality is measured and maintained (data quality), data origins and transformations are traceable (lineage), regulatory requirements are met (compliance), and data assets are properly documented and owned (stewardship).

Where is access control enforced in a governed lakehouse?

Best practice is to enforce access control at the catalog layer — in the Iceberg REST Catalog (Apache Polaris, Dremio Open Catalog) or in AWS Lake Formation. Catalog-layer enforcement is engine-agnostic: regardless of which engine queries the data (Spark, Dremio, Trino), the catalog API enforces access policies consistently by returning or withholding table metadata and storage credentials.

How does Apache Iceberg support data governance?

Iceberg supports governance through: its catalog integration (access control enforced at catalog API), column-level statistics in manifests (enabling quality checks), time travel and snapshot history (enabling audit of historical data states), and its structured metadata (enabling lineage tools to parse the transformation history from snapshot to snapshot).

Data Governance in the Data Lakehouse: The Definitive Guide

What Is Data Governance?

Data governance is the organizational framework — policies, processes, roles, and technologies — that ensures data assets are managed as strategic organizational resources: accurate, trustworthy, accessible to authorized users, protected from unauthorized access, and compliant with applicable regulations.

In the data lakehouse, data governance is both more critical and more technically complex than in traditional data warehouses. More critical because the lakehouse's openness (multiple engines, open formats, direct storage access) creates more potential access paths that must be governed. More complex because governance must span the catalog layer, storage layer, query engine layer, and semantic layer — with policies enforced consistently regardless of which engine or interface accesses the data.

Enterprise lakehouse governance has six pillars: access control, data quality, data lineage, metadata management, data classification, and regulatory compliance.

Access Control: The Catalog Layer

The most important architectural decision in lakehouse governance is where to enforce access control. The correct answer is the catalog layer — enforcing policies in the Iceberg REST Catalog API, where table metadata and storage credentials are served.

Catalog-layer enforcement is engine-agnostic: every engine (Spark, Dremio, Trino, Flink) must load table metadata from the catalog before accessing any data files. If the catalog denies metadata access to an unauthorized principal, that principal cannot access the table regardless of engine. This is fundamentally superior to engine-layer enforcement, where each engine must independently implement and maintain the same access policies.

Leading catalog-layer access control implementations: Apache Polaris RBAC, Dremio Open Catalog access policies, AWS Lake Formation column-level and row-level policies, and Unity Catalog fine-grained access control.

Lakehouse Governance Architecture diagram — Figure 1: Governance enforcement at the catalog layer — engine-agnostic access control for the open lakehouse.

Data Quality Governance

Data quality governance ensures that data meets defined standards for completeness, accuracy, consistency, timeliness, and uniqueness. In the lakehouse context:

Quality rules: Defined expectations for each table (e.g., 'order_value must be positive', 'customer_id must be non-null', 'event_date must be within the last 90 days for new records')
Quality checks: Automated validation runs (via dbt tests, Great Expectations, or Soda) that evaluate rules against actual data and produce quality metrics
Quality metadata: Quality scores and check results stored in the data catalog alongside table descriptions — enabling analysts to assess data trustworthiness before using it
Quality gates: Pipeline stages that halt promotion of data to the next Medallion layer if quality checks fail

Data Lineage and Compliance

Data lineage tracks how data flows from source systems through transformations to analytical outputs — which upstream tables each table depends on, which downstream tables and reports depend on each table. Lineage enables: impact analysis (if source table schema changes, which downstream tables break?), root cause investigation (if a metric is wrong, which upstream transformation introduced the error?), and regulatory compliance (prove to auditors that financial report data originated from validated, controlled source systems).

In the lakehouse, lineage is captured at multiple levels: pipeline-level lineage (which Spark jobs produce which Iceberg tables), column-level lineage (which source columns contribute to each target column), and query-level lineage (which tables are read by each query that produces a dashboard or report).

Data Lineage in the Lakehouse diagram — Figure 2: Data lineage across the Medallion Architecture — source systems to reports with full traceability.

Summary

Data governance is the foundation that transforms a technical data lakehouse into a trusted enterprise asset. Without governance, the lakehouse's openness becomes a liability — inconsistent data access, quality issues, regulatory exposure. With governance — catalog-layer access control, quality management, lineage tracking, and metadata stewardship — the lakehouse becomes an organizationally trustworthy platform where data consumers at every level can confidently use data knowing it is accurate, authorized, and compliant. The open lakehouse governed by Apache Polaris or Dremio Open Catalog provides the same governance rigor as proprietary cloud warehouses without the vendor lock-in.

What Is Data Governance?

Access Control: The Catalog Layer

Data Quality Governance

Data Lineage and Compliance

Summary

Related Concepts

Go Deeper — Recommended Resources