What Is Data Governance?

Data governance is the organizational framework — policies, processes, roles, and technologies — that ensures data assets are managed as strategic organizational resources: accurate, trustworthy, accessible to authorized users, protected from unauthorized access, and compliant with applicable regulations.

In the data lakehouse, data governance is both more critical and more technically complex than in traditional data warehouses. More critical because the lakehouse's openness (multiple engines, open formats, direct storage access) creates more potential access paths that must be governed. More complex because governance must span the catalog layer, storage layer, query engine layer, and semantic layer — with policies enforced consistently regardless of which engine or interface accesses the data.

Enterprise lakehouse governance has six pillars: access control, data quality, data lineage, metadata management, data classification, and regulatory compliance.

Access Control: The Catalog Layer

The most important architectural decision in lakehouse governance is where to enforce access control. The correct answer is the catalog layer — enforcing policies in the Iceberg REST Catalog API, where table metadata and storage credentials are served.

Catalog-layer enforcement is engine-agnostic: every engine (Spark, Dremio, Trino, Flink) must load table metadata from the catalog before accessing any data files. If the catalog denies metadata access to an unauthorized principal, that principal cannot access the table regardless of engine. This is fundamentally superior to engine-layer enforcement, where each engine must independently implement and maintain the same access policies.

Leading catalog-layer access control implementations: Apache Polaris RBAC, Dremio Open Catalog access policies, AWS Lake Formation column-level and row-level policies, and Unity Catalog fine-grained access control.

Lakehouse Governance Architecture diagram
Figure 1: Governance enforcement at the catalog layer — engine-agnostic access control for the open lakehouse.

Data Quality Governance

Data quality governance ensures that data meets defined standards for completeness, accuracy, consistency, timeliness, and uniqueness. In the lakehouse context:

  • Quality rules: Defined expectations for each table (e.g., 'order_value must be positive', 'customer_id must be non-null', 'event_date must be within the last 90 days for new records')
  • Quality checks: Automated validation runs (via dbt tests, Great Expectations, or Soda) that evaluate rules against actual data and produce quality metrics
  • Quality metadata: Quality scores and check results stored in the data catalog alongside table descriptions — enabling analysts to assess data trustworthiness before using it
  • Quality gates: Pipeline stages that halt promotion of data to the next Medallion layer if quality checks fail

Data Lineage and Compliance

Data lineage tracks how data flows from source systems through transformations to analytical outputs — which upstream tables each table depends on, which downstream tables and reports depend on each table. Lineage enables: impact analysis (if source table schema changes, which downstream tables break?), root cause investigation (if a metric is wrong, which upstream transformation introduced the error?), and regulatory compliance (prove to auditors that financial report data originated from validated, controlled source systems).

In the lakehouse, lineage is captured at multiple levels: pipeline-level lineage (which Spark jobs produce which Iceberg tables), column-level lineage (which source columns contribute to each target column), and query-level lineage (which tables are read by each query that produces a dashboard or report).

Data Lineage in the Lakehouse diagram
Figure 2: Data lineage across the Medallion Architecture — source systems to reports with full traceability.

Summary

Data governance is the foundation that transforms a technical data lakehouse into a trusted enterprise asset. Without governance, the lakehouse's openness becomes a liability — inconsistent data access, quality issues, regulatory exposure. With governance — catalog-layer access control, quality management, lineage tracking, and metadata stewardship — the lakehouse becomes an organizationally trustworthy platform where data consumers at every level can confidently use data knowing it is accurate, authorized, and compliant. The open lakehouse governed by Apache Polaris or Dremio Open Catalog provides the same governance rigor as proprietary cloud warehouses without the vendor lock-in.