How is data lineage captured in a lakehouse?

Lineage is captured through: pipeline metadata (Airflow DAG definitions, Spark job metadata), catalog integration (tools like DataHub and OpenMetadata that parse ETL job metadata), query log parsing (capturing which tables each query reads and writes), and engine-native lineage APIs (Unity Catalog and Databricks capture lineage automatically from Spark operations).

Data Lineage: The Definitive Guide for Data Lakehouse

Q: What is data lineage?

Data lineage is the tracking of data's origin, movement, and transformation through a data system — from source systems through ETL pipelines, Iceberg tables, and transformations to reports and dashboards. It answers: where did this data come from, how was it transformed, and what depends on it?

Q: What are the types of data lineage?

Three main types: Table-level lineage (which tables are inputs and outputs of each pipeline job — coarse-grained, easy to capture), Column-level lineage (which source columns contribute to each target column through transformations — fine-grained, harder to capture), and Query-level lineage (which tables are read by each query that produces a dashboard or report).

What Is Data Lineage?

Data lineage is the systematic tracking of data's journey through an organization's data systems — from the original source (an operational database, an API, a sensor) through every transformation, join, aggregation, and copy, to its final use in reports, dashboards, ML models, or analytical decisions. Lineage maps the full graph of data dependencies: for each data asset, lineage tells you where it came from (upstream lineage) and where it goes (downstream lineage).

In the data lakehouse, lineage typically spans: source operational databases → Kafka CDC events → Bronze Iceberg tables → Silver Iceberg tables (after cleansing) → Gold Iceberg tables (after aggregation) → Dremio Virtual Datasets (semantic layer) → BI dashboards and reports. Without captured lineage, this chain is invisible — when a dashboard shows wrong numbers, finding which upstream transformation introduced the error requires manual investigation of every pipeline step.

Table-Level vs Column-Level Lineage

Lineage granularity has a significant impact on its utility:

Table-Level Lineage

Records which tables are read and written by each pipeline job. Captures the dependency graph between Bronze, Silver, and Gold Iceberg tables. Sufficient for impact analysis (which downstream tables break if a source table schema changes?) and pipeline debugging at a coarse level. Relatively easy to capture from pipeline orchestration metadata (Airflow DAG definitions).

Column-Level Lineage

Records which specific source columns contribute to each target column in the output. Required for regulatory compliance (proving that a financial metric derives from validated, controlled source data) and for precise debugging (which upstream column mapping introduced a calculation error in a specific metric). Harder to capture — requires SQL parsing or engine-native API integration.

Data Lineage Table and Column Level diagram — Figure 1: Data lineage levels — table-level dependency graph and column-level transformation tracking.

Lineage Tools for the Lakehouse

Key data lineage tools for the lakehouse ecosystem:

DataHub (LinkedIn/open source): Automatic lineage ingestion from Spark, Airflow, dbt, and SQL query logs. Visualizes lineage graphs in a web UI. Most widely adopted open-source lineage platform.
OpenMetadata: Unified metadata platform with built-in lineage capture from pipeline metadata and SQL parsing. Strong Iceberg catalog integration.
Apache Atlas: The Hadoop-era governance and lineage platform. Deeply integrated with Apache Ranger for combined lineage + access control governance.
Unity Catalog (Databricks): Captures column-level lineage automatically from Spark SQL operations within Databricks workspaces — the most comprehensive automated lineage for Databricks users.
dbt lineage: dbt's manifest.json captures table-level lineage for all dbt model dependencies — visualized in dbt docs and ingested by DataHub or OpenMetadata.

Lineage Tools Lakehouse Ecosystem diagram — Figure 2: Data lineage tool ecosystem — DataHub, OpenMetadata, Atlas, Unity Catalog for lakehouse lineage.

Summary

Data lineage is the transparency layer that makes the data lakehouse trustworthy and governable. Without lineage, every debugging session and compliance inquiry requires manual pipeline archaeology. With comprehensive lineage — from source databases through Bronze/Silver/Gold Iceberg tables to Dremio Virtual Datasets and BI dashboards — data teams can instantly trace errors to their source, assess the impact of upstream changes, and provide regulators with complete data provenance documentation. Investing in data lineage tooling is one of the highest-ROI governance investments for any mature lakehouse organization.