What are the main data quality dimensions?

The six key data quality dimensions: Accuracy (values are correct), Completeness (no missing required values), Consistency (values match across systems and tables), Timeliness (data is fresh enough for its use case), Uniqueness (no duplicate records), and Validity (values conform to defined formats, ranges, and constraints).

What tools enforce data quality in Apache Iceberg lakehouses?

Key data quality tools for lakehouse pipelines: dbt tests (built-in not_null, unique, accepted_values, relationships tests run as part of dbt model execution), Great Expectations (rich expectation library with HTML reports), Soda Core (SQL-based quality checks integrated with data catalogs), and Apache Griffin (open-source quality platform for Spark-based pipelines).

Data Quality: The Definitive Guide for Data Lakehouse

Q: What is data quality?

Data quality refers to the degree to which data is accurate (correct values), complete (no missing values), consistent (same value across systems), timely (fresh enough for its use case), and unique (no duplicates). High-quality data is fit for its intended use — analysts and AI models can trust it to make correct decisions.

What Is Data Quality?

Data quality is the measure of how well data meets the requirements of its intended use — encompassing accuracy, completeness, consistency, timeliness, uniqueness, and validity across all data assets in the data lakehouse. Poor data quality is one of the most expensive and pervasive problems in data-driven organizations: analysts who discover that data cannot be trusted spend time validating rather than analyzing, business decisions made on incorrect data have real operational consequences, and regulatory reports based on bad data create compliance exposure.

In the lakehouse context, data quality is not just a monitoring activity — it is an active engineering discipline that builds quality checks into ETL pipelines, enforces quality gates between Medallion Architecture layers, makes quality metrics visible in the data catalog, and alerts data engineers when quality degradation is detected before it reaches business consumers.

Six Data Quality Dimensions

Accuracy: Values are correct representations of real-world facts. (Revenue amounts match transaction system totals)
Completeness: All required fields are populated; no missing values for critical attributes. (customer_id is never null in the orders table)
Consistency: The same fact has the same value across all systems and tables. (Customer address in CRM matches address in orders table)
Timeliness: Data is available and updated within the freshness requirement for its use case. (Order data is available within 5 minutes of transaction for operational dashboards)
Uniqueness: No duplicate records for the same real-world entity. (Each order appears exactly once in the orders Silver table)
Validity: Values conform to defined formats, ranges, and business rules. (order_status is always one of ['pending', 'shipped', 'delivered', 'cancelled'])

Implementing Quality Gates in Medallion Pipelines

The Medallion Architecture provides natural quality gate insertion points between layers:

Bronze → Silver gate: Validate that Bronze records meet minimum standards before Silver transformation (not-null checks on primary keys, format validation on critical dates, range checks on numeric values)
Silver → Gold gate: Validate Silver data completeness and referential integrity before Gold aggregation (join key existence checks, referential integrity between related tables, freshness checks)

dbt's built-in test framework makes implementing these gates straightforward:

# dbt schema.yml
models:
  - name: silver_orders
    columns:
      - name: order_id
        tests: [not_null, unique]
      - name: order_status
        tests:
          - accepted_values:
              values: ['pending','shipped','delivered','cancelled']

Quality Gates in Medallion Architecture diagram — Figure 2: Data quality gates between Medallion layers — preventing bad data from propagating to Gold.

Summary

Data quality is the foundation of analytical trust in the data lakehouse. Without quality enforcement, every dataset is suspect, every metric is questioned, and the value of the lakehouse investment erodes through analyst distrust and business decision errors. Building quality checks into ETL pipelines as quality gates, surfacing quality metrics in the data catalog, and alerting on quality degradation through data observability platforms creates a lakehouse where data consumers can confidently use data for analysis and AI without manual validation overhead.

What Is Data Quality?

Six Data Quality Dimensions

Implementing Quality Gates in Medallion Pipelines

Summary

Related Concepts

Go Deeper — Recommended Resources