What Is Data Quality?

Data quality is the measure of how well data meets the requirements of its intended use — encompassing accuracy, completeness, consistency, timeliness, uniqueness, and validity across all data assets in the data lakehouse. Poor data quality is one of the most expensive and pervasive problems in data-driven organizations: analysts who discover that data cannot be trusted spend time validating rather than analyzing, business decisions made on incorrect data have real operational consequences, and regulatory reports based on bad data create compliance exposure.

In the lakehouse context, data quality is not just a monitoring activity — it is an active engineering discipline that builds quality checks into ETL pipelines, enforces quality gates between Medallion Architecture layers, makes quality metrics visible in the data catalog, and alerts data engineers when quality degradation is detected before it reaches business consumers.

Six Data Quality Dimensions

  • Accuracy: Values are correct representations of real-world facts. (Revenue amounts match transaction system totals)
  • Completeness: All required fields are populated; no missing values for critical attributes. (customer_id is never null in the orders table)
  • Consistency: The same fact has the same value across all systems and tables. (Customer address in CRM matches address in orders table)
  • Timeliness: Data is available and updated within the freshness requirement for its use case. (Order data is available within 5 minutes of transaction for operational dashboards)
  • Uniqueness: No duplicate records for the same real-world entity. (Each order appears exactly once in the orders Silver table)
  • Validity: Values conform to defined formats, ranges, and business rules. (order_status is always one of ['pending', 'shipped', 'delivered', 'cancelled'])
Six Data Quality Dimensions diagram
Figure 1: Six data quality dimensions — the framework for defining and measuring lakehouse data quality.

Implementing Quality Gates in Medallion Pipelines

The Medallion Architecture provides natural quality gate insertion points between layers:

  • Bronze → Silver gate: Validate that Bronze records meet minimum standards before Silver transformation (not-null checks on primary keys, format validation on critical dates, range checks on numeric values)
  • Silver → Gold gate: Validate Silver data completeness and referential integrity before Gold aggregation (join key existence checks, referential integrity between related tables, freshness checks)

dbt's built-in test framework makes implementing these gates straightforward:

# dbt schema.yml
models:
  - name: silver_orders
    columns:
      - name: order_id
        tests: [not_null, unique]
      - name: order_status
        tests:
          - accepted_values:
              values: ['pending','shipped','delivered','cancelled']
Quality Gates in Medallion Architecture diagram
Figure 2: Data quality gates between Medallion layers — preventing bad data from propagating to Gold.

Summary

Data quality is the foundation of analytical trust in the data lakehouse. Without quality enforcement, every dataset is suspect, every metric is questioned, and the value of the lakehouse investment erodes through analyst distrust and business decision errors. Building quality checks into ETL pipelines as quality gates, surfacing quality metrics in the data catalog, and alerting on quality degradation through data observability platforms creates a lakehouse where data consumers can confidently use data for analysis and AI without manual validation overhead.