What Is Z-Ordering?

Z-Ordering is a data layout optimization technique based on the Z-order space-filling curve — a mathematical curve that maps multi-dimensional data to one-dimensional order while preserving locality in multiple dimensions simultaneously. In the context of Apache Iceberg table optimization, Z-Ordering means sorting (or clustering) data files so that rows with similar values across multiple filter columns are physically located in the same files — maximizing the effectiveness of data skipping based on file-level column statistics.

The problem Z-Ordering solves: standard sorting by a single column clusters rows by that column but disperses rows along all other columns. A table sorted by date has excellent date statistics (narrow date ranges per file) but poor statistics for region, product, or user — filters on those columns must read nearly every file. Z-Ordering clusters by multiple columns simultaneously, providing good (though not perfect) statistics for all specified columns in all files.

Z-Ordering vs Single-Column Sorting

Consider a 100-file Iceberg table with 100M rows, queried by both date AND region frequently:

Sorted by date only: Queries filtering by date alone are fast (most files skipped). Queries filtering by region alone must read all 100 files (region is randomly distributed across all files — no statistics benefit).

Z-Ordered by (date, region): Files cluster rows with similar date + region combinations together. Queries filtering by date skip ~90% of files. Queries filtering by region skip ~80% of files. Queries filtering by both date AND region skip ~95% of files. The Z-order curve's locality property ensures related (date, region) combinations cluster near each other in sort order.

Z-Ordering vs Single Column Sort diagram
Figure 1: Z-Ordering vs single-column sort — multi-dimensional clustering for multi-column query patterns.

Running Z-Order Optimization in Iceberg

Apache Iceberg supports Z-Order optimization via the OPTIMIZE (or REWRITE DATA) command available in Spark, Dremio, and Trino:

In Spark with Iceberg:

CALL catalog.system.rewrite_data_files(
  table => 'silver.orders',
  strategy => 'sort',
  sort_order => 'zorder(region, product_category, order_date)'
);

In Dremio SQL:

OPTIMIZE TABLE silver.orders
USING SORT (region, product_category, order_date);

Z-Order optimization is typically run as a scheduled maintenance job (daily or weekly for large tables) via Apache Airflow, as part of the table maintenance pipeline alongside compaction.

Z-Order Optimization Workflow diagram
Figure 2: Z-Order optimization workflow — scheduled maintenance producing clustered, statistics-rich files.

Summary

Z-Ordering is one of the highest-ROI table optimization techniques available in the open data lakehouse. By clustering rows with similar multi-column values into the same Parquet files, it dramatically narrows per-file column statistics and enables data skipping to eliminate 80–99% of data files for typical analytical filters. The combination of compaction (right-sized files), Z-Ordering (clustered layout), and Reflections (pre-computed materializations) represents the complete optimization stack for high-performance Apache Iceberg lakehouses — the toolkit that delivers sub-second analytics at petabyte scale.