What is Z-Ordering in Apache Iceberg?

Z-Ordering (also called Z-order curve clustering) is a data layout optimization that sorts table data by multiple columns simultaneously using a space-filling curve. In Apache Iceberg, the OPTIMIZE TABLE ... REWRITE DATA ... STRATEGY z-order USING COLUMNS (...) command rewrites data files clustering rows with similar values in the specified columns together, making file-level statistics more selective for queries filtering on those columns.

How does Z-Ordering improve query performance?

By clustering rows with similar values together in the same data files, Z-Ordering narrows each file's per-column min/max statistics range. When a query filters WHERE region = 'US-WEST', only files whose region range includes 'US-WEST' need to be read. With Z-Ordering, very few files will have a range including 'US-WEST' — most files contain only rows from other regions and are skipped entirely.

When should I use Z-Ordering vs standard partitioning?

Use both together: partitioning provides coarse-grained elimination at the partition level (eliminates entire groups of files), Z-Ordering provides fine-grained clustering within partitions (eliminates individual files using statistics). Z-Ordering is most valuable for high-cardinality columns where partitioning would create too many partitions (user_id, product_sku) or for multi-dimensional filtering where a single partition column cannot capture all filter dimensions.

Z-Ordering: The Definitive Guide for Apache Iceberg

What Is Z-Ordering?

Z-Ordering is a data layout optimization technique based on the Z-order space-filling curve — a mathematical curve that maps multi-dimensional data to one-dimensional order while preserving locality in multiple dimensions simultaneously. In the context of Apache Iceberg table optimization, Z-Ordering means sorting (or clustering) data files so that rows with similar values across multiple filter columns are physically located in the same files — maximizing the effectiveness of data skipping based on file-level column statistics.

The problem Z-Ordering solves: standard sorting by a single column clusters rows by that column but disperses rows along all other columns. A table sorted by date has excellent date statistics (narrow date ranges per file) but poor statistics for region, product, or user — filters on those columns must read nearly every file. Z-Ordering clusters by multiple columns simultaneously, providing good (though not perfect) statistics for all specified columns in all files.

Z-Ordering vs Single-Column Sorting

Consider a 100-file Iceberg table with 100M rows, queried by both date AND region frequently:

Sorted by date only: Queries filtering by date alone are fast (most files skipped). Queries filtering by region alone must read all 100 files (region is randomly distributed across all files — no statistics benefit).

Z-Ordered by (date, region): Files cluster rows with similar date + region combinations together. Queries filtering by date skip ~90% of files. Queries filtering by region skip ~80% of files. Queries filtering by both date AND region skip ~95% of files. The Z-order curve's locality property ensures related (date, region) combinations cluster near each other in sort order.

Z-Ordering vs Single Column Sort diagram — Figure 1: Z-Ordering vs single-column sort — multi-dimensional clustering for multi-column query patterns.

Running Z-Order Optimization in Iceberg

Apache Iceberg supports Z-Order optimization via the OPTIMIZE (or REWRITE DATA) command available in Spark, Dremio, and Trino:

In Spark with Iceberg:

CALL catalog.system.rewrite_data_files(
  table => 'silver.orders',
  strategy => 'sort',
  sort_order => 'zorder(region, product_category, order_date)'
);

In Dremio SQL:

OPTIMIZE TABLE silver.orders
USING SORT (region, product_category, order_date);

Z-Order optimization is typically run as a scheduled maintenance job (daily or weekly for large tables) via Apache Airflow, as part of the table maintenance pipeline alongside compaction.

Z-Order Optimization Workflow diagram — Figure 2: Z-Order optimization workflow — scheduled maintenance producing clustered, statistics-rich files.

Summary

Z-Ordering is one of the highest-ROI table optimization techniques available in the open data lakehouse. By clustering rows with similar multi-column values into the same Parquet files, it dramatically narrows per-file column statistics and enables data skipping to eliminate 80–99% of data files for typical analytical filters. The combination of compaction (right-sized files), Z-Ordering (clustered layout), and Reflections (pre-computed materializations) represents the complete optimization stack for high-performance Apache Iceberg lakehouses — the toolkit that delivers sub-second analytics at petabyte scale.

What Is Z-Ordering?

Z-Ordering vs Single-Column Sorting

Running Z-Order Optimization in Iceberg

Summary

Related Concepts

Go Deeper — Recommended Resources