What Is Z-Ordering?

Z-ordering (also called Z-order clustering or multi-dimensional clustering) is a data layout optimization technique that physically organizes rows in data files to maximize the effectiveness of data skipping for multi-dimensional query filters. Unlike simple sorting by a single column (which is only optimal for queries filtering on that column), Z-ordering produces a data layout that is efficient for queries filtering on any combination of the clustered columns.

The technique is named for the Z-shaped path (resembling the letter Z or the Morton code curve) that the Z-curve traces when applied to two-dimensional integer coordinates. Applied to columnar data: the Z-curve interleaves the binary representations of multiple column values into a single composite sort key, such that rows with similar values across all dimensions cluster together in sorted order.

In practical terms: if a table is Z-ordered by (customer_id, product_category), then data files will tend to contain rows with similar customer_ids AND similar product_categories. A query filtering on customer_id = 'abc' alone can skip most files. A query filtering on product_category = 'Electronics' alone can skip most files. A query filtering on both can skip even more files. All three query patterns benefit from data skipping — no single-column sort provides this multi-dimensional coverage.

How Z-Ordering Improves Data Skipping

Data skipping in Apache Iceberg works through per-file column statistics: each data file records the min and max value for each column it contains. At query time, the engine uses these statistics to skip files whose min-max range does not overlap with the query's filter predicates.

For data skipping to be effective, files must contain rows with similar values for the queried columns — so that their min-max ranges are narrow and don't overlap with ranges that contain no relevant rows. Poor data layout (random file assignment) produces wide min-max ranges for every file, and no files can be skipped.

Z-ordering produces narrow per-file min-max ranges for all clustered dimensions simultaneously. A file in a Z-ordered table by (region, product_id) contains rows from a limited range of regions and a limited range of product IDs — its min-max statistics are tight. A query for a specific region AND product_id can skip all files whose region range or product_id range does not include the query values.

Z-Ordering Multi-Dimensional Data Skipping diagram
Figure 1: Z-ordering clusters rows by multiple dimensions, enabling aggressive data skipping for any filter combination.

Z-Ordering vs. Single-Column Sorting

DimensionSingle-Column SortZ-Ordering
Best query filterSort column onlyAny clustered column or combination
File skipping for filter on col AExcellent (if sorted on A)Good
File skipping for filter on col BPoor (if sorted on A)Good
File skipping for filter on A AND BGood for A, poor for BExcellent
Best forPredictable, single-column query patternsAd-hoc analytics with variable filters

Z-ordering is most valuable for tables used in exploratory analytics where query patterns vary — analysts filter on different column combinations depending on the question they are answering. For operational dashboards with fixed, predictable filters, simple sorting on the primary filter column is often sufficient and has lower computational overhead.

Z-Ordering in Practice with Dremio

Dremio applies Z-ordering as part of its Automated Table Optimization feature. When Z-ordering is configured for an Iceberg table, Dremio's background optimization process applies the Z-curve algorithm to the table's data files during compaction runs, rewriting files with optimally clustered row ordering.

Configuring Z-ordering in Dremio: ALTER TABLE my_table CLUSTER BY (region, product_category, customer_segment). After configuration, Dremio's optimizer automatically maintains the clustering during future compaction cycles. Dremio can also combine Z-ordering with file size optimization — ensuring that Z-ordered files are also optimally sized, not just optimally ordered.

Z-Ordering Compaction Result diagram
Figure 2: Compaction with Z-ordering produces optimally clustered files for multi-dimensional data skipping.

Z-Ordering Best Practices

Applying Z-ordering effectively requires careful column selection:

  • Cluster by 2–4 high-impact columns. Z-ordering effectiveness diminishes with more than 4 columns — the Z-curve space becomes too high-dimensional for effective clustering. Choose the 2–4 columns most commonly used in query filters.
  • Prioritize high-cardinality filter columns. Z-ordering is most effective for columns with many distinct values (product_id, customer_id, transaction_id). Low-cardinality columns (status, region) benefit more from partitioning than Z-ordering.
  • Apply Z-ordering at the Gold layer. The Gold layer tables queried by BI tools benefit most from Z-ordering — the query patterns are known, and the read performance improvement directly impacts user-facing dashboard latency.
  • Combine with partitioning. Z-ordering works within partitions. A table partitioned by days(event_date) and Z-ordered by (customer_id, product_id) within partitions benefits from partition pruning for date filters AND Z-order skipping for customer/product filters.

Summary

Z-ordering is the most powerful data layout optimization for tables with diverse, multi-dimensional query patterns. By co-locating rows with similar values across multiple clustered dimensions, Z-ordering enables aggressive data skipping for any combination of filter predicates — making it ideal for the ad-hoc analytical workloads of the Gold layer.

Combined with hidden partitioning (for coarse-grained file elimination) and file size compaction (for optimal I/O), Z-ordering is the final layer of a comprehensive data layout optimization strategy for Apache Iceberg tables. Dremio's automated table optimization applies all three layers transparently, maintaining optimal table performance without manual DBA intervention.