What is data skipping in Apache Iceberg?

Data skipping is the process where Iceberg's query engine reads per-file column statistics (min/max values, null counts) from manifest files and eliminates any data files whose statistics prove they cannot contain rows satisfying the query's WHERE clause. Files with min > filter_value or max < filter_value cannot contain matches and are skipped entirely.

How do column statistics enable data skipping?

Each data file entry in an Iceberg manifest stores lower_bounds and upper_bounds for each column — the minimum and maximum values of that column in the file. If a query filters WHERE order_date = '2026-01-15', any file whose order_date max '2026-01-15' cannot contain matching rows and is skipped without reading its Parquet content.

What is the difference between partition pruning and data skipping?

Partition pruning eliminates files based on partition metadata — it works at a coarser grain (entire partition boundaries). Data skipping uses per-file column statistics — it works at the individual file level regardless of partitioning. Both operate before any Parquet bytes are read. Z-ordering improves data skipping by clustering related values, making per-file statistics more selective.

Data Skipping: The Definitive Guide

Q: How do column statistics enable data skipping?

Each data file entry in an Iceberg manifest stores lower_bounds and upper_bounds for each column — the minimum and maximum values of that column in the file. If a query filters WHERE order_date = '2026-01-15', any file whose order_date max '2026-01-15' cannot contain matching rows and is skipped without reading its Parquet content.

Q: What is the difference between partition pruning and data skipping?

Partition pruning eliminates files based on partition metadata — it works at a coarser grain (entire partition boundaries). Data skipping uses per-file column statistics — it works at the individual file level regardless of partitioning. Both operate before any Parquet bytes are read. Z-ordering improves data skipping by clustering related values, making per-file statistics more selective.

What Is Data Skipping?

Data skipping is the query optimization technique where a query engine eliminates data files from a scan before reading their content — using per-file column statistics stored in Iceberg manifest files to prove that certain files cannot contain rows satisfying the query's filter conditions.

Data skipping is the second level of predicate pushdown in the Iceberg hierarchy (after partition pruning, before Parquet row group pruning). While partition pruning eliminates entire partitions worth of files coarsely, data skipping operates at the individual file level — eliminating files within a partition whose column value ranges don't overlap with the query's filter values.

How Column Statistics Enable Skipping

Each data file entry in an Iceberg manifest file stores per-column statistics:

lower_bound: The minimum value of that column across all rows in the file
upper_bound: The maximum value of that column across all rows in the file
null_value_count: The count of null values for that column in the file

These statistics are written when data files are created (during Iceberg writes) and stored in the manifest. They require no separate indexing step — they are a natural part of the Iceberg commit process.

For a query with WHERE revenue > 10000, the engine reads manifest statistics and skips any file where upper_bound(revenue) <= 10000 — those files contain no rows with revenue above 10,000 and cannot contribute to the result.

Data Skipping Column Statistics diagram — Figure 1: Data skipping — manifest statistics eliminate files before any Parquet bytes are read.

Z-Ordering Amplifies Data Skipping

Z-Ordering dramatically improves data skipping effectiveness. In an unordered table, data files contain a random mixture of values — the min/max range for most columns in each file spans nearly the entire value domain. A filter like WHERE customer_id = 'C12345' would find that almost every file's customer_id range includes 'C12345', requiring every file to be read.

After Z-Ordering by customer_id, files cluster data with similar customer IDs together. Now most files have a narrow customer_id range that doesn't include 'C12345', and are skipped. Only the 1–3 files whose range includes 'C12345' are read — reducing I/O from 100% of files to 1–3%.

The combination of compaction (optimal file sizes) + Z-Ordering (clustered values) + Iceberg manifest statistics (data skipping) is the most powerful trifecta for improving lakehouse query performance on large tables.

Z-Ordering Amplifying Data Skipping diagram — Figure 2: Z-Ordering narrows file statistics, making data skipping dramatically more selective.

Summary

Data skipping is a foundational query optimization for the data lakehouse. By reading per-file column statistics from Iceberg manifest files and eliminating files whose value ranges cannot satisfy query filters, data skipping reduces physical I/O from petabytes to gigabytes for selective queries. Maximizing data skipping effectiveness requires pairing it with good table management practices: compaction for right-sized files, and Z-ordering for clustered, narrow statistics. Together, these techniques are what make sub-second analytical queries on petabyte-scale Iceberg tables consistently achievable.

What Is Data Skipping?

How Column Statistics Enable Skipping

Z-Ordering Amplifies Data Skipping

Summary

Related Concepts

Go Deeper — Recommended Resources