What Is Data Skipping?
Data skipping is the query optimization technique where a query engine eliminates data files from a scan before reading their content — using per-file column statistics stored in Iceberg manifest files to prove that certain files cannot contain rows satisfying the query's filter conditions.
Data skipping is the second level of predicate pushdown in the Iceberg hierarchy (after partition pruning, before Parquet row group pruning). While partition pruning eliminates entire partitions worth of files coarsely, data skipping operates at the individual file level — eliminating files within a partition whose column value ranges don't overlap with the query's filter values.
How Column Statistics Enable Skipping
Each data file entry in an Iceberg manifest file stores per-column statistics:
- lower_bound: The minimum value of that column across all rows in the file
- upper_bound: The maximum value of that column across all rows in the file
- null_value_count: The count of null values for that column in the file
These statistics are written when data files are created (during Iceberg writes) and stored in the manifest. They require no separate indexing step — they are a natural part of the Iceberg commit process.
For a query with WHERE revenue > 10000, the engine reads manifest statistics and skips any file where upper_bound(revenue) <= 10000 — those files contain no rows with revenue above 10,000 and cannot contribute to the result.

Z-Ordering Amplifies Data Skipping
Z-Ordering dramatically improves data skipping effectiveness. In an unordered table, data files contain a random mixture of values — the min/max range for most columns in each file spans nearly the entire value domain. A filter like WHERE customer_id = 'C12345' would find that almost every file's customer_id range includes 'C12345', requiring every file to be read.
After Z-Ordering by customer_id, files cluster data with similar customer IDs together. Now most files have a narrow customer_id range that doesn't include 'C12345', and are skipped. Only the 1–3 files whose range includes 'C12345' are read — reducing I/O from 100% of files to 1–3%.
The combination of compaction (optimal file sizes) + Z-Ordering (clustered values) + Iceberg manifest statistics (data skipping) is the most powerful trifecta for improving lakehouse query performance on large tables.

Summary
Data skipping is a foundational query optimization for the data lakehouse. By reading per-file column statistics from Iceberg manifest files and eliminating files whose value ranges cannot satisfy query filters, data skipping reduces physical I/O from petabytes to gigabytes for selective queries. Maximizing data skipping effectiveness requires pairing it with good table management practices: compaction for right-sized files, and Z-ordering for clustered, narrow statistics. Together, these techniques are what make sub-second analytical queries on petabyte-scale Iceberg tables consistently achievable.