What is an Iceberg manifest file?

An Iceberg manifest file is an Avro file that lists the data files (or delete files) belonging to a specific table snapshot, along with each file's partition values and column-level statistics (min/max per column, null count, row count). Manifest files are the primary mechanism for partition pruning and data skipping in Apache Iceberg.

What is the difference between a manifest file and a manifest list?

A manifest file lists individual data files and their statistics. A manifest list (snapshot file) groups together all the manifest files belonging to a specific snapshot. The hierarchy is: Snapshot → Manifest List → Manifest Files → Data Files.

How do manifest files enable data skipping?

Each manifest file records per-file statistics: the min and max value for each column in each data file. At query time, the engine reads manifest files and skips any data file whose min-max range for a queried column does not overlap with the query's filter predicate — avoiding all I/O for those files.

Iceberg Manifest Files: The Definitive Guide

What Are Iceberg Manifest Files?

Iceberg manifest files are Avro files that form the middle layer of Apache Iceberg's three-level metadata tree, sitting between the manifest list (snapshot) at the top and the actual data files at the bottom. Each manifest file tracks a subset of the table's data files — listing their storage locations, partition values, and per-column statistics.

Manifest files are the primary mechanism for two of Iceberg's most important query optimization capabilities: partition pruning (skipping entire groups of data files based on partition values) and data skipping (skipping individual data files based on column-level min/max statistics).

At query time, the engine reads the manifest list to get the list of all manifests, then reads the manifests to get the list of data files — but it applies pruning at both levels: first eliminating manifests whose partition summary statistics show no relevant data, then eliminating individual data files within remaining manifests whose per-file statistics show no relevant rows. Only after this two-level pruning does the engine read actual data files.

Manifest File Structure

Each record in a manifest file represents one data file or delete file and contains:

file_path: The full S3/ADLS/GCS URL of the data file
file_format: The format of the data file (PARQUET, ORC, AVRO)
partition: The partition values for this file, encoded according to the table's partition spec
record_count: The number of rows in the file
file_size_in_bytes: The file's size in bytes
column_sizes: Per-column storage sizes (for projection optimization)
value_counts: Per-column non-null value counts
null_value_counts: Per-column null value counts
lower_bounds: Per-column minimum values (encoded in binary format)
upper_bounds: Per-column maximum values (encoded in binary format)

The lower_bounds and upper_bounds fields — the per-column min/max statistics — are the data that enables aggressive data skipping. A file whose upper_bounds[customer_id] is less than the filter predicate's value can be skipped entirely. A file whose lower_bounds[event_date] is after the query's date range can be skipped. These binary-encoded statistics are evaluated extremely quickly — the engine eliminates irrelevant files without touching their contents.

Iceberg Manifest File Structure diagram — Figure 1: Manifest file contents — file location, partition values, and column-level statistics for data skipping.

Manifest Lists vs. Manifest Files

The distinction between manifest lists and manifest files is a frequent source of confusion:

Attribute	Manifest List	Manifest File
Also called	Snapshot file	Manifest
Format	Avro	Avro
Lists	Manifest files (one per entry)	Data/delete files (thousands per file)
Statistics	Partition-level summaries across all manifests	Per-file column statistics
Created by	Each committed transaction	Each batch of new data files
Reused across snapshots?	No (new manifest list per snapshot)	Yes (unchanged manifests reused)

The manifest list contains partition-level summary statistics for each manifest it references — these are used for the first level of pruning (eliminating entire manifests). The manifest files contain per-file statistics — used for the second level of pruning within a manifest.

How Iceberg Reuses Manifests

One of Iceberg's most important performance optimizations is manifest reuse: when a transaction adds data to only some partitions, the manifests for unchanged partitions are reused in the new snapshot — not copied or rewritten.

For example: a table with 1000 manifest files covering 1 million data files. A new INSERT adds 100 new files to 5 partitions. Iceberg creates 5 new small manifest files (one per new partition batch) and commits a new snapshot that references those 5 new manifests PLUS the 995 unchanged old manifests. The 995 unchanged manifests are not touched — they are simply referenced by the new snapshot's manifest list.

This reuse property makes Iceberg's snapshot commits fast and cheap even for very large tables: committing a new snapshot requires writing only the new manifests (for the changed partitions) and a new manifest list — not reading or writing all existing manifests.

Manifest Reuse Across Snapshots diagram — Figure 2: Unchanged manifests are reused across snapshots — only new manifests are written per transaction.

Manifest File Compaction

Over time, a table accumulates many small manifest files — particularly if data is added in many small batches. Many small manifests slow down query planning (the engine must read each manifest separately) and metadata management.

Iceberg's rewriteManifests operation consolidates small manifest files into fewer, larger manifests while preserving all the file-level statistics. After rewriting, the engine reads fewer manifest files per query, reducing query planning latency. In Dremio, manifest rewriting is part of the automated table optimization — it runs automatically when manifest count exceeds configured thresholds.

Manifest Files and Data Skipping Effectiveness

The effectiveness of manifest file-based data skipping depends on the correlation between the queried columns and the physical data layout. When rows with similar column values are co-located in the same files (via Z-ordering or sort ordering), the per-file min-max statistics are narrow — the engine can skip most files. When rows are randomly distributed across files, min-max statistics span wide ranges, and few files can be skipped.

This is why compaction with sort ordering or Z-ordering is so important: it creates data files with narrow, well-targeted min-max statistics. Manifest files faithfully record these statistics, enabling query engines to leverage the sorted layout for aggressive pruning.

Summary

Iceberg manifest files are the metadata layer that makes petabyte-scale query performance possible in the data lakehouse. By recording per-file column statistics for millions of data files, manifests enable a two-level pruning hierarchy that eliminates the vast majority of storage reads before any data is accessed. Manifest reuse across snapshots makes transaction commits fast; manifest compaction keeps query planning efficient as tables grow. Understanding manifest files is key to understanding why well-maintained Iceberg tables can answer queries in seconds over petabytes of data.