Apache Iceberg Architecture: The Definitive Technical Guide

Introduction to Iceberg's Abstraction

Before Apache Iceberg, data lakes relied on the Apache Hive metastore model. In Hive, a table is defined as a directory in a file system, and partitions are subdirectories. When a query engine needed to read a table, it would list the contents of the directory. This "directory-first" approach caused severe performance bottlenecks, made safe concurrent writes nearly impossible, and resulted in the dreaded "small file problem" when scanning cloud object storage (like Amazon S3 or Google Cloud Storage).

Apache Iceberg fundamentally reverses this paradigm. In Iceberg, a table is defined not by a directory, but by an explicit, hierarchical list of files tracked in a metadata tree. By tracking data at the file level rather than the directory level, Iceberg eliminates expensive directory listing operations, enables safe concurrent transactions (ACID), and unlocks advanced features like time travel, schema evolution, and hidden partitioning.

Key Concept: Iceberg tracks files, not directories. Every single data file belonging to an Iceberg table is explicitly listed in the metadata. If a file is in the directory but not in the metadata, it is not part of the table.

The Metadata Tree: A Layered Architecture

The genius of Apache Iceberg lies in its metadata structure. The architecture is best understood as a tree that points downward from a single root file to the actual data records. This tree has four distinct layers:

The Catalog: The external system that tracks the current root metadata file.
Metadata Files (JSON): Defines the table state, schema, partitioning, and all snapshots.
Manifest Lists (Avro): Summarizes the manifest files for a specific snapshot.
Manifest Files (Avro): Lists the actual data files and tracks file-level statistics.
Data Files (Parquet, ORC, Avro): The underlying raw data.

            graph TD
                Cat[Iceberg Catalog] -->|Points to current| Meta1[Table Metadata v2.json]
                Meta1 -->|Contains Snapshots| S1[Snapshot 1]
                Meta1 --> S2[Snapshot 2 Current]
                S2 --> ML[Manifest List .avro]
                ML -->|Tracks Manifests| MF1[Manifest File 1 .avro]
                ML --> MF2[Manifest File 2 .avro]
                MF1 -->|Tracks Data| D1[(Data File .parquet)]
                MF1 --> D2[(Data File .parquet)]
                MF2 --> D3[(Data File .parquet)]
                MF2 --> D4[(Data File .parquet)]
                
                style Cat fill:#f8fafc,stroke:#334155,stroke-width:2px
                style Meta1 fill:#e0f2fe,stroke:#0284c7
                style ML fill:#dbeafe,stroke:#2563eb
                style MF1 fill:#bfdbfe,stroke:#3b82f6
                style MF2 fill:#bfdbfe,stroke:#3b82f6
                style D1 fill:#f0fdf4,stroke:#16a34a

1. The Catalog: The Entry Point

Because Iceberg files sit in cloud object storage, an engine needs to know where to begin reading. This is the job of the Iceberg Catalog. A catalog acts as the atomic swap mechanism that guarantees ACID transactions. Examples of catalogs include the Iceberg REST Catalog, Apache Polaris, Project Nessie, AWS Glue, and Hive Metastore.

The catalog holds a pointer to the absolute URI of the current Table Metadata JSON file. When a writer wants to commit a change (like inserting data), it creates a new Table Metadata JSON file. It then asks the Catalog to atomically swap the pointer from the old JSON file to the new JSON file. If two writers attempt to swap the pointer at the exact same millisecond, the catalog enforces Optimistic Concurrency Control (OCC)—one writer succeeds, and the other is forced to retry.

2. Table Metadata Files (JSON)

The root of the Iceberg tree is the Table Metadata file. It is a JSON document that defines the complete state of the table at a specific point in time. Every time the table is modified (schema change, data insert, partition spec change), a brand new Metadata JSON file is written. The previous JSON file is retained to support time travel.

Inside the Metadata JSON file, you will find:

Table Schema: The column names, types, and their unique IDs. Iceberg uses unique IDs to track columns, enabling safe schema evolution (renaming, dropping, adding) without breaking historical data.
Partition Specs: Defines how the data is partitioned. Iceberg supports Hidden Partitioning, meaning the partition spec can change over time without requiring data rewrites.
Snapshots Array: An array of all historical states of the table.
Current Snapshot ID: A pointer to the specific snapshot that currently represents the active state of the table.

3. Snapshots and Manifest Lists (Avro)

A Snapshot is the state of the table at a specific moment in time. Because every change generates a new snapshot, Iceberg natively supports Time Travel. You can literally tell your query engine: SELECT * FROM table FOR SYSTEM_VERSION AS OF 123456789, and it will read exactly the data files that were active during that snapshot.

Every Snapshot points to a single Manifest List file (stored in Avro format). The Manifest List is an index of all the Manifest Files that make up that snapshot. But it doesn't just list them; it tracks vital statistics about each Manifest File, such as:

The partition bounds (minimum and maximum partition values) contained in that manifest.
The number of added, existing, and deleted data files.

During a query, the engine reads the Manifest List first. If a query filters by a specific date, and the Manifest List shows that a particular Manifest File only contains data for a different year, the engine skips reading that entire Manifest File. This is the first layer of pruning.

4. Manifest Files (Avro)

If the Manifest List is the index, the Manifest Files are the actual ledgers. A manifest file explicitly lists a subset of the data files that make up the table. It also stores deep, column-level statistics for every single data file it tracks.

For each data file, the manifest tracks:

The absolute URI of the data file in object storage.
The file format (Parquet, ORC, Avro).
The partition tuple it belongs to.
Column-level Min/Max values: The lowest and highest value for every column in that specific file.
Null and NaN counts: How many nulls exist in the file.

This is where Iceberg achieves its legendary query performance. When a query includes a `WHERE` clause (e.g., `WHERE customer_id = 500`), the query engine inspects the Manifest File. It looks at the min/max statistics for `customer_id` for every data file. If a data file's min is 1000 and max is 2000, the engine mathematically knows that `customer_id = 500` cannot possibly exist in that file. It entirely skips downloading and parsing that Parquet file. This is known as Min/Max filtering or Predicate Pushdown.

Why Avro? Manifest Lists and Manifest Files are written in Avro because Avro is a row-based format that is incredibly fast to read and write sequentially, making it perfect for rapid metadata scanning during query planning.

5. Data and Delete Files

At the very bottom of the tree are the actual data files. These are typically stored in Apache Parquet, a highly compressed, columnar storage format optimized for analytical queries. Iceberg is format-agnostic, meaning it also supports ORC and Avro, but Parquet is the industry standard.

In Iceberg v2 (the current standard), Iceberg introduced Row-Level Deletes. When you run a `DELETE` or `UPDATE` statement on a massive table, rewriting a 1GB Parquet file just to change a single row is incredibly inefficient. Iceberg solves this using Delete Files.

There are two types of delete files:

Position Delete Files: Stores the exact file path and row position of the deleted row. When the query engine reads the data, it applies the position delete file on the fly to hide the deleted rows.
Equality Delete Files: Stores the condition of the deletion (e.g., `id = 5`). The engine filters out rows matching this condition during read.

This approach is known as Merge-on-Read (MoR). It makes writes blazingly fast because deletes are simply appended as new small files. Eventually, background compaction jobs will merge the data and delete files into fresh, clean Parquet files (a process called Rewrite Data Files or Compaction).

The Read Path: How a Query executes

Let's trace exactly what happens when you run `SELECT count(*) FROM sales WHERE date = '2026-01-01' AND region = 'NA'`:

Catalog Lookup: The query engine (like Dremio or Spark) asks the Catalog for the current Table Metadata JSON file.
Read Metadata: The engine parses the JSON to find the Current Snapshot ID, and locates the Manifest List for that snapshot.
Manifest List Pruning: The engine reads the Manifest List. It looks at the partition bounds. It skips any Manifest Files that do not contain the `region = 'NA'` or `date = '2026-01-01'` partitions.
Manifest File Pruning: For the surviving Manifest Files, the engine reads the file-level statistics. It checks the min/max bounds for the `date` and `region` columns on every data file. It skips data files that cannot possibly match the predicate.
Data File Read: The engine has narrowed down petabytes of data to perhaps just a dozen relevant Parquet files. It downloads those specific files from object storage.
Result Processing: The engine applies any active Delete Files to the data, computes the count, and returns the result to the user.

The Write Path: The Commit Flow

Writing data to Iceberg is designed to be completely safe, regardless of how many engines are reading or writing at the same time.

The write engine begins writing new Parquet data files to object storage. (Readers ignore these files because they aren't in the metadata yet).
The engine creates a new Manifest File tracking these new Parquet files.
The engine creates a new Manifest List that includes the old manifests plus the new manifest.
The engine generates a new Snapshot and writes a new Table Metadata JSON file pointing to it.
The Atomic Commit: The engine attempts to swap the catalog pointer to this new JSON file. If successful, the transaction is committed, and all new readers instantly see the new data.

Maintenance: Compaction and Garbage Collection

Because Iceberg creates a new snapshot and new metadata files for every transaction, the metadata tree grows over time. Furthermore, streaming workloads can create thousands of tiny Parquet files, slowing down query planning.

To keep the architecture healthy, Iceberg relies on background maintenance operations:

Compaction (Rewrite Data Files): A scheduled job reads thousands of tiny Parquet files and rewrites them into larger, optimized files (typically 128MB to 512MB). It creates a new snapshot pointing to the large files and removes pointers to the small ones.
Expire Snapshots: A job that deletes old snapshots (e.g., snapshots older than 7 days) from the Metadata JSON file. This limits how far back Time Travel can go but keeps the JSON file small.
Remove Orphan Files: Once snapshots are expired, the actual underlying Parquet files that are no longer referenced by any snapshot are permanently deleted from object storage to save cloud costs.

Conclusion

The Apache Iceberg architecture represents a monumental leap forward in data engineering. By shifting the source of truth from the file system directory to an explicit, hierarchical metadata tree, it brings database-like ACID transactions, performance, and reliability to commodity cloud object storage.

Understanding this architecture—snapshots, manifest lists, and atomic catalog swaps—is essential for any data engineer building a modern data lakehouse. It is the foundation that allows engines like Dremio, Spark, and Flink to interoperate seamlessly on a single, governed copy of data.