What types of delete files does Iceberg V2 support?

Iceberg V2 supports two delete file types: positional delete files (recording the file path and row position of each deleted row — most efficient for random deletes) and equality delete files (recording column values that identify rows to delete — most natural for CDC upserts using record keys).

When should I use Merge-on-Read?

Use MoR for tables with high-frequency updates or streaming CDC ingestion where the write cost of full file rewrites (Copy-on-Write) would be prohibitive. Silver layer tables receiving continuous CDC updates are the classic MoR use case.

Merge-on-Read (MoR) in Apache Iceberg: The Definitive Guide

Q: What is Merge-on-Read in Apache Iceberg?

Merge-on-Read (MoR) is an Iceberg write strategy where UPDATE and DELETE operations write small delete files alongside existing data files, rather than rewriting the data files. At query time, the engine merges the delete files with the data files to produce the correct result.

What Is Merge-on-Read?

Merge-on-Read (MoR) is the write strategy in Apache Iceberg (V2+) where UPDATE and DELETE operations write small auxiliary delete files alongside existing data files, rather than rewriting the data files. The merge — reconciling data files with delete files to produce the correct view of the table — is deferred to query time.

MoR's fundamental advantage is write efficiency. Deleting one row from a 500MB data file in MoR mode costs the I/O of writing a tiny delete file (a few kilobytes) rather than reading and rewriting the entire 500MB file. For workloads with thousands of record-level changes per minute — such as CDC pipelines from production databases — this write efficiency difference is the factor that makes the lakehouse practical.

The trade-off is increased read overhead: queries must read both data files and their associated delete files, then merge them to filter out deleted rows and apply updated values. For tables with large delete file accumulation, this merge overhead can significantly slow down reads. Compaction (rewriting MoR data+delete files into clean CoW files) is the maintenance operation that restores read performance.

Iceberg V2 Delete File Types

Iceberg V2 defines two delete file types, each optimized for different use cases:

Positional Delete Files

A positional delete file records the exact location (file path + row position) of each deleted row. When a query reads a data file, it checks the positional delete file: rows at the recorded positions are skipped. Positional deletes are efficient for point-in-time snapshot reads and for sparse, random row deletes — the file path + position record is small and indexable.

Equality Delete Files

An equality delete file records the column values that identify rows to delete — for example, customer_id = 'abc123'. When a query reads data files, it applies the equality delete predicates: rows whose column values match any equality delete record are filtered out. Equality deletes are natural for CDC MERGE INTO workloads, where the source system identifies records by a natural key rather than a physical position.

MoR Write Flow

The MoR write flow for a DELETE operation:

The engine applies the WHERE clause to Iceberg metadata, identifying which data files contain rows matching the delete predicate
For each affected data file, the engine writes a positional delete file recording the positions of deleted rows within that file
The new delete files are recorded in a new manifest alongside their associated data files
A new snapshot is committed referencing the new manifest

No data files are rewritten. The affected data files remain in storage unchanged; the delete files mark which rows within them should be excluded from query results. The entire write operation touches only a tiny fraction of the data's total size.

MoR Read Overhead and Compaction

MoR's write efficiency comes at the cost of read overhead. Each query against a MoR table must:

Read data files
For each data file, check for associated delete files
For positional deletes: skip rows at the recorded positions
For equality deletes: filter rows matching the equality predicates

As delete files accumulate from many CDC operations, this merge overhead grows. A table that was initially fast to read becomes progressively slower as delete files pile up. Scheduled compaction (using Iceberg's RewriteDataFiles with delete file handling) merges data and delete files into clean CoW files, restoring read performance. Dremio's automated table optimization handles this transparently.

MoR Read Merge and Compaction Lifecycle diagram — Figure 2: MoR accumulates delete files over time; compaction merges them into clean files.

MoR for CDC Pipelines

MoR is the standard write strategy for CDC pipeline targets in the Silver layer. A typical CDC pipeline flow with MoR:

Source database changes are captured by Debezium as Kafka events
Flink or Spark Structured Streaming reads Kafka events and writes them to Bronze Iceberg tables (append-only)
A Silver transformation job runs MERGE INTO against the Silver Iceberg table (MoR mode): inserts new records, updates existing records (writes equality delete + new insert), deletes removed records (writes equality delete)
Periodic compaction merges accumulated delete files into clean Silver data files

This pattern delivers near-real-time data freshness in Silver with manageable operational complexity.

Choosing Between MoR and CoW

The write frequency and read/write ratio should drive the CoW vs MoR decision:

Gold layer tables: Use CoW. Reads dominate; batch partition overwrites are the primary write pattern. Clean files without delete overhead maximize BI query performance.
Silver layer with frequent CDC updates: Use MoR. Write frequency is high; write cost of CoW would be prohibitive. Schedule regular compaction to manage delete file accumulation.
Bronze layer: Append-only. Neither CoW nor MoR applies — no UPDATE or DELETE operations at the Bronze layer.

Summary

Merge-on-Read is Apache Iceberg's write strategy for high-frequency update workloads. By writing small delete files instead of rewriting entire data files, MoR makes CDC upsert pipelines practical at scale. The read overhead from delete file accumulation is managed through regular compaction, which merges delete files into clean data files and restores read performance. Understanding when to use MoR vs Copy-on-Write is fundamental to designing efficient Iceberg table pipelines.