An upsert is a combined UPDATE + INSERT operation: if a record with the matching key already exists in the target table, update it with the new values; if no matching record exists, insert it as a new row. Upsert is essential for maintaining current-state tables from CDC change streams, where each event may be an INSERT (new record) or UPDATE (existing record changed) in the source.

How does Apache Iceberg implement upsert?

Apache Iceberg implements upsert through the MERGE INTO SQL statement. MERGE INTO matches source and target rows by a join key, then applies WHEN MATCHED THEN UPDATE (update existing rows) and WHEN NOT MATCHED THEN INSERT (insert new rows) clauses atomically. Iceberg writes updates as either Copy-on-Write (rewrites entire affected data files) or Merge-on-Read (writes delete + new data files separately).

What is the performance difference between Copy-on-Write and Merge-on-Read upserts?

Copy-on-Write (CoW) upserts rewrite entire data files for each updated row — slow for write-heavy workloads (each MERGE rewrites large files) but fast for reads (no delete file merging needed). Merge-on-Read (MoR) upserts write small delete + insert files separately — fast for writes (only changed records written) but requires merging on read. CoW is better for read-heavy Silver tables; MoR is better for high-frequency CDC streams.

Upsert: The Definitive Guide for Apache Iceberg

What Is an Upsert?

An upsert (a portmanteau of update and insert) is a database write operation that combines two behaviors in a single statement: if a record matching the specified key already exists in the target table, update it with the new values; if no matching record exists, insert it as a new row. Upsert is the fundamental operation for maintaining current-state tables from change data streams.

In the data lakehouse, upsert is the operation that makes Silver tables valuable — not just a raw historical log like Bronze tables, but a continuously updated current state that reflects the present state of source operational databases. A Silver orders table maintained via upsert always shows the current status of each order (not a log of all status changes), enabling analysts to query 'how many orders are currently pending?' rather than 'how many order_status = pending events have ever occurred?'

MERGE INTO: Iceberg's Upsert Implementation

Apache Iceberg implements upsert through the ANSI standard MERGE INTO statement:

MERGE INTO silver.orders AS target
USING bronze.orders_cdc AS source
  ON target.order_id = source.order_id
WHEN MATCHED AND source.op_type = 'D'
  THEN DELETE
WHEN MATCHED AND source.op_type IN ('U', 'R')
  THEN UPDATE SET
    target.status = source.status,
    target.updated_at = source.updated_at
WHEN NOT MATCHED AND source.op_type = 'I'
  THEN INSERT (order_id, status, created_at, updated_at)
  VALUES (source.order_id, source.status, source.created_at, source.updated_at);

This MERGE INTO processes a batch of CDC events (INSERTs, UPDATEs, DELETEs) from the Bronze CDC table and applies them atomically to the Silver current-state table — maintaining a correct, current-state representation of the source orders table.

MERGE INTO Upsert Iceberg CDC diagram — Figure 1: MERGE INTO for CDC upsert — applying Bronze CDC events to Silver current-state Iceberg tables.

CoW vs MoR Upsert Strategies

Apache Iceberg supports two physical write strategies for upsert operations:

Copy-on-Write (CoW)

When a MERGE INTO updates a row, Iceberg rewrites the entire Parquet data file containing that row — incorporating the updated value into a new file and marking the old file as deleted. Read queries require no merge logic — every data file contains only current values. Best for tables with low update frequency but high read volume.

Merge-on-Read (MoR)

When a MERGE INTO updates a row, Iceberg writes a small equality delete file (marking the old row as deleted by its primary key) and a new data file (containing the updated row). Old data files are not rewritten. Read queries must merge data files with delete files to produce current values. Best for high-frequency CDC streams where rewriting large files on every update is impractical. Compaction periodically merges delete files back into data files to maintain read performance.

CoW vs MoR Upsert Strategies diagram — Figure 2: Copy-on-Write vs Merge-on-Read upsert — write performance vs read simplicity tradeoff.

Summary

Upsert via MERGE INTO is the cornerstone DML operation for maintaining current-state Silver tables in the data lakehouse. By combining CDC events with Iceberg's MERGE INTO (implemented as either Copy-on-Write or Merge-on-Read depending on update frequency), data teams create Silver tables that continuously reflect the present state of source operational databases — enabling analysts to ask current-state questions of lakehouse data with the same confidence as querying the operational system directly, but with the scale, governance, and analytical richness of the Apache Iceberg lakehouse.