What Is Nessie Branching and Tagging?

Project Nessie implements a Git-inspired version control model for the Iceberg catalog — introducing branches (isolated parallel namespaces for experimentation and development) and tags (named snapshots of the catalog at a specific point in time) to the data lakehouse.

The core insight behind Nessie branching is that data teams face the same challenges that software teams faced before version control: how to develop and test changes to production data assets without risking data corruption, how to maintain reproducible experiments, and how to deploy changes safely. Git solved these problems for code; Nessie applies the same concepts to data tables.

Branching for Safe Experimentation

The Nessie branching workflow for safe data experimentation:

  1. Create a branch: nessie branch experiment-branch main — creates a new branch starting from main's current state
  2. Work on the branch: Configure Spark to write to experiment-branch and run the experimental transformation. All writes are isolated to the branch.
  3. Validate: Query the branch (with Dremio or Trino pointing to experiment-branch) to verify the transformation produced correct results
  4. Merge or discard: If the experiment is successful, merge to main. If unsuccessful, simply delete the branch — no rollback needed, main was never touched
Nessie Branching Workflow diagram
Figure 1: Nessie branching workflow — isolated experimentation, validation, and safe merge to production.

Tagging for ML Reproducibility

ML experiment reproducibility requires that the exact dataset used for training can be reconstructed at any future point. Nessie tags provide this capability:

# Tag the data state when training begins
nessie tag create ml-experiment-2026-05-14 main

# Train model using the tagged data state
spark = SparkSession.builder\
  .config('spark.sql.catalog.nessie.ref', 'ml-experiment-2026-05-14')\
  .getOrCreate()

# Six months later, reproduce exactly the same training dataset
# by checking out the same tag

This eliminates the 'training data drift' problem where re-running a training pipeline months later produces different results because the underlying Silver tables have been updated.

Nessie Tagging ML Reproducibility diagram
Figure 2: Nessie tags for ML reproducibility — exact data state preserved for future experiment reproduction.

Summary

Project Nessie's branching and tagging brings the software engineering discipline of version control to the data lakehouse. Branches enable safe, isolated experimentation and zero-risk schema migrations without production impact. Tags enable ML experiment reproducibility and point-in-time data audits. For data teams that have experienced the pain of experimental transformations corrupting production data or ML experiments that cannot be reproduced, Nessie branching is a transformative operational capability — bringing Git-like confidence to data engineering workflows on Apache Iceberg tables.