How does Nessie tagging work?

Nessie tags mark a specific point in the catalog's commit history with a named label — like a Git tag. A tag records the exact state of all tables at a specific moment. For ML experiments, tagging the data state used for training ensures that the exact same dataset can be reproduced months later by checking out the tag — enabling perfect experiment reproducibility.

Can all engines use Nessie branches?

Yes. Engines connect to Nessie specifying a branch or tag name as part of their catalog configuration (or per-query). Spark, Trino, Dremio, and Flink can all be configured to read/write against a specific Nessie branch — enabling branch-isolated pipelines where the same ETL job logic runs against both a development branch and the production main branch.

Nessie Branching and Tagging: The Definitive Guide

Q: What is Nessie branching?

Nessie branching is a catalog-level feature that creates an isolated copy of the table namespace — new commits on the branch don't affect the main branch. Data engineers can run experimental transformations, test schema changes, or debug data quality issues on a branch without any impact on production queries running against main.

What Is Nessie Branching and Tagging?

Project Nessie implements a Git-inspired version control model for the Iceberg catalog — introducing branches (isolated parallel namespaces for experimentation and development) and tags (named snapshots of the catalog at a specific point in time) to the data lakehouse.

The core insight behind Nessie branching is that data teams face the same challenges that software teams faced before version control: how to develop and test changes to production data assets without risking data corruption, how to maintain reproducible experiments, and how to deploy changes safely. Git solved these problems for code; Nessie applies the same concepts to data tables.

Branching for Safe Experimentation

The Nessie branching workflow for safe data experimentation:

Create a branch: nessie branch experiment-branch main — creates a new branch starting from main's current state
Work on the branch: Configure Spark to write to experiment-branch and run the experimental transformation. All writes are isolated to the branch.
Validate: Query the branch (with Dremio or Trino pointing to experiment-branch) to verify the transformation produced correct results
Merge or discard: If the experiment is successful, merge to main. If unsuccessful, simply delete the branch — no rollback needed, main was never touched

Nessie Branching Workflow diagram — Figure 1: Nessie branching workflow — isolated experimentation, validation, and safe merge to production.

Tagging for ML Reproducibility

ML experiment reproducibility requires that the exact dataset used for training can be reconstructed at any future point. Nessie tags provide this capability:

# Tag the data state when training begins
nessie tag create ml-experiment-2026-05-14 main

# Train model using the tagged data state
spark = SparkSession.builder\
  .config('spark.sql.catalog.nessie.ref', 'ml-experiment-2026-05-14')\
  .getOrCreate()

# Six months later, reproduce exactly the same training dataset
# by checking out the same tag

This eliminates the 'training data drift' problem where re-running a training pipeline months later produces different results because the underlying Silver tables have been updated.

Nessie Tagging ML Reproducibility diagram — Figure 2: Nessie tags for ML reproducibility — exact data state preserved for future experiment reproduction.

Summary

Project Nessie's branching and tagging brings the software engineering discipline of version control to the data lakehouse. Branches enable safe, isolated experimentation and zero-risk schema migrations without production impact. Tags enable ML experiment reproducibility and point-in-time data audits. For data teams that have experienced the pain of experimental transformations corrupting production data or ML experiments that cannot be reproduced, Nessie branching is a transformative operational capability — bringing Git-like confidence to data engineering workflows on Apache Iceberg tables.

What Is Nessie Branching and Tagging?

Branching for Safe Experimentation

Tagging for ML Reproducibility

Summary

Related Concepts

Go Deeper — Recommended Resources