Can Apache Iceberg serve as a feature store?

Yes. Apache Iceberg Gold tables work as an effective offline feature store: they provide versioned feature history (time travel), efficient point-in-time feature retrieval for training datasets (historical joins using event timestamps), governed access through RBAC, and discoverability through data catalog integration. For real-time online serving, Iceberg is typically paired with a low-latency serving layer (Redis, DynamoDB).

What is the difference between offline and online feature store?

Offline feature store provides historical feature values for model training — query features for any past time period for batch training jobs. Online feature store provides current (latest) feature values for real-time inference — low-latency key-value lookup for serving predictions on live traffic. Iceberg handles the offline store; real-time databases (Redis, DynamoDB) handle the online store.

Feature Store: The Definitive Guide for Data Lakehouse ML

Q: What is a feature store?

A feature store is a centralized repository that stores pre-computed ML features — making them discoverable, reusable, versioned, and accessible for both offline training (historical feature values by time period) and online serving (current feature values for real-time inference). It prevents the common problem of data scientists duplicating feature computation across teams.

What Is a Feature Store?

A feature store is a data infrastructure component that manages the lifecycle of ML features — the engineered inputs to machine learning models. It serves two distinct use cases: the offline store provides historical feature values for model training (a data scientist needs the last 30 days of customer purchase frequency, computed daily, for each customer in the training dataset), and the online store provides current feature values for real-time model serving (a recommendation system needs the customer's current browsing history feature, updated in real time).

Without a shared feature store, each data science team computes features independently — duplicating computation, creating inconsistencies between training and serving (the 'training-serving skew' problem), and making features invisible to other teams that could reuse them. The feature store centralizes feature computation once, serves it consistently, and makes features discoverable across the entire ML organization.

Apache Iceberg as an Offline Feature Store

Apache Iceberg Gold tables are an effective offline feature store because they provide:

Feature history: Append-only feature tables partition by date preserve complete historical feature values for point-in-time training dataset construction
Point-in-time correctness: Iceberg time travel enables 'as-of' feature retrieval — query feature values as they existed at each training event's timestamp, preventing label leakage
Versioning: Iceberg snapshots + Nessie tags mark the exact feature table version used for each model training run
Governance: RBAC ensures only authorized teams can access sensitive feature data (e.g., customer financial features)
Discoverability: Feature tables registered in the data catalog with rich descriptions enable ML teams to find and reuse existing features

Iceberg as Offline Feature Store diagram — Figure 1: Apache Iceberg as offline feature store — versioned, governed, point-in-time feature history.

Feature Store Architecture in the Lakehouse

The complete feature store architecture on the lakehouse:

Feature computation: Spark batch jobs or dbt models compute features from Silver Iceberg tables and write them to Gold feature tables (e.g., gold.customer_features, gold.product_features)
Offline store: Gold Iceberg feature tables — queried by MLflow-tracked training jobs using PyIceberg or PySpark for historical feature retrieval
Online store sync: A materialization job (Spark or Flink) reads the latest feature values from Iceberg and writes them to a low-latency online store (Redis, DynamoDB, BigTable)
Serving: ML inference services query Redis/DynamoDB for current feature values in milliseconds during real-time serving

Feature Store Lakehouse Architecture diagram — Figure 2: Feature store architecture — Iceberg offline store + Redis online store for training and serving.

Summary

The feature store is the ML engineering complement to the data lakehouse — providing organized, governed, versioned, and reusable ML features that eliminate the duplicated computation and training-serving skew that plague unorganized ML organizations. For the offline feature store, Apache Iceberg Gold tables provide an excellent foundation: versioned history, point-in-time queries, RBAC governance, and data catalog discoverability — all the properties a production offline feature store requires, without needing a proprietary feature store platform. Combined with a real-time online store and MLflow for experiment tracking, Iceberg-based feature stores provide a complete, open ML platform on the lakehouse.

What Is a Feature Store?

Apache Iceberg as an Offline Feature Store

Feature Store Architecture in the Lakehouse

Summary

Related Concepts

Go Deeper — Recommended Resources