What Is a Feature Store?

A feature store is a data infrastructure component that manages the lifecycle of ML features — the engineered inputs to machine learning models. It serves two distinct use cases: the offline store provides historical feature values for model training (a data scientist needs the last 30 days of customer purchase frequency, computed daily, for each customer in the training dataset), and the online store provides current feature values for real-time model serving (a recommendation system needs the customer's current browsing history feature, updated in real time).

Without a shared feature store, each data science team computes features independently — duplicating computation, creating inconsistencies between training and serving (the 'training-serving skew' problem), and making features invisible to other teams that could reuse them. The feature store centralizes feature computation once, serves it consistently, and makes features discoverable across the entire ML organization.

Apache Iceberg as an Offline Feature Store

Apache Iceberg Gold tables are an effective offline feature store because they provide:

  • Feature history: Append-only feature tables partition by date preserve complete historical feature values for point-in-time training dataset construction
  • Point-in-time correctness: Iceberg time travel enables 'as-of' feature retrieval — query feature values as they existed at each training event's timestamp, preventing label leakage
  • Versioning: Iceberg snapshots + Nessie tags mark the exact feature table version used for each model training run
  • Governance: RBAC ensures only authorized teams can access sensitive feature data (e.g., customer financial features)
  • Discoverability: Feature tables registered in the data catalog with rich descriptions enable ML teams to find and reuse existing features
Iceberg as Offline Feature Store diagram
Figure 1: Apache Iceberg as offline feature store — versioned, governed, point-in-time feature history.

Feature Store Architecture in the Lakehouse

The complete feature store architecture on the lakehouse:

  1. Feature computation: Spark batch jobs or dbt models compute features from Silver Iceberg tables and write them to Gold feature tables (e.g., gold.customer_features, gold.product_features)
  2. Offline store: Gold Iceberg feature tables — queried by MLflow-tracked training jobs using PyIceberg or PySpark for historical feature retrieval
  3. Online store sync: A materialization job (Spark or Flink) reads the latest feature values from Iceberg and writes them to a low-latency online store (Redis, DynamoDB, BigTable)
  4. Serving: ML inference services query Redis/DynamoDB for current feature values in milliseconds during real-time serving
Feature Store Lakehouse Architecture diagram
Figure 2: Feature store architecture — Iceberg offline store + Redis online store for training and serving.

Summary

The feature store is the ML engineering complement to the data lakehouse — providing organized, governed, versioned, and reusable ML features that eliminate the duplicated computation and training-serving skew that plague unorganized ML organizations. For the offline feature store, Apache Iceberg Gold tables provide an excellent foundation: versioned history, point-in-time queries, RBAC governance, and data catalog discoverability — all the properties a production offline feature store requires, without needing a proprietary feature store platform. Combined with a real-time online store and MLflow for experiment tracking, Iceberg-based feature stores provide a complete, open ML platform on the lakehouse.