How do data scientists access Apache Iceberg data?

Data scientists access Iceberg data through multiple interfaces: PyIceberg (Python library for direct Iceberg table access without Spark — returns Arrow tables for Pandas), PySpark (Spark Iceberg integration for large-scale feature engineering), Dremio SQL via Arrow Flight SQL + PyArrow (high-throughput query results into Pandas), and dbt (for defining feature tables as SQL models on Silver/Gold data).

How does time travel help with ML reproducibility?

Iceberg time travel allows data scientists to reconstruct the exact dataset state used for any historical model training run — query the Silver table as it existed on the date training occurred. Combined with Nessie branch tagging, this ensures that any model training run can be exactly reproduced months later, solving the training data drift problem.

What is the role of the AI Semantic Layer in data science?

The AI Semantic Layer (Dremio's MCP-exposed metric and dataset definitions) allows AI agents to autonomously discover, understand, and query enterprise lakehouse data for ML feature retrieval, dataset exploration, and model input preparation — without requiring data scientists to write custom SQL for each feature extraction task.

Data Science on the Data Lakehouse: The Definitive Guide

The Lakehouse as a Data Science Platform

The data lakehouse solves the most frustrating problem in data science: access to good data. Traditional data science workflows were bottlenecked on data engineering: data scientists submitted tickets to get data extracted from warehouses, waited days, received stale dumps in CSV files, and then realized they needed different columns and started over. The lakehouse eliminates this bottleneck by giving data scientists direct, governed, self-service access to curated Gold and Silver Iceberg tables through Python interfaces they already know.

Python Access to Iceberg Data

Data scientists access Iceberg data through several Python interfaces:

PyIceberg: Direct Iceberg Access

from pyiceberg.catalog import load_catalog
catalog = load_catalog('polaris', uri='https://polaris.example.com/api/catalog')
table = catalog.load_table('gold.customer_features')
df = table.scan(row_filter="region = 'US-WEST'").to_arrow().to_pandas()

Dremio + Arrow Flight SQL

from pyarrow import flight
client = flight.FlightClient('grpc+tls://dremio.example.com:32010')
df = client.do_get(...).read_all().to_pandas()
# Sub-second for Reflection-accelerated queries

PySpark + Iceberg for Large-Scale Feature Engineering

df = spark.table('catalog.gold.customer_features')
features = df.groupBy('customer_id').agg(F.sum('ltv'), F.count('orders'))

Python Data Science Access to Iceberg diagram — Figure 1: Python data science access to lakehouse data — PyIceberg, Arrow Flight SQL, and PySpark.

Feature Engineering on the Lakehouse

The lakehouse is the ideal platform for ML feature engineering:

Historical features: Time travel queries recreate exact historical feature values — essential for training models on point-in-time correct features (no label leakage)
Fresh features: Streaming-ingested Iceberg tables provide near-real-time feature values for online inference
Feature sharing: Gold Iceberg feature tables serve as a shared feature store — data scientists publish computed features as Iceberg tables that other teams can reuse
Version control: Nessie tags snapshot the exact data state used for each training run — enabling perfect ML reproducibility

ML Feature Engineering on Iceberg diagram — Figure 2: ML feature engineering on Iceberg — historical, fresh, shared, and version-controlled features.

Summary

The data lakehouse is transforming data science from a bottlenecked, ticket-driven, CSV-file workflow into a governed, self-service, feature-rich analytical platform. With PyIceberg for direct Python access, Dremio Arrow Flight SQL for high-throughput query results, PySpark for large-scale feature engineering, Nessie tagging for experiment reproducibility, and the AI Semantic Layer for autonomous agent data access, the lakehouse provides everything data scientists need — on trusted, governed, current data — without waiting for data engineering support on every feature extraction task.

Data Science on the Lakehouse: The Definitive Guide

The Lakehouse as a Data Science Platform

Python Access to Iceberg Data

PyIceberg: Direct Iceberg Access

Dremio + Arrow Flight SQL

PySpark + Iceberg for Large-Scale Feature Engineering

Feature Engineering on the Lakehouse

Summary

Go Deeper — Recommended Resources

The Lakehouse as a Data Science Platform

Python Access to Iceberg Data

PyIceberg: Direct Iceberg Access

Dremio + Arrow Flight SQL

PySpark + Iceberg for Large-Scale Feature Engineering

Feature Engineering on the Lakehouse

Summary

Related Concepts

Go Deeper — Recommended Resources