What Is Lakehouse Federation?

Lakehouse federation is the ability of a query engine to execute SQL queries that seamlessly span multiple heterogeneous data sources simultaneously — Apache Iceberg tables in object storage alongside PostgreSQL tables, MySQL databases, MongoDB collections, REST APIs, and other sources — joining and aggregating across all of them in a single SQL query without any data movement or ETL preprocessing.

Federation eliminates the data lake anti-pattern of 'copy everything into the lake before you can analyze it.' Instead of running an ETL pipeline to copy operational PostgreSQL data into Iceberg tables (introducing latency, storage duplication, and pipeline maintenance), a federated query engine reads PostgreSQL directly at query time — always using the most current operational data, joined in real-time with historical Iceberg analytics data.

How Federation Works

Federation engines use a connector (or source) model: each external data source type has a connector that translates SQL operations into the source's native query protocol:

  • A PostgreSQL connector translates SQL predicates into PostgreSQL queries via JDBC, returning results as rows
  • A MongoDB connector translates SQL filters into MongoDB queries, returning documents as rows
  • An Iceberg connector reads Parquet files from S3 directly via the Iceberg metadata API

Predicate pushdown is critical for federation performance: the engine pushes WHERE clause predicates into each source's native query — so PostgreSQL returns only matching rows (not the entire table), and Iceberg's metadata eliminates irrelevant files. Without predicate pushdown, federation would transfer entire tables over the network for every query.

Lakehouse Federation Architecture diagram
Figure 1: Lakehouse federation — single SQL query spanning Iceberg tables and external databases simultaneously.

Federation in Dremio

Dremio is architecturally designed for federation. Every data source in Dremio — whether an Iceberg table in its Open Catalog or a PostgreSQL database registered as a source — appears as a table in Dremio's unified namespace. Analysts write standard SQL joining any combination of sources without knowing where each table lives physically.

Dremio's federation capabilities include: PostgreSQL, MySQL, Oracle, SQL Server, MongoDB, Elasticsearch, S3 files (Parquet, CSV, JSON), and REST API sources. Reflections can be created on federated Virtual Datasets that join Iceberg with external sources — pre-computing the federated join so subsequent queries hit the Reflection rather than re-executing the cross-source join at query time.

Federation Tradeoffs

Federation is powerful but not always the right pattern:

  • Use federation when: operational data must be current (no ETL lag acceptable), data volume from external sources is small, or the join produces a small result set that doesn't require moving large volumes
  • Use ETL instead when: external source data is large and queries filter a small fraction (ELT to Iceberg is more efficient), the external source has performance constraints that federation queries would overwhelm, or data must be enriched and cleaned before joining with analytical data
Federation vs ETL Decision Framework diagram
Figure 2: Federation vs ETL — when to query in place vs when to copy data into the lakehouse.

Summary

Lakehouse federation is the capability that makes the data lakehouse a unified analytical platform rather than just another data silo. By enabling real-time SQL across Iceberg tables and operational databases, federation eliminates the ETL copy lag for operational analytics and allows analysts to work with current data alongside historical lakehouse data in a single query. Dremio's federation model — combining multi-source connectivity with Reflection acceleration for federated VDSs — is the most complete federation implementation in the open lakehouse ecosystem.