Which engines support lakehouse federation?

Dremio and Trino are the leading engines for lakehouse federation. Dremio provides federation through its source connector model (PostgreSQL, MySQL, MongoDB, Elasticsearch, REST API sources) alongside Iceberg table access. Trino's connector architecture supports dozens of federated sources. Apache Spark can federate via JDBC sources, though with less optimization than dedicated federation engines.

What is the difference between federation and ETL?

ETL copies data from source systems into the lakehouse, creating a latency gap and storage duplication. Federation queries source systems directly at query time — always reading the most current data without any copy lag or storage overhead. Federation is ideal for operational data that changes frequently and must be joined with historical lakehouse data.

Lakehouse Federation: The Definitive Guide

Q: What is lakehouse federation?

Lakehouse federation is the capability to execute SQL queries that span Apache Iceberg tables (in object storage) and external data sources (PostgreSQL, MySQL, MongoDB, Salesforce, etc.) simultaneously — joining, aggregating, and filtering across all sources in a single query without ETL or data movement.

What Is Lakehouse Federation?

Lakehouse federation is the ability of a query engine to execute SQL queries that seamlessly span multiple heterogeneous data sources simultaneously — Apache Iceberg tables in object storage alongside PostgreSQL tables, MySQL databases, MongoDB collections, REST APIs, and other sources — joining and aggregating across all of them in a single SQL query without any data movement or ETL preprocessing.

Federation eliminates the data lake anti-pattern of 'copy everything into the lake before you can analyze it.' Instead of running an ETL pipeline to copy operational PostgreSQL data into Iceberg tables (introducing latency, storage duplication, and pipeline maintenance), a federated query engine reads PostgreSQL directly at query time — always using the most current operational data, joined in real-time with historical Iceberg analytics data.

How Federation Works

Federation engines use a connector (or source) model: each external data source type has a connector that translates SQL operations into the source's native query protocol:

A PostgreSQL connector translates SQL predicates into PostgreSQL queries via JDBC, returning results as rows
A MongoDB connector translates SQL filters into MongoDB queries, returning documents as rows
An Iceberg connector reads Parquet files from S3 directly via the Iceberg metadata API

Predicate pushdown is critical for federation performance: the engine pushes WHERE clause predicates into each source's native query — so PostgreSQL returns only matching rows (not the entire table), and Iceberg's metadata eliminates irrelevant files. Without predicate pushdown, federation would transfer entire tables over the network for every query.

Lakehouse Federation Architecture diagram — Figure 1: Lakehouse federation — single SQL query spanning Iceberg tables and external databases simultaneously.

Federation in Dremio

Dremio is architecturally designed for federation. Every data source in Dremio — whether an Iceberg table in its Open Catalog or a PostgreSQL database registered as a source — appears as a table in Dremio's unified namespace. Analysts write standard SQL joining any combination of sources without knowing where each table lives physically.

Dremio's federation capabilities include: PostgreSQL, MySQL, Oracle, SQL Server, MongoDB, Elasticsearch, S3 files (Parquet, CSV, JSON), and REST API sources. Reflections can be created on federated Virtual Datasets that join Iceberg with external sources — pre-computing the federated join so subsequent queries hit the Reflection rather than re-executing the cross-source join at query time.

Federation Tradeoffs

Federation is powerful but not always the right pattern:

Use federation when: operational data must be current (no ETL lag acceptable), data volume from external sources is small, or the join produces a small result set that doesn't require moving large volumes
Use ETL instead when: external source data is large and queries filter a small fraction (ELT to Iceberg is more efficient), the external source has performance constraints that federation queries would overwhelm, or data must be enriched and cleaned before joining with analytical data

Federation vs ETL Decision Framework diagram — Figure 2: Federation vs ETL — when to query in place vs when to copy data into the lakehouse.

Summary

Lakehouse federation is the capability that makes the data lakehouse a unified analytical platform rather than just another data silo. By enabling real-time SQL across Iceberg tables and operational databases, federation eliminates the ETL copy lag for operational analytics and allows analysts to work with current data alongside historical lakehouse data in a single query. Dremio's federation model — combining multi-source connectivity with Reflection acceleration for federated VDSs — is the most complete federation implementation in the open lakehouse ecosystem.

What Is Lakehouse Federation?

How Federation Works

Federation in Dremio

Federation Tradeoffs

Summary

Related Concepts

Go Deeper — Recommended Resources