Data Lakehouse Knowledge Base

The definitive reference for modern data lakehouse architecture — 98 expert-written guides on Apache Iceberg, Dremio, open table formats, catalogs, governance, and Agentic AI.

98Definitive Guides
11Topic Categories
4K+Words per Guide
100%Open Access

Analytics & BI

Architecture Patterns

Architecture Patterns Agentic Lakehouse Learn what the Agentic Lakehouse is, how AI agents autonomously discover and query Apache Iceberg da… Architecture Patterns Bronze Layer Learn what the Bronze Layer is in the Medallion Architecture, how it stores raw ingested data in Apa… Architecture Patterns Data Engineering Learn what data engineering is, the key skills and tools data engineers use to build Apache Iceberg … Architecture Patterns Data Science on the Lakehouse Learn how data scientists use Apache Iceberg tables, PyIceberg, and Dremio's AI Semantic Layer for M… Architecture Patterns ETL (Extract, Transform, Load) Learn what ETL is, how it differs from ELT in the data lakehouse, and how Apache Spark and Flink imp… Architecture Patterns Feature Store Learn what a feature store is, how Apache Iceberg tables serve as open feature stores, and how featu… Architecture Patterns Gold Layer Learn what the Gold Layer is in the Medallion Architecture, how it delivers pre-aggregated business … Architecture Patterns Lakehouse Architecture Learn what the data lakehouse architecture is, its key layers and components, and how Apache Iceberg… Architecture Patterns Medallion Architecture Learn what the Medallion Architecture is, how Bronze, Silver, and Gold layers organize data lakehous… Architecture Patterns MLflow Learn what MLflow is, how it tracks ML experiments on Apache Iceberg features, and why it is the sta… Architecture Patterns MCP (Model Context Protocol) Learn what the Model Context Protocol (MCP) is, how Dremio's MCP server enables AI agents to autonom… Architecture Patterns Nessie Branching and Tagging Learn how Project Nessie's Git-like branching and tagging enables isolated data experimentation, rep… Architecture Patterns Open Lakehouse Learn what the Open Lakehouse is, how open standards (Apache Iceberg, Parquet, Iceberg REST Catalog)… Architecture Patterns Silver Layer Learn what the Silver Layer is in the Medallion Architecture, how it cleanses and conforms Bronze da…

Catalogs & Metadata

Core Concepts

File Formats & Storage

Governance

Governance & Quality

Ingestion

Ingestion & Streaming

Query Engines & Platforms

Query Engines & Platforms Apache Flink Learn what Apache Flink is, how it enables real-time streaming ingestion into Apache Iceberg tables,… Query Engines & Platforms Apache Spark Learn what Apache Spark is, how it powers Iceberg ETL and ML workloads, and when to use Spark vs Dre… Query Engines & Platforms Autonomous Reflections Learn how Dremio Autonomous Reflections automatically analyze query patterns and create, update, and… Query Engines & Platforms Column Pruning Learn what column pruning is, how it works with Apache Parquet's columnar storage, and why it dramat… Query Engines & Platforms Dremio Cloud Learn what Dremio Cloud is, how its serverless lakehouse platform works on AWS and Azure, and how it… Query Engines & Platforms Dremio Open Catalog Learn what Dremio Open Catalog is, how it implements the Iceberg REST Catalog spec with Git-like Nes… Query Engines & Platforms Dremio Intelligent Query Engine Learn how Dremio's Intelligent Query Engine uses Apache Arrow vectorized execution, Reflection accel… Query Engines & Platforms Dremio Reflections Learn what Dremio Reflections are, how raw and aggregation Reflections work, and how they deliver su… Query Engines & Platforms Dremio Learn what Dremio is, how its intelligent query engine works, and why it is the leading data lakehou… Query Engines & Platforms Physical Datasets (Dremio) Learn what Physical Datasets are in Dremio, how they register data sources including Apache Iceberg … Query Engines & Platforms Predicate Pushdown Learn what predicate pushdown is, how it works in Apache Iceberg and Parquet, and why it is one of t… Query Engines & Platforms Presto Learn what Presto is, how it compares to Trino, and its role in distributed SQL analytics across Apa… Query Engines & Platforms Trino Learn what Trino is, how its federated SQL query engine works across Apache Iceberg and other data s… Query Engines & Platforms Vectorized Query Execution Learn what vectorized query execution is, how it uses Apache Arrow and SIMD instructions to accelera… Query Engines & Platforms Virtual Datasets (Dremio) Learn what Dremio Virtual Datasets are, how they create a semantic layer above raw Iceberg data, and…

Table Formats

Table Formats Apache Hudi Learn what Apache Hudi is, how its incremental processing model works, how it compares to Apache Ice… Table Formats Apache Iceberg Learn what Apache Iceberg is, how its metadata architecture works, and why it is the industry-standa… Table Formats Compaction Learn what compaction is in Apache Iceberg, why it is essential for lakehouse performance, how Copy-… Table Formats Copy-on-Write (CoW) Learn what Copy-on-Write means in Apache Iceberg, when to use CoW vs Merge-on-Read, and how CoW upda… Table Formats Delta Lake Learn what Delta Lake is, how it works, how it compares to Apache Iceberg, and when to choose Delta … Table Formats Hidden Partitioning Learn how Apache Iceberg hidden partitioning works, why it eliminates the need to write partition-aw… Table Formats Iceberg Manifest Files Learn what Apache Iceberg manifest files are, how they store file-level statistics for data skipping… Table Formats Iceberg REST Catalog Learn what the Apache Iceberg REST Catalog specification is, how it enables multi-engine catalog int… Table Formats Iceberg Snapshots Learn what Apache Iceberg snapshots are, how they enable ACID transactions and time travel, and how … Table Formats Merge-on-Read (MoR) Learn what Merge-on-Read means in Apache Iceberg, how delete files work, when to use MoR vs Copy-on-… Table Formats Partition Evolution Learn how Apache Iceberg partition evolution works, why it solves the static partitioning problem, a… Table Formats Row-Level Deletes Learn how row-level deletes work in Apache Iceberg V2, the difference between positional and equalit… Table Formats Schema Evolution Learn how Apache Iceberg schema evolution works, what changes are safe vs breaking, and how to evolv… Table Formats Time Travel Learn how Apache Iceberg time travel works, how to query historical snapshots by timestamp or snapsh… Table Formats Z-Ordering (Data Sorting) Learn what Z-Ordering is in Apache Iceberg, how it clusters data to improve data skipping, and how t…