Data Lakehouse Knowledge Base — 98 Definitive Guides

Analytics & BI

Analytics & BI AI Semantic Layer (Dremio) Learn what Dremio's AI Semantic Layer is, how it enables natural language data access for AI agents … Analytics & BI Automated Table Optimization Learn what automated table optimization is, how Dremio and Apache Iceberg automate compaction, Z-ord… Analytics & BI Data Skipping Learn what data skipping is in Apache Iceberg, how file-level statistics enable skipping irrelevant … Analytics & BI Lakehouse Federation Learn what lakehouse federation is, how engines like Dremio and Trino query across Iceberg tables an… Analytics & BI Multi-Engine Interoperability Learn what multi-engine interoperability means in the data lakehouse, how Apache Iceberg and the RES… Analytics & BI Query Optimization Learn what query optimization is, the key optimization techniques used in Apache Iceberg and Dremio,… Analytics & BI Self-Service Analytics Learn what self-service analytics is, how the data lakehouse enables it at scale, and why the semant… Analytics & BI Semantic Layer Learn what a semantic layer is, how it translates raw data into business-friendly metrics, and why i… Analytics & BI SQL Analytics Learn how SQL analytics powers the data lakehouse, the key SQL patterns for Iceberg queries, and how…

Architecture Patterns

Catalogs & Metadata

Core Concepts

Core Concepts ACID Transactions Learn what ACID transactions are, why they matter in the data lakehouse, and how Apache Iceberg impl… Core Concepts Data Lake Learn what a data lake is, how it works, its strengths and weaknesses, and how it evolved into the m… Core Concepts Data Lakehouse Learn what a data lakehouse is, how it works, why it replaced the data warehouse for modern analytic… Core Concepts Data Mesh Learn what Data Mesh is, its four core principles, how it differs from centralized data platforms, a… Core Concepts Data Warehouse Learn what a data warehouse is, how it works, its architecture patterns, key limitations, and how it… Core Concepts Decoupled Storage and Compute Learn what decoupled storage and compute means in the data lakehouse, how Apache Iceberg on object s… Core Concepts ELT (Extract, Load, Transform) Learn what ELT is, how Extract Load Transform differs from ETL, why ELT is the dominant pattern in t… Core Concepts Open Table Format Learn what an open table format is, how Apache Iceberg, Delta Lake, and Apache Hudi compare, and why…

File Formats & Storage

Governance

Governance Data Governance Learn what data governance means in the data lakehouse, how access control, lineage, quality, and ca…

Governance & Quality

Ingestion

Ingestion Apache Kafka Learn how Apache Kafka enables streaming data ingestion into the lakehouse, its role in CDC pipeline… Ingestion Change Data Capture (CDC) Learn what Change Data Capture is, how Debezium and Apache Flink stream CDC events into Apache Icebe…

Ingestion & Streaming

Ingestion & Streaming Batch Processing Learn what batch processing is, how Apache Spark handles large-scale batch ETL into Apache Iceberg, … Ingestion & Streaming Data Ingestion Learn what data ingestion is, the key ingestion patterns for Apache Iceberg lakehouses, and how to c… Ingestion & Streaming dbt (Data Build Tool) Learn what dbt is, how it implements SQL-based ELT transformations on Apache Iceberg tables, and why… Ingestion & Streaming Real-Time Analytics Learn what real-time analytics means in the data lakehouse, how streaming ingestion and query engine… Ingestion & Streaming Stream Processing Learn what stream processing is, how Apache Flink enables real-time stream processing into Apache Ic… Ingestion & Streaming Upsert Learn what upsert is, how Apache Iceberg's MERGE INTO implements upsert semantics, and why upsert is…

Query Engines & Platforms

Table Formats

Table Formats Apache Hudi Learn what Apache Hudi is, how its incremental processing model works, how it compares to Apache Ice… Table Formats Apache Iceberg Learn what Apache Iceberg is, how its metadata architecture works, and why it is the industry-standa… Table Formats Compaction Learn what compaction is in Apache Iceberg, why it is essential for lakehouse performance, how Copy-… Table Formats Copy-on-Write (CoW) Learn what Copy-on-Write means in Apache Iceberg, when to use CoW vs Merge-on-Read, and how CoW upda… Table Formats Delta Lake Learn what Delta Lake is, how it works, how it compares to Apache Iceberg, and when to choose Delta … Table Formats Hidden Partitioning Learn how Apache Iceberg hidden partitioning works, why it eliminates the need to write partition-aw… Table Formats Iceberg Manifest Files Learn what Apache Iceberg manifest files are, how they store file-level statistics for data skipping… Table Formats Iceberg REST Catalog Learn what the Apache Iceberg REST Catalog specification is, how it enables multi-engine catalog int… Table Formats Iceberg Snapshots Learn what Apache Iceberg snapshots are, how they enable ACID transactions and time travel, and how … Table Formats Merge-on-Read (MoR) Learn what Merge-on-Read means in Apache Iceberg, how delete files work, when to use MoR vs Copy-on-… Table Formats Partition Evolution Learn how Apache Iceberg partition evolution works, why it solves the static partitioning problem, a… Table Formats Row-Level Deletes Learn how row-level deletes work in Apache Iceberg V2, the difference between positional and equalit… Table Formats Schema Evolution Learn how Apache Iceberg schema evolution works, what changes are safe vs breaking, and how to evolv… Table Formats Time Travel Learn how Apache Iceberg time travel works, how to query historical snapshots by timestamp or snapsh… Table Formats Z-Ordering (Data Sorting) Learn what Z-Ordering is in Apache Iceberg, how it clusters data to improve data skipping, and how t…