What is data ingestion in the data lakehouse?

Data ingestion is the process of moving data from source systems (operational databases, SaaS APIs, event streams, files, IoT sensors) into Bronze Apache Iceberg tables in the lakehouse. Ingestion patterns include streaming CDC (near-real-time), batch ETL (scheduled loads), file-based landing (S3 drop zone), and federated query (read-in-place without moving data).

What are the main data ingestion patterns for Apache Iceberg?

Four main patterns: (1) Streaming CDC — Debezium + Kafka + Flink writes change events continuously to Bronze Iceberg; (2) Batch ETL — Spark jobs read source databases via JDBC and write to Iceberg on a schedule; (3) File landing — source systems drop CSV/JSON/Parquet files to S3, Spark converts them to Iceberg; (4) Federated query — Dremio reads source systems directly without physically ingesting data.

What tools handle data ingestion for Apache Iceberg?

Key ingestion tools: Debezium (CDC capture from databases), Apache Kafka (event transport), Apache Flink (streaming ingestion to Iceberg), Apache Spark (batch ingestion via JDBC or file conversion), Airbyte and Fivetran (managed ELT connectors that write to Iceberg), dbt (transformation-layer ELT after initial load), and Dremio (federated query eliminating physical ingestion).

Data Ingestion: The Definitive Guide for Data Lakehouse

What Is Data Ingestion?

Data ingestion is the process of acquiring data from source systems and loading it into the data lakehouse — specifically into Bronze Apache Iceberg tables that serve as the raw data landing zone of the Medallion Architecture. Ingestion is the first stage of the lakehouse data lifecycle: data cannot be transformed, governed, or analyzed until it has been ingested.

Ingestion decisions have cascading effects: the ingestion pattern chosen determines data freshness (seconds vs hours), pipeline complexity (managed service vs custom Flink jobs), cost (compute for streaming vs cheaper batch schedules), and recovery characteristics (can the pipeline be replayed if a failure occurs?). Getting ingestion architecture right is foundational to lakehouse success.

Ingestion Patterns Compared

Pattern	Freshness	Complexity	Use Case
Streaming CDC (Flink)	Seconds	High	Operational databases needing real-time freshness
Batch ETL (Spark)	Hours–Days	Medium	High-volume historical loads, complex transformations
File landing	Hours	Low	Partner data drops, SaaS CSV exports
Managed ELT (Airbyte)	Minutes–Hours	Low	SaaS sources (Salesforce, HubSpot, Google Ads)
Federation (Dremio)	Real-time	None	Small, frequently changing operational data

Data Ingestion Patterns Comparison diagram — Figure 1: Ingestion pattern comparison — freshness, complexity, and use case fit for each pattern.

Managed ELT Tools for Iceberg

Managed ELT platforms simplify ingestion by providing pre-built connectors for hundreds of data sources, eliminating the need to build and maintain custom JDBC batch jobs or API pollers:

Airbyte (open source): 350+ source connectors, supports writing to Iceberg destinations (via S3 Parquet landing + Spark conversion or direct Iceberg write). Self-hosted or cloud managed.
Fivetran: Commercial managed ELT with automated schema change handling. Writes to data warehouse destinations; Iceberg support through partner integrations.
dbt Cloud: Handles the T in ELT — runs dbt transformations on schedule after initial loads by other tools place data in Bronze tables.

For SaaS source ingestion (Salesforce, HubSpot, Stripe, Google Analytics), managed ELT tools provide dramatically faster time-to-value than custom API integration code.

Managed ELT Iceberg Ingestion diagram — Figure 2: Managed ELT tools for Iceberg — Airbyte, Fivetran, and dbt in the ingestion pipeline.

Summary

Data ingestion is the foundation of the lakehouse pipeline — without reliable, timely ingestion, no amount of transformation, governance, or query performance can produce business value. Choosing the right ingestion pattern for each data source (streaming CDC for operational databases, batch ETL for high-volume historical sources, managed ELT for SaaS, federation for small operational tables) is the data architecture decision that most directly determines the lakehouse's data freshness, cost, and operational reliability.

What Is Data Ingestion?

Ingestion Patterns Compared

Managed ELT Tools for Iceberg

Summary

Related Concepts

Go Deeper — Recommended Resources