What Is Data Ingestion?

Data ingestion is the process of acquiring data from source systems and loading it into the data lakehouse — specifically into Bronze Apache Iceberg tables that serve as the raw data landing zone of the Medallion Architecture. Ingestion is the first stage of the lakehouse data lifecycle: data cannot be transformed, governed, or analyzed until it has been ingested.

Ingestion decisions have cascading effects: the ingestion pattern chosen determines data freshness (seconds vs hours), pipeline complexity (managed service vs custom Flink jobs), cost (compute for streaming vs cheaper batch schedules), and recovery characteristics (can the pipeline be replayed if a failure occurs?). Getting ingestion architecture right is foundational to lakehouse success.

Ingestion Patterns Compared

PatternFreshnessComplexityUse Case
Streaming CDC (Flink)SecondsHighOperational databases needing real-time freshness
Batch ETL (Spark)Hours–DaysMediumHigh-volume historical loads, complex transformations
File landingHoursLowPartner data drops, SaaS CSV exports
Managed ELT (Airbyte)Minutes–HoursLowSaaS sources (Salesforce, HubSpot, Google Ads)
Federation (Dremio)Real-timeNoneSmall, frequently changing operational data
Data Ingestion Patterns Comparison diagram
Figure 1: Ingestion pattern comparison — freshness, complexity, and use case fit for each pattern.

Managed ELT Tools for Iceberg

Managed ELT platforms simplify ingestion by providing pre-built connectors for hundreds of data sources, eliminating the need to build and maintain custom JDBC batch jobs or API pollers:

  • Airbyte (open source): 350+ source connectors, supports writing to Iceberg destinations (via S3 Parquet landing + Spark conversion or direct Iceberg write). Self-hosted or cloud managed.
  • Fivetran: Commercial managed ELT with automated schema change handling. Writes to data warehouse destinations; Iceberg support through partner integrations.
  • dbt Cloud: Handles the T in ELT — runs dbt transformations on schedule after initial loads by other tools place data in Bronze tables.

For SaaS source ingestion (Salesforce, HubSpot, Stripe, Google Analytics), managed ELT tools provide dramatically faster time-to-value than custom API integration code.

Managed ELT Iceberg Ingestion diagram
Figure 2: Managed ELT tools for Iceberg — Airbyte, Fivetran, and dbt in the ingestion pipeline.

Summary

Data ingestion is the foundation of the lakehouse pipeline — without reliable, timely ingestion, no amount of transformation, governance, or query performance can produce business value. Choosing the right ingestion pattern for each data source (streaming CDC for operational databases, batch ETL for high-volume historical sources, managed ELT for SaaS, federation for small operational tables) is the data architecture decision that most directly determines the lakehouse's data freshness, cost, and operational reliability.