What Is Data Ingestion?
Data ingestion is the process of acquiring data from source systems and loading it into the data lakehouse — specifically into Bronze Apache Iceberg tables that serve as the raw data landing zone of the Medallion Architecture. Ingestion is the first stage of the lakehouse data lifecycle: data cannot be transformed, governed, or analyzed until it has been ingested.
Ingestion decisions have cascading effects: the ingestion pattern chosen determines data freshness (seconds vs hours), pipeline complexity (managed service vs custom Flink jobs), cost (compute for streaming vs cheaper batch schedules), and recovery characteristics (can the pipeline be replayed if a failure occurs?). Getting ingestion architecture right is foundational to lakehouse success.
Ingestion Patterns Compared
| Pattern | Freshness | Complexity | Use Case |
|---|---|---|---|
| Streaming CDC (Flink) | Seconds | High | Operational databases needing real-time freshness |
| Batch ETL (Spark) | Hours–Days | Medium | High-volume historical loads, complex transformations |
| File landing | Hours | Low | Partner data drops, SaaS CSV exports |
| Managed ELT (Airbyte) | Minutes–Hours | Low | SaaS sources (Salesforce, HubSpot, Google Ads) |
| Federation (Dremio) | Real-time | None | Small, frequently changing operational data |

Managed ELT Tools for Iceberg
Managed ELT platforms simplify ingestion by providing pre-built connectors for hundreds of data sources, eliminating the need to build and maintain custom JDBC batch jobs or API pollers:
- Airbyte (open source): 350+ source connectors, supports writing to Iceberg destinations (via S3 Parquet landing + Spark conversion or direct Iceberg write). Self-hosted or cloud managed.
- Fivetran: Commercial managed ELT with automated schema change handling. Writes to data warehouse destinations; Iceberg support through partner integrations.
- dbt Cloud: Handles the T in ELT — runs dbt transformations on schedule after initial loads by other tools place data in Bronze tables.
For SaaS source ingestion (Salesforce, HubSpot, Stripe, Google Analytics), managed ELT tools provide dramatically faster time-to-value than custom API integration code.

Summary
Data ingestion is the foundation of the lakehouse pipeline — without reliable, timely ingestion, no amount of transformation, governance, or query performance can produce business value. Choosing the right ingestion pattern for each data source (streaming CDC for operational databases, batch ETL for high-volume historical sources, managed ELT for SaaS, federation for small operational tables) is the data architecture decision that most directly determines the lakehouse's data freshness, cost, and operational reliability.