Why is object storage the foundation of the data lakehouse?

Object storage provides infinite scale at low cost with a simple HTTP API that any engine can access without proprietary storage drivers. Its durability (11 nines for S3), availability, and decoupled billing model (pay for storage separately from compute) make it ideal for the decoupled storage-and-compute architecture of the data lakehouse.

What is S3-compatible object storage?

S3-compatible object storage implements Amazon S3's HTTP API, allowing any tool built for S3 to work with it without modification. MinIO, Ceph, and many cloud providers offer S3-compatible APIs. Apache Iceberg, Parquet readers, and all major lakehouse engines access object storage through the S3 API — making S3 compatibility the universal storage interface standard for the lakehouse ecosystem.

Object Storage: The Definitive Guide

Q: What is object storage?

Object storage is a cloud storage model where data is stored as discrete objects — each object has a unique key (path), its binary data payload, and metadata. Unlike file systems (which use hierarchical directories) or block storage (which uses fixed-size blocks), object storage uses a flat namespace where objects are addressed by key. Amazon S3, Azure ADLS, Google Cloud Storage, and MinIO are the leading object storage systems.

What Is Object Storage?

Object storage is a cloud storage architecture where data is stored as discrete objects — each object consisting of a unique key (the object's address), its data payload (arbitrary binary data), and metadata (key-value pairs describing the object). Unlike traditional file systems with hierarchical directory trees, object storage uses a flat namespace: all objects in a bucket are addressed by their full key path without actual directory hierarchy.

The leading object storage services are: Amazon S3 (AWS), Azure Data Lake Storage Gen2 (Azure), Google Cloud Storage (GCP), and MinIO (open-source, self-hosted). All provide HTTP/HTTPS APIs for object upload, download, list, and delete operations — making storage access language-agnostic and infrastructure-agnostic.

Object storage is the storage layer of the data lakehouse because it combines unlimited scale, extremely low cost, high durability (S3's 99.999999999% durability guarantee), and a universal API that any engine can access without proprietary storage drivers or specialized network configuration.

Object Storage and the Lakehouse Architecture

Object storage enables the defining characteristic of the data lakehouse: decoupled storage and compute. Because object storage is:

Independently scalable: Storage grows independently of compute — no need to provision more compute to store more data
Independently billed: Storage costs are separate from compute costs — idle data incurs only storage costs, not compute costs
Multi-engine accessible: Any engine (Spark, Dremio, Trino, Flink) can read from and write to the same S3 bucket simultaneously
Globally durable: Multi-AZ replication ensures data survives hardware failures without any application-level handling

...the data lakehouse can place all data in one location and bring any number of compute engines to it — the opposite of the traditional data warehouse model where data must be loaded into the warehouse's proprietary storage.

Object Storage as Lakehouse Foundation diagram — Figure 1: Object storage as the shared foundation — multiple engines access the same Iceberg data files simultaneously.

S3 Consistency and Iceberg

A historical challenge for data lakehouses built on S3 was S3's eventual consistency model: after writing a new object, list operations might not immediately return the new object, causing race conditions in table management operations. Amazon S3 achieved strong read-after-write consistency in December 2020 — a landmark change that made building correct Iceberg implementations on S3 dramatically simpler.

Apache Iceberg's atomic commit model also works correctly with S3's consistency: Iceberg commits by writing a new metadata file and updating the catalog pointer atomically. With S3's strong consistency, readers immediately see the new metadata file after it is written, and the catalog update is the atomic serialization point that prevents concurrent write conflicts.

Object Storage S3 Consistency Iceberg diagram — Figure 2: S3 strong consistency + Iceberg atomic commits enable correct multi-writer lakehouse tables.

Summary

Object storage is the inexpensive, infinitely scalable, durably reliable foundation that makes the data lakehouse economically viable at any scale. By storing all data as objects on S3, ADLS, or GCS, organizations eliminate the storage scaling bottlenecks, proprietary storage costs, and single-engine lock-in of traditional data warehouses — while gaining the multi-engine interoperability and open format flexibility that define the open lakehouse architecture.

What Is Object Storage?

Object Storage and the Lakehouse Architecture

S3 Consistency and Iceberg

Summary

Related Concepts

Go Deeper — Recommended Resources