Amazon S3 (Simple Storage Service) is AWS's object storage service, providing unlimited storage capacity with 99.999999999% (11 nines) durability, multiple storage tiers, strong read-after-write consistency, and a universal HTTP API. It is the primary storage backend for Apache Iceberg data files and metadata in AWS-based data lakehouse architectures.

Why is Amazon S3 ideal for Apache Iceberg lakehouses?

S3's combination of 11-nine durability, unlimited scale, low cost, strong consistency (since December 2020), and universal API makes it ideal for Iceberg. Iceberg's atomic commit model works correctly with S3's consistency guarantees. S3's storage classes (Intelligent-Tiering, Infrequent Access) enable cost optimization for historical data.

How does S3 storage pricing work for a lakehouse?

S3 charges per GB stored per month (varies by storage class and region), per API request (PUT, GET, LIST operations), and for data transfer out of AWS. For lakehouses, storage costs dominate. S3 Standard is ~$0.023/GB/month in US East; S3 Intelligent-Tiering automatically moves infrequently accessed data to lower-cost tiers, reducing costs for cold historical data by 40–68%.

Amazon S3 for Data Lakehouse: The Definitive Guide

What Is Amazon S3?

Amazon S3 (Simple Storage Service) is AWS's flagship object storage service, launched in 2006 as one of the first cloud services. It provides unlimited storage capacity at low cost with 99.999999999% (11 nines) object durability through automatic multi-AZ replication. S3 is the most widely used object storage service in the world and the dominant storage backend for AWS-based data lakehouses.

For Apache Iceberg lakehouses on AWS, S3 is where everything lives: Parquet data files, Avro manifest files, manifest lists, table metadata JSON files, and Iceberg catalog state. Query engines (Dremio, Spark, Trino) access S3 through the S3 API — reading and writing objects using the same universal interface.

S3 Storage Classes for Lakehouse Data

S3 offers multiple storage classes with different cost/access-time trade-offs — enabling lakehouse cost optimization based on data access patterns:

S3 Standard: Default class for frequently accessed data. Millisecond access time. Highest cost per GB. Best for Bronze/Silver/Gold layer tables actively queried.
S3 Intelligent-Tiering: Automatically moves objects between access tiers based on usage patterns. No retrieval fee. Best for data with unpredictable access patterns (older partitions that are occasionally queried).
S3 Standard-IA (Infrequent Access): Lower storage cost, retrieval fee per GB. Best for data accessed less than once per month.
S3 Glacier Instant Retrieval: Very low storage cost, millisecond retrieval, higher retrieval fee. Best for data retained for compliance but rarely queried.

Amazon S3 Storage Classes for Lakehouse diagram — Figure 1: S3 storage class tiering for lakehouse data — optimize costs based on partition access frequency.

S3 Strong Consistency and Iceberg

Amazon S3 achieved strong read-after-write consistency for all GET, PUT, LIST, and DELETE operations in December 2020. This was a critical milestone for Iceberg on S3: previously, Iceberg implementations had to work around S3's eventual consistency with mechanisms like DynamoDB lock tables. With strong consistency, Iceberg's native atomic commit mechanism (writing metadata and updating the catalog pointer) works correctly on S3 without any workarounds.

S3 strong consistency means: immediately after an Iceberg commit writes a new metadata file to S3, any subsequent GET of that key will return the new file. LIST operations immediately reflect new objects. This is the consistency model Iceberg's optimistic concurrency control requires.

S3 Access Control for Lakehouses

S3 provides multiple access control mechanisms for securing lakehouse data:

IAM Policies: Identity-based policies granting specific AWS principals (IAM users, roles) access to specific S3 buckets and object prefixes
Bucket Policies: Resource-based policies attached to S3 buckets, defining who can access which objects
S3 Block Public Access: Account and bucket-level settings preventing any public access regardless of individual object ACLs
AWS Lake Formation: Fine-grained table and column level access control layered on top of Glue-cataloged S3 data
Credential Vending: Iceberg REST catalogs (Polaris, Nessie, Glue) return short-lived STS credentials scoped to specific S3 prefixes — engines get exactly the permissions needed for specific table access

S3 Access Control Lakehouse Security diagram — Figure 2: S3 security layers for lakehouse data — IAM, bucket policies, Lake Formation, and credential vending.

Summary

Amazon S3 is the default storage layer for AWS-based data lakehouses, combining unlimited scale, 11-nine durability, strong consistency, cost-optimized storage tiers, and the universal S3 API that every lakehouse engine supports. For organizations building on Apache Iceberg on AWS, S3 is the natural and optimal storage foundation — directly enabling the decoupled storage-and-compute architecture that makes the open lakehouse economically superior to proprietary cloud warehouses.

What Is Amazon S3?

S3 Storage Classes for Lakehouse Data

S3 Strong Consistency and Iceberg

S3 Access Control for Lakehouses

Summary

Related Concepts

Go Deeper — Recommended Resources