What does decoupled storage and compute mean?

Decoupled storage and compute means that data storage (files in S3, ADLS, GCS) is completely separate from query processing infrastructure (Dremio executors, Spark clusters, Trino workers). Storage scales independently of compute. Compute scales independently of storage. Storage is always-on; compute can be turned off when not in use. They are billed separately.

How is decoupled architecture different from a traditional data warehouse?

In a traditional data warehouse (Redshift, Snowflake in early versions, Teradata), storage and compute are tightly coupled — you provision a cluster that determines both storage capacity and compute power together. More data requires more nodes which adds both storage and compute. In the lakehouse, adding data requires only more object storage; scaling queries requires only more compute nodes.

What technologies enable decoupled storage and compute in the lakehouse?

Three technologies work together: object storage (S3, ADLS, GCS) provides the decoupled storage layer; Apache Iceberg provides the shared data format and metadata model that multiple compute engines can read simultaneously; and elastic compute services (Dremio Cloud, EMR, Databricks) provide on-demand query processing that can scale to zero when idle.

Decoupled Storage and Compute: The Definitive Guide

What Is Decoupled Storage and Compute?

Decoupled storage and compute is the architectural principle that separates data storage infrastructure from query processing infrastructure, allowing each to scale independently. In the data lakehouse, this means:

Storage: Data lives in object storage (S3, ADLS, GCS) — infinitely scalable, always-on, billed per byte stored per month, independent of any compute infrastructure
Compute: Query engines (Dremio, Spark, Trino, Flink) are elastic compute clusters that connect to object storage on demand, execute queries, and scale down (or shut off entirely) when workloads are idle

The decoupling is enabled by a common interface: Apache Iceberg's open table format stored in object storage provides the shared data contract that any compute engine can read. No proprietary storage format or engine-specific storage protocol — just open Parquet files on S3.

Economic Advantages Over Coupled Architectures

Decoupling storage and compute provides dramatic economic advantages over tightly coupled alternatives:

Storage Cost

Object storage costs $0.023/GB/month (S3 Standard). A coupled data warehouse that uses proprietary storage costs 5–10x more for equivalent durability and capacity. For petabyte-scale data, this difference is millions of dollars per year.

Compute Elasticity

Decoupled compute can scale to zero when idle — no queries running means no compute costs. A coupled warehouse runs at full capacity 24/7, even during nights and weekends when no one is querying. Lakehouse organizations report 60–80% lower total compute costs through elasticity.

Independent Scaling

Data grows faster than query volume in most organizations. Decoupled storage allows data to grow without proportionally growing compute — a 10x data increase doesn't require a 10x compute increase, only more object storage capacity (free at current scale, billing increases incrementally).

Decoupled Storage Compute Architecture diagram — Figure 1: Decoupled lakehouse architecture — object storage shared by multiple independent compute engines.

Multi-Engine Concurrent Access

A critical capability enabled by decoupled storage is simultaneous multi-engine access: multiple compute engines can read from and write to the same Iceberg tables in object storage simultaneously, with ACID isolation ensuring consistency.

In a coupled warehouse, adding a second query engine means either duplicating data or complex synchronization mechanisms. In the decoupled lakehouse: Spark writes new data while Dremio serves BI queries and Flink ingests streaming events — all on the same Iceberg tables, in the same S3 bucket, simultaneously. Each engine's writes are committed atomically via the Iceberg catalog; each engine reads a consistent snapshot without locking others out.

Multi-Engine Decoupled Access diagram — Figure 2: Multiple engines concurrently accessing the same Iceberg tables — the decoupled lakehouse advantage.

Summary

Decoupled storage and compute is the foundational economic and architectural principle that makes the data lakehouse superior to tightly coupled data warehouse architectures for most modern analytical workloads. By storing data in open Parquet files on cheap object storage and processing it with elastic compute engines, organizations achieve unlimited data scale at minimal cost, pay only for active compute, and maintain the freedom to choose any engine for any workload — without the vendor lock-in, storage costs, or scaling constraints of proprietary coupled systems.