Does Apache Iceberg support HDFS?

Yes. Apache Iceberg supports HDFS as a storage backend using the hdfs:// URI scheme and Hadoop filesystem libraries. Iceberg tables on HDFS work identically to Iceberg tables on S3, with the same metadata model, snapshots, and schema evolution. HDFS support is primarily useful for organizations migrating from on-premises Hadoop to Iceberg before moving to cloud storage.

Why are organizations migrating from HDFS to cloud object storage?

HDFS requires dedicated hardware clusters with significant operational overhead — cluster management, capacity planning, hardware maintenance, and NameNode high availability. Cloud object storage (S3, ADLS, GCS) is serverless, infinitely scalable, more durable, and dramatically cheaper for most workloads. The lakehouse on cloud object storage delivers better economics and operational simplicity than HDFS-based data lakes.

Apache Hadoop HDFS: The Definitive Guide for Data Lakehouse

Q: What is Apache Hadoop HDFS?

HDFS (Hadoop Distributed File System) is the distributed file system component of the Apache Hadoop ecosystem. It stores large datasets across clusters of commodity hardware, replicating blocks across multiple nodes for fault tolerance. HDFS was the standard storage layer for Hadoop MapReduce, Apache Hive, and early Spark deployments.

What Is Apache Hadoop HDFS?

Apache Hadoop HDFS (Hadoop Distributed File System) is the distributed file system that powered the Hadoop data lake era (2008–2020). Designed for storing very large files across clusters of commodity hardware, HDFS splits files into 128MB blocks and replicates each block across three nodes for fault tolerance. It provided the scalable, fault-tolerant storage layer that made petabyte-scale data lakes possible before cloud storage existed.

HDFS was the foundation of the original data lake stack: Apache Hive running on MapReduce (and later Spark) read and wrote Hive tables stored on HDFS, with the Hive Metastore tracking metadata. This architecture served the enterprise data world for over a decade and many organizations still operate HDFS-based data lakes in production today.

HDFS Architecture

HDFS uses a master-worker architecture:

NameNode: The master node that maintains the file system namespace — the directory tree, file metadata, and the mapping of file blocks to DataNode locations. The NameNode is a single point of failure (mitigated by HA NameNode configurations).
DataNodes: Worker nodes that store actual file blocks on local disk. DataNodes replicate blocks to peer DataNodes and report block health to the NameNode.
Secondary NameNode / JournalNodes: Auxiliary services for NameNode checkpoint and high availability.

This architecture requires dedicated hardware infrastructure, careful capacity planning, and ongoing operational management — in sharp contrast to cloud object storage which is serverless and infinitely scalable.

HDFS Architecture vs Cloud Object Storage diagram — Figure 1: HDFS master-worker architecture vs cloud object storage serverless model.

Migrating from HDFS to Object Storage with Iceberg

The migration from HDFS-based data lakes to cloud object storage lakehouses is the dominant infrastructure modernization project in enterprise data engineering in 2025–2026. Apache Iceberg is a key enabler of this migration:

Iceberg on HDFS: Migrate Hive ORC/Parquet tables to Iceberg format while still on HDFS (gaining snapshots, ACID, schema evolution without changing storage)
Distcp to S3/ADLS: Copy Iceberg data and metadata files from HDFS to cloud object storage using DistCp or cloud-native sync tools
Update catalog pointers: Re-register Iceberg tables in the new cloud catalog (Glue, Polaris, Nessie) pointing to the cloud storage location
Decommission HDFS: Once all tables are migrated and validated, decommission the HDFS cluster

Iceberg's file-level metadata model (all metadata in object files, not in a proprietary cluster) makes this migration path straightforward — there is no cluster-internal state to migrate.

HDFS to Cloud Object Storage Migration Path diagram — Figure 2: HDFS to cloud object storage migration using Apache Iceberg as the bridge.

Summary

Apache Hadoop HDFS was the foundational storage technology of the data lake era, enabling petabyte-scale distributed storage before cloud infrastructure existed. In 2025, HDFS is being systematically replaced by cloud object storage (S3, ADLS, GCS) as organizations migrate to the cloud data lakehouse architecture. Apache Iceberg's support for HDFS as a storage backend provides a practical migration path — organizations can adopt Iceberg's benefits on HDFS first, then migrate storage to the cloud without changing the table format.

What Is Apache Hadoop HDFS?

HDFS Architecture

Migrating from HDFS to Object Storage with Iceberg

Summary

Related Concepts

Go Deeper — Recommended Resources