What Is Apache Hadoop HDFS?

Apache Hadoop HDFS (Hadoop Distributed File System) is the distributed file system that powered the Hadoop data lake era (2008–2020). Designed for storing very large files across clusters of commodity hardware, HDFS splits files into 128MB blocks and replicates each block across three nodes for fault tolerance. It provided the scalable, fault-tolerant storage layer that made petabyte-scale data lakes possible before cloud storage existed.

HDFS was the foundation of the original data lake stack: Apache Hive running on MapReduce (and later Spark) read and wrote Hive tables stored on HDFS, with the Hive Metastore tracking metadata. This architecture served the enterprise data world for over a decade and many organizations still operate HDFS-based data lakes in production today.

HDFS Architecture

HDFS uses a master-worker architecture:

  • NameNode: The master node that maintains the file system namespace — the directory tree, file metadata, and the mapping of file blocks to DataNode locations. The NameNode is a single point of failure (mitigated by HA NameNode configurations).
  • DataNodes: Worker nodes that store actual file blocks on local disk. DataNodes replicate blocks to peer DataNodes and report block health to the NameNode.
  • Secondary NameNode / JournalNodes: Auxiliary services for NameNode checkpoint and high availability.

This architecture requires dedicated hardware infrastructure, careful capacity planning, and ongoing operational management — in sharp contrast to cloud object storage which is serverless and infinitely scalable.

HDFS Architecture vs Cloud Object Storage diagram
Figure 1: HDFS master-worker architecture vs cloud object storage serverless model.

Migrating from HDFS to Object Storage with Iceberg

The migration from HDFS-based data lakes to cloud object storage lakehouses is the dominant infrastructure modernization project in enterprise data engineering in 2025–2026. Apache Iceberg is a key enabler of this migration:

  1. Iceberg on HDFS: Migrate Hive ORC/Parquet tables to Iceberg format while still on HDFS (gaining snapshots, ACID, schema evolution without changing storage)
  2. Distcp to S3/ADLS: Copy Iceberg data and metadata files from HDFS to cloud object storage using DistCp or cloud-native sync tools
  3. Update catalog pointers: Re-register Iceberg tables in the new cloud catalog (Glue, Polaris, Nessie) pointing to the cloud storage location
  4. Decommission HDFS: Once all tables are migrated and validated, decommission the HDFS cluster

Iceberg's file-level metadata model (all metadata in object files, not in a proprietary cluster) makes this migration path straightforward — there is no cluster-internal state to migrate.

HDFS to Cloud Object Storage Migration Path diagram
Figure 2: HDFS to cloud object storage migration using Apache Iceberg as the bridge.

Summary

Apache Hadoop HDFS was the foundational storage technology of the data lake era, enabling petabyte-scale distributed storage before cloud infrastructure existed. In 2025, HDFS is being systematically replaced by cloud object storage (S3, ADLS, GCS) as organizations migrate to the cloud data lakehouse architecture. Apache Iceberg's support for HDFS as a storage backend provides a practical migration path — organizations can adopt Iceberg's benefits on HDFS first, then migrate storage to the cloud without changing the table format.