What Is Apache Avro?

Apache Avro is an open-source data serialization system developed within the Apache Hadoop ecosystem. Unlike the columnar analytical formats — Apache Parquet and Apache ORC — Avro is a row-oriented format designed for efficient data serialization and exchange rather than analytical query optimization.

Avro defines data structures using a JSON schema, and serializes records into a compact binary format. The schema is embedded in Avro files (written in the file header), making them self-describing: any Avro reader can determine the data structure by reading the schema from the file header. This schema embedding is critical for streaming scenarios where producers and consumers evolve independently.

Avro's schema evolution model is one of its strongest features: Avro supports adding new fields with defaults, removing unused fields, and renaming fields — with the reader schema able to handle data written by older or newer writer schemas. This makes Avro ideal for long-running streaming pipelines where schemas evolve over time.

Avro in Apache Iceberg Metadata

One of Avro's most important roles in the modern lakehouse is as the metadata file format for Apache Iceberg. All of Iceberg's metadata files are Avro files:

  • Manifest lists (snapshot files): Avro files listing manifest file locations and partition summaries
  • Manifest files: Avro files listing individual data file locations with column statistics

Avro's self-describing schema, efficient row encoding, and schema evolution capabilities make it a natural choice for metadata records: each manifest record is a compact structured row with well-defined fields, and Avro's schema evolution allows Iceberg to extend manifest record schemas across spec versions without breaking older readers.

Avro in Iceberg Metadata Architecture diagram
Figure 1: Avro as Iceberg's metadata format — manifest lists and manifest files are both Avro files.

Avro for Kafka Streaming

In streaming ingestion pipelines — the pipelines that feed data from operational systems into lakehouse Bronze tables — Avro is the dominant message serialization format for Apache Kafka. The typical streaming-to-lakehouse architecture:

  1. Application services emit events to Kafka topics, serializing each event as an Avro binary record with a schema registered in the Confluent Schema Registry
  2. The Schema Registry validates that producer schemas are compatible with consumer expectations, preventing breaking schema changes from propagating through the pipeline
  3. Apache Flink or Spark Structured Streaming reads Avro messages from Kafka, deserializes them using the registered schema, and writes them to Bronze Iceberg tables (in Parquet format)

The Avro-Kafka-Iceberg pipeline is one of the most common streaming ingestion patterns in the lakehouse ecosystem.

Avro Kafka to Iceberg Streaming Pipeline diagram
Figure 2: Avro serialization in the Kafka-to-Iceberg streaming pipeline.

Summary

Apache Avro occupies a unique and important role in the lakehouse ecosystem: not as a data storage format for analytical reads (where Parquet dominates), but as the serialization format for streaming messages (Kafka) and the metadata file format for Apache Iceberg's manifest system. Its schema embedding, schema evolution capabilities, and compact binary row encoding make it the right tool for these use cases — complementing Parquet's columnar analytical strengths with row-oriented streaming and metadata efficiency.