Apache Avro is an open-source, row-oriented data serialization format that embeds a JSON schema definition within the file itself. Unlike Parquet and ORC (which are columnar and designed for analytical reads), Avro is row-oriented and designed for efficient data exchange and serialization — making it well-suited for streaming message formats and metadata storage.

How is Apache Avro used in Apache Iceberg?

Apache Iceberg uses Avro as the format for its metadata files — manifest list files and manifest files are both Avro files. Avro's schema evolution capabilities and compact binary row format make it ideal for Iceberg's metadata records, which are small structured records (not large analytical datasets) that need efficient read/write performance.

Why is Avro the standard for Kafka messages?

Avro is the dominant serialization format for Apache Kafka messages because it provides compact binary encoding, schema evolution support (readers can handle messages written with older or newer schemas), and integration with Kafka Schema Registry — which stores and validates Avro schemas centrally, ensuring producer/consumer schema compatibility.

Apache Avro: The Definitive Guide

What Is Apache Avro?

Apache Avro is an open-source data serialization system developed within the Apache Hadoop ecosystem. Unlike the columnar analytical formats — Apache Parquet and Apache ORC — Avro is a row-oriented format designed for efficient data serialization and exchange rather than analytical query optimization.

Avro defines data structures using a JSON schema, and serializes records into a compact binary format. The schema is embedded in Avro files (written in the file header), making them self-describing: any Avro reader can determine the data structure by reading the schema from the file header. This schema embedding is critical for streaming scenarios where producers and consumers evolve independently.

Avro's schema evolution model is one of its strongest features: Avro supports adding new fields with defaults, removing unused fields, and renaming fields — with the reader schema able to handle data written by older or newer writer schemas. This makes Avro ideal for long-running streaming pipelines where schemas evolve over time.

Avro in Apache Iceberg Metadata

One of Avro's most important roles in the modern lakehouse is as the metadata file format for Apache Iceberg. All of Iceberg's metadata files are Avro files:

Manifest lists (snapshot files): Avro files listing manifest file locations and partition summaries
Manifest files: Avro files listing individual data file locations with column statistics

Avro's self-describing schema, efficient row encoding, and schema evolution capabilities make it a natural choice for metadata records: each manifest record is a compact structured row with well-defined fields, and Avro's schema evolution allows Iceberg to extend manifest record schemas across spec versions without breaking older readers.

Avro in Iceberg Metadata Architecture diagram — Figure 1: Avro as Iceberg's metadata format — manifest lists and manifest files are both Avro files.

Avro for Kafka Streaming

In streaming ingestion pipelines — the pipelines that feed data from operational systems into lakehouse Bronze tables — Avro is the dominant message serialization format for Apache Kafka. The typical streaming-to-lakehouse architecture:

Application services emit events to Kafka topics, serializing each event as an Avro binary record with a schema registered in the Confluent Schema Registry
The Schema Registry validates that producer schemas are compatible with consumer expectations, preventing breaking schema changes from propagating through the pipeline
Apache Flink or Spark Structured Streaming reads Avro messages from Kafka, deserializes them using the registered schema, and writes them to Bronze Iceberg tables (in Parquet format)

The Avro-Kafka-Iceberg pipeline is one of the most common streaming ingestion patterns in the lakehouse ecosystem.

Avro Kafka to Iceberg Streaming Pipeline diagram — Figure 2: Avro serialization in the Kafka-to-Iceberg streaming pipeline.

Summary

Apache Avro occupies a unique and important role in the lakehouse ecosystem: not as a data storage format for analytical reads (where Parquet dominates), but as the serialization format for streaming messages (Kafka) and the metadata file format for Apache Iceberg's manifest system. Its schema embedding, schema evolution capabilities, and compact binary row encoding make it the right tool for these use cases — complementing Parquet's columnar analytical strengths with row-oriented streaming and metadata efficiency.

What Is Apache Avro?

Avro in Apache Iceberg Metadata

Avro for Kafka Streaming

Summary

Related Concepts

Go Deeper — Recommended Resources