What Is Apache Parquet?
Apache Parquet is an open-source, columnar storage file format originally developed by Twitter and Cloudera, donated to the Apache Software Foundation in 2013. It is the dominant file format for analytical data storage in the lakehouse ecosystem — the default data file format for Apache Iceberg, Delta Lake, and Apache Hudi.
Parquet's defining design choice is columnar storage: rather than storing each row's values together (row-oriented, like CSV or traditional OLTP storage), Parquet stores each column's values together. All revenue values for all rows are stored contiguously; all customer_id values are stored contiguously; all event_date values are stored contiguously. This layout enables two critical query optimizations that are foundational to lakehouse performance: column pruning and predicate pushdown.
Parquet File Structure
A Parquet file has a layered structure:
- Row Groups: A Parquet file is divided into row groups — horizontal partitions containing a fixed number of rows (typically 128MB of data). Each row group contains all columns for its set of rows.
- Column Chunks: Within each row group, data is organized into column chunks — one per column. Each column chunk stores all values for that column within the row group, encoded and compressed together.
- Pages: Each column chunk is further divided into pages (typically 1MB each) for fine-grained I/O and dictionary encoding.
- File Footer: The Parquet file footer (at the end of the file) contains the schema definition, the byte offset and size of every row group and column chunk, and per-column-chunk statistics (min, max, null count). The footer is read first by query engines to plan column-level reads.

Parquet and Apache Iceberg
Apache Iceberg is built around Parquet as its default data file format. Every Iceberg data file is a Parquet file. Iceberg's manifest file column statistics (lower_bounds and upper_bounds) mirror Parquet's footer statistics — they are derived from Parquet footer reads during data file ingestion.
The synergy between Iceberg and Parquet creates a multi-level optimization hierarchy: Iceberg's manifest metadata provides file-level statistics (for file skipping), and Parquet's row group statistics provide within-file row group skipping. Together, they deliver the two-level predicate pushdown that enables precise, efficient reads from petabyte-scale tables.
Parquet Encoding and Compression
Parquet applies column-specific encodings before compression, exploiting the homogeneity of column values:
- Dictionary encoding: Low-cardinality columns (status, region, category) store a dictionary of unique values and index references instead of repeated full values — dramatically reducing storage size
- RLE (Run Length Encoding): Sequences of repeated values encoded as a count + value — effective for sorted or clustered data
- Delta encoding: Integer columns store differences between consecutive values — very effective for timestamps and monotonically increasing IDs
These encodings combined with compression codecs (Snappy, ZSTD) typically achieve 5–10x compression ratios for typical business data — turning 100GB of raw CSV into 10–20GB of Parquet.

Summary
Apache Parquet is the data file format foundation of the lakehouse. Its columnar layout, nested type support, per-column statistics, and efficient compression make it uniquely suited for the analytical query patterns of the data lakehouse: read few columns, skip irrelevant files, decompress fast, and process with vectorized SIMD operations. Understanding Parquet's structure is fundamental to understanding why column pruning, predicate pushdown, and compaction have such dramatic effects on lakehouse query performance.