Why is Parquet the standard for data lakehouses?

Parquet's columnar layout enables two critical optimizations: column pruning (reading only needed columns) and predicate pushdown (skipping files and row groups using per-column min/max statistics). For analytical queries that touch a small subset of columns over large datasets, these optimizations reduce I/O by 90%+ vs row-oriented formats. Apache Iceberg was designed around Parquet's metadata model.

What compression codecs does Apache Parquet support?

Parquet supports Snappy (fast compression/decompression, moderate ratio — the most common lakehouse choice), Zstandard/ZSTD (better compression ratio than Snappy with comparable speed), GZIP (high compression ratio, slower), LZ4, and BROTLI. For lakehouse data files, Snappy and ZSTD are most widely used — Snappy for maximum query speed, ZSTD for storage cost optimization.

Apache Parquet: The Definitive Guide

Q: What is Apache Parquet?

Apache Parquet is an open-source, columnar storage file format designed for efficient analytical processing. It stores data column-by-column rather than row-by-row, enabling query engines to read only the columns needed for a query (column pruning) and apply filter statistics to skip irrelevant data (predicate pushdown). It is the default data file format for Apache Iceberg.

What Is Apache Parquet?

Apache Parquet is an open-source, columnar storage file format originally developed by Twitter and Cloudera, donated to the Apache Software Foundation in 2013. It is the dominant file format for analytical data storage in the lakehouse ecosystem — the default data file format for Apache Iceberg, Delta Lake, and Apache Hudi.

Parquet's defining design choice is columnar storage: rather than storing each row's values together (row-oriented, like CSV or traditional OLTP storage), Parquet stores each column's values together. All revenue values for all rows are stored contiguously; all customer_id values are stored contiguously; all event_date values are stored contiguously. This layout enables two critical query optimizations that are foundational to lakehouse performance: column pruning and predicate pushdown.

Parquet File Structure

A Parquet file has a layered structure:

Row Groups: A Parquet file is divided into row groups — horizontal partitions containing a fixed number of rows (typically 128MB of data). Each row group contains all columns for its set of rows.
Column Chunks: Within each row group, data is organized into column chunks — one per column. Each column chunk stores all values for that column within the row group, encoded and compressed together.
Pages: Each column chunk is further divided into pages (typically 1MB each) for fine-grained I/O and dictionary encoding.
File Footer: The Parquet file footer (at the end of the file) contains the schema definition, the byte offset and size of every row group and column chunk, and per-column-chunk statistics (min, max, null count). The footer is read first by query engines to plan column-level reads.

Apache Parquet File Structure diagram — Figure 1: Parquet file structure — row groups, column chunks, and footer statistics for query optimization.

Parquet and Apache Iceberg

Apache Iceberg is built around Parquet as its default data file format. Every Iceberg data file is a Parquet file. Iceberg's manifest file column statistics (lower_bounds and upper_bounds) mirror Parquet's footer statistics — they are derived from Parquet footer reads during data file ingestion.

The synergy between Iceberg and Parquet creates a multi-level optimization hierarchy: Iceberg's manifest metadata provides file-level statistics (for file skipping), and Parquet's row group statistics provide within-file row group skipping. Together, they deliver the two-level predicate pushdown that enables precise, efficient reads from petabyte-scale tables.

Parquet Encoding and Compression

Parquet applies column-specific encodings before compression, exploiting the homogeneity of column values:

Dictionary encoding: Low-cardinality columns (status, region, category) store a dictionary of unique values and index references instead of repeated full values — dramatically reducing storage size
RLE (Run Length Encoding): Sequences of repeated values encoded as a count + value — effective for sorted or clustered data
Delta encoding: Integer columns store differences between consecutive values — very effective for timestamps and monotonically increasing IDs

These encodings combined with compression codecs (Snappy, ZSTD) typically achieve 5–10x compression ratios for typical business data — turning 100GB of raw CSV into 10–20GB of Parquet.

Parquet Encoding and Compression diagram — Figure 2: Parquet encoding strategies — dictionary, RLE, delta — before compression codec application.

Summary

Apache Parquet is the data file format foundation of the lakehouse. Its columnar layout, nested type support, per-column statistics, and efficient compression make it uniquely suited for the analytical query patterns of the data lakehouse: read few columns, skip irrelevant files, decompress fast, and process with vectorized SIMD operations. Understanding Parquet's structure is fundamental to understanding why column pruning, predicate pushdown, and compaction have such dramatic effects on lakehouse query performance.

What Is Apache Parquet?

Parquet File Structure

Parquet and Apache Iceberg

Parquet Encoding and Compression

Summary

Related Concepts

Go Deeper — Recommended Resources