What Is Query Optimization?
Query optimization is the process that a query engine performs between receiving a SQL query and executing it — transforming the declarative SQL statement into an efficient physical execution plan that minimizes resource consumption and maximizes result delivery speed. The query optimizer is one of the most complex and performance-critical components of any analytical query engine.
In the data lakehouse, query optimization spans multiple layers: the logical query plan (join ordering, predicate placement), the metadata access pattern (how much of Iceberg's metadata tree must be read), the physical file access (which Parquet files and row groups are read), the execution strategy (hash join vs merge join, sort-based vs hash-based aggregation), and the materialization routing (whether to use a Reflection). Each layer compounds: optimal choices at each level can reduce query time by orders of magnitude compared to naive choices.
Optimization Levels in the Lakehouse
Lakehouse query optimization operates at five distinct levels:
L1: Reflection/Materialization Routing
The highest-impact optimization: does a pre-computed Reflection or materialized view exist that can serve this query? If yes, query planning routes to the Reflection — avoiding all raw data scanning. A 30-second query becomes 50ms.
L2: Iceberg Metadata Pruning
Multi-level predicate pushdown into Iceberg's manifest list and manifest files — eliminating irrelevant manifests and data files before any storage reads. Often eliminates 95%+ of data from consideration.
L3: File Format Optimization
Within remaining Parquet files: row group statistics pruning (eliminating row groups), column pruning (reading only needed columns), and dictionary filter pushdown.
L4: Physical Execution Planning
Join order (largest table last in hash join build side), aggregation strategy (hash vs sort), partition strategy for distributed execution (data shuffling).
L5: Adaptive Runtime Adjustment
Adaptive query execution — adjusting the plan at runtime based on actual cardinalities observed, switching join strategies mid-query when estimates were wrong.

Data Layout Optimizations
Beyond query planning, the physical layout of data on storage dramatically affects optimization effectiveness:
- Compaction: Merging small files into large, optimally-sized Parquet files reduces file-open overhead and improves row group statistics effectiveness
- Z-Ordering: Sorting data by multiple frequently-filtered columns clusters related rows together, making column statistics highly selective and dramatically reducing row groups read for filtered queries
- Partition design: Choosing partition columns that match common query filters enables coarse-grained file elimination before statistics even need to be consulted
Data engineers who understand these layout principles can make queries 10–100x faster without any query rewriting — purely through table management operations.

Summary
Query optimization in the data lakehouse is a multi-layered discipline combining query planner intelligence (Reflection routing, join ordering, adaptive execution), metadata-level pruning (Iceberg manifest and file statistics), physical format optimization (predicate pushdown, column pruning), and data engineering practices (compaction, Z-ordering, partitioning). Organizations that understand and invest in all five levels of optimization consistently achieve sub-second analytics on petabyte-scale Iceberg tables — the performance standard that makes the open lakehouse competitive with proprietary cloud warehouses for demanding analytical workloads.