Managing Large-Scale Optimizations — Parallelism, Checkpointing, and Fail Recovery

When working with Apache Iceberg at scale, optimization jobs can become heavy and time-consuming. Rewriting thousands of files, scanning massive partitions, and coordinating metadata updates requires careful execution planning—especially in environments with limited compute or strict SLAs.

In this post, we’ll look at strategies for making compaction and metadata cleanup operations scalable, resilient, and efficient, including:

Tuning parallelism
Using partition pruning
Applying checkpointing for long-running jobs
Handling failures safely and automatically

Why Scaling Optimization Matters

As your Iceberg tables grow:

File counts increase
Partition cardinality rises
Manifest files balloon
Compaction jobs touch terabytes of data

Without scaling strategies:

Jobs may fail due to timeouts or memory errors
Optimization may lag behind ingestion
Query performance continues to degrade despite efforts

1. Leveraging Partition Pruning

Partition pruning ensures that only the parts of the table that need compaction are touched.

Use metadata tables to target only problem areas:

SELECT partition
FROM my_table.files
GROUP BY partition
HAVING COUNT(*) > 20 AND AVG(file_size_in_bytes) < 100000000;

You can then pass this list to a compaction job to limit the scope of the rewrite.

2. Tuning Parallelism in Spark or Flink

Large optimization jobs should run with enough parallel tasks to distribute I/O and computation.

In Spark: Use spark.sql.shuffle.partitions to increase default parallelism.

Tune executor memory and cores to handle larger partitions.

Use .option("partial-progress.enabled", true) for better resilience in Iceberg actions.

spark.conf.set("spark.sql.shuffle.partitions", "200")

Actions.forTable(spark, table)
  .rewriteDataFiles()
  .option("min-input-files", "5")
  .option("partial-progress.enabled", "true")
  .execute()

In Flink:

Use fine-grained task managers
Enable incremental compaction and checkpointing

3. Incremental and Windowed Compaction

Don’t try to compact the entire table at once. Instead:

Group partitions into batches
Use rolling windows (e.g., compact N partitions per hour)
Resume from the last successfully compacted partition on failure
You can build this logic into orchestration tools like Airflow or Dagster.

4. Checkpointing and Partial Progress

Iceberg supports partial progress mode in Spark:

.option("partial-progress.enabled", "true")

This allows successfully compacted partitions to commit, even if others fail—making retries cheaper and safer.

In Flink, this is handled more granularly via stateful streaming checkpointing.

5. Retry and Failover Strategies

Wrap compaction logic in robust retry mechanisms:

Use exponential backoff
Separate retries by partition
Alert on repeated failures for human intervention

For example, in Airflow:

PythonOperator(
    task_id="compact_partition",
    python_callable=run_compaction,
    retries=3,
    retry_delay=timedelta(minutes=5)
)

Also consider:

Writing logs to object storage for audit
Emitting metrics to Prometheus/Grafana for observability

6. Monitoring Job Health

Track:

Job duration
Files rewritten vs skipped
Failed partitions
Number of manifests reduced
Snapshot size pre- and post-job
These metrics help tune parameters and detect regressions over time.

Summary

Scaling Iceberg optimization jobs requires thoughtful execution planning:

Use metadata to limit scope
Tune parallelism to avoid resource waste
Use partial progress and checkpointing to survive failure
Automate retries and monitor outcomes

In the final post of this series, we’ll bring it all together—showing how to build a fully autonomous optimization pipeline using orchestration, metadata triggers, and smart defaults.