Tune Spark with right partitioning, broadcast joins, caching, avoiding shuffles, and Parquet format.
Spark Performance Tuning
# 1. Partitioning strategy
# Too few partitions: underutilise cluster
# Too many partitions: overhead per task
# Rule: 2-4 partitions per CPU core
df = df.repartition(200, "user_id") # hash partition
df = df.coalesce(10) # reduce without shuffle
# 2. Broadcast join (for small tables <200MB)
from pyspark.sql.functions import broadcast
result = large_df.join(broadcast(small_lookup_df), "product_id")
# 3. Cache frequently used DataFrames
df.cache() # keep in memory (Spark decides if overflow to disk)
df.persist() # same as cache()
df.unpersist() # free memory when done
# 4. Avoid shuffles when possible
# Shuffle-heavy: join, groupBy, distinct, repartition
# Shuffle-free: filter, map, union, broadcast join
# 5. Predicate pushdown (filter early)
df.filter(F.col("date") == "2025-01-01") # push to file scan
# 6. Use Parquet format (columnar, compressed)
df.write.parquet("output/", compression="snappy")
# 7. Tune memory
# spark.executor.memory=8g
# spark.executor.cores=4
# spark.sql.shuffle.partitions=400 # for large joins