Tutorials › Data Engineering › Spark Performance Tuning

Spark Performance Tuning

5 min read Quiz at the end

Tune Spark with right partitioning, broadcast joins, caching, avoiding shuffles, and Parquet format.

Spark Performance Tuning

# 1. Partitioning strategy
# Too few partitions: underutilise cluster
# Too many partitions: overhead per task
# Rule: 2-4 partitions per CPU core

df = df.repartition(200, "user_id")   # hash partition
df = df.coalesce(10)                  # reduce without shuffle

# 2. Broadcast join (for small tables <200MB)
from pyspark.sql.functions import broadcast
result = large_df.join(broadcast(small_lookup_df), "product_id")

# 3. Cache frequently used DataFrames
df.cache()    # keep in memory (Spark decides if overflow to disk)
df.persist()  # same as cache()
df.unpersist()  # free memory when done

# 4. Avoid shuffles when possible
# Shuffle-heavy: join, groupBy, distinct, repartition
# Shuffle-free: filter, map, union, broadcast join

# 5. Predicate pushdown (filter early)
df.filter(F.col("date") == "2025-01-01")  # push to file scan

# 6. Use Parquet format (columnar, compressed)
df.write.parquet("output/", compression="snappy")

# 7. Tune memory
# spark.executor.memory=8g
# spark.executor.cores=4
# spark.sql.shuffle.partitions=400  # for large joins

← Apache Spark Next: dbt — Data Build Tool →

Topic Quiz · 1 questions

Test your understanding before moving on

1. What does broadcasting a DataFrame in a Spark join do?

💡 Broadcasting eliminates the expensive shuffle of a large DataFrame — use when one side is under 200MB.

Quick Access

Spark Performance Tuning

Spark Performance Tuning

Test your understanding before moving on