Medallion architecture: Bronze (raw) -> Silver (clean) -> Gold (aggregated) — partition by date for speed.
Data Lake Architecture
# Data Lake layers (Medallion Architecture)
# Bronze: raw ingested data (immutable)
# Silver: cleaned and validated data
# Gold: business-ready aggregated data (marts)
# File organisation
s3://datalake/
bronze/
raw_orders/year=2025/month=01/day=15/
orders_20250115_000001.json.gz
silver/
cleaned_orders/year=2025/month=01/day=15/
part-00000.parquet
gold/
monthly_sales_summary/
part-00000.parquet
# Partition pruning -- read only needed partitions
df = spark.read.parquet('s3://datalake/silver/cleaned_orders/')
df.filter(
(F.col('year') == 2025) &
(F.col('month') == 1)
) # Spark reads only jan 2025 partitions (not all data!)
# File formats comparison
# Parquet: columnar, compressed, best for analytics
# Avro: row-based, schema evolution, good for streaming
# ORC: columnar, best for Hive, good compression
# Delta/Iceberg/Hudi: add ACID to parquet (table formats)