📡 You're offline — showing cached content
New version available!
Quick Access
Tutorials Data Engineering Data Pipeline Design

Data Pipeline Design

5 min read Quiz at the end
ETL, ELT, streaming, and Lambda architecture — choose based on latency requirements and data volume.

Data Pipeline Architecture Patterns

PatternDescriptionUse Case
ETLExtract Transform Load -- transform before loadingData warehouse loading
ELTExtract Load Transform -- transform after loadingCloud DW (BigQuery, Snowflake)
StreamingContinuous real-time processingFraud detection, IoT
Micro-batchSmall batches every 1-5 minutesNear-real-time analytics
LambdaBatch + streaming layersHistorical + real-time combined
KappaStreaming only, no separate batch layerSimplified real-time architecture
# ETL vs ELT decision
# ETL: expensive compute before load, legacy DW
# ELT: load raw data, transform in warehouse (modern)

# ELT pipeline
1. Extract raw data from sources -> S3 (data lake)
2. Load raw data into Snowflake staging tables
3. Transform with dbt (SQL models)
4. Expose mart tables to BI tools

# Streaming pipeline
Kafka -> Flink/Spark Streaming -> Kafka/ClickHouse
(continuous, low latency <1 second)

# Lambda architecture
Batch layer: Spark -> HDFS -> BigQuery (historical accuracy)
Speed layer: Kafka -> Flink -> Redis (low latency recent)
Serving layer: merges both views for queries
Topic Quiz · 1 questions

Test your understanding before moving on

1. What is the Lambda architecture?
💡 Lambda architecture uses separate batch (Spark) and speed (Kafka/Flink) layers merged at query time.