Tutorials › Data Engineering › Data Pipeline Design

Data Pipeline Design

5 min read Quiz at the end

ETL, ELT, streaming, and Lambda architecture — choose based on latency requirements and data volume.

Data Pipeline Architecture Patterns

Pattern	Description	Use Case
ETL	Extract Transform Load -- transform before loading	Data warehouse loading
ELT	Extract Load Transform -- transform after loading	Cloud DW (BigQuery, Snowflake)
Streaming	Continuous real-time processing	Fraud detection, IoT
Micro-batch	Small batches every 1-5 minutes	Near-real-time analytics
Lambda	Batch + streaming layers	Historical + real-time combined
Kappa	Streaming only, no separate batch layer	Simplified real-time architecture

# ETL vs ELT decision
# ETL: expensive compute before load, legacy DW
# ELT: load raw data, transform in warehouse (modern)

# ELT pipeline
1. Extract raw data from sources -> S3 (data lake)
2. Load raw data into Snowflake staging tables
3. Transform with dbt (SQL models)
4. Expose mart tables to BI tools

# Streaming pipeline
Kafka -> Flink/Spark Streaming -> Kafka/ClickHouse
(continuous, low latency <1 second)

# Lambda architecture
Batch layer: Spark -> HDFS -> BigQuery (historical accuracy)
Speed layer: Kafka -> Flink -> Redis (low latency recent)
Serving layer: merges both views for queries

← Data Engineering Overview Next: Apache Airflow →

Topic Quiz · 1 questions

Test your understanding before moving on

1. What is the Lambda architecture?

💡 Lambda architecture uses separate batch (Spark) and speed (Kafka/Flink) layers merged at query time.

Quick Access

Data Pipeline Design

Data Pipeline Architecture Patterns

Test your understanding before moving on