Data Engineering Interview Prep
5 min read Quiz at the end
Data engineering interview: ETL vs ELT, star schema, SCD, partitioning, Spark, Iceberg, idempotency, CDC.
Data Engineering Interview Topics
- ETL vs ELT -- ETL transforms before loading (legacy); ELT loads raw and transforms in warehouse (modern cloud DW)
- Star schema -- fact tables (events, many FKs) and dimension tables (lookups); optimised for analytics queries
- SCD Type 2 -- slowly changing dimensions with valid_from/valid_to rows track full history
- Partitioning -- partition large tables by date; reduces scan cost and improves query performance
- dbt ref() -- references another model; builds dependency graph for correct execution order
- Spark DAG -- lazy transformations build a Directed Acyclic Graph; actions trigger execution
- Broadcast join -- send small table to all executors; eliminates shuffle for large-small joins
- Iceberg vs Delta -- both add ACID to data lake; Iceberg is vendor-neutral; Delta is Databricks-native
- Idempotency -- pipeline can run multiple times with same result; use MERGE or truncate+insert
- CDC -- Change Data Capture reads database binlog (Debezium) for row-level change streaming
- Great Expectations -- define data quality rules as code; fail pipeline on violation
- Data Mesh -- domain teams own their data products; central platform provides tooling
Topic Quiz · 1 questions
Test your understanding before moving on
1. What is the main difference between dbt ref() and source() functions?
💡 Use source() for raw data coming from ingestion, ref() for transformed dbt models — they serve different layers.