Data engineering is the discipline of designing, building, and maintaining the infrastructure and pipelines that collect, store, transform, and deliver data reliably at scale.
| Role | Responsibility | Tools |
|---|---|---|
| Data Engineer | Build pipelines, manage data infrastructure | Spark, Airflow, dbt, Kafka |
| Data Analyst | Query and visualise data | SQL, Tableau, Looker |
| Data Scientist | Build ML models | Python, scikit-learn, PyTorch |
| MLOps Engineer | Deploy and monitor models | MLflow, Kubeflow, SageMaker |
# Data Engineering Stack
Ingestion: Kafka, Fivetran, Airbyte, Debezium
Storage: S3, GCS, ADLS, HDFS
Processing: Apache Spark, Flink, dbt
Orchestration: Apache Airflow, Prefect, Dagster
Warehouse: Snowflake, BigQuery, Redshift, ClickHouse
Catalog: Apache Atlas, DataHub, Amundsen
Quality: Great Expectations, Soda, Monte Carlo