A practical, opinionated roadmap built from years of real data engineering experience across Azure, Databricks, Microsoft Fabric, AWS, and Snowflake.
Most roadmaps list every tool that exists. This one lists what you actually need to get hired and deliver at a senior level — based on what enterprise companies actually use, what shows up in interviews, and what makes production pipelines succeed or fail.
- Data types, comprehensions, generators, decorators
- File I/O (CSV, JSON, Parquet, Avro)
requests,boto3,azure-storage-blob,pandas- Virtual environments, packaging (
pyproject.toml)
Practice: Build a REST API ingestion script that writes Parquet to local disk.
- JOINs (inner, left, anti, semi)
- Window functions:
ROW_NUMBER,RANK,LAG,LEAD,SUM OVER,NTILE - CTEs, recursive CTEs
- Query optimisation: indexes, execution plans, partition pruning
MERGE(upsert) statements
Practice: Solve 50 LeetCode SQL problems (medium difficulty).
- DataFrames, RDDs, Datasets
- Transformations vs Actions — lazy evaluation
- Partitioning, shuffling, broadcast joins
- Window functions in PySpark
- Performance tuning: AQE, Z-ORDER, salting for skew
- Delta Lake: MERGE, OPTIMIZE, VACUUM, Z-ORDER, CDF, time travel
Resource: PySpark Interview Prep
- Bronze: raw landing, Auto Loader, schema evolution
- Silver: cleansing, deduplication, DQ checks, SCD Type 2
- Gold: aggregations, business entities, serving layer
Resource: Azure Data Engineering Project
- Star schema, snowflake schema
- Slowly Changing Dimensions (SCD1, SCD2, SCD3)
- Data Vault 2.0 basics (Hubs, Links, Satellites)
- Kimball vs Inmon approaches
Resource: Data Modeling and Design
| Tool | What to Learn |
|---|---|
| Azure Data Factory | Copy Activity, Data Flows, ForEach, triggers, ARM export, CI/CD |
| ADLS Gen2 | Hierarchical namespace, lifecycle policies, RBAC |
| Azure Databricks | Clusters, notebooks, jobs, Unity Catalog, DLT, DABs |
| Azure Synapse | Serverless SQL pool, dedicated pool, external tables |
| Microsoft Fabric | Lakehouses, Dataflow Gen2, Fabric Warehouse, OneLake |
| Azure Event Hubs | Partitions, consumer groups, Kafka compatibility |
Resource: Azure End-to-End Project | Microsoft Fabric Analytics
| Tool | What to Learn |
|---|---|
| AWS Glue | PySpark jobs, job bookmarks, Glue Catalog, crawlers |
| Amazon S3 | Partitioning strategies, lifecycle policies, event notifications |
| AWS Lambda | S3-triggered ETL, Kinesis consumer, Step Functions integration |
| Amazon Redshift | Distribution keys, sort keys, Redshift Spectrum, COPY command |
| Amazon Kinesis | Data Streams, Firehose, consumer libraries |
Resource: AWS Data Engineering Pipeline
| Topic | What to Learn |
|---|---|
| Snowflake | Virtual warehouses, resource monitors, Snowpipe, Streams & Tasks, Dynamic Tables |
| dbt Core | Models, sources, tests, macros, snapshots (SCD2), incremental models, state:modified |
| Analytics Engineering | Staging → Intermediate → Marts pattern, data contracts |
Resource: Snowflake + dbt Project
- DAG structure, task dependencies, XCom, Variables, Connections
- BranchOperator, TaskGroups, dynamic DAGs
- Sensors (FileSensor, ExternalTaskSensor)
- Custom operators and hooks
- Docker Compose local setup
Resource: Apache Airflow Data Pipelines
- ADF ARM template export and parameterised deploy (dev → staging → prod)
- Databricks Asset Bundles (DABs) deployment
- dbt slim CI (
state:modified+) - Git branching strategy for data pipelines
- Azure DevOps / GitHub Actions YAML pipelines
Resource: Azure DevOps CI/CD
- Docker Compose for local Airflow, Kafka, Spark
- Multi-stage builds for dbt projects
- Container networking for data stack
Resource: Docker for Data Engineers
- Azure OpenAI API — embeddings, chat completions, function calling
- RAG (Retrieval-Augmented Generation) — chunk, embed, retrieve, generate
- Text-to-SQL — let business users query data in natural language
- LangChain / LangGraph — agent frameworks for data workflows
- Vector databases — Azure AI Search, ChromaDB, pgvector
Resource: Azure AI Data Engineering
exceptAllrow-level diff for pipeline validation- Skew handling with salting
- SCD Type 2 with Delta MERGE
- Sessionisation with window functions
Resource: PySpark Interview Prep
- Design a real-time fraud detection pipeline (< 100ms latency)
- Design a data lake for 100TB+ with cost optimisation
- Design a CI/CD strategy for 50+ ADF pipelines
- Design a metadata-driven ETL framework
| Cert | Value |
|---|---|
| DP-203: Azure Data Engineer Associate | High — widely recognised |
| Databricks Certified Data Engineer Associate | High — vendor specific, practical |
| DP-600: Fabric Analytics Engineer Associate | High — very new, low competition |
| DEA-C01: AWS Data Engineer Associate | Medium — good for AWS shops |
| SnowPro Core | Medium — useful for Snowflake-heavy roles |
| Project | Stack | Link |
|---|---|---|
| Azure End-to-End Pipeline | ADF · Databricks · Synapse | → |
| Databricks Lakehouse | DLT · Unity Catalog · Streaming | → |
| Microsoft Fabric Analytics | Fabric · OneLake · Power BI | → |
| Netflix Data Pipeline | Azure · Databricks · Airflow | → |
| AWS Serverless Pipeline | Glue · Lambda · Redshift | → |
| Snowflake + dbt ELT | Snowflake · dbt · Streams | → |
| Apache Airflow DAGs | Airflow · dbt · Databricks | → |
| Azure AI for Data Eng | Azure OpenAI · LangChain · RAG | → |
| Azure DevOps CI/CD | ADF · DABs · dbt slim CI | → |
| PySpark Interview Prep | PySpark · Delta Lake | → |
Built by Naveen Donthula — Senior Data Engineer | ndonthula3@gmail.com