Skip to content

donthula9908/data-engineer-roadmap

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 

Repository files navigation

Data Engineer Roadmap 2025

A practical, opinionated roadmap built from years of real data engineering experience across Azure, Databricks, Microsoft Fabric, AWS, and Snowflake.


Why This Roadmap?

Most roadmaps list every tool that exists. This one lists what you actually need to get hired and deliver at a senior level — based on what enterprise companies actually use, what shows up in interviews, and what makes production pipelines succeed or fail.


Phase 1 — Non-Negotiables (Weeks 1–6)

Python

  • Data types, comprehensions, generators, decorators
  • File I/O (CSV, JSON, Parquet, Avro)
  • requests, boto3, azure-storage-blob, pandas
  • Virtual environments, packaging (pyproject.toml)

Practice: Build a REST API ingestion script that writes Parquet to local disk.

SQL

  • JOINs (inner, left, anti, semi)
  • Window functions: ROW_NUMBER, RANK, LAG, LEAD, SUM OVER, NTILE
  • CTEs, recursive CTEs
  • Query optimisation: indexes, execution plans, partition pruning
  • MERGE (upsert) statements

Practice: Solve 50 LeetCode SQL problems (medium difficulty).


Phase 2 — Core Data Engineering (Weeks 7–16)

Apache Spark / PySpark

  • DataFrames, RDDs, Datasets
  • Transformations vs Actions — lazy evaluation
  • Partitioning, shuffling, broadcast joins
  • Window functions in PySpark
  • Performance tuning: AQE, Z-ORDER, salting for skew
  • Delta Lake: MERGE, OPTIMIZE, VACUUM, Z-ORDER, CDF, time travel

Resource: PySpark Interview Prep

Medallion Architecture

  • Bronze: raw landing, Auto Loader, schema evolution
  • Silver: cleansing, deduplication, DQ checks, SCD Type 2
  • Gold: aggregations, business entities, serving layer

Resource: Azure Data Engineering Project

Data Modelling

  • Star schema, snowflake schema
  • Slowly Changing Dimensions (SCD1, SCD2, SCD3)
  • Data Vault 2.0 basics (Hubs, Links, Satellites)
  • Kimball vs Inmon approaches

Resource: Data Modeling and Design


Phase 3 — Cloud Platforms (Weeks 17–28)

Azure Data Engineering ☁️

Tool What to Learn
Azure Data Factory Copy Activity, Data Flows, ForEach, triggers, ARM export, CI/CD
ADLS Gen2 Hierarchical namespace, lifecycle policies, RBAC
Azure Databricks Clusters, notebooks, jobs, Unity Catalog, DLT, DABs
Azure Synapse Serverless SQL pool, dedicated pool, external tables
Microsoft Fabric Lakehouses, Dataflow Gen2, Fabric Warehouse, OneLake
Azure Event Hubs Partitions, consumer groups, Kafka compatibility

Resource: Azure End-to-End Project | Microsoft Fabric Analytics

AWS Data Engineering 🌩️

Tool What to Learn
AWS Glue PySpark jobs, job bookmarks, Glue Catalog, crawlers
Amazon S3 Partitioning strategies, lifecycle policies, event notifications
AWS Lambda S3-triggered ETL, Kinesis consumer, Step Functions integration
Amazon Redshift Distribution keys, sort keys, Redshift Spectrum, COPY command
Amazon Kinesis Data Streams, Firehose, consumer libraries

Resource: AWS Data Engineering Pipeline

Snowflake + dbt ❄️

Topic What to Learn
Snowflake Virtual warehouses, resource monitors, Snowpipe, Streams & Tasks, Dynamic Tables
dbt Core Models, sources, tests, macros, snapshots (SCD2), incremental models, state:modified
Analytics Engineering Staging → Intermediate → Marts pattern, data contracts

Resource: Snowflake + dbt Project


Phase 4 — Orchestration & DevOps (Weeks 29–36)

Apache Airflow

  • DAG structure, task dependencies, XCom, Variables, Connections
  • BranchOperator, TaskGroups, dynamic DAGs
  • Sensors (FileSensor, ExternalTaskSensor)
  • Custom operators and hooks
  • Docker Compose local setup

Resource: Apache Airflow Data Pipelines

CI/CD for Data Engineering

  • ADF ARM template export and parameterised deploy (dev → staging → prod)
  • Databricks Asset Bundles (DABs) deployment
  • dbt slim CI (state:modified+)
  • Git branching strategy for data pipelines
  • Azure DevOps / GitHub Actions YAML pipelines

Resource: Azure DevOps CI/CD

Docker

  • Docker Compose for local Airflow, Kafka, Spark
  • Multi-stage builds for dbt projects
  • Container networking for data stack

Resource: Docker for Data Engineers


Phase 5 — AI for Data Engineers (Weeks 37–40)

  • Azure OpenAI API — embeddings, chat completions, function calling
  • RAG (Retrieval-Augmented Generation) — chunk, embed, retrieve, generate
  • Text-to-SQL — let business users query data in natural language
  • LangChain / LangGraph — agent frameworks for data workflows
  • Vector databases — Azure AI Search, ChromaDB, pgvector

Resource: Azure AI Data Engineering


Phase 6 — Interview Preparation

PySpark Scenarios

  • exceptAll row-level diff for pipeline validation
  • Skew handling with salting
  • SCD Type 2 with Delta MERGE
  • Sessionisation with window functions

Resource: PySpark Interview Prep

System Design Questions

  • Design a real-time fraud detection pipeline (< 100ms latency)
  • Design a data lake for 100TB+ with cost optimisation
  • Design a CI/CD strategy for 50+ ADF pipelines
  • Design a metadata-driven ETL framework

Certifications Worth Getting

Cert Value
DP-203: Azure Data Engineer Associate High — widely recognised
Databricks Certified Data Engineer Associate High — vendor specific, practical
DP-600: Fabric Analytics Engineer Associate High — very new, low competition
DEA-C01: AWS Data Engineer Associate Medium — good for AWS shops
SnowPro Core Medium — useful for Snowflake-heavy roles

My Projects (All Free, All Code)

Project Stack Link
Azure End-to-End Pipeline ADF · Databricks · Synapse
Databricks Lakehouse DLT · Unity Catalog · Streaming
Microsoft Fabric Analytics Fabric · OneLake · Power BI
Netflix Data Pipeline Azure · Databricks · Airflow
AWS Serverless Pipeline Glue · Lambda · Redshift
Snowflake + dbt ELT Snowflake · dbt · Streams
Apache Airflow DAGs Airflow · dbt · Databricks
Azure AI for Data Eng Azure OpenAI · LangChain · RAG
Azure DevOps CI/CD ADF · DABs · dbt slim CI
PySpark Interview Prep PySpark · Delta Lake

Built by Naveen Donthula — Senior Data Engineer | ndonthula3@gmail.com

About

Practical Data Engineer Roadmap 2025 — Azure, Databricks, Fabric, AWS, Snowflake, dbt, Airflow, AI — built from 6+ years experience

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors