What you'll learn

Build a production-grade 7-layer data platform end-to-end with storage, compute, transform, stream, orchestrate, validate, and serve, using open-source tools.
Process millions of rows with PySpark, write distributed batch pipelines, read Spark execution plans, and tune slow joins with broadcast hints and AQE.
Build a Lakehouse with Apache Iceberg, ACID transactions, time-travel queries, snapshot management, and painless schema evolution on object storage.
Model analytical data with dbt, layered staging→marts projects, automated tests, generated docs, and lineage DAGs that analysts can trust.
Stream events with Kafka and Flink, build a real-time fraud detection consumer and stateful tumbling and sliding window aggregations using PyFlink.
Orchestrate pipelines with Airflow, author DAGs, manage dependencies, pass data with XCom, and add retry/alert logic to a nightly end-to-end ETL.
Enforce data quality with Great Expectations and ship self-service dashboards in Apache Superset on top of a curated DuckDB analytical mart.
Build a RAG pipeline on your warehouse, embed policy docs with SentenceTransformers, index in ChromaDB, and ground OpenAI API answers semantically.
Master Claude Code as a pair engineer, design prompting strategies, learn file-context patterns, anchored upon the trust-but-verify discipline.
Walk away with an end-to-end portfolio project, a runnable GitHub repo of the full data platform to demo in interviews and link from your résumé.

Course content

5 sections • 50 lectures • 4h 57m total length

Week 0: Orientation and Structure5:02
Week 0: Overview and Premise4:12
### Objectives
By the end of this course, students will be capable of:
* **Understanding** the definition and scope of modern data engineering, and how it differs from data science and software engineering.
* **Familiarizing** with the pedagogical philosophy of the course: learning concepts over specific tools, and understanding trade-offs.
* **Learning** the structure of the 12-week curriculum and how the six modules build upon one another sequentially.
* **Exploring** the concept of the "Data Engineering Lifecycle" (generation, storage, ingestion, transformation, serving).
* **Gaining** an appreciation for the non-technical skills required for success, such as DataOps, documentation, and stakeholder management.
* **Knowing** how to prepare your local development environment for the practical labs that accompany each chapter.

Week 1 - The Architects of Data4:39
### Objectives
* **Understanding** the distinct role and responsibilities of a data engineer within a modern data organisation.
* **Familiarizing** with the Four V's of Big Data (Volume, Velocity, Variety, Veracity) and why they present unique engineering challenges.
* **Learning** about the historical evolution of big data processing, from vertical scaling to the Hadoop revolution and MapReduce paradigm.
* **Exploring** the shift from on-premises Hadoop clusters to decoupled, cloud-native data architectures.
* **Gaining** an appreciation for the modern data engineering stack, including ingestion, storage, processing, and orchestration layers.
* **Knowing** the key differences between data engineering, data science, and data analysis roles.
Week 2 - Modern Data Storage Architectures5:07
### Objectives
* **Understanding** the architectural shift from on-premises HDFS to cloud object storage and why it matters.
* **Familiarizing** with the dominant columnar file formats — Apache Parquet and Apache ORC — and their performance advantages over row-based formats.
* **Learning** the trade-offs between different serialisation formats including Avro, JSON, and CSV.
* **Exploring** the landscape of NoSQL databases and their appropriate use cases in data engineering.
* **Gaining** hands-on experience benchmarking CSV versus Parquet in a practical Python demonstration.
* **Knowing** how to select the right storage format and system for a given data engineering use case.
Week 3 - Distributed Processing with Spark5:01
### Objectives
* **Understanding** why Apache Spark was created and how it overcomes the limitations of Hadoop MapReduce.
* **Familiarizing** with the Spark cluster architecture, including the Driver, Executors, and Cluster Manager.
* **Learning** the difference between RDDs (Resilient Distributed Datasets) and the higher-level DataFrame API.
* **Exploring** the concept of lazy evaluation and the distinction between transformations and actions.
* **Gaining** hands-on experience installing PySpark and running a complete batch processing pipeline.
* **Knowing** how the Catalyst Optimizer automatically improves query performance.
Week 4 - Advanced Spark and Performance Optimisation4:53
### Objectives
* **Understanding** what data skew is, why it degrades performance, and how to diagnose it using the Spark UI.
* **Familiarizing** with the salting technique for resolving data skew in `groupBy` and `join` operations.
* **Learning** the difference between shuffle joins and broadcast joins, and when to use each.
* **Exploring** Adaptive Query Execution (AQE) and how it automatically optimises query plans at runtime.
* **Gaining** hands-on experience writing a complete, optimised end-to-end PySpark batch pipeline.
* **Knowing** how to use `.explain()` to inspect and interpret Spark's physical execution plan.
Week 5 - The Data Lakehouse Paradigm4:50
### Objectives
* **Understanding** the limitations of both traditional data warehouses and data lakes that motivated the Lakehouse architecture.
* **Familiarizing** with the three major open table formats — Apache Iceberg, Delta Lake, and Apache Hudi — and their key differentiating features.
* **Learning** how ACID transactions are implemented on top of immutable object storage using metadata layers.
* **Exploring** time travel queries and how they enable debugging, auditing, and accidental deletion recovery.
* **Gaining** hands-on experience writing to and querying an Apache Iceberg table using PySpark.
* **Knowing** how to perform schema evolution in a Lakehouse without downtime or data rewrites.
Week 6 - Data Transformation with dbt4:25
### Objectives
* **Understanding** the difference between ETL and ELT and articulating why ELT has become the dominant pattern in cloud data platforms.
* **Familiarizing** with how dbt brings software engineering best practices to SQL-based data transformations.
* **Learning** the staging-intermediate-mart layering pattern for organising dbt models.
* **Exploring** how dbt's `{{ ref() }}` function builds a dependency graph (DAG) that ensures models execute in the correct order.
* **Gaining** hands-on experience writing dbt models, defining YAML tests, and running a complete dbt project.
* **Knowing** how to use dbt's built-in documentation and lineage graph to communicate data models to stakeholders.
Week 7 - Distributed Event Streaming with Kafka4:20
### Objectives
* **Understanding** why event streaming is necessary for real-time data use cases and how it fundamentally differs from batch processing.
* **Familiarizing** with Kafka's core architectural components: topics, partitions, offsets, brokers, and consumer groups.
* **Learning** how Kafka achieves fault tolerance through replication and how consumers track their position using offsets.
* **Exploring** the role of the Kafka Schema Registry in enforcing data contracts on event streams.
* **Gaining** hands-on experience writing a Kafka producer and consumer in Python using the `kafka-python` library.
* **Knowing** how to configure producers for durability (`acks=all`) and consumers for reliable offset management.
Week 8 - Stateful Stream Processing with Flink7:22
### Objectives
* **Understanding** why a dedicated stream processing engine like Flink is needed beyond what Kafka alone provides.
* **Familiarizing** with the difference between event time and processing time and explaining why event time is preferred for accurate analytics.
* **Learning** the three window types — tumbling, sliding, and session — and selecting the appropriate type for a given use case.
* **Exploring** how watermarks allow Flink to handle late-arriving data without waiting indefinitely.
* **Gaining** an understanding of Flink's checkpointing mechanism and how it guarantees exactly-once processing semantics.
* **Knowing** how to write a continuous streaming query using Flink SQL to detect anomalies in a payment stream.
Week 9 - Pipeline Orchestration with Airflow7:48
### Objectives
* **Understanding** why time-based schedulers like cron are insufficient for complex data pipelines and what problems Airflow solves.
* **Familiarizing** with Airflow's core components: DAGs, Tasks, Operators, Sensors, and the Scheduler.
* **Learning** how to define a DAG in Python, specify task dependencies, and schedule it.
* **Exploring** how Sensors enable event-driven pipeline triggering based on external conditions.
* **Gaining** hands-on experience writing a production-grade Airflow DAG with retries, sensors, and Slack alerting.
* **Knowing** how to use the Airflow UI to monitor pipeline execution, inspect logs, and manually trigger or re-run tasks.
Week 10 - DataOps and Defence in Depth7:28
### Objectives
* **Understanding** the distinction between loud failures and silent failures in data pipelines, and why silent failures are more dangerous.
* **Familiarizing** with the concept of DataOps and how it applies DevOps principles to data engineering.
* **Learning** how to write data quality tests using Great Expectations and integrate them into an Airflow pipeline.
* **Exploring** the concept of a Data Contract and how it prevents schema-breaking changes from reaching production.
* **Gaining** an understanding of the five pillars of data observability (freshness, volume, schema, distribution, lineage).
* **Knowing** how to design a multi-layer data quality strategy combining testing, contracts, and observability.
Week 11 - The Last Mile: Insights Consumption7:16
### Objectives
* **Understanding** the role of the serving layer in a data platform and why it requires deliberate engineering design.
* **Familiarizing** with the semantic layer concept and how it ensures metric consistency across an organisation.
* **Learning** query optimisation techniques — pre-aggregation, materialised views, partitioning, and clustering — that make BI dashboards performant at scale.
* **Exploring** the distinction between batch dashboards and real-time operational dashboards, and the appropriate tool for each use case.
* **Gaining** hands-on experience building a multi-chart dashboard with Apache Superset.
* **Knowing** how to design a data serving architecture that balances performance, cost, and accessibility.
Week 12 - Data Engineering for AI8:03
### Objectives
* **Understanding** why data engineering is the foundational layer for AI and ML, and how data quality directly impacts model quality.
* **Familiarizing** with the concept of training-serving skew and how Feature Stores eliminate it.
* **Learning** the architecture of a Retrieval-Augmented Generation (RAG) pipeline and the role of embeddings and vector databases within it.
* **Exploring** the data engineering challenges specific to multimodal data — images, audio, and video.
* **Gaining** hands-on experience building a simple RAG pipeline using embeddings and a vector store.
* **Knowing** how to articulate the connections between all six modules of this course as a complete, end-to-end AI-ready data platform.

Lab 1 - Project DataShop Initialisation9:00
Lab 2 - 14x Data Architecture Advantage4:58
Lab 3 - Distributed Processing with PySpark5:59
Lab 4 - Spark Performance Tuning6:27
Lab 5 - Upgrading DataShop to the Iceberg Lakehouse4:16
Lab 6 - Raw Data to Production Grade Analytics6:44
Lab 7 - Moving to the Millisecond5:34
Lab 8 - Realtime Fraud Detection6:59
Lab 9 - Nightly Data Orchestration6:57
Lab 10 - The Quality Gate7:00
Lab 11 - Building the Analytics Dashboard10:02
Lab 12 - The AI Serving Layer6:57

Overview and Companion Website4:17
Orientation 1.1 - System Onboarding and Paradigm Shift4:56
Orientation 1.2 - Files and Context in Claude Code4:45
Orientation 1.3 - Prompting Patterns in Claude Code5:38
Orientation 1.4 - Command, Agents and Memory6:04
Orientation 1.5 - Ready for Big Data Engineering5:06
Foundation 1 - Terminal Open. Ready to Build5:11
Foundation 2 - The DataShop Platform3:56
Foundation 3 - The DataShop Blueprint8:33
Foundation 4 - Signal in the Noise3:53
Foundation 5 - Data Engineering Environment Setup5:23
Lab 1 - DataShop Recapped5:56
Lab 2 - Analytical Storage5:13
Lab 3 - PySpark Engineering Blueprint5:07
Lab 4 - PySpark Performance Mastery5:33
Lab 5 - The Iceberg Paradigm6:41
Lab 6 - The Glass Refinery with dbt5:36
Lab 7 - Real-Time Streaming with Kafka5:22
Lab 8 - Stateful Streaming with Flink5:14
Lab 9 - Data Orchestration with Airflow5:26
Lab 10 - Quality Stress Test with Great Expectations5:33
Lab 11 - Self Service Analytics with Superset6:32
Lab 12 - The Intelligent Engine with AI7:25

Closing Remarks8:37
### Objectives
* **Understanding** how the six modules of the course integrate to form a complete, end-to-end modern data platform.
* **Familiarizing** with the macro-trends shaping the future of data engineering, including the convergence of batch and streaming, and the rise of AI-driven infrastructure.
* **Learning** the distinction between building pipelines and operating a data platform as a product.
* **Exploring** the non-technical competencies — stakeholder management, cost awareness, and domain knowledge — that distinguish senior data engineers.
* **Gaining** perspective on how to evaluate and adopt new tools without falling victim to hype cycles.
* **Knowing** how to continuously update your skills in a field characterized by rapid technological obsolescence.

Requirements

Basic Python. Write functions, use loops, work with lists and dicts. No prior PySpark, Flink, Kafka, dbt, or other course-tool experience needed.
Basic SQL. SELECT, JOIN, WHERE, GROUP BY, ORDER BY. We build on this with window functions, CTEs, and dbt-flavored analytical SQL during the course.
Command-line basics. Navigate directories, run commands, edit files in macOS, Linux, or Windows WSL2. If you've used `cd` and `ls`, you're ready.
A laptop with at least 8 GB RAM (16 GB recommended) and Docker Desktop installed. All 12 labs run locally, no cloud account, no cloud bill, ever.
Curiosity and persistence. Labs will break, errors will appear, and you'll learn the stack by debugging real failures.
Claude Code installed with an Anthropic API key or Claude subscription. Setup is covered in Foundation Module F4, no prior AI tooling experience required. (Optional, relevant for pair engineering only)
A code editor you're comfortable with. VS Code or Cursor recommended for first-class Claude Code integration, but any editor works fine. (Optional, relevant for pair engineering only)
Willingness to pair with an AI assistant. You don't need to be a power user; you just need to be open to a new way of building software. (Optional, relevant for pair engineering only)

Description

Data engineering is the fastest-growing role in the technology industry, and this course is your complete, practical guide to mastering it.

Most data engineering courses teach tools in isolation. You learn Spark in one course, Kafka in another, and dbt somewhere else. By the end, you have a collection of disconnected skills but no idea how to wire them together into a real platform. This course is different.

Over 12 structured weeks, you will build a complete, production-grade data platform for DataShop, a fictional global e-commerce company processing 2 million orders per day. Every week, you add a new layer to the same platform: first the storage foundation, then the batch processing engine, then the Lakehouse, then real-time streaming, then orchestration and data quality, and finally analytics dashboards and an AI-powered assistant. By Week 12, you have not just learned the tools, you have built something that works end to end.

The course covers the full modern data stack: Apache Spark for distributed batch processing, Apache Kafka and Apache Flink for real-time event streaming, Apache Iceberg for the Data Lakehouse, dbt for version-controlled SQL transformations, Apache Airflow for pipeline orchestration, Great Expectations for data quality, Apache Superset for dashboards, and ChromaDB for Retrieval-Augmented Generation (RAG) AI pipelines.

Every chapter is paired with a standalone Practice Lab, a realistic, hands-on exercise grounded in the DataShop scenario. You will not be copying tutorial code; you will be solving engineering problems. All labs run locally using Docker, so there are no cloud costs.

What makes this course unlike anything else: Claude Code, while optional, may be used as your pair engineer the entire way. You'll learn the prompting patterns, file-context strategies, and trust-but-verify workflows that turn a six-hour debugging session into a forty-minute one. You'll read Spark execution plans together, refactor brittle DAGs together, and ship features faster than you thought possible, without skipping the fundamentals that make a senior engineer senior.

Whether you are a software engineer pivoting into data, a data analyst ready to build your own pipelines, or an aspiring data engineer who wants a rigorous, concept-first education, this course will give you the architecture, the code, and the confidence to build the modern data platform.

Who this course is for:

Software engineers and backend developers pivoting into data engineering who want to learn distributed systems, streaming, and the analytical-data world.
Data analysts and data scientists who want to build their own pipelines and stop waiting on engineering for every new dataset or dashboard refresh.
Aspiring data engineers with Python and SQL basics who want one rigorous, project-driven path into the field, not 47 disconnected YouTube tutorials.
Tech leads, staff engineers, and architects evaluating modern-stack tradeoffs: batch vs. streaming, Lakehouse vs. warehouse, ELT vs. ETL, where AI fits.
ML engineers and AI builders who want a proper warehouse, real data quality, and a working RAG pattern they can adapt to their own LLM applications.
Bootcamp grads and self-taught engineers ready to bridge the gap between "I finished a course" and "I can build something a real company would run."
Career changers from analytics, finance, or operations with the Python and SQL basics who want a structured, end-to-end on-ramp into a data role.
Senior engineers exploring AI-paired workflows, the prompting patterns, file-context strategy, and trust-but-verify habits transfer to your day job.

What you'll learn

Explore related topics

Course content

Welcome and Introduction2 lectures • 9min

Concepts and Tools12 lectures • 1hr 11min

Practice Labs12 lectures • 1hr 21min

Practice Labs - Pair Engineering with Claude Code23 lectures • 2hr 7min

Conclusion1 lecture • 9min

Requirements

Description

Who this course is for: