
### Objectives
By the end of this course, students will be capable of:
* **Understanding** the definition and scope of modern data engineering, and how it differs from data science and software engineering.
* **Familiarizing** with the pedagogical philosophy of the course: learning concepts over specific tools, and understanding trade-offs.
* **Learning** the structure of the 12-week curriculum and how the six modules build upon one another sequentially.
* **Exploring** the concept of the "Data Engineering Lifecycle" (generation, storage, ingestion, transformation, serving).
* **Gaining** an appreciation for the non-technical skills required for success, such as DataOps, documentation, and stakeholder management.
* **Knowing** how to prepare your local development environment for the practical labs that accompany each chapter.
### Objectives
* **Understanding** the distinct role and responsibilities of a data engineer within a modern data organisation.
* **Familiarizing** with the Four V's of Big Data (Volume, Velocity, Variety, Veracity) and why they present unique engineering challenges.
* **Learning** about the historical evolution of big data processing, from vertical scaling to the Hadoop revolution and MapReduce paradigm.
* **Exploring** the shift from on-premises Hadoop clusters to decoupled, cloud-native data architectures.
* **Gaining** an appreciation for the modern data engineering stack, including ingestion, storage, processing, and orchestration layers.
* **Knowing** the key differences between data engineering, data science, and data analysis roles.
### Objectives
* **Understanding** the architectural shift from on-premises HDFS to cloud object storage and why it matters.
* **Familiarizing** with the dominant columnar file formats — Apache Parquet and Apache ORC — and their performance advantages over row-based formats.
* **Learning** the trade-offs between different serialisation formats including Avro, JSON, and CSV.
* **Exploring** the landscape of NoSQL databases and their appropriate use cases in data engineering.
* **Gaining** hands-on experience benchmarking CSV versus Parquet in a practical Python demonstration.
* **Knowing** how to select the right storage format and system for a given data engineering use case.
### Objectives
* **Understanding** why Apache Spark was created and how it overcomes the limitations of Hadoop MapReduce.
* **Familiarizing** with the Spark cluster architecture, including the Driver, Executors, and Cluster Manager.
* **Learning** the difference between RDDs (Resilient Distributed Datasets) and the higher-level DataFrame API.
* **Exploring** the concept of lazy evaluation and the distinction between transformations and actions.
* **Gaining** hands-on experience installing PySpark and running a complete batch processing pipeline.
* **Knowing** how the Catalyst Optimizer automatically improves query performance.
### Objectives
* **Understanding** what data skew is, why it degrades performance, and how to diagnose it using the Spark UI.
* **Familiarizing** with the salting technique for resolving data skew in `groupBy` and `join` operations.
* **Learning** the difference between shuffle joins and broadcast joins, and when to use each.
* **Exploring** Adaptive Query Execution (AQE) and how it automatically optimises query plans at runtime.
* **Gaining** hands-on experience writing a complete, optimised end-to-end PySpark batch pipeline.
* **Knowing** how to use `.explain()` to inspect and interpret Spark's physical execution plan.
### Objectives
* **Understanding** the limitations of both traditional data warehouses and data lakes that motivated the Lakehouse architecture.
* **Familiarizing** with the three major open table formats — Apache Iceberg, Delta Lake, and Apache Hudi — and their key differentiating features.
* **Learning** how ACID transactions are implemented on top of immutable object storage using metadata layers.
* **Exploring** time travel queries and how they enable debugging, auditing, and accidental deletion recovery.
* **Gaining** hands-on experience writing to and querying an Apache Iceberg table using PySpark.
* **Knowing** how to perform schema evolution in a Lakehouse without downtime or data rewrites.
### Objectives
* **Understanding** the difference between ETL and ELT and articulating why ELT has become the dominant pattern in cloud data platforms.
* **Familiarizing** with how dbt brings software engineering best practices to SQL-based data transformations.
* **Learning** the staging-intermediate-mart layering pattern for organising dbt models.
* **Exploring** how dbt's `{{ ref() }}` function builds a dependency graph (DAG) that ensures models execute in the correct order.
* **Gaining** hands-on experience writing dbt models, defining YAML tests, and running a complete dbt project.
* **Knowing** how to use dbt's built-in documentation and lineage graph to communicate data models to stakeholders.
### Objectives
* **Understanding** why event streaming is necessary for real-time data use cases and how it fundamentally differs from batch processing.
* **Familiarizing** with Kafka's core architectural components: topics, partitions, offsets, brokers, and consumer groups.
* **Learning** how Kafka achieves fault tolerance through replication and how consumers track their position using offsets.
* **Exploring** the role of the Kafka Schema Registry in enforcing data contracts on event streams.
* **Gaining** hands-on experience writing a Kafka producer and consumer in Python using the `kafka-python` library.
* **Knowing** how to configure producers for durability (`acks=all`) and consumers for reliable offset management.
### Objectives
* **Understanding** why a dedicated stream processing engine like Flink is needed beyond what Kafka alone provides.
* **Familiarizing** with the difference between event time and processing time and explaining why event time is preferred for accurate analytics.
* **Learning** the three window types — tumbling, sliding, and session — and selecting the appropriate type for a given use case.
* **Exploring** how watermarks allow Flink to handle late-arriving data without waiting indefinitely.
* **Gaining** an understanding of Flink's checkpointing mechanism and how it guarantees exactly-once processing semantics.
* **Knowing** how to write a continuous streaming query using Flink SQL to detect anomalies in a payment stream.
### Objectives
* **Understanding** why time-based schedulers like cron are insufficient for complex data pipelines and what problems Airflow solves.
* **Familiarizing** with Airflow's core components: DAGs, Tasks, Operators, Sensors, and the Scheduler.
* **Learning** how to define a DAG in Python, specify task dependencies, and schedule it.
* **Exploring** how Sensors enable event-driven pipeline triggering based on external conditions.
* **Gaining** hands-on experience writing a production-grade Airflow DAG with retries, sensors, and Slack alerting.
* **Knowing** how to use the Airflow UI to monitor pipeline execution, inspect logs, and manually trigger or re-run tasks.
### Objectives
* **Understanding** the distinction between loud failures and silent failures in data pipelines, and why silent failures are more dangerous.
* **Familiarizing** with the concept of DataOps and how it applies DevOps principles to data engineering.
* **Learning** how to write data quality tests using Great Expectations and integrate them into an Airflow pipeline.
* **Exploring** the concept of a Data Contract and how it prevents schema-breaking changes from reaching production.
* **Gaining** an understanding of the five pillars of data observability (freshness, volume, schema, distribution, lineage).
* **Knowing** how to design a multi-layer data quality strategy combining testing, contracts, and observability.
### Objectives
* **Understanding** the role of the serving layer in a data platform and why it requires deliberate engineering design.
* **Familiarizing** with the semantic layer concept and how it ensures metric consistency across an organisation.
* **Learning** query optimisation techniques — pre-aggregation, materialised views, partitioning, and clustering — that make BI dashboards performant at scale.
* **Exploring** the distinction between batch dashboards and real-time operational dashboards, and the appropriate tool for each use case.
* **Gaining** hands-on experience building a multi-chart dashboard with Apache Superset.
* **Knowing** how to design a data serving architecture that balances performance, cost, and accessibility.
### Objectives
* **Understanding** why data engineering is the foundational layer for AI and ML, and how data quality directly impacts model quality.
* **Familiarizing** with the concept of training-serving skew and how Feature Stores eliminate it.
* **Learning** the architecture of a Retrieval-Augmented Generation (RAG) pipeline and the role of embeddings and vector databases within it.
* **Exploring** the data engineering challenges specific to multimodal data — images, audio, and video.
* **Gaining** hands-on experience building a simple RAG pipeline using embeddings and a vector store.
* **Knowing** how to articulate the connections between all six modules of this course as a complete, end-to-end AI-ready data platform.
### Objectives
* **Understanding** how the six modules of the course integrate to form a complete, end-to-end modern data platform.
* **Familiarizing** with the macro-trends shaping the future of data engineering, including the convergence of batch and streaming, and the rise of AI-driven infrastructure.
* **Learning** the distinction between building pipelines and operating a data platform as a product.
* **Exploring** the non-technical competencies — stakeholder management, cost awareness, and domain knowledge — that distinguish senior data engineers.
* **Gaining** perspective on how to evaluate and adopt new tools without falling victim to hype cycles.
* **Knowing** how to continuously update your skills in a field characterized by rapid technological obsolescence.
Data engineering is the fastest-growing role in the technology industry, and this course is your complete, practical guide to mastering it.
Most data engineering courses teach tools in isolation. You learn Spark in one course, Kafka in another, and dbt somewhere else. By the end, you have a collection of disconnected skills but no idea how to wire them together into a real platform. This course is different.
Over 12 structured weeks, you will build a complete, production-grade data platform for DataShop, a fictional global e-commerce company processing 2 million orders per day. Every week, you add a new layer to the same platform: first the storage foundation, then the batch processing engine, then the Lakehouse, then real-time streaming, then orchestration and data quality, and finally analytics dashboards and an AI-powered assistant. By Week 12, you have not just learned the tools, you have built something that works end to end.
The course covers the full modern data stack: Apache Spark for distributed batch processing, Apache Kafka and Apache Flink for real-time event streaming, Apache Iceberg for the Data Lakehouse, dbt for version-controlled SQL transformations, Apache Airflow for pipeline orchestration, Great Expectations for data quality, Apache Superset for dashboards, and ChromaDB for Retrieval-Augmented Generation (RAG) AI pipelines.
Every chapter is paired with a standalone Practice Lab, a realistic, hands-on exercise grounded in the DataShop scenario. You will not be copying tutorial code; you will be solving engineering problems. All labs run locally using Docker, so there are no cloud costs.
What makes this course unlike anything else: Claude Code, while optional, may be used as your pair engineer the entire way. You'll learn the prompting patterns, file-context strategies, and trust-but-verify workflows that turn a six-hour debugging session into a forty-minute one. You'll read Spark execution plans together, refactor brittle DAGs together, and ship features faster than you thought possible, without skipping the fundamentals that make a senior engineer senior.
Whether you are a software engineer pivoting into data, a data analyst ready to build your own pipelines, or an aspiring data engineer who wants a rigorous, concept-first education, this course will give you the architecture, the code, and the confidence to build the modern data platform.