Teach on Udemy

Turn what you know into an opportunity and reach millions around the world.

Learn More

Your cart is empty.

Keep shopping

Data Engineering Masterclass for Beginners

Name: Data Engineering Masterclass for Beginners
Rating: 4.5 (1935 reviews)

Master Hadoop , Spark with PySpark & Scala, AWS Glue, Databricks, Delta Lake, NiFi. Build Real Projects , ETL Pipelines

Created byFutureX Skills

Last updated 12/2025

English

What you'll learn

Big Data , Hadoop and Spark from scratch by solving a real world use case using Python and Scala
Spark Scala & PySpark real world coding framework.
Real world coding best practices, logging, error handling , configuration management using both Scala and Python.
Serverless big data solution using AWS Glue, Athena and S3

Course content

19 sections • 170 lectures • 16h 31m total length

Introduction2:32
What is Data Engineering1:16
Learn how data engineering underpins the data-driven world by building pipelines in the cloud using Hadoop, Spark, Databricks, and NiFi to collect, store, and process real-time data for analytics.

Big Data concepts6:05
Hadoop concepts1:44
Hadoop Distributed File System (HDFS)1:59
Explore how HDFS distributes large data across a cluster using 128 MB blocks, threefold replication on data nodes for fault tolerance, and how to set up on Google Cloud.
Understanding Google Cloud (GCP) Dataproc1:27
Signing up for a Google Cloud free trial1:44
Creating a Dataproc Cluster3:34
Create a production-grade data cluster on Google Cloud Dataproc with one master and two workers, enable the Dataproc API, and access the master via SSH.
Storing a file in HDFS9:45
MapReduce and YARN4:21
Hive1:29
Querying HDFS data using Hive11:43
Analyzing a billion records with Hive12:14
Fast queries with Hive Partitioning9:29
Fast queries with Hive Bucketing2:29
Big Data and Hadoop Concepts Revision + Interview Questions8:03

What is Spark?5:25
Explore how Apache Spark speeds big data processing by keeping intermediate results in memory, enabling real-time streaming and batch analytics, and replacing MapReduce in Hadoop ecosystems.
Spark Hello World on Dataproc5:01
Running Python Spark 3 on Google Colab1:38
Spark for data transformation1:39
What is a DataFrame?1:30
RDDs - The fundamental building block1:59
Python basics6:08
PySpark - Creating RDDs4:53
Python functions and lambda expressions2:57
RDD - Transformation & Action10:45
PySpark Data Engineering: Solve Real Business Problems22:04
Scala basics8:06
Boost Productivity: PySpark to Scala Conversion via ChatGPT2:08
A Real-World Use Case in Action2:29
Apply distributed data processing to cleanse bank marketing prospects by replacing missing values with column averages and removing unknowns, using Hadoop storage and Spark processing via PySpark and Spark Scala.
Use-Case Solution with PySpark on Colab9:44
Use-Case Solution with PySpark on DataProc3:49
Use-Case Solution with Spark Scala on DataProc2:54
Spark SQL and Temporary Views - Querying DataFrames with SQL14:31
Explore Spark SQL and temporary views to run SQL queries on data frames, register temporary views, and perform joins, aggregations, subqueries, and CTEs using both SQL and the DataFrame API.
Assignment: Spark SQL with Scala on Google Cloud Dataproc1:09

What is Databricks2:07
Explore how Databricks, built on Apache Spark, enables data engineering, analytics, and machine learning with a collaborative lakehouse platform and Delta tables for scalable data pipelines.
Getting Started with Databricks Free Edition2:37
Working with Unity Catalog and Delta Tables10:17
Exploring DBFS and dbutils - Working with Sample Datasets5:26
Unity Catalog Volumes - File Storage and Operations10:31
Using Generative AI in Databricks for Data Transformation and Querying2:08
No-Code Data Engineering with Databricks Generative AI3:07
Important: Interface Differences Between Databricks Editions1:37
Compare the Databricks community edition interface with the free edition; notebooks, code, and datasets work the same, while serverless versus cluster setups and Scala availability are highlighted.
Sample transformations on Databricks using PySpark9:09
Sample transformations on Databricks using Spark Scala5:45
Explore sample transformations on Databricks using Spark with Scala and Python in a notebook, performing groupBy and avg on a DataFrame, with explicit val and var declarations and camel casing.
Spark User defined functions (UDF)14:24
Explore Spark user defined functions (UDF) to create and apply custom Python functions to data frames or via SQL, using Databricks, UDFs, concat, lit, and lambda expressions.
Joining Datasets using DataFrame APIs and Spark SQL14:57
More join operations using Spark4:48

Understanding Data Warehouse, Data Lake and Data Lakehouse7:31
Databricks Lakehouse Architecture and Delta Lake4:38
Delta tables1:32
Storing data in a Delta table, Databricks SQL and time travel12:35
Databricks SQL vs Spark SQL5:50
Compare spark sql and Databricks sql while creating a delta table from the diamonds data, and perform transformations with both interfaces using describe commands and the optimized engine.
Delta Table caching10:43
Delta Table partitioning5:31
Delta Table Z-ordering5:07
Explore g ordering, a delta table optimization that sorts data on disk by specified columns to speed up range queries, demonstrated on payment type with the optimize command.

Building Python Spark Code Outside a Notebook Environment0:35
Setting Up a PySpark Hadoop Development Environment in PyCharm8:25
Windows Guide: PyCharm Setup for PySpark, Hadoop, and Hive with Winutils14:25
Following Along on Windows and Mac0:25
Structuring code with classes and methods6:19
Creating and reusing SparkSession6:13
Create and reuse a SparkSession across a PySpark data pipeline by initializing it in a class, storing it as self.spark, and passing it to ingest and pipeline components.
Spark DataFrame2:27
Separating out Ingestion, Transformation and Persistence code4:57

Requirements

Students should have some programming background and some knowledge of SQL queries.

Description

Become a Job-Ready Data Engineer with Real-World, Hands-On Projects!

The Data Engineering Masterclass prepares you for an actual Data Engineer role, covering everything from Hadoop and Spark to AWS Glue, Databricks, Delta Lake, and Apache NiFi — the complete modern data engineering ecosystem.

Data Engineering powers every data-driven organization — it’s the foundation behind analytics, AI, and business intelligence. In this course, you’ll master how large-scale data is collected, processed, stored, and analyzed using today’s most in-demand Big Data tools.

Through step-by-step, hands-on labs and real-world projects, you’ll build end-to-end data pipelines using Hadoop, Spark, Databricks, and NiFi — applying both Python (PySpark) and Scala.

You’ll also learn professional-grade coding techniques including logging, error handling, unit testing, and configuration management — to code like an industry data engineer.

With Apache NiFi, you’ll go beyond traditional ETL. You’ll learn how to design, automate, and monitor data flows between systems, and understand where NiFi fits in a modern cloud-based architecture.

By the end, you’ll confidently work with cloud platforms, data lakes, and ETL pipelines, and know how to leverage ChatGPT and other generative AI tools to boost productivity, automate repetitive tasks, and think critically in an AI-driven world.

What You’ll Learn

Big Data and Hadoop fundamentals
Create a free Hadoop and Spark cluster using Google Dataproc
Hands-on Hadoop: HDFS and Hive projects
Python and PySpark basics for Big Data
PySpark RDD, SQL, and DataFrame operations — hands-on
Spark SQL and Temporary Views - Querying DataFrames with SQL
Build an end-to-end project using PySpark and Hive
Scala basics and Spark Scala DataFrames
Real-world Spark Scala project with IntelliJ and Maven
Databricks and Delta Lakehouse fundamentals
Manage Delta Tables — versioning, restoring, and time travel
Unity Catalog Volumes - File Storage and Operations
Optimize Spark queries using Delta Cache
Build a full data pipeline with Hive, PostgreSQL, and Spark
Logging, error handling, and unit testing for PySpark & Scala applications
Apache NiFi fundamentals — build, automate, and monitor data flows
Integrate AWS Glue, Athena, and S3 for data transformation and analytics
Use ChatGPT to accelerate learning and automate repetitive tasks
Vibe coding with GitHub Copilot to build data pipelines using simple natural language conversation.

Tools & Technologies Covered

Hadoop • Spark • Hive • PySpark • Scala • Databricks • Delta Lake • NiFi • AWS Glue • Athena • PostgreSQL • IntelliJ • Maven • PyCharm

Who This Course Is For

Beginners who want to become Data Engineers
Software or SQL developers looking to move into Big Data
Data Analysts or Scientists wanting to understand data pipelines
Anyone preparing for a Data Engineer job or interview

Prerequisites

No prior programming experience is required — you’ll learn Python and Scala from scratch.
A basic understanding of databases and SQL will help, but it’s not mandatory.

Outcome

By completing this masterclass, you will:

Understand Big Data and distributed computing concepts
Build and deploy Spark and NiFi data pipelines on cloud platforms
Work confidently with Databricks, Delta Lake, and AWS Glue
Apply best practices in logging, testing, error handling, and performance tuning
Be ready for real-world Data Engineering roles with hands-on, practical experience

Who this course is for:

Beginners who want to learn Big Data or experienced people who want to transition to a Big Data role
Big data beginners who want to learn how to code in the real world

Data Engineering Masterclass for Beginners

What you'll learn

Explore related topics

Course content

Introduction2 lectures • 4min

Big Data Hadoop concepts and hands-on14 lectures • 1hr 16min

Spark concepts and hands-on19 lectures • 1hr 49min

Review and Path Forward3 lectures • 23min

Learning Apache Spark on Databricks13 lectures • 1hr 27min

Deep dive into Databricks Delta Lake Lakehouse Platform8 lectures • 53min

Creating a PySpark real world coding framework8 lectures • 44min

PySpark Logging and Error Handling5 lectures • 33min

Creating a Data Pipeline with Hadoop PySpark and PostgreSQL6 lectures • 25min

PySpark - Reading Configuration from properties file2 lectures • 5min

Requirements

Description

Who this course is for: