Teach on Udemy

Turn what you know into an opportunity and reach millions around the world.

Learn More

Your cart is empty.

Keep shopping

PySpark - Apache Spark Programming for Beginners (2026)

Name: PySpark - Apache Spark Programming for Beginners (2026)
Rating: 4.6 (17087 reviews)

Master Apache Spark Programming in Python (PySpark) Using Databricks Free Edition - Recreated for 2026

Bestseller

Created byPrashant Kumar Pandey, Learning Journal

Last updated 1/2026

English

English [Auto],

What you'll learn

Apache Spark Programming in Python (PySpark)
Spark Programing in Databricks Free Account
Working with Data Frames Transformations and Actions
Handling Schema and working with different data types
Working with Complex Data Types, Aggregation, Joins and UDF
Working with Data Sources and Sinks
Unit Testing and Data Engineering Techniques

Course content

14 sections • 81 lectures • 28h 28m total length

What is Big Data and How it Started22:08
Hadoop Architecture, History and Evolution30:40
Data Lake and Lakehouse Architecture16:08

What is Apache Spark8:54
Understand Apache Spark as a multi-language engine and framework for data engineering, data science, and machine learning on single-node or cluster setups. Learn its SQL, batch, and stream processing capabilities.
Apache Spark System Architecture14:11
Examine the Spark system architecture, covering Spark core and RDDs, language wrappers, SQL and data frame APIs, structured streaming, ML lib, and Spark Connect, plus cloud deployment and storage.
Spark Platform and Development Environment9:56
Explain the Spark platform and development environments, detailing Spark core, libraries, resource manager, and storage. Compare cloud platforms like Databricks and EMR with on-premise Cloudera, and notebook versus IDE development.
What is Databricks Cloud7:33
Understand Databricks Cloud, a unified data intelligence platform built on Spark, MLflow, and Delta Lake, enabling on-demand and serverless Spark workloads on AWS, Azure, and Google Cloud.
Create your Databricks Free Account6:51
Create a Databricks free edition account with a step-by-step signup, email verification, and login, then customize preferences like dark mode and editor theme.
Setup your hands-on environment19:50
Set up your Databricks free environment, import the spark programming dbc, and initialize a dev catalog, spark_db database, and data sets volume to load diamonds data from GitHub.
Download Resources0:12

Starting Point - Data Engineering, Spark, and Spark Session38:20
Dataframe - A view to Structured data39:51
Explore spark dataframes to process and transform data in memory, using dataframe APIs for select, filter, join, and aggregate, with hands-on practice loading sf-fire-calls.csv and saving results to a table.
Dataframe Transformations and Actions30:26
Learn spark data frames by reading data, applying transformations, and triggering actions to execute and display results, while understanding four categories: transformations, actions, data frame writer methods, and auxiliary methods.
Dataframe Concepts23:36
Explore Spark DataFrame concepts, including the optimized query plan and explain plans. Understand immutable dataframes and composable transformations that return new dataframes.
Exploring Dataframe Transformations9:10
Explore Spark data frame transformations and the data frame API through hands-on practice with select, where, distinct, group by, order by, and limit, using the SF Fire Calls table.
Creating Spark Dataframe29:49
Explore five approaches to creating a spark data frame, including using a connector, reading from spark tables, sql, Python lists, and ranges.

Spark Data Types12:43
Explore Spark data types, from numeric primitives (byte, short, int, long, float, double) to complex types (array, map, struct) and interval types, with Python and Spark SQL keywords.
Schema on read in Spark35:50
apply schema on read in spark by defining a flight schema for a json file, ensuring correct field order and date types, and writing the data to a table.
Correcting Data Types41:42
Explore data type correction in spark and schema on read. Convert time fields into interval data types to enable straightforward sql operations and analysis on flight time data.
Exploratory analysis and data type conversion43:11
Learn to read the sales sample csv with a defined schema on read, perform exploratory data analysis, and fix data type and schema issues in spark data frames.

Adding, Removing, and Renaming Columns39:04
Learn to add, remove, and rename columns in a Spark data frame using withColumn, withColumns, and case when expressions, and to define schema for efficient data frame creation.
Dataframe Column Expressions46:03
Learn to write dataframe column expressions in Spark using select expr and withColumn, rename and compute arrival and departure dates, and compare SQL-style and column-based approaches.
Filtering and removing duplicates25:22
Learn row-level transformations in PySpark to filter records and remove duplicates, using four expression approaches, and apply distinct or drop duplicates on chosen columns.
Sorting, Limiting and Collecting32:56
Read the flight time data frame, filter for 2000-01-16 from US to AUD, compute delay as actual minus scheduled arrival, sort by delay, take three, and collect results into Python.
Transforming Unstructured data30:52
Transform unstructured Apache logs with Spark by extracting ip address, visit timestamp, visit resource, and referring url using regex, then refine timestamps and root url, comparing ai-based extraction.
Transforming data with LLM33:49
Transform unstructured data with Apache Spark using an AI prompt to extract log records into a structured JSON format with IP address, visit timestamp, visit resources, and referring URL.

Working with Nulls29:07
Explore how to work with nulls in Apache Spark, including equality checks with is null, null handling in logical and mathematical operations, and using nvl for safe aggregations.
Working with Numbers30:53
Learn to work with numbers in Spark data frames by writing mathematical expressions, applying round and aggregate functions, creating total value columns, and performing analysis with describe, summary, and percentile.
Manipulating Strings32:14
Master string manipulation in Spark data frames using concat, concat_ws, and to_date to create dates, and format_number and format_string for presentation; explore translate and replace for data cleaning.
Working with Date39:13
Explore date types in Apache Spark, using date functions to convert strings to dates, perform date arithmetic, and format results while handling invalid dates as null.
Working with Timestamps23:45
Discover how to process timestamps in Spark by converting strings to timestamp with formats, using two_timestamp and try_timestamp, and understanding timestamp components and default utc time zone.
Handling Time Zone Information39:52
Explore handling time zone information in spark by contrasting timestamp with time zone and timestamp without time zone, using session time zone and convert_time_zone to normalize to utc.
Working with Complex Data Types47:48
Explore complex data types in spark: struct, array, and map, parsing json fields with from_json and building optimized offline tables for fast analytics, including country-wise counts.
Working with JSON data27:19
Load json data with spark using the json connector and a schema, then query complex types to analyze country counts and spark experience.
Working with Variant Type37:17
Learn how to use the variant data type in PySpark to store semi-structured fields from a csv, parse json without a fixed schema, and query with variant explode.

Introduction to Joins in Spark8:10
Introduce joins in Apache Spark and explore eight join types, including inner, outer, natural, cross, self, semi, and anti joins, with hands-on data prep using facilities, members, and bookings tables.
Inner Joins35:11
Master inner joins with Spark dataframes to combine members and bookings, alias tables, filter Smith with more than five slots, and sort by first name and slots.
Outer Joins37:47
Explore outer joins in Apache Spark, including left, right, and full outer joins on DataFrame APIs, with Darren Smith scenarios and nulls first sorting for bookings.
Lateral Join31:44
Master lateral joins in PySpark data frame APIs to query a right table for each left row, enabling per-parent top N results and table-valued function calls.
Other Types of Joins29:21
Explore five Spark join types beyond the basics—natural, cross, self, left semi, and left anti joins—using real examples with member, bookings, and facilities data.

User-Defined Functions20:05
Discover how to create scalar Python UDFs, Pandas vectorized UDFs, and UDTFs in Apache Spark, and register them for Spark SQL to use in dataframes.
Vectorized UDF19:21
Explore vectorized Pandas UDFs in Spark, learn how they process input as Pandas Series or Pandas DataFrame in blocks, use Arrow serialization, and boost performance over Python UDFs.
User Defined Table Functions24:18
Unit Testing Spark Code50:32

Requirements

Programming Knowledge Using Python Programming Language
SQL Programming Knowledge

Description

This course does not require any prior knowledge of Apache Spark or Hadoop. We have taken sufficient care to explain the fundamental concepts of Spark, helping you come up to speed and grasp the content of this course.

About the Course

I am creating the PySpark - Apache Spark Programming for Beginners course to help you understand Spark programming and apply that knowledge to build data engineering solutions. This course is example-driven and follows a working session-like approach. We will take a live coding approach and explain all the necessary concepts along the way.

Who should take this Course?

I designed this course for software engineers willing to develop a Data Engineering pipeline and application using Apache Spark. I am also creating this course for data architects and data engineers who are responsible for designing and building the organisation’s data-centric infrastructure. Another group of people is the managers and architects who do not directly work with Spark implementation. Still, they work with the people who implement Apache Spark at the ground level.

Spark Version used in the Course

This Course is using Apache Spark 4.1. I have tested all the source code and examples used in this Course on Apache Spark 4.1 in the Databricks environment.

Who this course is for:

Software Engineers and Architects who are willing to design and develop a Bigdata Engineering Projects using Apache Spark
Programmers and developers who are aspiring to grow and learn Data Engineering using Apache Spark

PySpark - Apache Spark Programming for Beginners (2026)

What you'll learn

Explore related topics

Course content

Understanding Big Data and Distributed Data Processing3 lectures • 1hr 9min

Introduction to Apache Spark7 lectures • 1hr 7min

Getting Started with Spark Programming6 lectures • 2hr 51min

Spark Data Types and Schema4 lectures • 2hr 13min

Dataframe Transformations6 lectures • 3hr 28min

Working with different data types9 lectures • 5hr 7min

Joins in Spark Dataframe5 lectures • 2hr 22min

Aggregations in Spark Dataframe4 lectures • 1hr 43min

UDF and Unit Testing4 lectures • 1hr 54min

Spark On Your Laptop IDE3 lectures • 1hr 44min

Requirements

Description

Who this course is for: