
Create a Databricks free edition account with a step-by-step signup, email verification, and login, then customize preferences like dark mode and editor theme.
Explore spark dataframes to process and transform data in memory, using dataframe APIs for select, filter, join, and aggregate, with hands-on practice loading sf-fire-calls.csv and saving results to a table.
Learn to read the sales sample csv with a defined schema on read, perform exploratory data analysis, and fix data type and schema issues in spark data frames.
Learn to add, remove, and rename columns in a Spark data frame using withColumn, withColumns, and case when expressions, and to define schema for efficient data frame creation.
Learn to write dataframe column expressions in Spark using select expr and withColumn, rename and compute arrival and departure dates, and compare SQL-style and column-based approaches.
Read the flight time data frame, filter for 2000-01-16 from US to AUD, compute delay as actual minus scheduled arrival, sort by delay, take three, and collect results into Python.
Transform unstructured data with Apache Spark using an AI prompt to extract log records into a structured JSON format with IP address, visit timestamp, visit resources, and referring URL.
Learn to work with numbers in Spark data frames by writing mathematical expressions, applying round and aggregate functions, creating total value columns, and performing analysis with describe, summary, and percentile.
Master string manipulation in Spark data frames using concat, concat_ws, and to_date to create dates, and format_number and format_string for presentation; explore translate and replace for data cleaning.
Explore complex data types in spark: struct, array, and map, parsing json fields with from_json and building optimized offline tables for fast analytics, including country-wise counts.
Explore five Spark join types beyond the basics—natural, cross, self, left semi, and left anti joins—using real examples with member, bookings, and facilities data.
Develop a local Spark application in Python using PyCharm by creating a Spark session, loading survey.csv into a data frame, and performing a group by country count.
This course does not require any prior knowledge of Apache Spark or Hadoop. We have taken sufficient care to explain the fundamental concepts of Spark, helping you come up to speed and grasp the content of this course.
About the Course
I am creating the PySpark - Apache Spark Programming for Beginners course to help you understand Spark programming and apply that knowledge to build data engineering solutions. This course is example-driven and follows a working session-like approach. We will take a live coding approach and explain all the necessary concepts along the way.
Who should take this Course?
I designed this course for software engineers willing to develop a Data Engineering pipeline and application using Apache Spark. I am also creating this course for data architects and data engineers who are responsible for designing and building the organisation’s data-centric infrastructure. Another group of people is the managers and architects who do not directly work with Spark implementation. Still, they work with the people who implement Apache Spark at the ground level.
Spark Version used in the Course
This Course is using Apache Spark 4.1. I have tested all the source code and examples used in this Course on Apache Spark 4.1 in the Databricks environment.