Teach on Udemy

Turn what you know into an opportunity and reach millions around the world.

Learn More

Your cart is empty.

Keep shopping

Databricks Certified Associate Spark Developer Certification

Name: Databricks Certified Associate Spark Developer Certification
Rating: 4.9 (10 reviews)

2026 | 2 practice tests included. Hands-on PySpark, DataFrames & Structured Streaming for the Databricks Certification

Role Play

Created byGraeme Gordon

Last updated 3/2026

English

What you'll learn

Feel exam-ready with practice questions and topic breakdowns covering all 7 sections of the Databricks Certified Associate Developer for Apache Spark exam.
Explain Spark's core architecture including driver, executors, lazy evaluation, shuffles, and the job/stage/task execution hierarchy.
Manipulate Spark DataFrames using the API to filter, join, aggregate, handle missing data, and apply UDFs and Spark SQL functions.
Get hands-on experience writing and running real Spark code in Databricks, reinforcing key concepts through practical exercises and examples.
Understand and apply Structured Streaming in Spark to handle real-time data pipelines using windowing, watermarking, and output sinks.

Course content

10 sections • 74 lectures • 6h 50m total length

Introduction3:11
Download Course Resources Here0:26
Disclaimer0:45

Introduction to Databricks and Apache Spark6:54
Compare Apache Spark with Databricks, define Spark as an open source distributed computing engine, and summarize Databricks as a managed cloud platform with notebooks and Delta Lake.
Setting up a Databricks Account7:01
Sign up for a Databricks free edition and verify your email. Explore notebooks, workspace, catalog, and compute clusters for learning streaming topics with an all-purpose cluster.
Core Components of Spark's Execution Architecture8:54
Explore the core components of Spark's execution architecture: the driver node, cluster manager, and worker nodes with executors, and how the Spark session, DAG, and Catalyst Optimizer shape execution.
Introduction to Databricks Notebooks9:22
Learn how Databricks notebooks enable interactive, multi-language coding with code cells and markdown, connect to compute, and use serverless and all-purpose compute with Spark sessions.
How to Import a Notebook to Databricks0:44
Spark Core & API Architecture7:58
Understand Spark's layered architecture—from Spark core to data frame and data set APIs—and explore Spark SQL, structured streaming, machine learning, and graph libraries built on core.
Understanding Variables in Python3:48
Introduction to DataFrames in Databricks10:13
Learn to read a CSV into a Spark data frame in Databricks, define a schema, enable header and infer schema, and assign the result to a data frame variable.
Spark Transformations, Actions & Lazy Evaluation9:22
Explore how Spark uses lazy evaluation with transformations and actions, building a logical execution plan into a DAG, and how actions trigger job execution and optimization.

Working with CSV4:51
Learn how csvs, or comma-separated values, offer a readable format with no enforced schema, where all values are strings, and explore reading them with spark.read.csv using header and infer schema.
Working with JSON8:49
Explore JSON as a plain text, human-readable, hierarchical data format with nested objects and arrays. Learn to read and write JSON in Databricks using Spark APIs.
Working with ORC and Parquet5:47
Explore ORC and Parquet, binary columnar formats that enable selective reads with columnar storage, self-describing schemas, strong typing, and high compression across Spark, Hive, and data lake storage.
Introduction to Delta Files and Delta Lake5:13
Discover Delta Lake, an open source storage layer that adds ACID transactions to Spark, stores data as Parquet, tracks changes with a transaction log, and enables time travel.
Defining Explicit DataFrame Schemas Using PySpark9:40
Define an explicit PySpark dataframe schema using a struct type for hotel bookings, convert price and total amount to decimals, and parse the booking date with a day-month-year format.
Partitioning Data in Databricks6:08
Partition data with PySpark and SQL by a country column, creating folders for each country and partitioned delta tables to speed up queries that filter by country.
Sorting, Optimizing and Z-Ordering Tables5:03
Sort and optimize Delta Lake tables to boost query performance, data skipping, and compression, while partitioning, ordering by booking date and hotel ID, and applying z-ordering for faster joins.

Selecting and Renaming Columns Using the PySpark DataFrame API7:57
Learn to select and rename columns in PySpark using DataFrame APIs, including aliasing and SQL expressions, demonstrated on the hotel bookings dataset.
Applying Column Transformations in PySpark8:03
Apply column transformations in PySpark to create a discounted amount from total amount and to uppercase room types, using withColumn and withColumns in a Databricks notebook.
Applying Filters to a DataFrame6:51
Drop the discounted amount column from the hotel bookings dataframe, then apply filters for premium bookings using nights, total amount, and logical operators, including where for USA results.
Creating Conditional Columns in PySpark7:20
Create conditional columns in PySpark using when to label bookings as premium or standard, or to produce an is_premium boolean flag, based on nights and total_amount conditions.
Introduction to Aggregate Functions5:31
Learn to use PySpark aggregate functions on a hotel bookings data frame, counting records and summing the total amount with aliasing and both select and groupby approaches.
Grouping Data and Performing Aggregate Operations4:54
Learn to perform multi-aggregate analysis with Spark by grouping hotel bookings by country and computing total bookings, average nights, average booking value, and total revenue in a single dataframe.
Understanding the Order of PySpark Code Execution2:52
Appending Additional Data to a DataFrame5:15
Append additional data to a bookings dataframe by turning a list into a dataframe with the booking schema, then use union to handle duplicates, nulls, and date and decimal types.
Handling Duplicates in PySpark DataFrames3:45
Identify and remove duplicates in PySpark dataframes by grouping by booking ID, counting records, filtering counts greater than one, then dropping duplicates to create a clean dataframe.
Counting Unique Values in PySpark3:01
Learn to count unique values in PySpark with count distinct and approx count distinct, compare accuracy and speed, and explore the hyperloglog algorithm for large data sets.
Handling Null Values in PySpark DataFrames6:56
Count nulls per column in PySpark DataFrames, then drop rows or fill with unknown using dropNA and fillNA, applying to columns like country, room type, and nights.
Generating Summary Statistics2:30
Display numeric data statistics to snapshot a clean data frame using the summary, including average, minimum, maximum, and 25/50/75 percentiles, with optional 10/90 percentiles, and selectively apply to columns.
Changing Column Data Types in PySpark4:29
Learn to cast and change column data types in PySpark after reading data, using withColumn and cast to double for price per night, then print schema to verify.
Introduction to Date Functions6:10
Explore PySpark date functions to extract year, month, day, quarter, and week of year, format dates, and build a unique date dim dataframe for joining to analyses.
Working with Unix Timestamps in PySpark6:56
Demonstrate converting Unix epoch seconds into readable dates in Spark by using from Unix time and toDate to create date columns, format dates, and compare epoch and date data types.
Performing Date Calculations in PySpark4:53
Learn to compute date differences in PySpark by using date_diff with current date and booking date, add a days since the last booking column, and preview results.
Introduction to DataFrame Joins in PySpark7:07
join two data frames in PySpark by loading a hotels data frame and joining on hotel_id to enrich bookings with hotel details, using inner and left joins.
Using Broadcast Joins in PySpark4:56
Explore how broadcast joins in PySpark optimize performance by broadcasting a small data frame to every executor, eliminating shuffle and reducing network IO during joins.
Understanding Cross Joins in PySpark5:23
Master cross joins in PySpark by creating two data frames—room types and meal plans—and generating all combinations, with options for a cross join type or a cross join method.
Combining Columns in PySpark7:38
Concatenate PySpark columns with concat and withColumn to form room meal package, using lit for spaces. Build hotel full column by concatenating hotel name and country name after left joins.
Creating a Top Hotels Report3:35
Develop a top hotels report by grouping by hotel full, calculating total bookings, average stay length, total revenue, and revenue per booking, ordering by revenue descending to identify top hotels.
Converting a Spark DataFrame to a Python List6:06
Convert a spark data frame to a python list with collect, then transform rows to dictionaries using row.asDict for easy access to hotel names and ratings.
User-Defined Functions (UDFs) in PySpark5:56
Create a PySpark user defined function with def to classify nights as short, medium, or long stays. Apply it to the hotel bookings data frame as a stateless UDF.
Assigning Row Numbers Within Partitions6:00
Using Window Functions for Running Totals3:45
Flattening Arrays with Explode in PySpark5:23
Debug a Spark Data Pipeline: Cleaning, Transforming & Aggregating Data

Creating an Azure Account and Databricks Workspace4:53
Learn to create an Azure account and deploy a Databricks workspace, configure a resource group and region, and use the 14-day DBU trial with hybrid compute.
How Spark Executes a Query4:01
Explore how Spark executes a DataFrame query, from unresolved to analyzed to optimized logical plans, through Catalyst optimization to a physical plan, with Databricks Photon Engine.
Understanding Spark Performance & Optimization6:11
Explore how partitions, tasks, and resource usage shape Spark performance, and learn optimization factors like executor cores, memory, disk IO, shuffling, data skew, and garbage collection.
Creating an All Purpose Compute Cluster7:20
Create an all purpose compute cluster in Databricks via the Azure portal, configuring runtime, node type, termination settings, and dedicated access mode to view Spark UI partitions and shuffles.
Checking the Number of Partitions in Spark8:08
Learn to check spark partitions by converting a dataframe to rdd, count partitions, and verify delta table writes four files, one per partition.
Understanding Data Partitioning and Shuffle Partitions5:58
Explore how Spark partitions drive parallelism, differentiate initial and shuffle partitions, and how joins trigger shuffles, with a practical count by partition ID.
Adaptive Query Execution (AQE) in Spark7:53
Learn how adaptive query execution optimizes Spark queries at runtime by dynamically adjusting shuffle partitions, selecting smarter join strategies such as broadcast joins, and addressing data skew for efficient workloads.
Managing Shuffle Partitions and Repartitioning3:47
learn to repartition the join data frame to six partitions to reduce empty partitions and avoid creating 200 small files, and tune shuffle partitions with spark.conf.set for better performance.
Reducing Partitions with Coalesce in Spark2:51
Coalesce reduces the number of partitions in a data frame without a full shuffle, improving efficiency for many small partitions, but it does not rebalance data.
Explaining Query Plans in Spark4:42
Explore how to use Spark’s explain to analyze query execution plans, compare efficient and inefficient queries, and understand physical, logical, and extended plans, including broadcast joins and column pruning.
Storage Levels and Caching in Spark5:53
Reduce Spark data shuffling by early filtering and pruning columns, optimize partitioning, and leverage storage levels like memory and disk to optimize cache performance.
Optimize a Slow Spark Job: Performance Tuning

Introduction to Structured Streaming5:09
Hands-On Structured Streaming Demo in Databricks10:34
Set up a streaming notebook and managed volume in Databricks, then read JSON files with a structured streaming data frame and monitor progress with checkpointing and a dashboard.
Ingesting New Arriving Data into a Streaming Table5:22
Demonstrates ingesting new arriving data into a streaming table by generating files every five seconds in the webseals json folder, and deleting the checkpoint location to reset processing.
Exactly-Once Semantics and Fault Tolerance in Structured Streaming4:26
Explains exactly-once semantics and fault tolerance in Spark Structured Streaming, detailing deterministic micro batches, checkpointing, and transactional writes to prevent duplicates and data loss.
Writing Streaming Data to an Output Sink11:54
Write the streaming data frame to a Delta table using a Delta sink, configuring checkpoint and output volumes and a five-second trigger.
Streaming to a Delta Table in Databricks5:08
Viewing Aggregated Results in Structured Streaming5:25
Time-Based Window Aggregations in Spark6:00
Tumbling and Sliding Windows in Structured Streaming3:00
Stream-to-Static Joins in Structured Streaming4:36
Joining Two Streaming Tables with Watermarking9:49

Requirements

A basic understanding of Python and SQL is recommended. No prior experience with Apache Spark or Databricks is needed

Description

This course is designed to prepare students for success in the Databricks Certified Developer for Apache Spark exam by providing hands-on training in building, managing, and optimizing data pipelines on the Databricks Lakehouse Platform. Whether you're a data analyst, data engineer, or cloud analytics professional, this course will help you gain practical experience with PySpark, Spark SQL, DataFrames, structured streaming, and performance tuning, giving you the skills needed to pass the certification and work effectively with large-scale data workloads.

What You Will Learn:

Section 1: Spark Fundamentals and Databricks Essentials

Understand the core components of Apache Spark’s execution architecture.
Set up a Databricks account and explore notebooks for development.
Learn Spark Core, API architecture, and Python variables in a Spark context.
Work with DataFrames and understand transformations, actions, and lazy evaluation.

Section 2: Working with Data

Read and write data in CSV, JSON, ORC, Parquet, and Delta formats.
Define explicit DataFrame schemas, partition data, and optimize tables with Z-Ordering.
Apply column transformations, filters, conditional columns, and aggregate functions.

Section 3: Advanced Data Processing

Group, join, and combine data efficiently using Spark SQL and PySpark APIs.
Use window functions for running totals and row numbering.
Handle duplicates, nulls, and summary statistics for robust data pipelines.

Section 4: Performance Tuning and Optimization

Understand Spark query execution, partitions, caching, and storage levels.
Apply techniques like repartitioning, coalesce, and adaptive query execution (AQE).
Explain query plans and optimize performance for large-scale workloads.

Section 5: Structured Streaming

Build real-time data pipelines with structured streaming.
Perform windowed aggregations, stream-to-stream and stream-to-static joins.
Explore exactly-once semantics, fault tolerance, and output sinks in Databricks.

Section 6: Spark Connect, Deployment, and Pandas API

Understand local, client, and cluster deployment modes for Spark applications.
Leverage the Pandas API and Pandas UDFs for scalable data processing.

Who Should Take This Course:
Ideal for beginners, data analysts, cloud professionals, and aspiring Spark developers looking to gain hands-on experience with Apache Spark on Databricks and prepare for the certification exam. No prior Spark or Databricks experience is required.

By the end of this course, you'll be able to:

Confidently work with Spark DataFrames, SQL, and structured streaming.
Build, optimize, and troubleshoot data pipelines at scale.
Apply performance tuning techniques to improve Spark workloads.
Prepare thoroughly and confidently for the Databricks Certified Developer for Apache Spark exam.

Who this course is for:

This course is for anyone from beginners with basic Python and SQL knowledge to experienced data professionals looking to learn Apache Spark on Databricks and earn the Databricks Certified Associate Developer certification.

Databricks Certified Associate Spark Developer Certification

What you'll learn

Explore related topics

Course content

Introduction3 lectures • 4min

Introduction to Apache Spark Architecture and Databricks9 lectures • 1hr 4min

Using Spark SQL7 lectures • 46min

Data Processing with the Apache Spark DataFrame API27 lectures • 2hr 23min

Performance Tuning & Troubleshooting in Apache Spark12 lectures • 1hr 2min

Structured Streaming with Apache Spark11 lectures • 1hr 11min

Spark Connect and Deployment Modes4 lectures • 10min

Using the Pandas API in Spark2 lectures • 9min

Congratulations & Next Steps1 lecture • 1min

Practice Exams0

Requirements

Description

Who this course is for: