
Compare Apache Spark with Databricks, define Spark as an open source distributed computing engine, and summarize Databricks as a managed cloud platform with notebooks and Delta Lake.
Sign up for a Databricks free edition and verify your email. Explore notebooks, workspace, catalog, and compute clusters for learning streaming topics with an all-purpose cluster.
Explore the core components of Spark's execution architecture: the driver node, cluster manager, and worker nodes with executors, and how the Spark session, DAG, and Catalyst Optimizer shape execution.
Learn how Databricks notebooks enable interactive, multi-language coding with code cells and markdown, connect to compute, and use serverless and all-purpose compute with Spark sessions.
Understand Spark's layered architecture—from Spark core to data frame and data set APIs—and explore Spark SQL, structured streaming, machine learning, and graph libraries built on core.
Learn to read a CSV into a Spark data frame in Databricks, define a schema, enable header and infer schema, and assign the result to a data frame variable.
Explore how Spark uses lazy evaluation with transformations and actions, building a logical execution plan into a DAG, and how actions trigger job execution and optimization.
Learn how csvs, or comma-separated values, offer a readable format with no enforced schema, where all values are strings, and explore reading them with spark.read.csv using header and infer schema.
Explore JSON as a plain text, human-readable, hierarchical data format with nested objects and arrays. Learn to read and write JSON in Databricks using Spark APIs.
Explore ORC and Parquet, binary columnar formats that enable selective reads with columnar storage, self-describing schemas, strong typing, and high compression across Spark, Hive, and data lake storage.
Discover Delta Lake, an open source storage layer that adds ACID transactions to Spark, stores data as Parquet, tracks changes with a transaction log, and enables time travel.
Define an explicit PySpark dataframe schema using a struct type for hotel bookings, convert price and total amount to decimals, and parse the booking date with a day-month-year format.
Partition data with PySpark and SQL by a country column, creating folders for each country and partitioned delta tables to speed up queries that filter by country.
Sort and optimize Delta Lake tables to boost query performance, data skipping, and compression, while partitioning, ordering by booking date and hotel ID, and applying z-ordering for faster joins.
Learn to select and rename columns in PySpark using DataFrame APIs, including aliasing and SQL expressions, demonstrated on the hotel bookings dataset.
Apply column transformations in PySpark to create a discounted amount from total amount and to uppercase room types, using withColumn and withColumns in a Databricks notebook.
Drop the discounted amount column from the hotel bookings dataframe, then apply filters for premium bookings using nights, total amount, and logical operators, including where for USA results.
Create conditional columns in PySpark using when to label bookings as premium or standard, or to produce an is_premium boolean flag, based on nights and total_amount conditions.
Learn to use PySpark aggregate functions on a hotel bookings data frame, counting records and summing the total amount with aliasing and both select and groupby approaches.
Learn to perform multi-aggregate analysis with Spark by grouping hotel bookings by country and computing total bookings, average nights, average booking value, and total revenue in a single dataframe.
Append additional data to a bookings dataframe by turning a list into a dataframe with the booking schema, then use union to handle duplicates, nulls, and date and decimal types.
Identify and remove duplicates in PySpark dataframes by grouping by booking ID, counting records, filtering counts greater than one, then dropping duplicates to create a clean dataframe.
Learn to count unique values in PySpark with count distinct and approx count distinct, compare accuracy and speed, and explore the hyperloglog algorithm for large data sets.
Count nulls per column in PySpark DataFrames, then drop rows or fill with unknown using dropNA and fillNA, applying to columns like country, room type, and nights.
Display numeric data statistics to snapshot a clean data frame using the summary, including average, minimum, maximum, and 25/50/75 percentiles, with optional 10/90 percentiles, and selectively apply to columns.
Learn to cast and change column data types in PySpark after reading data, using withColumn and cast to double for price per night, then print schema to verify.
Explore PySpark date functions to extract year, month, day, quarter, and week of year, format dates, and build a unique date dim dataframe for joining to analyses.
Demonstrate converting Unix epoch seconds into readable dates in Spark by using from Unix time and toDate to create date columns, format dates, and compare epoch and date data types.
Learn to compute date differences in PySpark by using date_diff with current date and booking date, add a days since the last booking column, and preview results.
join two data frames in PySpark by loading a hotels data frame and joining on hotel_id to enrich bookings with hotel details, using inner and left joins.
Explore how broadcast joins in PySpark optimize performance by broadcasting a small data frame to every executor, eliminating shuffle and reducing network IO during joins.
Master cross joins in PySpark by creating two data frames—room types and meal plans—and generating all combinations, with options for a cross join type or a cross join method.
Concatenate PySpark columns with concat and withColumn to form room meal package, using lit for spaces. Build hotel full column by concatenating hotel name and country name after left joins.
Develop a top hotels report by grouping by hotel full, calculating total bookings, average stay length, total revenue, and revenue per booking, ordering by revenue descending to identify top hotels.
Convert a spark data frame to a python list with collect, then transform rows to dictionaries using row.asDict for easy access to hotel names and ratings.
Create a PySpark user defined function with def to classify nights as short, medium, or long stays. Apply it to the hotel bookings data frame as a stateless UDF.
Learn to create an Azure account and deploy a Databricks workspace, configure a resource group and region, and use the 14-day DBU trial with hybrid compute.
Explore how Spark executes a DataFrame query, from unresolved to analyzed to optimized logical plans, through Catalyst optimization to a physical plan, with Databricks Photon Engine.
Explore how partitions, tasks, and resource usage shape Spark performance, and learn optimization factors like executor cores, memory, disk IO, shuffling, data skew, and garbage collection.
Create an all purpose compute cluster in Databricks via the Azure portal, configuring runtime, node type, termination settings, and dedicated access mode to view Spark UI partitions and shuffles.
Learn to check spark partitions by converting a dataframe to rdd, count partitions, and verify delta table writes four files, one per partition.
Explore how Spark partitions drive parallelism, differentiate initial and shuffle partitions, and how joins trigger shuffles, with a practical count by partition ID.
Learn how adaptive query execution optimizes Spark queries at runtime by dynamically adjusting shuffle partitions, selecting smarter join strategies such as broadcast joins, and addressing data skew for efficient workloads.
learn to repartition the join data frame to six partitions to reduce empty partitions and avoid creating 200 small files, and tune shuffle partitions with spark.conf.set for better performance.
Coalesce reduces the number of partitions in a data frame without a full shuffle, improving efficiency for many small partitions, but it does not rebalance data.
Explore how to use Spark’s explain to analyze query execution plans, compare efficient and inefficient queries, and understand physical, logical, and extended plans, including broadcast joins and column pruning.
Reduce Spark data shuffling by early filtering and pruning columns, optimize partitioning, and leverage storage levels like memory and disk to optimize cache performance.
Set up a streaming notebook and managed volume in Databricks, then read JSON files with a structured streaming data frame and monitor progress with checkpointing and a dashboard.
Demonstrates ingesting new arriving data into a streaming table by generating files every five seconds in the webseals json folder, and deleting the checkpoint location to reset processing.
Explains exactly-once semantics and fault tolerance in Spark Structured Streaming, detailing deterministic micro batches, checkpointing, and transactional writes to prevent duplicates and data loss.
Write the streaming data frame to a Delta table using a Delta sink, configuring checkpoint and output volumes and a five-second trigger.
This course is designed to prepare students for success in the Databricks Certified Developer for Apache Spark exam by providing hands-on training in building, managing, and optimizing data pipelines on the Databricks Lakehouse Platform. Whether you're a data analyst, data engineer, or cloud analytics professional, this course will help you gain practical experience with PySpark, Spark SQL, DataFrames, structured streaming, and performance tuning, giving you the skills needed to pass the certification and work effectively with large-scale data workloads.
What You Will Learn:
Section 1: Spark Fundamentals and Databricks Essentials
Understand the core components of Apache Spark’s execution architecture.
Set up a Databricks account and explore notebooks for development.
Learn Spark Core, API architecture, and Python variables in a Spark context.
Work with DataFrames and understand transformations, actions, and lazy evaluation.
Section 2: Working with Data
Read and write data in CSV, JSON, ORC, Parquet, and Delta formats.
Define explicit DataFrame schemas, partition data, and optimize tables with Z-Ordering.
Apply column transformations, filters, conditional columns, and aggregate functions.
Section 3: Advanced Data Processing
Group, join, and combine data efficiently using Spark SQL and PySpark APIs.
Use window functions for running totals and row numbering.
Handle duplicates, nulls, and summary statistics for robust data pipelines.
Section 4: Performance Tuning and Optimization
Understand Spark query execution, partitions, caching, and storage levels.
Apply techniques like repartitioning, coalesce, and adaptive query execution (AQE).
Explain query plans and optimize performance for large-scale workloads.
Section 5: Structured Streaming
Build real-time data pipelines with structured streaming.
Perform windowed aggregations, stream-to-stream and stream-to-static joins.
Explore exactly-once semantics, fault tolerance, and output sinks in Databricks.
Section 6: Spark Connect, Deployment, and Pandas API
Understand local, client, and cluster deployment modes for Spark applications.
Leverage the Pandas API and Pandas UDFs for scalable data processing.
Who Should Take This Course:
Ideal for beginners, data analysts, cloud professionals, and aspiring Spark developers looking to gain hands-on experience with Apache Spark on Databricks and prepare for the certification exam. No prior Spark or Databricks experience is required.
By the end of this course, you'll be able to:
Confidently work with Spark DataFrames, SQL, and structured streaming.
Build, optimize, and troubleshoot data pipelines at scale.
Apply performance tuning techniques to improve Spark workloads.
Prepare thoroughly and confidently for the Databricks Certified Developer for Apache Spark exam.