
Spark jobs are submitted by the driver and split into stages and tasks that run in parallel on executors, with lazy execution until actions like count or save trigger them.
Understand how shuffle in Spark moves data across the cluster to enable aggregations and joins. Identify memory pressure, spill to disk, and network traffic as common bottlenecks.
Switch from group by to reduce by to minimize shuffle in Spark and boost performance. Reduce by shuffles only summaries, unlike group by which shuffles all data for each key.
Increase spark partitions to maximize parallelism and reduce data shuffled, lowering skew and speeding processing, by setting spark.sql.shuffle.partitions, while noting higher memory and network usage.
Wrap up by reviewing the Spark architecture—driver, executor, cluster manager—and deployment modes, then cover jobs, stages, tasks, shuffle, skew, and spill to improve performance.
Spark is a powerful framework for processing large datasets in parallel. But, with the complex architecture come frequent performance issues.
In my experience, it can be frustrating looking everywhere, trying to find a resource online that is worded in such a way that you fully understand the inner workings of Spark and how to address these issues. So, I created this course!
This is not a code-along course. This course assumes you already know how to code in Spark. Here, we're talking about how you resolve the performance issues that you encounter during your development journey! We will walk through all of the theory & you'll have actionable steps to take to resolve your performance issues.
In this course, we will cover off:
The Apache Spark Architecture
The type of deployment modes in Apache Spark
The structure of jobs in Apache Spark
How to handle the three main performance concerns in Spark
If you don't yet know how to code in Spark, you can join my 60 minute crash course in PySpark, here on Udemy.
Let's get to work understanding why your scripts are not performing as you may hope and resolve your performance issues together. Shuffle, Skew and Spill will be concerns of the past after this course!