Overcoming Common Performance Issues in Apache Spark

Name: Overcoming Common Performance Issues in Apache Spark
Rating: 4.4 (43 reviews)

Speed up your Spark Scripts and overcome errors

Created byKieran Keene

Last updated 4/2023

English

What you'll learn

The three main causes of performance issues in Apache Spark
How to overcome shuffle induced performance issues in Apache Spark
How to overcome skew induced performance issues in Apache Spark
How to overcome spill induced performance issues in Apache Spark

Course content

1 section • 21 lectures • 39m total length

Introduction1:01
Spark Architecture2:33
Spark Performance & Config Changes Article1:15
Deployment Modes in Spark2:55
Reviewing Cluster vs Client Deployment Modes0:56
Jobs, Stages & Tasks in Spark3:56
Spark jobs are submitted by the driver and split into stages and tasks that run in parallel on executors, with lazy execution until actions like count or save trigger them.
Introduction to Performance Concerns in Spark1:09
What is Shuffle?1:47
Understand how shuffle in Spark moves data across the cluster to enable aggregations and joins. Identify memory pressure, spill to disk, and network traffic as common bottlenecks.
Further Insight into Shuffle2:11
How do we identify Shuffle?1:19
Resolve Shuffle: Broadcast Joins1:19
Resolve Shuffle: ReduceBy()2:58
Switch from group by to reduce by to minimize shuffle in Spark and boost performance. Reduce by shuffles only summaries, unlike group by which shuffles all data for each key.
Resolve Shuffle: Config1:35
Increase spark partitions to maximize parallelism and reduce data shuffled, lowering skew and speeding processing, by setting spark.sql.shuffle.partitions, while noting higher memory and network usage.
What is Skew1:55
More About Skew1:26
How to Identify Skew1:44
How to Resolve Skew4:13
Coalesce Vs Repartitioning Article1:46
What is Spill1:29
How To Prevent Spill1:29
Wrapping up!1:01
Wrap up by reviewing the Spark architecture—driver, executor, cluster manager—and deployment modes, then cover jobs, stages, tasks, shuffle, skew, and spill to improve performance.

Requirements

Apache Spark Programming

Description

Spark is a powerful framework for processing large datasets in parallel. But, with the complex architecture come frequent performance issues.

In my experience, it can be frustrating looking everywhere, trying to find a resource online that is worded in such a way that you fully understand the inner workings of Spark and how to address these issues. So, I created this course!

This is not a code-along course. This course assumes you already know how to code in Spark. Here, we're talking about how you resolve the performance issues that you encounter during your development journey! We will walk through all of the theory & you'll have actionable steps to take to resolve your performance issues.

In this course, we will cover off:

The Apache Spark Architecture
The type of deployment modes in Apache Spark
The structure of jobs in Apache Spark
How to handle the three main performance concerns in Spark

If you don't yet know how to code in Spark, you can join my 60 minute crash course in PySpark, here on Udemy.

Let's get to work understanding why your scripts are not performing as you may hope and resolve your performance issues together. Shuffle, Skew and Spill will be concerns of the past after this course!

Who this course is for:

Spark developers looking to improve performance of their scripts