Spark Performance Tuning for Data Engineers: Part2 - Spill

Name: Spark Performance Tuning for Data Engineers: Part2 - Spill
Rating: 4.8 (27 reviews)

Data Engineering & Apache Spark Optimization Techniques on Databricks to Boost Speed, Reduce cost & Handle Big Data

Created byS Santra

Last updated 3/2026

English

English [Auto],

What you'll learn

Hands on Demo based on different Scenarios & Usecases
Learn the nuances of spark performance tuning
Get detailed insights about different operations in spark
Get clear understanding about how spark configs work hand in hand & best combination for optimal results
Learn to identify and solve bottlenecks & errors in your spark application

Course content

4 sections • 15 lectures • 3h 9m total length

Introduction4:37
Linkedin - www.linkedin.com/in/suprobho-santra
What is Optimization5:41
What is Benchmarking8:27
Benchmark by running baseline and optimized code in a controlled environment, measure matrices such as response time, throughput, and memory, and use noop write to ensure repeatable, consistent results.
Suggest for Upcoming Courses0:09

Reading Spark UI21:16
Explore Spark UI fundamentals using a simple notebook, including executors, drivers, memory, partitions, and how stages and jobs map to read, filter, and project operations.
Physical Plans & DAG - Part 121:45
Explore how spark converts a submitted query into an unresolved logical plan, validates it against the catalog, and generates optimized logical and physical plans with catalyst and adaptive query execution.
Physical Plans & DAG - Part 217:33

Introduction to Data Spill10:11
Understand data spill in Spark by seeing how large partitions exceed memory and spill to disk, and learn how maintaining moderate partitions prevents spill.
Spark Executor memory allocation19:49
Data Spill Scenario22:50
Spark Memory Fraction config8:30
Increase spark.memory.fraction to 0.8 for about 30 percent more memory to address the data spill problem, and benchmark on a Databricks cluster.
Large VM13:59
Tuning Shuffle Partitions15:01

Requirements

Basic Spark Architecture & internals
Spark programming in PySpark or Scala
Databricks Cloud Platform

Description

Unlock the true potential of Apache Spark by mastering storage-related performance tuning techniques. This hands-on course is packed with real-world scenarios, guided demos, and practical use cases that will help you fine-tune Spark storage strategies for speed, efficiency, and scalability.

This course is perfect for Intermediate Data Engineers & Spark Developers as well as Aspiring Achitects who wants to optimize Spark jobs, reduce resource costs, and ensure fast, reliable performance for large-scale data applications.

What You’ll Learn

1. Understand how Apache Spark handles storage internally: memory vs disk

2. Learn when and how to use Spark caching and persistence effectively

3. Compare and choose the right storage levels: MEMORY_ONLY, MEMORY_AND_DISK, etc.

4. Use real-world examples and hands-on demos to benchmark storage decisions

5. Learn how to monitor storage metrics using the Spark UI

6. Handle memory spills, disk I/O bottlenecks, and storage tuning in cluster environments

7. Apply best practices for storage optimization in cloud and on-prem Spark clusters

Why Take This Course?

100% Hands-on: Focused on practical implementation, not just theory
Designed for Data Engineers, Spark Developers, and Big Data Practitioners
Covers both foundational concepts and advanced tuning techniques
Teaches how to measure performance gains using real metrics
Helps you make cost-efficient decisions for big data storage

Tools & Technologies Covered

Apache Spark (2.x and 3.x)
DataBricks
Spark UI
HDFS, DataLake (for storage scenarios)

Who this course is for:

Data Engineers & Spark Developers as well as Aspiring Achitects curious about advanced techniques of Performance Tuning & Optimization

Spark Performance Tuning for Data Engineers: Part2 - Spill

What you'll learn

Explore related topics

Course content

Introduction4 lectures • 19min

Important Concepts2 lectures • 20min

Learn to Diagnose3 lectures • 1hr 1min

Optimizing Data Spill6 lectures • 1hr 30min

Requirements

Description

Who this course is for: