Teach on Udemy

Turn what you know into an opportunity and reach millions around the world.

Learn More

Your cart is empty.

Keep shopping

Strategies for Parallelizing LLMs Masterclass

Name: Strategies for Parallelizing LLMs Masterclass
Rating: 4.3 (25 reviews)

Mastering LLM Parallelism: Scale Large Language Models with DeepSpeed & Multi-GPU Systems

Created byPaulo Dichone | Software Engineer, AWS Cloud Practitioner & Instructor

Last updated 3/2025

English

What you'll learn

Understand and Apply Parallelism Strategies for LLMs
Implement Distributed Training with DeepSpeed
Deploy and Manage LLMs on Multi-GPU Systems
Enhance Fault Tolerance and Scalability in LLM Training

Course content

17 sections • 99 lectures • 8h 41m total length

Introduction & What Is This Course About1:50
Learn strategies for parallelizing large language models and training massive LLMs at scale, using data, model, and pipeline parallelism techniques, plus practical PyTorch and ML library skills.
Course Structure1:03
Explore course structure that blends theory and hands-on practice, starting with fundamental concepts and core lingo of strategies for parallelizing LMS, then move toward practical application.
DEMO - What You'll Build in This Course3:53
Explore practical parallelism strategies for training large language models, from transformer architecture basics to data and tensor parallelism, activation recomputation, with hands-on experiences on single GPUs, CPUs, and multi-GPU systems.

What is Parallelism and Why it Matters4:28
Learn how parallelism powers large language model training by splitting tasks across multi-gpu clusters, reducing training time for massive models like GPT-3 with tens of terabytes of data.
Understanding the Single GPU Strategy4:40
Explore the single GPU strategy for LLM training, highlighting memory limits, sequential processing, bottlenecks, and how parallelism boosts efficiency and scalability.
Understanding the Parallel Strategy and Advantages5:12
Distribute workloads across GPUs in the multi-GPU parallel strategy, expanding memory and throughput while managing communication overhead via interconnects for scalable distributed training.
Parallelism vs Single GPU - Summary4:06
Show how single-gpu sequential processing contrasts with multi-gpu parallelism for llms, delivering faster training, better resource utilization, and a GPT-3 scale example with 175 billion parameters.

IT Fundamentals - Introduction1:05
Explore optional IT fundamentals, including CPUs, GPUs, memory units, and RAM, to understand how computation powers parallelizing LLMs.
What is a Computer - CPU and RAM Overview6:54
Define computers as input processors that convert input into output using hardware and software, with the CPU and RAM driving fast data processing to deliver smooth, usable results.
Data Storage and File Systems3:55
Explore how computers store data in RAM for fast processing and how file systems organize formats like PNG, JPEG, and PDF across hard drives, SSDs, and cloud storage.
OS File System Structure2:42
Explore a visual overview of a typical file system, detailing directories and a tree-like hierarchy. See how the system enforces access permissions for files and folders to store data.
LAN Introduction10:31
Understand how local area networks connect computers and devices via switches, routers, and modems, with wired and wireless links and WANs enabling internet access.
What is the Internet7:49
Explore how the internet operates as a global, distributed network of computers and servers that uses standardized protocols, enabling fast, resilient communication across unique IP addresses.
Internet Communication Deep Dive4:36
Shows how internet communication happens between a client and a server using IP addresses and the domain name system to translate names into addresses.
Understanding Servers and Clients6:37
Understand how servers provide resources to clients through a request-response model governed by http and ftp protocols, supporting web, file, and database servers across the internet.
GPUs - Overview1:54
Examine the hardware trinity of gpu, cpu, and ram and their symbiotic role in parallel workloads for training large language models, with gpu as the primary engine.

GPU Architecture for LLM Training4:46
Explain GPU architecture for large language model training, detailing thousands of compute cores, CUDA and tensor cores, memory hierarchy from VRAM to L1 cache, and attention and matrix operations.
Why this Architecture Excels11:37
Explore why this architecture excels for large language model training, highlighting the hardware trinity, parallelism, tensor cores, and memory bandwidth.

Machine and Deep Learning Introduction1:31
Explore the fundamentals of machine learning and deep learning, and how they underpin AI, as a basis for understanding the strategies for parallelizing LMS.
Deep and Machine Learning - Overview and Breakdown9:26
Explore the fundamentals of deep learning and machine learning, including neural networks, data handling, feature extraction, and the model life cycle from training to deployment.
Deep Learning Key Aspects10:49
Understand deep learning's key aspects: automatic feature learning, hierarchical structure, and end-to-end learning from raw input to output. Contrast with traditional stepwise methods through driving analogies.
Deep Neural Networks - Deep Dive9:22
Explore how deep neural networks process data through input, hidden, and output layers, using activation functions and backpropagation to refine weights for accurate predictions.
The Single Neuron Computation - Deep Dive5:38
Explore how a single neuron computes output by summing inputs multiplied by weights, adding biases, and applying an activation function across input, hidden, and output layers.
Weights2:58
Explore how weights function in neural networks through a restaurant analogy, adjusting recipe proportions from base stock to spices based on customer feedback to update the weights.
Activation Functions - Deep Dive6:14
Explore activation functions as quality checks in neural networks, comparing ReLU, sigmoid, and softmax, and map weights, biases, and activation to a restaurant workflow.
Deep Learning - Summary1:59
Summarizes deep learning fundamentals: neural networks with input, hidden, and output layers, bias and weights, activation functions, and how learning transforms data into outputs for pattern recognition, classification, and prediction.
Machine Learning Introduction - ML vs DL4:52
Explore how machine learning is a broad framework with deep learning as a subset, contrasting manual feature engineering and data needs with automatic learning and natural language processing applications.
Learning Types and Full ML & DL Analogy Example5:43
Explore supervised, unsupervised, and reinforcement learning with real-world tasks like fraud detection, recommendations, image recognition, natural language, and classification and predictions, plus DL vs ML contrasts.
DL and ML Comparative Capabilities - Summary4:28
Explore the three-layer framework of deep learning, machine learning, and artificial intelligence, comparing neural networks, pattern recognition, and data-based learning, with applications from image and language processing to autonomous systems.

Introduction0:34
Explore the transformer architecture and why it underpins large language models, offering a quick overview of how it works to equip you for strategies to parallelize large language models.
The Transformer Architecture Fundamentals7:33
Explore how the transformer architecture, a deep learning model for text and natural language processing, uses tokenization, embeddings, positional encoding, and self-attention to capture long-range dependencies.
The Self-Attention Mechanism - Analogy5:25
Explore how self-attention in transformer models uses query, key, and value to compute attention scores, apply weighted sums, and produce contextual representations.
The Transformer Architecture Animation4:16
Explore how the transformer architecture uses embeddings, self-attention, information exchange, and processing with context to generate output and translate English to Spanish, illustrated by an animation.
The Transformer Library - Deep dive4:26
Explore the Hugging Face transformers library as a modular toolkit for accessing state-of-the-art transformer models, tokenizers, and pipelines for tasks like classification, question answering, and generation.

Parallel Computing Introduction - Key Concepts4:27
Master the fundamentals of parallel computing, including nodes, CPU and GPU memory, and how multiple nodes with GPUs train large language models, guided by Amdahl's and Gustafson's laws.
Parallel Computing Fundamentals and Scaling Laws - Deep Dive8:45
Learn parallel computing fundamentals with CPU, memory, and GPU architectures, shared memory, and data transfer, and apply Amdahl's law and Gustafson's law to scale training of large language models.

Types of Parallelism in LLM Training1:58
Explore the types of parallelism in large language model training, including data, model, pipeline, and tensor parallelism, along with bottlenecks, communication models, and memory considerations for scalable training.
Data Parallelism - How It Works11:30
explore data parallelism, distributing training data across multiple GPUs with identical model copies, processing different batches in parallel, and gradient synchronization via allreduce to update parameters.
Data Parallelism Advantages for LLM Training0:59
Data parallelism offers near-linear throughput scaling with more GPUs, enables processing trillions of tokens, provides memory efficiency by storing one model copy per GPU, and remains straightforward to implement.
Real-world Example - Data Parallelism in GPT-3 Training5:00
Leverage data parallelism to train GPT-3 across thousands of GPU clusters processing trillions of tokens from millions of documents, with high-speed interconnects and all-reduce gradients for months instead of years.
Model Parallelism and Tensor Parallelism and Layer Parallelism - Deep Dive8:27
Explore model parallelism for training large language models by distributing the model across multiple gpus, using tensor parallelism with horizontal splitting and layer parallelism with vertical splitting.
LLM Relevance and Implementaion2:05
Master model parallelism for scaling language models with billions to trillions of parameters, addressing memory needs for parameters, optimizer states, and activations while highlighting graph partitioning and load balancing.
Model vs Data Parallelism8:45
Compare data and model parallelism to reveal their tradeoffs. Data parallelism boosts throughput with full model copies and infrequent gradient synchronization, while model parallelism enables large models with frequent communication.
Key Differences Highlighted - Data vs Model Parallelism2:31
compare data parallelism and model parallelism using a chef and recipe analogy, highlighting independent parallel work versus sequential handoffs, specialist roles, and memory and scaling limits.
Data vs Model Parallelism1:47
Demonstrates data and model parallelism by visualizing how identical model copies on multiple GPUs process different data batches, and how a single model splits across GPUs for sequential layers.
Hybrid Parallelism - Animation4:03
Hybrid parallelism combines model and data parallelism to overcome size limits and enable multidimensional scaling, improving hardware utilization and memory efficiency via pipeline parallelism and micro-batching.
Hybrid Parallelism - What is It and Motivation2:06
Combine model and data parallelism to form hybrid parallelism, using a restaurant franchise analogy to show specialized stations processing batches of orders to maximize throughput.

Pipeline Parallelism Overview3:04
Learn pipeline parallelism, a distributed training technique that divides the model's layers across multiple GPUs or TPUs, enabling overlapped computation across mini batches and reducing device idle time.
Pipeline Parallelism Key Concepts and How it Works - Step by Step6:42
Understand pipeline parallelism by partitioning the model into sequential stages across GPUs and using mini-batch processing with micro-batching to reduce bubbles and idle time.
Pipeline Bubbles Key Concepts3:03
Explore pipeline parallelism concepts, highlighting micro-batching and pipeline bubbles, including warm up and cool down phases that create idle times and impact efficiency across stages.
Pipeline Schedules Key Concepts3:44
Explore pipeline schedules for parallelizing llms, comparing Gpipe and 1F1B strategies. Learn how forward and backward passes, memory use, and micro-batch flow affect efficiency, especially for deep models.
Activation Recomputation - Overview and Introduction1:31
Explore activation recomputation, or gradient checkpointing, which trades computation for memory by discarding forward activations and recomputing them in the backward pass to train large language models.
Neural Network and Activation and Backward and Forward Passes - Full Dive7:18
Understand neural networks with neurons and trainable weights, where forward passes produce activations that feed layers and backward passes propagate gradients through backpropagation with memory of activations, including activation recomputation.
Understanding Activation Recomputation vs Standard Training - Deep Dive9:35
Explore activation recomputation to reduce memory by storing checkpoints and recomputing activations during backward pass, and compare standard training with gradient checkpointing for larger language models.
Demo - Activation Recomputation Visualization2:56
This visualization compares standard training and activation recomputation, showing how recomputation lowers memory usage by storing checkpoints and recomputing activations during backprop, at the cost of extra computation.
Activation Recomputation vs Standard Approach3:43
Compare activation recomputation with the standard approach using a road trip analogy. Learn how checkpoints and stored activations reduce memory use at the cost of extra recomputation.
Benefits of Activation Recomputation and Implementation Strategies9:11
Explore how activation recomputation dramatically reduces memory during llm training, enabling GPT-4-scale models and larger batch sizes, using optimal checkpointing, hierarchical checkpointing, and activation compression.
Pipeline Parallelism Implementation Frameworks and Key Takeaways4:11
Explore pipeline parallelism frameworks like Gpipe, Pipedream, Megatron-LM, and the zero redundancy optimizer (zero for short) to understand combining data, tensor, and pipeline parallelism for scalable large-model training.

Requirements

Basic knowledge of Python programming and deep learning concepts.
Familiarity with PyTorch or similar frameworks is helpful but not required.
Access to a GPU-enabled environment (e.g., colab) for hands-on sections—don’t worry, we’ll guide you through setup!

Description

Mastering LLM Parallelism: Scale Large Language Models with DeepSpeed & Multi-GPU Systems

Are you ready to unlock the full potential of large language models (LLMs) and train them at scale?

In this comprehensive course, you’ll dive deep into the world of parallelism strategies, learning how to efficiently train massive LLMs using cutting-edge techniques like data, model, pipeline, and tensor parallelism.

Whether you’re a machine learning engineer, data scientist, or AI enthusiast, this course will equip you with the skills to harness multi-GPU systems and optimize LLM training with DeepSpeed.

What You’ll Learn

Foundational Knowledge: Start with the essentials of IT concepts, GPU architecture, deep learning, and LLMs (Sections 3-7). Understand the fundamentals of parallel computing and why parallelism is critical for training large-scale models (Section 8).
Types of Parallelism: Explore the core parallelism strategies for LLMs—data, model, pipeline, and tensor parallelism (Sections 9-11). Learn the theory and practical applications of each method to scale your models effectively.
Hands-On Implementation: Get hands-on with DeepSpeed, a leading framework for distributed training. Implement data parallelism on the WikiText dataset and master pipeline parallelism strategies (Sections 12-13). Deploy your models on RunPod, a multi-GPU cloud platform, and see parallelism in action (Section 14).
Fault Tolerance & Scalability: Discover strategies to ensure fault tolerance and scalability in distributed LLM training, including advanced checkpointing techniques (Section 15).
Advanced Topics & Trends: Stay ahead of the curve with emerging trends and advanced topics in LLM parallelism, preparing you for the future of AI (Section 16).

Why Take This Course?

Practical, Hands-On Focus: Build real-world skills by implementing parallelism strategies with DeepSpeed and deploying on Run Pod’s multi-GPU systems.
Comprehensive Deep Dives: Each section includes in-depth explanations and practical examples, ensuring you understand both the "why" and the "how" of LLM parallelism.
Scalable Solutions: Learn techniques to train LLMs efficiently, whether you’re working with a single GPU or a distributed cluster.

Who This Course Is For

Machine learning engineers and data scientists looking to scale LLM training.
AI researchers interested in distributed computing and parallelism strategies.
Developers and engineers working with multi-GPU systems who want to optimize LLM performance.
Anyone with a basic understanding of deep learning and Python who wants to master advanced LLM training techniques.

Prerequisites

Basic knowledge of Python programming and deep learning concepts.
Familiarity with PyTorch or similar frameworks is helpful but not required.
Access to a GPU-enabled environment (e.g., run pod) for hands-on sections—don’t worry, we’ll guide you through setup!

Who this course is for:

Machine learning engineers and data scientists looking to scale LLM training.
AI researchers interested in distributed computing and parallelism strategies.
Developers and engineers working with multi-GPU systems who want to optimize LLM performance.
Anyone with a basic understanding of deep learning and Python who wants to master advanced LLM training techniques.

Strategies for Parallelizing LLMs Masterclass

What you'll learn

Explore related topics

Course content

Introduction3 lectures • 7min

Course Source Code and Resources2 lectures • 1min

Strategies for Parallelizing LLMS - Deep Dive4 lectures • 18min

IT Fundamental Concepts9 lectures • 46min

GPU Architecture for LLM Training Deep Dive2 lectures • 16min

Deep and Machine Learning - Deep Dive11 lectures • 1hr 3min

Large Language Models - Fundamentals of AI and LLMs5 lectures • 22min

Parallel Computing Fundamentals & Parallelism in LLM Training2 lectures • 13min

Types of Parallelism in LLM Training - Data - Model and Hybrid Parallelism11 lectures • 49min

Types of Parallelism - Pipeline and Tensor Parallelism11 lectures • 55min

Requirements

Description

Who this course is for: