Teach on Udemy

Turn what you know into an opportunity and reach millions around the world.

Learn More

Your cart is empty.

Keep shopping

Advanced Reinforcement Learning in Python: from DQN to SAC

Name: Advanced Reinforcement Learning in Python: from DQN to SAC
Rating: 4.3 (192 reviews)

Build Artificial Intelligence (AI) agents using Deep Reinforcement Learning and PyTorch: DDPG, TD3, SAC, NAF, HER.

Created byJavier Ventajas

Last updated 5/2025

English

What you'll learn

Master some of the most advanced Reinforcement Learning algorithms.
Learn how to create AIs that can act in a complex environment to achieve their goals.
Create from scratch advanced Reinforcement Learning agents using Python's most popular tools (PyTorch Lightning, OpenAI gym, Brax, Optuna)
Learn how to perform hyperparameter tuning (Choosing the best experimental conditions for our AI to learn)
Fundamentally understand the learning process for each algorithm.
Debug and extend the algorithms presented.
Understand and implement new algorithms from research papers.

Course content

14 sections • 122 lectures • 8h 7m total length

Introduction4:51
The first course of our reinforcement learning series:

https://www.udemy.com/course/beginner-master-rl-1/?referralCode=376738F1E8AF47CAA6F1
Reinforcement Learning series0:13
Google Colab1:26
Explore Google Colab, an online programming environment with cloud-based notebooks that run code on GPUs, offering minimal setup and preinstalled Python libraries with Google Drive-based sharing.
Where to begin1:32
The first course of our reinforcement learning series:

https://www.udemy.com/course/beginner-master-rl-1/?referralCode=376738F1E8AF47CAA6F1
Complete code0:06
Connect with me on social media0:06

Module Overview0:47
The first course of our reinforcement learning series:

https://www.udemy.com/course/beginner-master-rl-1/?referralCode=376738F1E8AF47CAA6F1
Elements common to all control tasks5:44
Discover the five core elements of any control task in reinforcement learning: state, actions, rewards, agent, and environment; learn how they interact over time to guide decision making.
The Markov decision process (MDP)5:52
Define the Markov decision process as a discrete-time, stochastic, controlled framework where an agent acts in an environment, observing states, selecting actions, receiving rewards, and transitioning between states via probabilities.
Types of Markov decision process2:23
Classify Markov decision processes into finite and infinite types, then episodic and continuing, highlighting examples like a five by five maze and driving where state includes position and speed.
Trajectory vs episode1:13
Learn how a trajectory captures the sequence of states, actions, and rewards from the starting state to a final state; an episode is a trajectory that starts and ends there.
Reward vs Return1:39
Explore the distinction between reward and return and learn why maximizing the long-term expected return guides agent actions beyond immediate gains.
Discount factor4:19
Learn how the discount factor shapes an agent's incentives by discounting future rewards with gamma, guiding it to maximize the long-term discounted sum of rewards.
Policy2:16
State values v(s) and action values q(s,a)1:11
Define state values v(s) as the return from a state under a policy, and action values q(s,a) as the return after taking action a in state s and following policy.
Bellman equations3:17
Explore the Bellman equations for the value of a state and the Cuban view, revealing the recursive relationship to the expected return from rewards and successor states in reinforcement learning.
Solving a Markov decision process3:21

Module overview0:31
The first course of our reinforcement learning series:

https://www.udemy.com/course/beginner-master-rl-1/?referralCode=376738F1E8AF47CAA6F1
Temporal difference methods3:16
Temporal difference methods learn from experience by updating value estimates and policy online, blending Monte Carlo and dynamic programming without a model. They bootstrap estimates to guide policy.
Solving control tasks with temporal difference methods3:58
Q-Learning2:22
Advantages of temporal difference methods0:56

Module overview0:36
The first course of our reinforcement learning series:

https://www.udemy.com/course/beginner-master-rl-1/?referralCode=376738F1E8AF47CAA6F1
Function approximators7:35
Learn how function approximators replace large value tables with parameterized models to estimate value functions in continuous state spaces, using linear and polynomial approximations and policy evaluation cycles.
Artificial Neural Networks3:32
Explore artificial neural networks, including feedforward architectures, layers, and fully connected networks, and learn how they approximate functions in reinforcement learning.
Artificial Neurons5:38
Explore artificial neurons inspired by biological neurons, showing how inputs are weighted, aggregated, and transformed by activation functions such as relu and sigmoid in hidden and output layers.
How to represent a Neural Network6:44
Represent neural networks in code to estimate action values for a three-layer network with 3 input, 6 hidden, and 2 outputs.
Stochastic Gradient Descent5:40
Explore how stochastic gradient descent minimizes the cost function by estimating gradients via backpropagation and updating neural network parameters with alpha to descend toward local and global minima.
Neural Network optimization4:01
Optimize the neural network by adjusting the W parameters to minimize mean squared error in estimating Q values, using the reward plus a discounted next-state Q estimate, acknowledging local minima.

Module overview0:26
The first course of our reinforcement learning series:

https://www.udemy.com/course/beginner-master-rl-1/?referralCode=376738F1E8AF47CAA6F1
Deep Q-Learning3:02
Explore how deep q-learning combines temporal difference learning with neural nets, using epsilon-greedy exploration, a separate target policy, replay memory, and a target network for stable updates.
Experience Replay1:58
Target Network3:28
Explore how a target network stabilizes learning in deep reinforcement learning by using a replica to compute target values, avoiding moving targets during bootstrapping, and synchronizing it every few episodes.

PyTorch Lightning7:56
Link to the code notebook0:05
Introduction to PyTorch Lightning5:11
Create the Deep Q-Network4:48
Build a deep q network that estimates q-values for the available actions from a state's features, using a sequential neural network with linear layers and relu.
Create the policy4:51
Create an epsilon-based policy that maps environment states to actions using a neural network, sampling random actions with probability epsilon and selecting the best action for states on the device.
Create the replay buffer5:33
Create the environment7:02
Create an environment via gym make, demonstrate lunar lander v2, examine eight-value observations and four-action space, render episodes, and save them to a videos folder for later policy comparison.
Define the class for the Deep Q-Learning algorithm11:56
Define a deep q-learning class extending the LightningModule to train a DQN agent. Implement initialization, forward pass, optimizers, replay buffer integration, and epsilon-greedy data collection with a target network.
Define the play_episode() function4:59
Define play_episode to sample environment data, store experiences (state, action, reward, next_state, done) in a replay buffer, and drive epsilon-greedy action selection with random exploration and no gradient through actions.
Prepare the data loader and the optimizer4:51
Implement the Fort Worth method to process states, configure an AdamW optimizer with a learning rate, and build a dataset and data loader to supply training samples.
Define the train_step() method9:02
Define the train_epoch_end() method4:25
[Important] Lecture correction.0:12
Train the Deep Q-Learning algorithm6:11
Explore the resulting agent3:08
Watch a reinforcement learning agent train to land a rocket. A Q-network learns action values and a smarter policy improves decision making across episodes with hyperparameter tuning ahead.

Hyperparameter tuning with Optuna8:37
Learn to optimize hyperparameters for deep reinforcement learning with automatic search using a study, trials, and objective functions, including Bayesian and evolutionary samplers, and pruning.
Link to the code notebook0:05
Log average return4:40
Tune hyperparameters for reinforcement learning agents using a library to optimize gamma and learning rate, and use the moving average of the last 100 episode returns.
Define the objective function5:28
Create and launch the hyperparameter tuning job2:55
Create and launch a hyperparameter tuning job by building a study to maximize the running average of returns, and use a pruner to drop underperforming trials, running 20 trials.
Explore the best trial2:40
The lecture shows hyperparameter trials in deep reinforcement learning, evaluating 20 runs; identifies the best run (gamma ~0.99, lr 0.001), extracts the best params, and retrains with 10,000 epochs.

Continuous action spaces6:01
The advantage function4:05
Discover the advantage function and how to adapt deep reinforcement learning to continuous action spaces by decomposing Q-values into state value and action advantage, enabling efficient action selection.
Normalized Advantage Function (NAF)2:49
Normalized Advantage Function pseudocode5:27
Learn normalized advantage function pseudocode for deep reinforcement learning, including network initialization, replay buffer, exploration noise, and target updates using state value in the Bellman equation.
Link to the code notebook0:05
Hyperbolic tangent1:29
Explore the hyperbolic tangent activation function, which maps inputs to the -1 to 1 range and bounds neural network outputs in the last layer, ensuring actions stay valid.
Creating the (NAF) Deep Q-Network 18:04
Creating the (NAF) Deep Q-Network 23:20
Define the meu function to pass a state through the network, compute action values, scale with tanh to the action range, and select the best action with no gradient flow.
Creating the (NAF) Deep Q-Network 31:08
Pass the state through the network’s common layers, apply the linear value layer to produce a single value estimate, return it with a no-grad annotation for future neural network updates.
Creating the (NAF) Deep Q-Network 410:21
Execute the forward pass of the NAF network to combine state value and advantage, build a lower triangular L and B matrix, then form P to estimate the q-value.
Creating the policy5:31
Create a noisy policy by selecting the action with the network's highest q-value, adding noise via epsilon for exploration, and clipping the result to the environment's action bounds.
Create the environment4:46
Prepare the replay buffer and dataset, import random, and copy data. Create a rapid action wrapper to repeat actions in the environment, stabilizing learning for continuous actions in robotics.
Polyak averaging1:19
Implementing Polyak averaging2:14
Create the (NAF) Deep Q-Learning algorithm8:47
Implement the training step2:56
Implement the training step by unpacking a batch, computing action values with the Q network, next-state values with the target Q network, zero terminal values, then compute and log loss.
Implement the end-of-epoch logic2:38
Implement end-of-epoch logic by using training step outputs to collect new samples, update the target network, decay epsilon, run a policy-driven episode, and log the episode return.
Debugging and launching the algorithm3:19
Debug and launch the continuous lunar lander task by fixing code errors, aligning devices, and configuring the NAF model with a 1e-3 learning rate, then run the trainer.
Checking the resulting agent2:47

Policy gradient methods4:16
Explore policy gradient methods where a neural network defines a stochastic policy by outputting action probabilities, enabling smoother learning and handling uncertain tasks.
Policy performance2:16
Representing policies using neural networks4:43
The policy gradient theorem3:20
Entropy Regularization5:39
Increase policy exploration through entropy regularization in policy gradient methods, keeping the entropy of action distributions high to improve robustness and fine-tune optimal policies.

The Brax Physics engine3:24
Explore the Brax physics engine to simulate rigid bodies, joints, and actuators for locomotion tasks, including parallel gpu-accelerated environments compatible with gym, demonstrated with a spider-like robot.
Deep Deterministic Policy Gradient (DDPG)8:51
Explore deep deterministic policy gradient for continuous actions, using an actor-critic setup with policy and Q networks, differentiable Q functions, and noise-based exploration for stable learning.
DDPG pseudocode3:31
Learn how the deep deterministic policy gradient (ddpg) algorithm works in practice by building actor and critic networks, using replay buffers, target networks, and polyak averaging.
Link to the code notebook0:11
Important - updated code0:18
Deep Deterministic Policy Gradient (DDPG)5:11
To complete this section, you will work with the code notebook that you will find at the following link:
https://colab.research.google.com/github/escape-velocity-labs/advanced_rl/blob/main/5_deep_deterministic_policy_gradient.ipynb

Happy coding! :)
Create the gradient policy9:40
Create the gradient policy - Correction0:33
Create the Deep Q-Network5:01
Create the DDPG class8:10
Define the play method2:22
Define the play method - Correction0:42
Setup the optimizers and dataloader3:37
Define the training step11:12
Define the training step - Correction0:24
Launch the training process5:35
Launch the training process by configuring the environment, creating train updates with GPUs, and running 5000 steps, while logging every 10 steps and generating videos every 100 episodes.
Check the resulting agent2:13

Requirements

Be comfortable programming in Python
Completing our course "Reinforcement Learning beginner to master" or being familiar with the basics of Reinforcement Learning (or watching the leveling sections included in this course).
Know basic statistics (mean, variance, normal distribution)

Description

This is the most complete Advanced Reinforcement Learning course on Udemy. In it, you will learn to implement some of the most powerful Deep Reinforcement Learning algorithms in Python using PyTorch and PyTorch lightning. You will implement from scratch adaptive algorithms that solve control tasks based on experience. You will learn to combine these techniques with Neural Networks and Deep Learning methods to create adaptive Artificial Intelligence agents capable of solving decision-making tasks.

This course will introduce you to the state of the art in Reinforcement Learning techniques. It will also prepare you for the next courses in this series, where we will explore other advanced methods that excel in other types of task.

The course is focused on developing practical skills. Therefore, after learning the most important concepts of each family of methods, we will implement one or more of their algorithms in jupyter notebooks, from scratch.

Leveling modules:

- Refresher: The Markov decision process (MDP).

- Refresher: Q-Learning.

- Refresher: Brief introduction to Neural Networks.

- Refresher: Deep Q-Learning.

- Refresher: Policy gradient methods

Advanced Reinforcement Learning:

- PyTorch Lightning.

- Hyperparameter tuning with Optuna.

- Deep Q-Learning for continuous action spaces (Normalized advantage function - NAF).

- Deep Deterministic Policy Gradient (DDPG).

- Twin Delayed DDPG (TD3).

- Soft Actor-Critic (SAC).

- Hindsight Experience Replay (HER).

Who this course is for:

Developers who want to get a job in Machine Learning.
Data scientists/analysts and ML practitioners seeking to expand their breadth of knowledge.
Robotics students and researchers.
Engineering students and researchers.

Advanced Reinforcement Learning in Python: from DQN to SAC

What you'll learn

Explore related topics

Course content

Introduction6 lectures • 8min

Refresher: The Markov Decision Process (MDP)11 lectures • 32min

Refresher: Q-Learning5 lectures • 11min

Refresher: Brief introduction to Neural Networks7 lectures • 34min

Refresher: Deep Q-Learning4 lectures • 9min

PyTorch Lightning15 lectures • 1hr 20min

Hyperparameter tuning with Optuna6 lectures • 24min

Deep Q-Learning for continuous action spaces (Normalized Advantage Function)19 lectures • 1hr 17min

Refresher: Policy gradient methods5 lectures • 20min

Deep Deterministic Policy Gradient (DDPG)17 lectures • 1hr 11min

Requirements

Description

Who this course is for: