Teach on Udemy

Turn what you know into an opportunity and reach millions around the world.

Learn More

Your cart is empty.

Keep shopping

Advanced Reinforcement Learning: policy gradient methods

Name: Advanced Reinforcement Learning: policy gradient methods
Rating: 4.3 (135 reviews)

Build Artificial Intelligence (AI) agents using Deep Reinforcement Learning and PyTorch: (REINFORCE, A2C, PPO, etc)

Created byJavier Ventajas

Last updated 5/2025

English

What you'll learn

Master some of the most advanced Reinforcement Learning algorithms.
Learn how to create AIs that can act in a complex environment to achieve their goals.
Create from scratch advanced Reinforcement Learning agents using Python's most popular tools (PyTorch Lightning, OpenAI gym, Optuna)
Learn how to perform hyperparameter tuning (Choosing the best experimental conditions for our AI to learn)
Fundamentally understand the learning process for each algorithm.
Debug and extend the algorithms presented.
Understand and implement new algorithms from research papers.

Course content

15 sections • 97 lectures • 7h 34m total length

Introduction6:07
Reinforcement Learning series0:14
Google Colab1:26
Where to begin0:57
Complete code0:03
Connect with me on social media0:06

Elements common to all control tasks5:44
The Markov decision process (MDP)5:52
Types of Markov decision process2:23
Explore finite and infinite Markov decision processes, and episodic versus continuing tasks, with examples like a 5x5 maze and a car's continuous states.
Trajectory vs episode1:13
Reward vs Return1:39
Discount factor4:19
Explore how the discount factor gamma shapes reward planning in reinforcement learning, balancing immediate vs long-term gains by discounted returns in a maze task.
Policy2:16
State values v(s) and action values q(s,a)1:11
Bellman equations3:17
Uncover the Bellman equations for state value and Q-value, expressing expected returns as discounted rewards from successor states, enabling policy evaluation and solving control tasks in reinforcement learning.
Solving a Markov decision process3:21

Monte Carlo methods3:09
Explore Monte Carlo methods by learning from experience to estimate state-action values under a policy, using returns to update estimates and rely on law of large numbers without a model.
Solving control tasks with Monte Carlo methods6:56
On-policy Monte Carlo control4:33
Implement on-policy Monte Carlo control with an epsilon-greedy policy that occasionally selects random actions; update action-value estimates by averaging returns across episodes to derive a near-optimal policy.

Function approximators7:35
Explore how function approximators replace tabular value estimates with parameterized models for continuous state spaces. Compare linear and polynomial estimators, learning weights to achieve memory-efficient, adaptable value approximation.
Artificial Neural Networks3:32
Artificial Neurons5:38
Explore artificial neurons that aggregate weighted inputs, apply activation functions like ReLU and sigmoid, and propagate signals through hidden and output layers.
How to represent a Neural Network6:44
Stochastic Gradient Descent5:40
Learn how stochastic gradient descent minimizes the neural network cost by using environment rewards, computing the gradient via backpropagation, and updating parameters to move toward minima.
Neural Network optimization4:01

Policy gradient methods4:16
Learn policy gradient methods, where a neural network defines action probabilities, producing stochastic policies that offer smoother learning and address limits of value based approaches.
Representing policies using neural networks4:43
Policy performance2:16
The policy gradient theorem3:20
Explore the policy gradient theorem, defining policy performance as the value of the initial state, and show how the gradient links returns to state frequencies and action probabilities for improvement.
REINFORCE3:38
Parallel learning3:06
Learn parallel learning for policy gradient methods to address similar states by running multiple environments in parallel, alongside experience replay for diverse updates in neural networks.
Entropy regularization5:39
REINFORCE 22:03
Initialize the policy neural network and parallel environments to collect trajectories across episodes. Backpropagate returns with gamma, apply entropy, and update the policy via gradient ascent with alpha.

PyTorch Lightning7:56
Link to the code notebook0:02
Create the policy13:37
Create the environment9:31
Create the dataset14:02
Create a dataset class to collect and structure batched transitions from a policy interacting with an environment, compute discounted returns, and prepare data for training a policy gradient algorithm.
Create the REINFORCE algorithm - Part 16:46
Create the REINFORCE algorithm - Part 210:45
Implement the second part of the REINFORCE algorithm by computing log probabilities, applying entropy regularization, and updating the policy network via a batch training step using PyTorch Lightning.
Check the resulting agent5:57

REINFORCE for continuous action spaces4:55
Explore extending policy gradient methods to continuous action spaces using the REINFORCE algorithm with a normal distribution, computing mean, std, log probabilities, and entropy for policy updates.
Link to the code notebook0:02
Create the policy9:47
Create the inverted pendulum environment8:46
Create the dataset8:59
Creating the algorithm - Part 16:24
Define the REINFORCE policy gradient in a lightning module, configuring environments, the neural policy, gamma, entropy, optimizer, and an RL dataset with train and test loaders.
Creating the algorithm - Part 26:45
Develop a continuous-action policy gradient algorithm by computing the mean and standard deviation from observations, forming a normal distribution, and optimizing policy loss with entropy regularization.
Check the resulting agent2:39

A2C10:49
Link to the code notebook0:02
Create the policy and value network4:19
Create the environment5:39
Create and render a pendulum environment with gym for advantage actor critique algorithm, run parallel copies, inspect observations and actions, scale actions via tanh, and wrap with statistics and normalization.
Create the dataset3:24
Implement A2C - Part 15:47
Implement the A2C algorithm by building the Advantage Vector Critique class, configuring dual optimizers for policy and value networks, and setting up environments and data loaders.
Implement A2C - Part 210:40
Check the resulting agent2:39

Requirements

Be comfortable programming in Python
Completing our course "Reinforcement Learning beginner to master" or being familiar with the basics of Reinforcement Learning (or watching the leveling sections included in this course).
Know basic statistics (mean, variance, normal distribution)

Description

This is the most complete Reinforcement Learning course series on Udemy. In it, you will learn to implement some of the most powerful Deep Reinforcement Learning algorithms in Python using PyTorch and PyTorch lightning. You will implement from scratch adaptive algorithms that solve control tasks based on experience. You will learn to combine these techniques with Neural Networks and Deep Learning methods to create adaptive Artificial Intelligence agents capable of solving decision-making tasks.

This course will introduce you to the state of the art in Reinforcement Learning techniques. It will also prepare you for the next courses in this series, where we will explore other advanced methods that excel in other types of task.

The course is focused on developing practical skills. Therefore, after learning the most important concepts of each family of methods, we will implement one or more of their algorithms in jupyter notebooks, from scratch.

Leveling modules:

- Refresher: The Markov decision process (MDP).

- Refresher: Monte Carlo methods.

- Refresher: Temporal difference methods.

- Refresher: N-step bootstrapping.

- Refresher: Brief introduction to Neural Networks.

- Refresher: Policy gradient methods.

Advanced Reinforcement Learning:

- REINFORCE

- REINFORCE for continuous action spaces

- Advantage actor-critic (A2C)

- Trust region methods

- Proximal policy optimization (PPO)

- Generalized advantage estimation (GAE)

- Trust region policy optimization (TRPO)

Who this course is for:

Developers who want to get a job in Machine Learning.
Data scientists/analysts and ML practitioners seeking to expand their breadth of knowledge.
Robotics students and researchers.
Engineering students and researchers.

Advanced Reinforcement Learning: policy gradient methods

What you'll learn

Explore related topics

Course content

Introduction6 lectures • 9min

Refresher: The Markov Decision Process (MDP)10 lectures • 31min

Refresher: Monte Carlo methods3 lectures • 15min

Refresher: Temporal difference methods6 lectures • 16min

Refresher: N-step bootstrapping3 lectures • 11min

Refresher: Brief introduction to Neural Networks6 lectures • 33min

Refresher: REINFORCE8 lectures • 29min

PyTorch Lightning8 lectures • 1hr 9min

REINFORCE for continuous control tasks8 lectures • 48min

Advantage Actor Critic (A2C)8 lectures • 43min

Requirements

Description

Who this course is for: