
The first course of our reinforcement learning series:
https://www.udemy.com/course/beginner-master-rl-1/?referralCode=376738F1E8AF47CAA6F1
Explore Google Colab, an online programming environment with cloud-based notebooks that run code on GPUs, offering minimal setup and preinstalled Python libraries with Google Drive-based sharing.
The first course of our reinforcement learning series:
https://www.udemy.com/course/beginner-master-rl-1/?referralCode=376738F1E8AF47CAA6F1
The first course of our reinforcement learning series:
https://www.udemy.com/course/beginner-master-rl-1/?referralCode=376738F1E8AF47CAA6F1
Discover the five core elements of any control task in reinforcement learning: state, actions, rewards, agent, and environment; learn how they interact over time to guide decision making.
Define the Markov decision process as a discrete-time, stochastic, controlled framework where an agent acts in an environment, observing states, selecting actions, receiving rewards, and transitioning between states via probabilities.
Classify Markov decision processes into finite and infinite types, then episodic and continuing, highlighting examples like a five by five maze and driving where state includes position and speed.
Learn how a trajectory captures the sequence of states, actions, and rewards from the starting state to a final state; an episode is a trajectory that starts and ends there.
Explore the distinction between reward and return and learn why maximizing the long-term expected return guides agent actions beyond immediate gains.
Learn how the discount factor shapes an agent's incentives by discounting future rewards with gamma, guiding it to maximize the long-term discounted sum of rewards.
Define state values v(s) as the return from a state under a policy, and action values q(s,a) as the return after taking action a in state s and following policy.
Explore the Bellman equations for the value of a state and the Cuban view, revealing the recursive relationship to the expected return from rewards and successor states in reinforcement learning.
The first course of our reinforcement learning series:
https://www.udemy.com/course/beginner-master-rl-1/?referralCode=376738F1E8AF47CAA6F1
Temporal difference methods learn from experience by updating value estimates and policy online, blending Monte Carlo and dynamic programming without a model. They bootstrap estimates to guide policy.
The first course of our reinforcement learning series:
https://www.udemy.com/course/beginner-master-rl-1/?referralCode=376738F1E8AF47CAA6F1
Learn how function approximators replace large value tables with parameterized models to estimate value functions in continuous state spaces, using linear and polynomial approximations and policy evaluation cycles.
Explore artificial neural networks, including feedforward architectures, layers, and fully connected networks, and learn how they approximate functions in reinforcement learning.
Explore artificial neurons inspired by biological neurons, showing how inputs are weighted, aggregated, and transformed by activation functions such as relu and sigmoid in hidden and output layers.
Represent neural networks in code to estimate action values for a three-layer network with 3 input, 6 hidden, and 2 outputs.
Explore how stochastic gradient descent minimizes the cost function by estimating gradients via backpropagation and updating neural network parameters with alpha to descend toward local and global minima.
Optimize the neural network by adjusting the W parameters to minimize mean squared error in estimating Q values, using the reward plus a discounted next-state Q estimate, acknowledging local minima.
The first course of our reinforcement learning series:
https://www.udemy.com/course/beginner-master-rl-1/?referralCode=376738F1E8AF47CAA6F1
Explore how deep q-learning combines temporal difference learning with neural nets, using epsilon-greedy exploration, a separate target policy, replay memory, and a target network for stable updates.
Explore how a target network stabilizes learning in deep reinforcement learning by using a replica to compute target values, avoiding moving targets during bootstrapping, and synchronizing it every few episodes.
Build a deep q network that estimates q-values for the available actions from a state's features, using a sequential neural network with linear layers and relu.
Create an epsilon-based policy that maps environment states to actions using a neural network, sampling random actions with probability epsilon and selecting the best action for states on the device.
Create an environment via gym make, demonstrate lunar lander v2, examine eight-value observations and four-action space, render episodes, and save them to a videos folder for later policy comparison.
Define a deep q-learning class extending the LightningModule to train a DQN agent. Implement initialization, forward pass, optimizers, replay buffer integration, and epsilon-greedy data collection with a target network.
Define play_episode to sample environment data, store experiences (state, action, reward, next_state, done) in a replay buffer, and drive epsilon-greedy action selection with random exploration and no gradient through actions.
Implement the Fort Worth method to process states, configure an AdamW optimizer with a learning rate, and build a dataset and data loader to supply training samples.
Watch a reinforcement learning agent train to land a rocket. A Q-network learns action values and a smarter policy improves decision making across episodes with hyperparameter tuning ahead.
Learn to optimize hyperparameters for deep reinforcement learning with automatic search using a study, trials, and objective functions, including Bayesian and evolutionary samplers, and pruning.
Tune hyperparameters for reinforcement learning agents using a library to optimize gamma and learning rate, and use the moving average of the last 100 episode returns.
Create and launch a hyperparameter tuning job by building a study to maximize the running average of returns, and use a pruner to drop underperforming trials, running 20 trials.
The lecture shows hyperparameter trials in deep reinforcement learning, evaluating 20 runs; identifies the best run (gamma ~0.99, lr 0.001), extracts the best params, and retrains with 10,000 epochs.
Discover the advantage function and how to adapt deep reinforcement learning to continuous action spaces by decomposing Q-values into state value and action advantage, enabling efficient action selection.
Learn normalized advantage function pseudocode for deep reinforcement learning, including network initialization, replay buffer, exploration noise, and target updates using state value in the Bellman equation.
Explore the hyperbolic tangent activation function, which maps inputs to the -1 to 1 range and bounds neural network outputs in the last layer, ensuring actions stay valid.
Define the meu function to pass a state through the network, compute action values, scale with tanh to the action range, and select the best action with no gradient flow.
Pass the state through the network’s common layers, apply the linear value layer to produce a single value estimate, return it with a no-grad annotation for future neural network updates.
Execute the forward pass of the NAF network to combine state value and advantage, build a lower triangular L and B matrix, then form P to estimate the q-value.
Create a noisy policy by selecting the action with the network's highest q-value, adding noise via epsilon for exploration, and clipping the result to the environment's action bounds.
Prepare the replay buffer and dataset, import random, and copy data. Create a rapid action wrapper to repeat actions in the environment, stabilizing learning for continuous actions in robotics.
Implement the training step by unpacking a batch, computing action values with the Q network, next-state values with the target Q network, zero terminal values, then compute and log loss.
Implement end-of-epoch logic by using training step outputs to collect new samples, update the target network, decay epsilon, run a policy-driven episode, and log the episode return.
Debug and launch the continuous lunar lander task by fixing code errors, aligning devices, and configuring the NAF model with a 1e-3 learning rate, then run the trainer.
Explore policy gradient methods where a neural network defines a stochastic policy by outputting action probabilities, enabling smoother learning and handling uncertain tasks.
Increase policy exploration through entropy regularization in policy gradient methods, keeping the entropy of action distributions high to improve robustness and fine-tune optimal policies.
Explore the Brax physics engine to simulate rigid bodies, joints, and actuators for locomotion tasks, including parallel gpu-accelerated environments compatible with gym, demonstrated with a spider-like robot.
Explore deep deterministic policy gradient for continuous actions, using an actor-critic setup with policy and Q networks, differentiable Q functions, and noise-based exploration for stable learning.
Learn how the deep deterministic policy gradient (ddpg) algorithm works in practice by building actor and critic networks, using replay buffers, target networks, and polyak averaging.
To complete this section, you will work with the code notebook that you will find at the following link:
https://colab.research.google.com/github/escape-velocity-labs/advanced_rl/blob/main/5_deep_deterministic_policy_gradient.ipynb
Happy coding! :)
Launch the training process by configuring the environment, creating train updates with GPUs, and running 5000 steps, while logging every 10 steps and generating videos every 100 episodes.
This is the most complete Advanced Reinforcement Learning course on Udemy. In it, you will learn to implement some of the most powerful Deep Reinforcement Learning algorithms in Python using PyTorch and PyTorch lightning. You will implement from scratch adaptive algorithms that solve control tasks based on experience. You will learn to combine these techniques with Neural Networks and Deep Learning methods to create adaptive Artificial Intelligence agents capable of solving decision-making tasks.
This course will introduce you to the state of the art in Reinforcement Learning techniques. It will also prepare you for the next courses in this series, where we will explore other advanced methods that excel in other types of task.
The course is focused on developing practical skills. Therefore, after learning the most important concepts of each family of methods, we will implement one or more of their algorithms in jupyter notebooks, from scratch.
Leveling modules:
- Refresher: The Markov decision process (MDP).
- Refresher: Q-Learning.
- Refresher: Brief introduction to Neural Networks.
- Refresher: Deep Q-Learning.
- Refresher: Policy gradient methods
Advanced Reinforcement Learning:
- PyTorch Lightning.
- Hyperparameter tuning with Optuna.
- Deep Q-Learning for continuous action spaces (Normalized advantage function - NAF).
- Deep Deterministic Policy Gradient (DDPG).
- Twin Delayed DDPG (TD3).
- Soft Actor-Critic (SAC).
- Hindsight Experience Replay (HER).