College-Level Reinforcement Learning : A Comprehensive Dive!

Name: College-Level Reinforcement Learning : A Comprehensive Dive!
Rating: 4.5 (9 reviews)

Learn Deep Reinforcement Learning from the ground up. With a special case study on RLHF & RLVR for LLM tuning

Created byAhmed Fathy | Tech-Lead | MSc

Last updated 2/2026

English

What you'll learn

Understand reinforcement learning (RL) from the ground up (Including relevant proofs and derivations)
Understand model-based & model-free RL techniques
Understand value-based and policy-gradient RL optimization techniques
Understand how to use deep learning in combination with reinforcement learning
Understand RL techniques for discrete and continuous action control
Understand Reinforcement Learning From Human Feedback (RLHF) & From Verifiable Rewards (RLVR)
Understand how LLMs learn to reason and provide chains of thought
Understand how LLMs get trained to call other tools and collaborate with other LLMs/Agents

Course content

8 sections • 170 lectures • 29h 10m total length

Introduction to the course!2:31
Success stories of RL5:53
Prerequisites & The reference book0:22
What is an RL agent ?10:58
What is a state ?14:57
What is a policy ?7:13
Partial Observability13:18
Discrete & continuous states7:43
Continuous actions and continuous time3:51
The model of the environment15:10
The Markov property5:26
Formalization of the Markov Property6:24
MDPs, MRPs, and POMDPs6:52
Notation conventions14:38
Episodes6:05
The discounted cumulative return G11:27
Notes on the return G3:10
The state and action value functions13:15
Derivation of the Bellman expectation equation17:47
Derive the Bellman expectation equation that maps a state's value to expected values of next states under a policy, using rewards, gamma, and transition probabilities.
Converting an MDP into an MRP22:48
A closed-form solution to the MRP10:24
A numerical example for the closed form solution of an MRP22:39
The Bellman expectation equation for action value functions14:39
Checkpoint! What have we covered so far ?2:29
Terminology: Evaluation, Control, Planning, and Search12:17
Non-Stationary MDPs4:41
Exploration and Exploitation3:55
The optimal policy & The optimal value functions14:53
The Bellman optimality equations10:51

The iterative solution to the bellman expectation equation6:02
Code : Numeric Example for iterative policy evaluation16:31
Overview: The Banach fixed point theorem14:15
Proof: The bellman expectation operator is a contraction mapping20:31
Proof: The Banach fixed point theorem proof26:01
Value Iteration: Introduction & Numeric example23:32
Proof: The Bellman optimality operator is a contraction mapping12:00
Policy Iteration: Introduction6:33
Policy iteration: Numeric example30:01
Proof: The policy improvement theorem24:46
Notes: Policy iteration vs Value iteration6:49
Generalized Policy Iteration12:07
Dynamic Programming15:54
Notes & Variants!11:43
Complexity analysis for PE, PI, and VI5:11
VI: Number of iterations needed for convergence8:09
Numeric Example: Policy Iteration using action value function equations13:57
Checkpoint!3:26

Trajectories!11:03
Monte-Carlo evaluation - Basic Algorithm10:36
Monte-Carlo evaluation - Iterative Algorithm7:48
Temporal Difference: 1-step TD13:17
A different mathematical derivation of the 1-step TD target8:10
Intuitive Comparison: MC vs TD9:45
MC vs TD : Same solution if we only have finite data ?13:44
Code: Effect of the choice of alpha on MC and TD evaluation12:19
Comparison Summary: MC vs TD(0)6:15
The n-step return & The intuition of TD(lambda)13:56
TD(lambda) : The mathematics13:41
Proof: TD(lambda) converges to TD(0) and MC at extreme values of lambda24:43
Intuition: The backward view of TD(lambda)7:30
Derivation: The backward view of TD(lambda)19:14
Eligibility Traces! Intuition & Derivation20:40
Checkpoint! Model-Free evaluation vs DP evaluation3:44
Model-Free evaluation of action value functions13:22

Terminology: On-Policy, Off-Policy, Online, and Offline learning17:02
Model-Free Control: General framework11:57
Numeric Example: Why is exploration a must for on-policy model-free control ?23:10
Proof: An epsilon-greedy policy satisfies the policy improvement theorem11:27
The MC control algorithm6:06
MC control : Notes on alpha3:56
Explore a td0 version of Monte Carlo policy control and how the action-value function updates across multiple policies, with alpha shaping convergence by favoring newer episodes and decaying over time.
The SARSA & n-step SARSA control algorithms7:02
The SARSA(lambda) control algorithm6:02
GLIE & The Robbins-Monro conditions7:16
Off-Policy control algorithms: Introduction5:56
QLearning!17:00
Importance Sampling for MC evaluation15:35
Concrete Example: Importance Sampling for MC evaluation13:20
Importance Sampling: TD(0) and n-step TD18:43
Formal Derivation of the Importance Sampling Ratio11:09
Convergence guarantees of tabular model-free evaluation & control algorithms3:34

RL with Function Approximators: Introduction and Prerequisites1:11
Feature vector representations for states13:19
Feature vector representations for actions10:57
Function approximators for state and action value functions10:57
Shared Weights! Pros & Cons8:40
Defining the objective function for policy evaluation6:23
MC evaluation with neural networks9:52
Value function graph vs Loss function graph5:15
MC returns with Linear function approximators3:33
The semi-gradient TD(0) evaluation with neural networks9:51
The n-step return & lambda return for policy evaluation with FA2:58
Intuition: Eligibility Traces in the world of function approximators16:27
Math Derivation: Eligibility Traces in the world of function approximators15:07
MC vs TD(0) with FA: The same evaluation with infinite data ? - Part I - MC8:52
MC vs TD(0) with FA: The same evaluation with infinite data ? - Part II - MC15:07
MC vs TD(0) with FA: The same evaluation with infinite data ? - Part III - TD(0)18:37
Why do we care about linear function approximators?13:28
Visual intuition of the problem of moving targets and changing loss function15:27
Fixed Target Networks!7:21
Example: Off-Policy learning with linear FA can diverge!26:48
Baird's Counter Example: Problem definition8:00
Using Dynamic Programming with Function Approximators!1:56
Baird's Counter Example diverges & An extended definition for off-policy eval10:46
Comparison: Off-Policy: Model free vs Model Based & Tabular vs FA13:54
The Deadly Triad5:44
Q-Value function evaluation weight update equation7:02
Control with FA: Deep SARSA14:50
Deep Q-Learning DQN19:22
DQN: The famous Atari paper5:53
Notes on Experience Replay & Prioritized Experience Replay15:18
Double Q-Learning & The maximization bias - Tabular17:39
Double Deep Q-Learning (DDQN)2:49
Batch Methods with Linear Function Approximations : Introduction5:20
LSMC: Closed Form Derivation11:49
LSTD(0) & LSTD(lambda) Closed form derivation10:24
LSPI3:28
Convergence Guarantees with function approximators - control & prediction5:35

Introduction to Policy Gradient Methods4:18
Why would we estimate the policy directly?5:17
Parameterizing a policy neural network4:29
Modelling a stochastic multi-variate continuous action policy as a neural net6:50
Notes on Value-Based vs Policy-Based methods4:09
State Aliasing & Stochastic optimal policies8:39
The derivative of the Softmax function8:18
The policy gradient theorem for a multi-armed bandit (MAB)15:09
Concrete example & intuition of the PG update in the MAB case12:50
The Baseline, Advantage, and the Critic in the MAB case6:30
Proof: Subtracting a baseline doesn't change the gradient direction (PG, MAB)3:45
Contextual Bandits! Introduction4:28
The policy improvement theorem for contextual bandits9:37
Derivation I: The full MDP : Policy Improvement Theorem17:44
Derivation II: The full MDP: Introducing the Action Value Function17:37
Derivation III: The full MDP: Introducing the Advantage6:51
The REINFORCE algorithm5:58
A2C: Advantage Actor Critic7:39
A3C: Asynchronous Advantage Actor Critic6:42
Entropy Regularization & Hints about designing a loss function8:40
DDPG: Deep Deterministic Policy Gradient for continuous action control - Intro8:00
DDPG: Training the Actor4:07
DDPG: Training the Critic9:13
DDPG: Algorithm Overview3:07
SAC: Soft Actor Critic - Introduction6:35
SAC: Training the Actor11:09
Explore the actor objective in SAC, combining critic guidance with entropy to promote exploration, and derive its gradient via the total-derivative law accounting for two gradient paths.
SAC: Training the Critic8:30
SAC: Algorithm Overview3:20
TRPO: Trust Region Policy Optimization - Intuition8:55
TRPO: Optimization Objective8:26
The KL Divergence Equation5:15
Derivation: The generalized advantage estimation (GAE)17:24
Truncated Lambda in GAE5:21
PPO: Proximal Policy Optimization (Soft Penalty version)8:51
PPO: Proximal Policy Optimization (CLIP version)10:04
PPO: Visualizing the objective function of PPO-CLIP9:25
GRPO: Group Relative Policy Optimization17:03
CheckPoint!8:40

Introduction: RL for LLM tuning - Case Study4:56
What are large language models (LLMs) ?16:27
Modelling LLMs as RL agents5:45
How is LLM Pretraining done ?9:19
Explore how large language models learn language structure through self-supervised next-token prediction during pre-training with vast data, using cross-entropy loss and teacher forcing.
SFT: Supervised Instruction FineTuning6:13
Prompt-Engineering vs RLHF - Introducing the reward model13:37
Training the reward model (RLHF/RLAIF)4:48
Reward Hacking!7:07
Using PPO for RLHF3:45
Using GRPO for RLHF2:30
DPO: Direct Preference Optimization - Introduction5:11
DPO: Direct Preference Optimization - The objective function19:15
Reasoning, Chains of thought, Math, and Function/Agent calling (RLVR)15:39
Summary: The typical procedure to train a reasoning LLM3:36

Requirements

Basic probability & statistics understanding (e.g. : distributions, mean, variance, expectation)
Basic linear algebra and calculus
Good knowledge of neural networks and deep learning (e.g. : gradient descent, back-propagation)

Description

This is a comprehensive deep dive into reinforcement learning course. It is university-level deep.
The course starts from the very basics of RL in constrained simple problems and progresses with complexity step by step until the introduction of algorithms capable of solving complex real world problems for discrete actions (e.g.: LLMs) and continuous (e.g.: Robotics).
The course is also highly mathematical. It introduces a lot of algorithms, proofs, and derivations. However, it is still highly intuitive as well. Lots of intuitive examples to explain every concept or idea are provided.
While there are some code examples, I don't view this as the main goal of the course. The course focuses much more on concepts, intuitions, and derivations. Coding is used mainly for illustration.
The course covers a lot of traditional and SOTA algorithms in rich & satisfying detail. Some algorithms covered in this course are: Iterative Policy Evaluation (PE), Value Iteration (VI), Policy Iteration (PI), Monte-Carlo evaluation, TD(0), TD(lambda), Backward TD(lambda) with eligibility traces, SARSA, Q-Learning, Double Q-Learning, Expected SARSA, Deep SARSA, Deep Q-Learning, Deep Double Q-Learning, REINFORCE, A2C, A3C, DDPG, SAC, TRPO, PPO, GRPO, DPO.
Finally, the course has a sizeable case study section on: RL with LLMs. It covers how large language models and chatting agents are trained using reinforcement learning to have better alignment with human preferences, produce chains of thought, and to be better at math & coding. Algorithms for RLHF & RLVR are covered in deep detail.

Who this course is for:

University students taking a serious reinforcement learning course
Machine learning engineering looking to get a deeper understanding of reinforcement learning
LLM engineers looking to understand the inner workings of RLHF and RLVR

College-Level Reinforcement Learning : A Comprehensive Dive!

What you'll learn

Explore related topics

Course content

Important Definitions In Reinforcement Learning29 lectures • 4hr 47min

Policy evaluation, value iteration, policy iteration, and dynamic programming18 lectures • 4hr 17min

Model Free Policy Evaluation17 lectures • 3hr 30min

Model-Free Control16 lectures • 2hr 59min

Deep RL: RL with function approximators37 lectures • 6hr 20min

Policy Gradient Methods38 lectures • 5hr 19min

Case Study: RL with LLMs - RLHF & RLVR for math, coding, and reasoning14 lectures • 1hr 58min

BONUS: Get all my discounted courses!1 lecture • 1min

Requirements

Description

Who this course is for: