Udemy
    •  
    •  
    •  
    •  
    •  
    •  
    •  
    •  
Turn what you know into an opportunity and reach millions around the world.
Learn More
Your cart is empty.
Keep shopping
College-Level Reinforcement Learning : A Comprehensive Dive!
Rating: 4.5 out of 5(9 ratings)
84 students

College-Level Reinforcement Learning : A Comprehensive Dive!

Learn Deep Reinforcement Learning from the ground up. With a special case study on RLHF & RLVR for LLM tuning
Last updated 2/2026
English

What you'll learn

  • Understand reinforcement learning (RL) from the ground up (Including relevant proofs and derivations)
  • Understand model-based & model-free RL techniques
  • Understand value-based and policy-gradient RL optimization techniques
  • Understand how to use deep learning in combination with reinforcement learning
  • Understand RL techniques for discrete and continuous action control
  • Understand Reinforcement Learning From Human Feedback (RLHF) & From Verifiable Rewards (RLVR)
  • Understand how LLMs learn to reason and provide chains of thought
  • Understand how LLMs get trained to call other tools and collaborate with other LLMs/Agents

Course content

8 sections170 lectures29h 10m total length
  • Introduction to the course!2:31
  • Success stories of RL5:53
  • Prerequisites & The reference book0:22
  • What is an RL agent ?10:58
  • What is a state ?14:57
  • What is a policy ?7:13
  • Partial Observability13:18
  • Discrete & continuous states7:43
  • Continuous actions and continuous time3:51
  • The model of the environment15:10
  • The Markov property5:26
  • Formalization of the Markov Property6:24
  • MDPs, MRPs, and POMDPs6:52
  • Notation conventions14:38
  • Episodes6:05
  • The discounted cumulative return G11:27
  • Notes on the return G3:10
  • The state and action value functions13:15
  • Derivation of the Bellman expectation equation17:47

    Derive the Bellman expectation equation that maps a state's value to expected values of next states under a policy, using rewards, gamma, and transition probabilities.

  • Converting an MDP into an MRP22:48
  • A closed-form solution to the MRP10:24
  • A numerical example for the closed form solution of an MRP22:39
  • The Bellman expectation equation for action value functions14:39
  • Checkpoint! What have we covered so far ?2:29
  • Terminology: Evaluation, Control, Planning, and Search12:17
  • Non-Stationary MDPs4:41
  • Exploration and Exploitation3:55
  • The optimal policy & The optimal value functions14:53
  • The Bellman optimality equations10:51

Requirements

  • Basic probability & statistics understanding (e.g. : distributions, mean, variance, expectation)
  • Basic linear algebra and calculus
  • Good knowledge of neural networks and deep learning (e.g. : gradient descent, back-propagation)

Description

  • This is a comprehensive deep dive into reinforcement learning course. It is university-level deep.

  • The course starts from the very basics of RL in constrained simple problems and progresses with complexity step by step until the introduction of algorithms capable of solving complex real world problems for discrete actions (e.g.: LLMs) and continuous (e.g.: Robotics).

  • The course is also highly mathematical. It introduces a lot of algorithms, proofs, and derivations. However, it is still highly intuitive as well. Lots of intuitive examples to explain every concept or idea are provided.

  • While there are some code examples, I don't view this as the main goal of the course. The course focuses much more on concepts, intuitions, and derivations. Coding is used mainly for illustration.

  • The course covers a lot of traditional and SOTA algorithms in rich & satisfying detail.  Some algorithms covered in this course are: Iterative Policy Evaluation (PE), Value Iteration (VI), Policy Iteration (PI), Monte-Carlo evaluation, TD(0), TD(lambda), Backward TD(lambda) with eligibility traces, SARSA, Q-Learning, Double Q-Learning, Expected SARSA, Deep SARSA, Deep Q-Learning, Deep Double Q-Learning, REINFORCE, A2C, A3C, DDPG, SAC, TRPO, PPO, GRPO, DPO.

  • Finally, the course has a sizeable case study section on: RL with LLMs. It covers how large language models and chatting agents are trained using reinforcement learning to have better alignment with human preferences, produce chains of thought, and to be better at math & coding. Algorithms for RLHF & RLVR are covered in deep detail.


Who this course is for:

  • University students taking a serious reinforcement learning course
  • Machine learning engineering looking to get a deeper understanding of reinforcement learning
  • LLM engineers looking to understand the inner workings of RLHF and RLVR