
Derive the Bellman expectation equation that maps a state's value to expected values of next states under a policy, using rewards, gamma, and transition probabilities.
Explore a td0 version of Monte Carlo policy control and how the action-value function updates across multiple policies, with alpha shaping convergence by favoring newer episodes and decaying over time.
Explore the actor objective in SAC, combining critic guidance with entropy to promote exploration, and derive its gradient via the total-derivative law accounting for two gradient paths.
Explore how large language models learn language structure through self-supervised next-token prediction during pre-training with vast data, using cross-entropy loss and teacher forcing.
This is a comprehensive deep dive into reinforcement learning course. It is university-level deep.
The course starts from the very basics of RL in constrained simple problems and progresses with complexity step by step until the introduction of algorithms capable of solving complex real world problems for discrete actions (e.g.: LLMs) and continuous (e.g.: Robotics).
The course is also highly mathematical. It introduces a lot of algorithms, proofs, and derivations. However, it is still highly intuitive as well. Lots of intuitive examples to explain every concept or idea are provided.
While there are some code examples, I don't view this as the main goal of the course. The course focuses much more on concepts, intuitions, and derivations. Coding is used mainly for illustration.
The course covers a lot of traditional and SOTA algorithms in rich & satisfying detail. Some algorithms covered in this course are: Iterative Policy Evaluation (PE), Value Iteration (VI), Policy Iteration (PI), Monte-Carlo evaluation, TD(0), TD(lambda), Backward TD(lambda) with eligibility traces, SARSA, Q-Learning, Double Q-Learning, Expected SARSA, Deep SARSA, Deep Q-Learning, Deep Double Q-Learning, REINFORCE, A2C, A3C, DDPG, SAC, TRPO, PPO, GRPO, DPO.
Finally, the course has a sizeable case study section on: RL with LLMs. It covers how large language models and chatting agents are trained using reinforcement learning to have better alignment with human preferences, produce chains of thought, and to be better at math & coding. Algorithms for RLHF & RLVR are covered in deep detail.