Teach on Udemy

Turn what you know into an opportunity and reach millions around the world.

Learn More

Your cart is empty.

Keep shopping

Artificial Intelligence: Reinforcement Learning in Python

Name: Artificial Intelligence: Reinforcement Learning in Python
Rating: 4.8 (10812 reviews)

Complete guide to Reinforcement Learning, with Stock Trading and Online Advertising Applications

Created byLazy Programmer Team, Lazy Programmer Inc.

Last updated 2/2026

English

English [Auto],

What you'll learn

Apply gradient-based supervised machine learning methods to reinforcement learning
Understand reinforcement learning on a technical level
Understand the relationship between reinforcement learning and psychology
Implement 17 different reinforcement learning algorithms
Understand important foundations for OpenAI ChatGPT, GPT-4

Course content

15 sections • 112 lectures • 14h 42m total length

Introduction3:14
Discover reinforcement learning in AI, from self-playing AIs like AlphaGo to mastering video games and robot navigation, using simulation to accelerate learning and push toward real-world impact.
Course Outline and Big Picture7:55
Outline the course from multi-armed bandits to full reinforcement learning, highlighting three approaches and real-world applications in A/B testing and recommender systems. Build a stock trading bot with Q-learning.
Where to get the Code4:36
Access the course code from the GitHub repository and clone it with git, avoiding forks to stay up to date while following the theory-to-code pattern and practicing hands-on coding.
How to Succeed in this Course3:04
Follow three guidelines to succeed: use the Q&A, meet prerequisites, and handwritten notes in your own words while coding what you see.
Warmup15:36
Delve into probability basics—joint, marginal, and conditional distributions—expectations and Bayes rule, then practice linear regression with mean squared error and gradient descent in Python using synthetic data.

Section Introduction: The Explore-Exploit Dilemma10:17
Explore the explore-exploit dilemma with a two-armed bandit analogy and learn four adaptive algorithms—epsilon greedy, optimistic initial value method, UCB one, and Thompson sampling.
Applications of the Explore-Exploit Dilemma8:00
Apply the explore-exploit dilemma to real-world advertising and web design by comparing options with experiments, using Gaussian distributions to measure click-through and conversion rates.
Epsilon-Greedy Theory7:04
Learn how epsilon-greedy balances exploration and exploitation in reinforcement learning by occasionally choosing a random bandit, updating mean rewards, and adapting epsilon over time.
Calculating a Sample Mean (pt 1)5:56
Learn to calculate the sample mean for binary 0/1 rewards, show it is the maximum likelihood estimate of the Bernoulli parameter, and enable constant time and space updates for epsilon-greedy.
Epsilon-Greedy Beginner's Exercise Prompt5:05
Begin with a hands-on epsilon-greedy exercise in reinforcement learning, building a bandit simulator from scratch in Python, and learn exploration versus exploitation through a complete loop and updates.
Designing Your Bandit Program4:09
Explore bandit methods with a hands-on approach to explore-exploit in reinforcement learning, designing a slot-machine pull function and tracking average rewards.
Epsilon-Greedy in Code7:12
Implement an epsilon-greedy bandit in Python to balance exploration and exploitation with simulated trials. Analyze estimated versus true win rates and extend with decaying epsilon strategies.
Comparing Different Epsilons6:02
Compare epsilon-greedy performance with Gaussian rewards across three arms, revealing how different epsilons influence convergence, cumulative reward, and exploration-exploitation tradeoffs.
Optimistic Initial Values Theory5:40
Explore the optimistic initial values method for the explore-exploit dilemma, using large initial estimates with greedy selection to induce exploration and discuss initialization as a hyperparameter.
Optimistic Initial Values Beginner's Exercise Prompt2:26
Study optimistic initial values in a beginner’s exercise for multi-armed bandits, covering the bandit class setup, p estimates, and selecting the next arm, with a full exercise for advanced learners.
Optimistic Initial Values Code4:18
Explore the optimistic initial values method for bandits, including initializing P estimates and n_A, greedy selection, and updates, and compare its performance to epsilon-greedy.
UCB1 Theory14:32
Learn how the UCB1 algorithm uses the sample mean plus an upper confidence bound to balance exploration and exploitation in reinforcement learning.
UCB1 Beginner's Exercise Prompt2:14
Learn to implement the UCB1 algorithm in a beginner-friendly exercise, including bandit setup, experiment loop, and plotting results, with a focus on applying the UCB formula.
UCB1 Code3:28
Implement the UCB1 algorithm in Python, detailing the UCB function formula, initialization and main loops, and show convergence to the optimal band with accurate mean estimates and cumulative reward.
Bayesian Bandits / Thompson Sampling Theory (pt 1)12:43
Explore Bayesian bandits and Thompson sampling with a beta prior for the Bernoulli mean theta, updating to a beta posterior as data arrives, illustrating conjugacy and online learning.
Bayesian Bandits / Thompson Sampling Theory (pt 2)17:35
Explore Thompson sampling, drawing from the posterior distribution to select arms instead of using an upper bound. Update beliefs with data, visualize beta posteriors, and compare exploration-exploitation in Bayesian bandits.
Thompson Sampling Beginner's Exercise Prompt2:50
Explore implementing the Thompson sampling algorithm in code with a beginner-friendly fill-in-the-blanks exercise. Learn to use the beta distribution for updating bandit posteriors in a Python reinforcement learning context.
Thompson Sampling Code5:03
Implements Thompson sampling for a Bayesian bandit by initializing beta priors, sampling from posteriors, and updating parameters, showing convergence to the optimal bandit and fat posteriors for exploration.
Thompson Sampling With Gaussian Reward Theory11:24
Extend Thompson sampling to gaussian rewards with a gaussian likelihood and known precision. Derive the posterior parameters for the mean under conjugate priors and implement single-sample updates.
Thompson Sampling With Gaussian Reward Code6:18
Explore implementing the Bayesian bandit for normally distributed rewards using gaussian posteriors and Thompson sampling, with code that initializes and updates the gaussian parameters.
Exercise on Gaussian Rewards1:20
Explore the bayesian bandit with gaussian rewards, adjust it for true means 5, 10, and 20, and diagnose why simple replacements break convergence.
Why don't we just use a library?5:40
Libraries can't replace the essential math and coding skills; real-world practice hinges on backend engineering and data infrastructure, databases, batch jobs, and cross-language systems.
Nonstationary Bandits7:11
Explore nonstationary bandits by updating the mean with a constant learning rate via an exponentially weighted moving average, enabling the model to adapt as rewards change over time.
Bandit Summary, Real Data, and Online Learning6:29
Examine the explore-exploit dilemma with epsilon-greedy, upper confidence bound, optimistic initial values, and Bayesian bandit approaches, and why online learning with real data testing remains impractical for true click-through rates.
(Optional) Alternative Bandit Designs10:05
Explore why there is no single correct bandit design and how naming reflects mathematical concepts. Learn how many-to-many modeling, environment and agent roles, and real-world versus simulation considerations guide implementation.
Suggestion Box3:10
Explore a practical suggestion box for reinforcement learning students, collecting background, course difficulty, missed explanations, and future topic requests to tailor and improve the Python-based course.

MDP Section Introduction6:19
Explore the fundamentals of Markov decision processes in reinforcement learning. Understand the return, value function, and Bellman equation, with grid world visuals and the roots of Q-learning.
Gridworld12:35
Explore grid world as a simple reinforcement learning environment, defining episodes, terminal states, state and action spaces, and learning policies—deterministic, probabilistic, and epsilon-greedy.
Choosing Rewards3:58
Design the reward structure to guide the agent toward the goal in grid worlds and mazes, using plus one for success and minus one to discourage wandering.
The Markov Property6:12
Learn how the Markov property limits dependencies to the previous state, explore the state transition matrix, and see language modeling and deep Q learning illustrate first-order assumptions in practice.
Markov Decision Processes (MDPs)14:42
Explore the Markov decision process as a discrete-time stochastic control framework with an agent and environment, where actions determine next states and rewards through p(s'|s,a) and a policy.
Future Rewards9:34
Maximize the return by summing future rewards—actions affect future, not past—using gamma discount to emphasize near-term rewards and improve reinforcement learning training.
Value Functions5:07
Define the value function as the expected return from a state under a policy. Show how state, policy, and dynamics affect future rewards, with terminal states valued at zero.
The Bellman Equation (pt 1)8:46
View the Bellman equation as a recursive, one-step lookahead for the value function, using the law of total expectation to compute expected returns from next states, underpinning Q-learning.
The Bellman Equation (pt 2)6:42
Explore the Bellman equation by expanding its terms over state, action, reward, and next state, highlighting policy and environment dynamics and the move toward scalable solutions.
The Bellman Equation (pt 3)6:09
Examine the Bellman equation for value functions, detailing state value V and action value Q, and how policies, including stochastic ones, shape future rewards.
Bellman Examples22:24
Learn how to solve the Bellman equation for the value function using simple two- and three-state examples, including deterministic and random transitions, terminal states, rewards, and a gamma of 0.9.
Optimal Policy and Optimal Value Function (pt 1)9:17
Explore how to find the best policy and the best value function using the Bellman optimal equation, linking v* and Q* with policy evaluation.
Optimal Policy and Optimal Value Function (pt 2)4:36
Explains deriving the optimal policy from V star or Q star using arg max and contrasts evaluation versus control, outlining dynamic programming, Monte Carlo, and temporal difference methods.
MDP Summary2:58
Explore the foundations of Markov decision processes with a grid world, defining states, actions, rewards, policies, and values, and derive the Bellman equation to maximize the expected return.

Dynamic Programming Section Introduction8:59
Introduces how to use the Markov decision process to solve real reinforcement learning problems, outlining prediction and control tasks and applying the Bellman equation through policy iteration and value iteration.
Iterative Policy Evaluation15:36
Explore iterative policy evaluation, a dynamic programming approach using the Bellman equation to compute the state-value function V^π for a policy, converging via Delta to a fixed point.
Designing Your RL Program5:00
Design reinforcement learning programs with a consistent interface for prediction and control problems. Initialize value functions, run episodes, collect states, actions, rewards, and iteratively update policies and values.
Gridworld in Code11:37
Implement a grid world environment in code, enabling policy evaluation and policy iteration, by defining state management, moves, terminal states, rewards, and actions.
Iterative Policy Evaluation in Code12:17
Explore iterative policy evaluation in a deterministic grid world by building transition probabilities, a policy dictionary, and value updates using Bellman equations, with visualization helpers and convergence checks.
Windy Gridworld in Code7:47
Explore windy grid world with probabilistic transitions stored in a dictionary of state-action to next-state probabilities. Adopt test-driven development to implement the move function and deterministic rewards.
Iterative Policy Evaluation for Windy Gridworld in Code7:14
Implement iterative policy evaluation in Windy Grid World with probabilistic state transitions and a probabilistic policy, building transition and rewards dictionaries and applying the Bellman equation to observe convergence.
Policy Improvement11:23
Explore policy improvement in reinforcement learning: compare Q and V values, choose Amax actions, update policies across states, and approach the Bellman optimality equation for the optimal policy.
Policy Iteration7:57
Policy iteration refines a policy by alternating evaluation and improvement until stability. It uses value functions and Bellman updates to guide each step.
Policy Iteration in Code8:27
Develop and implement policy iteration in a deterministic grid world, performing policy evaluation and improvement, using transition probabilities, rewards, and the Bellman optimality equation to converge on an optimal policy.
Policy Iteration in Windy Gridworld8:50
Value Iteration7:40
Value iteration speeds policy optimization by updating value function with the Bellman max, skipping policy evaluation, and converging when delta falls below a threshold toward the optimal value and policy.
Value Iteration in Code6:36
Implement value iteration in a grid world by computing transition probabilities and rewards, applying Bellman optimality, and extracting the optimal value function and policy.
Dynamic Programming Summary4:57
Review dynamic programming in reinforcement learning, covering policy evaluation, policy improvement, and value and disvalue iteration; contrast model-based and model-free approaches and learning from experience.

Monte Carlo Intro9:21
Explore Monte Carlo methods to estimate expected values from samples. Apply sample means to approximate value functions under a policy in reinforcement learning, addressing environment dynamics and the explore-exploit dilemma.
Monte Carlo Policy Evaluation10:52
Use Monte Carlo policy evaluation to estimate state values by averaging returns from episodes generated under a given policy, including first-visit and every-visit variants.
Monte Carlo Policy Evaluation in Code7:52
Implement Monte Carlo policy evaluation in a grid world to predict value functions, using first-visit Monte Carlo, episode sampling, and a 0.9 discount, with random starts and terminal states.
Monte Carlo Control9:00
Discover how Monte Carlo control solves the control problem by using policy evaluation and improvement to converge toward the optimal policy.
Monte Carlo Control in Code8:51
Implement Monte Carlo control in code by learning a policy over state and action pairs through first-visit updates, updating values and policy on a grid world.
Monte Carlo Control without Exploring Starts4:41
discover Monte Carlo control without exploring starts by using epsilon-greedy policies, first-visit updates, and Q estimates to learn the optimal state-action choices without impractical exploration.
Monte Carlo Control without Exploring Starts in Code5:40
Monte Carlo control with an epsilon-greedy policy without exploring starts in a grid world, updating Q via sample means and tracking state and state-action counts.
Monte Carlo Summary1:53
Apply Monte Carlo methods to reinforcement learning by estimating values from sample means as agents learn from experience. Highlight policy iteration in sample-based learning and epsilon-greedy exploration for Q-based control.

Temporal Difference Introduction3:55
Learn temporal difference learning, a bootstrapping, sample-based method that blends dynamic programming and Monte Carlo ideas, and explore two variants, Sauza and Q-learning, for prediction and control.
TD(0) Prediction5:24
Explore TD(0) prediction and how bootstrapping updates the value function with the one-step target r plus gamma V(s'), avoiding full-episode reliance like Monte Carlo.
TD(0) Prediction in Code4:54
Demonstrate implementing TD(0) prediction in Python code using a grid environment, with epsilon-greedy action selection, a constant learning rate alpha, and updating the value function and deltas.
SARSA4:36
Apply temporal-difference learning to control with the SARSA algorithm, updating Q values from state, action, reward, next state, and next action in an epsilon-greedy on-policy framework.
SARSA in Code6:20
Implement a SARSA-based control loop in Python using epsilon-greedy action selection from q, on a grid world with -0.1 step cost, and derive the optimal policy and value function.
Q Learning4:55
explore q-learning, which updates the q-function using the max future q value, making it off-policy with a greedy target policy and an epsilon-greedy behavior policy.
Q Learning in Code5:02
Implement q-learning in Python by building a q-table, applying epsilon-greedy actions, updating values with amax and max, and deriving the final policy for a grid environment.
TD Learning Section Summary2:27
Convert Monte Carlo into a one-step update via the Bellman equation, bootstrap value estimates with temporal difference learning, and compare on-policy methods like Sarsa with off-policy Q-learning.

Approximation Methods Section Introduction4:19
Explore function approximation in reinforcement learning, moving beyond tabular methods with linear models, linear regression, stochastic gradient descent, and feature engineering for prediction and control, applied to grid world.
Linear Models for Reinforcement Learning8:32
Explore linear regression as a foundation for prediction and control in reinforcement learning, using function approximation with W^T X and squared error minimized by gradient descent and stochastic gradient descent.
Feature Engineering10:16
Apply feature engineering to transform inputs with basis expansions, such as polynomials and radial basis functions, enabling linear models to approximate nonlinear value functions in reinforcement learning.
Approximation Methods for Prediction9:55
Use function approximation to solve the prediction task with a feature vector X and weight vector W, comparing Monte Carlo prediction and TD learning with gradient descent on squared error.
Approximation Methods for Prediction Code8:26
Implement function approximation for prediction with temporal difference learning using an rbf feature expansion and epsilon-greedy control, updating weights via td error and tracking mean squared error per episode.
Approximation Methods for Control4:41
Apply function approximation for control in reinforcement learning by encoding state-action pairs with one-hot actions, using feature expansion such as RBF kernel, and updating Q-values via gradient descent in Q-learning.
Approximation Methods for Control Code8:54
Learn function approximation for control with q-learning, using action-to-integer mappings, one-hot encoding, and greedy action selection to derive value and policy.
CartPole5:34
Explore CartPole in OpenAI Gym with q learning and function approximation. Grasp the four-state vector, left/right actions, and +1 rewards until termination at 200 steps, plus a random policy example.
CartPole Code6:00
Build an optimal CartPole controller in a gym environment with epsilon-greedy learning and gamma tuning. Train with a reusable model class and observe an average reward per episode of 200.
Approximation Methods Exercise4:07
Practice exercises reinforce function approximation in reinforcement learning: test feature expansions, target value forms, batch gradient descent, neural networks with deep libraries, optimization techniques, and state-action representations for Q-learning.
Approximation Methods Section Summary3:05
Learn function approximation for reinforcement learning with linear models, stochastic gradient descent, and feature engineering, polynomial expansions, and RBF kernel to predict values and guide control.

Beginners, halt! Stop here if you skipped ahead14:09
Begin with the course prerequisites and complete prior exercises to assess readiness for this reinforcement learning section; avoid skipping ahead.
Stock Trading Project Section Introduction5:13
Explore reinforcement learning in stock trading as an agent that buys and sells in a market environment, guided by states and rewards.
Data and Environment12:22
An open gym-style stock trading environment that simulates three stocks: Apple, Motorola, and Starbucks, with state as shares, prices, and cash; actions buy/sell/hold; reward equals portfolio value change.
How to Model Q for Q-Learning9:37
Describe a state-based linear regression Q-learning model with one output per action, updating only the chosen action using gradient descent with momentum in a deep reinforcement learning setting.
Design of the Program6:45
Designs a reinforcement learning trading program with train and test modes, using q-learning and epsilon-greedy updates to train on past data and test on future prices, tracking portfolio value.
Code pt 17:59
Explore reinforcement learning for stock trading with a linear regression model trained via stochastic gradient descent, using momentum, mean squared error, and a single step gradient update.
Code pt 29:40
Define a multi-stock reinforcement learning environment with state as shares, prices, and cash, and a 27-action space for buy, hold, or sell across three stocks, plus reset and step dynamics.
Code pt 34:28
Learn how an agent uses epsilon-greedy action selection, gamma discounting, and a neural model to learn from state, action, reward, next state, maximize future rewards, and save and load weights.
Code pt 47:17
Finish the reinforcement learning trading script by running episodes, updating the environment, and training or testing the agent to optimize the portfolio value on five years of time series data.
Stock Trading Project Discussion3:37
Discusses evaluating a stock trading reinforcement learning agent in Python, comparing its performance to a random agent, and exploring hyperparameters, data sets, and potential extensions like metadata and returns.

Requirements

Calculus (derivatives)
Probability / Markov Models
Numpy, Matplotlib
Beneficial to have experience with at least a few supervised machine learning methods
Gradient descent
Good object-oriented programming skills

Description

Ever wondered how AI technologies like OpenAI ChatGPT and GPT-4 really work? In this course, you will learn the foundations of these groundbreaking applications.

When people talk about artificial intelligence, they usually don’t mean supervised and unsupervised machine learning.

These tasks are pretty trivial compared to what we think of AIs doing - playing chess and Go, driving cars, and beating video games at a superhuman level.

Reinforcement learning has recently become popular for doing all of that and more.

Much like deep learning, a lot of the theory was discovered in the 70s and 80s but it hasn’t been until recently that we’ve been able to observe first hand the amazing results that are possible.

In 2016 we saw Google’s AlphaGo beat the world Champion in Go.

We saw AIs playing video games like Doom and Super Mario.

Self-driving cars have started driving on real roads with other drivers and even carrying passengers (Uber), all without human assistance.

If that sounds amazing, brace yourself for the future because the law of accelerating returns dictates that this progress is only going to continue to increase exponentially.

Learning about supervised and unsupervised machine learning is no small feat. To date I have over TWENTY FIVE (25!) courses just on those topics alone.

And yet reinforcement learning opens up a whole new world. As you’ll learn in this course, the reinforcement learning paradigm is very from both supervised and unsupervised learning.

It’s led to new and amazing insights both in behavioral psychology and neuroscience. As you’ll learn in this course, there are many analogous processes when it comes to teaching an agent and teaching an animal or even a human. It’s the closest thing we have so far to a true artificial general intelligence. What’s covered in this course?

The multi-armed bandit problem and the explore-exploit dilemma
Ways to calculate means and moving averages and their relationship to stochastic gradient descent
Markov Decision Processes (MDPs)
Dynamic Programming
Monte Carlo
Temporal Difference (TD) Learning (Q-Learning and SARSA)
Approximation Methods (i.e. how to plug in a deep neural network or other differentiable model into your RL algorithm)
How to use OpenAI Gym, with zero code changes
Project: Apply Q-Learning to build a stock trading bot

If you’re ready to take on a brand new challenge, and learn about AI techniques that you’ve never seen before in traditional supervised machine learning, unsupervised machine learning, or even deep learning, then this course is for you.

See you in class!

"If you can't implement it, you don't understand it"

Or as the great physicist Richard Feynman said: "What I cannot create, I do not understand".
My courses are the ONLY courses where you will learn how to implement machine learning algorithms from scratch
Other courses will teach you how to plug in your data into a library, but do you really need help with 3 lines of code?
After doing the same thing with 10 datasets, you realize you didn't learn 10 things. You learned 1 thing, and just repeated the same 3 lines of code 10 times...

Suggested Prerequisites:

Calculus
Probability
Object-oriented programming
Python coding: if/else, loops, lists, dicts, sets
Numpy coding: matrix and vector operations
Linear regression
Gradient descent

WHAT ORDER SHOULD I TAKE YOUR COURSES IN?:

Check out the lecture "Machine Learning and AI Prerequisite Roadmap" (available in the FAQ of any of my courses, including the free Numpy course)

UNIQUE FEATURES

Every line of code explained in detail - email me any time if you disagree
No wasted time "typing" on the keyboard like other courses - let's be honest, nobody can really write code worth learning about in just 20 minutes from scratch
Not afraid of university-level math - get important details about algorithms that other courses leave out

Who this course is for:

Anyone who wants to learn about artificial intelligence, data science, machine learning, and deep learning
Both students and professionals

Artificial Intelligence: Reinforcement Learning in Python

What you'll learn

Explore related topics

Course content

Welcome5 lectures • 34min

Return of the Multi-Armed Bandit26 lectures • 2hr 56min

High Level Overview of Reinforcement Learning2 lectures • 17min

Markov Decision Proccesses14 lectures • 1hr 59min

Dynamic Programming14 lectures • 2hr 4min

Monte Carlo8 lectures • 58min

Temporal Difference Learning8 lectures • 38min

Approximation Methods11 lectures • 1hr 14min

Interlude: Common Beginner Questions1 lecture • 7min

Stock Trading Project with Reinforcement Learning10 lectures • 1hr 21min

Requirements

Description

Who this course is for: