Math Behind LLMs, Transformers and Modern Computer Vision

Name: Math Behind LLMs, Transformers and Modern Computer Vision
Rating: 4.7 (1163 reviews)

From Multi Head Attention and Embeddings to Transformers, Vision Transformers, Modern Image Segmentation + LLMs and SAM

Created byPatrik Szepesi

Last updated 6/2026

English

What you'll learn

Mathematics Behind Large Language Models
Modern Image Segmentation
Positional Encodings
Compare CNNs and Vision Transformers mathematically
Compute prompt self-attention and image–prompt cross-attention
Multi Head Attention
Query, Value and Key Matrix
Attention Masks
Masked Language Modeling
Dot Products and Vector Alignments
Nature of Sine and Cosine functions in Positional Encodings
How models like ChatGPT work under the hood
Bidirectional Models
Context aware word representations
Vision Transformers
Word Embeddings
How dot products work
Modern Computer Vision
Understand quadratic complexity in Vision Transformers
Matrix multiplication
Programatically Create tokens
Derive self-attention, multi-head attention, and cross-attention from scratch
Analyze the full Vision Transformer pipeline
Break down the mathematics of Meta’s Segment Anything Model (SAM)
Understand prompt encoders in modern segmentation models

Course content

7 sections • 87 lectures • 11h 34m total length

What we are going to Cover in Natural Language Processing2:50
What we are going to Cover in Computer Vision1:49
(Optional) Mathematics Behind Backpropagation Course2:58
Course Resources0:03

Introduction to Attention Mechanisms3:02
Query, Key, and Value Matrix18:20
Getting started with our Step by Step Attention Calculation7:09
Calculating Key Vectors20:44
Query Matrix Introduction10:34
Calculating Raw Attention Scores21:59
Understanding the Mathematics behind Dot products and Vector Alignment13:56
Visualising Raw Attention Scores in 2 Dimensions5:56
Note on Normalization Before Softmax0:19
While this topic will not be covered in detail in this course, it's useful to know that many deep learning architectures—particularly those involving attention mechanisms, such as in Transformers—apply normalization before the softmax function.
This normalization is typically done to scale the attention scores produced by the dot product of query and key vectors. By controlling the magnitude of these scores before applying softmax, the model avoids overly sharp probability distributions and promotes more stable training.
Converting Raw Attention Scores to Probability Distributions with Softmax9:32
Understanding the Value Matrix and Value Vector9:24
Calculating the Final Context Aware Rich Representation for the word "river"10:59
Understanding the Output1:57
Understanding Multi Head Attention12:13
Multi Head Attention Example, and Subsequent layers10:13
Masked Language Modeling2:33

Introduction to Prompt Encoders for SAM5:06
SAM AutoPrompt Mode16:00
SAM Manual Click Mode8:05
Vision Transformer Embeddings Inside SAM5:08
Calculating Attention Score for Vision Transformers in SAM17:21
How SAM is Trained8:17
Calculating Prompt Self Attention for SAM4:17
Prompt Image Cross Attention7:44
Image to Prompt Cross Attention6:04
Finishing SAM Example Part 19:14
Finishing SAM Example Part 26:42
Finishing0:04

Introduction to Our Simple Neural Network6:52
Why We Use Computational Graphs6:18
Conducting the Forward Pass6:54
Roadmap to Understanding Backpropagation2:46
Derivatives Theory4:28
Numerical Example of Derivatives13:43
Understanding Partial Derivatives8:02
Understanding Gradients3:52
Understanding What Partial Derivatives Do (Example)10:18
Introduction to Backpropagation5:00
Understanding the Chain Rule (Optional)7:35
Gradient Derivation of the Mean Squared Error Loss Function7:35
Visualizing the Loss Function + Gradients11:43
Using the Chain rule to Calculate the Gradient of w219:02
Using the Chain Rule to Calculate the Gradient of w14:38
Visualizing Gradient Descent10:14
Introduction to Gradient Descent6:14
Understanding the Learning Rate (Alpha)8:19
Moving in the Opposite Direction of the Gradient5:35
Calculating Gradient Descent by Hand8:47
Coding our Simple Neural Network Part 14:34
Coding our Simple Neural Network Part 27:20
Coding our Simple Neural Network Part 36:49
Coding our Simple Neural Network Part 45:09
Coding our Simple Neural Network Part 55:33
Introduction to Our Advanced Neural Network5:35
Conducting the Forward Pass4:32
Getting Started with Backpropagation4:54
Getting the Derivative of the Sigmoid Activation Function (Optional)7:42
Implementing Backpropagation with the Chain Rule4:55
Understanding How w3 Affects the Final Loss6:10
Calculating Gradients For Z17:42
Understanding How w1 & w2 Affect the Loss5:06
Implementing Gradient Descent By Hand8:34
Coding our Advanced Neural Network Part (Implementing Forward Pass + Loss)7:05
Coding our Advanced Neural Network Part 2 (Implement Backpropagation)10:41
Coding our Advanced Neural Network Part 3 (Implement Gradient Descent)5:51
Coding our Advanced Neural Network Part 4 (Training our Neural Network)8:21

Requirements

Basic HS math(linear algebra)

Description

Welcome to Math Behind LLMs, Transformers and Modern Computer Vision, a rigorous deep dive into the mathematical foundations powering today’s most advanced AI systems.

This course is designed for learners who want more than intuition. We derive and analyze the core equations behind Large Language Models, Vision Transformers, and modern image segmentation systems.

You will begin with tokenization and embedding mathematics, understanding how raw text becomes high-dimensional vector representations through algorithms like WordPiece. From there, we mathematically unpack the heart of transformer architectures: query, key, and value matrices, attention score computation, scaling behavior, and multi-head attention.

We examine attention masks, contextual encoding, and positional encodings — including the sine and cosine formulations that preserve sequence structure. You’ll build strong geometric intuition around vectors, dot products, cosine similarity, and dense embeddings.

The course then expands beyond language.

You’ll compare Convolutional Neural Networks with Vision Transformers, analyze quadratic attention operations, and walk through the complete Vision Transformer pipeline from patch embeddings to final predictions.

In an advanced section, we dissect the mathematics behind Meta’s Segment Anything Model (SAM). You will explore prompt encoders, self-attention, cross-attention between prompts and images, attention score computation in segmentation models, and how these systems are trained at scale.

By the end of this course, you won’t just understand how transformers work — you will understand why they work at the equation level across language and vision.

If you aim to build deep technical mastery and develop the mathematical intuition required for cutting-edge AI research and engineering, this course will elevate your expertise.

Who this course is for:

For ambitious learners aiming to reach the upper echelon of the programming world: This content is designed for those who aspire to be within the top 1% of data scientists and machine learning engineers. It is particularly geared towards individuals who are keen to gain a deep understanding of transformers, the advanced technology behind large language models. This course will equip you with the foundational knowledge and technical skills required to excel in the development and implementation of cutting-edge AI applications

Math Behind LLMs, Transformers and Modern Computer Vision

What you'll learn

Explore related topics

Course content

Course Overview4 lectures • 8min

Tokenization and Multidimensional Word Embeddings5 lectures • 45min

Positional Encodings7 lectures • 1hr 4min

Attention Mechanism and Transformer Architecture16 lectures • 2hr 39min

Mathematics Behind Vision Transformers5 lectures • 50min

Mathematics Behind Meta's SAM(Segment Anything Model)12 lectures • 1hr 34min

(Optional) Math Behind Backpropagation-The algorithm behind all Machine Learning38 lectures • 4hr 34min

Requirements

Description

Who this course is for: