
While this topic will not be covered in detail in this course, it's useful to know that many deep learning architectures—particularly those involving attention mechanisms, such as in Transformers—apply normalization before the softmax function.
This normalization is typically done to scale the attention scores produced by the dot product of query and key vectors. By controlling the magnitude of these scores before applying softmax, the model avoids overly sharp probability distributions and promotes more stable training.
Welcome to Math Behind LLMs, Transformers and Modern Computer Vision, a rigorous deep dive into the mathematical foundations powering today’s most advanced AI systems.
This course is designed for learners who want more than intuition. We derive and analyze the core equations behind Large Language Models, Vision Transformers, and modern image segmentation systems.
You will begin with tokenization and embedding mathematics, understanding how raw text becomes high-dimensional vector representations through algorithms like WordPiece. From there, we mathematically unpack the heart of transformer architectures: query, key, and value matrices, attention score computation, scaling behavior, and multi-head attention.
We examine attention masks, contextual encoding, and positional encodings — including the sine and cosine formulations that preserve sequence structure. You’ll build strong geometric intuition around vectors, dot products, cosine similarity, and dense embeddings.
The course then expands beyond language.
You’ll compare Convolutional Neural Networks with Vision Transformers, analyze quadratic attention operations, and walk through the complete Vision Transformer pipeline from patch embeddings to final predictions.
In an advanced section, we dissect the mathematics behind Meta’s Segment Anything Model (SAM). You will explore prompt encoders, self-attention, cross-attention between prompts and images, attention score computation in segmentation models, and how these systems are trained at scale.
By the end of this course, you won’t just understand how transformers work — you will understand why they work at the equation level across language and vision.
If you aim to build deep technical mastery and develop the mathematical intuition required for cutting-edge AI research and engineering, this course will elevate your expertise.