Teach on Udemy

Turn what you know into an opportunity and reach millions around the world.

Learn More

Your cart is empty.

Keep shopping

A deep understanding of AI large language model mechanisms

Name: A deep understanding of AI large language model mechanisms
Rating: 4.8 (1245 reviews)

Build and train LLM NLP transformers and attention mechanisms (PyTorch). Explore with mechanistic interpretability tools

Bestseller

Highest Rated

Created byMike X Cohen

Last updated 6/2026

English

Bulgarian [Auto],Danish [Auto],

What you'll learn

Large language model (LLM) architectures, including GPT (OpenAI) and BERT
Transformer blocks
Attention algorithm
Pytorch
LLM pretraining
Explainable AI
Mechanistic interpretability
Machine learning
Deep learning
Principal components analysis
High-dimensional clustering
Dimension reduction
Advanced cosine similarity applications

Course content

40 sections • 329 lectures • 91h 3m total length

[IMPORTANT] Prerequisites and how to succeed in this course11:28
Explain the background knowledge you need and what is helpful, and guide learning goals, handwritten note-taking, and flexible engagement with the course.
Using the Udemy platform7:57
Master Udemy navigation, playback controls, speed settings, and note-taking to maximize learning, using captions, quality settings, and the Q&A forum to engage with course content.
Getting the course code, and the detailed overview7:06
Access and download all course code files from the GitHub repository and Udemy resources, learn the folder structure, file naming, and the mapping spreadsheet.
Do you need a Colab Pro subscription?8:06
Determine if you need a Colab Pro subscription for this course and how free Colab GPUs meet most needs. Understand CPU versus GPU trade-offs.
About the "CodeChallenge" videos9:09
Explain what the code challenge videos are and how to learn from them with Python exercises you pause to code analyses and visualizations, using helper files, solutions, and llms concepts.

Why text needs to be numbered24:37
Discover how text becomes numbers through tokenization and embeddings, and how tokens, embeddings, subwords, and context windows drive modern language models.
Parsing text to numbered tokens13:35
Explore building a vocabulary for tokenization and the role of token ids. Implement encoder and decoder mappings with Python dictionaries to tokenize text and recover words.
CodeChallenge: Create and visualize tokens (part 1)16:41
Explore code challenges that teach encoder and decoder functions, one-hot encoding, and tokenization by converting text to token indices and back, with vocabulary, context, and visualization.
CodeChallenge: Create and visualize tokens (part 2)9:01
Explore one hot encoding and tokenization by building a sparse token matrix from a sentence and its vocabulary, then transpose and visualize the encoding as an image.
Preparing text for tokenization16:59
Import text from the web into a Python session, clean it with regular expressions, and parse it into words to build a word-based vocabulary for tokenization.
CodeChallenge: Tokenizing The Time Machine23:35
Learn tokenization skills by building vocabularies and encoder–decoder functions, handling unknown words with an unk token, and exploring Brownian noise and token length distributions.
Tokenizing characters vs. subwords vs. words4:52
Compare character, word, and subword tokenization for language models, with subword tokenization, especially byte pair encoding, balancing vocabulary size, context, and semantics.
Byte-pair encoding algorithm27:53
Explore how byte-pair encoding builds tokenization vocabularies by merging frequent character pairs into new tokens, implemented in Python and iterated to a target vocab size.
CodeChallenge: Byte-pair encoding to a desired vocab size7:24
Tackle a code challenge to implement byte-pair encoding with a loop until a vocab size of 25 tokens, reusing functions from the previous video and exploring subword tokens and spaces.
Exploring ChatGPT4's tokenizer26:11
Explore how the GPT four tokenizer encodes text into tokens, including spaces and new line characters, and learn to encode and decode tokens in Python.
CodeChallenge: Token count by subword length (part 1)22:05
Analyze how tokens map to word length using the GPT-4 tokenizer on real text, splitting by whitespace and punctuation, and measuring token efficiency.
CodeChallenge: Token count by subword length (part 2)10:13
Explore how word frequency relates to token length through a tokenizer and GPT four encoding, and examine confounds like punctuation that affect word versus token interpretation.
How many "r"s in strawberry?5:43
Explore how models tokenize text; strawberry becomes three tokens, not a single r. Decode with Python to count, and recognize embedding vectors govern model answers.
CodeChallenge: Create your algorithmic rapper name :)6:31
Join a code challenge to generate an algorithmic rapper name by tokenizing your name and color, sorting tokens, and decoding three numbers with a GPT-4 tokenizer in Python.
Tokenization in BERT18:32
Explore the Bert tokenizer and how it differs from GPT, including uncased vs cased vocab, vocab size, special tokens like CLS and SEP, and encode methods.
CodeChallenge: Character counts in BERT tokens5:41
Count character occurrences in BERT tokenizer tokens, excluding unused tokens, and visualize the results with a bar graph of digits and letters using Python and numpy with the uncased tokenizer.
Translating between tokenizers13:48
Learn how GPT-4 and Bert tokenizers differ and how to translate between them using an intermediate text, while analyzing encoding, decoding, special tokens, and whitespace effects on compression.
CodeChallenge: More on token translation7:25
Practice translating Bert and GPT-4 token indices to illuminate tokenization, tokenizers, and how embeddings and llm mechanisms work, highlighting encoding, decoding, and inverse relationships.
CodeChallenge: Tokenization compression ratios15:22
Explore how tokenization compresses text into fewer tokens to extend a model’s context window, and follow hands-on exercises measuring token-to-character compression ratios with the GPT-4 tokenizer.
Tokenization in different languages8:25
Explore how tokenization differs across languages, comparing GPT-4 and Bert tokenizers, and learn why token counts can expand or compress text depending on language and training data.
CodeChallenge: Zipf's law in characters and tokens12:34
Explore Zipf's law, the log-log relationship between word frequency and rank, and apply it to characters and token IDs through a Python code challenge using numpy.
Word variations in Claude tokenizer13:34
Explore how the Claude tokenizer handles spaces, misspellings, punctuation, and multi-language code, illustrating subword tokenization, token IDs, attention masks, and a 65,000 token vocab.

Word2Vec vs. GloVe vs. GPT vs. BERT... oh my!10:28
Explore how embedding spaces and vectors power language models by comparing Word2Vec, GloVe, GPT, and BERT embeddings, and learn how static and contextual matrices are learned and used.
Exploring GloVe pretrained embeddings26:24
Explore glove pretrained embeddings by inspecting the 400,000 word vocabulary and 50-dimensional vectors, computing cosine similarity to find most similar words and nonmatches using Python.
CodeChallenge: Wikipedia vs. Twitter embeddings (part 1)18:35
Explore glove embeddings trained on Wikipedia versus Twitter text, visualize 50-dimensional vectors, compare embedding matrices, and preview representational similarity analysis using cosine similarity.
CodeChallenge: Wikipedia vs. Twitter embeddings (part 2)14:34
Explore representational similarity analysis to compare Wikipedia and Twitter word embeddings, compute inter-word cosine similarities, and visualize deviations from the line of unity to interpret word relationships.
Exploring GPT2 and BERT embeddings22:37
Explore GPT-2 and BERT embeddings by importing entire models, compare word embeddings and token lengths, and analyze embedding distributions to reveal model-specific vector characteristics.
CodeChallenge: Math with tokens and embeddings28:59
Explore how tokens and embeddings represent numbers and arithmetic, explain why token-based math is unreliable, and show how embeddings support meaningful arithmetic through unembedding.
Cosine similarity (and relation to correlation)26:12
Explore cosine similarity as a practical measure of relationships between vectors, including embeddings, and learn its formula, interpretation, and Python implementations, with a comparison to Pearson correlation.
CodeChallenge: GPT2 cosine similarities21:52
Explore cosine similarities in the GPT-2 embeddings matrix by tokenizing words (including multi-token words) and implementing both loop-based and vectorized calculations, including plotting the results.
CodeChallenge: Unembeddings (vectors to tokens)32:57
Explore how embeddings matrices convert token indices to vectors and back to text, and compare trained versus tied unembedding. Observe how real and random embeddings affect generated text coherence.
Position embeddings29:33
Explore how position embeddings provide temporal weighting to link current tokens with earlier context, using learned or sine-cosine predefined vectors, and examine cosine similarity matrices.
CodeChallenge: Exploring position embeddings22:43
Explore position embeddings through cosine similarity analyses, extract unique upper-triangle elements, and compare real versus shuffled embeddings to reveal structure and identify high-similarity vector pairs using numpy tricks.
Training embeddings from scratch3:11
Learn how embeddings matrices are trained from data using gradient descent, and how to initialize, train, and use an embeddings matrix inside a non-language model for educational purposes.
Create a data loader to train a model25:41
Create a PyTorch data loader for language model training by formatting text into input-target token sequences, with context length, stride, and batch size in a dataset class.
Build a model to learn the embeddings30:35
Build a helper model to train the embeddings matrix that maps tokens to dense vectors. Understand how sequence length and dimensionality shape these embeddings.
Loss function to train the embeddings29:18
Explore loss functions in training embeddings and language models, from mean squared error to negative log likelihood, and learn how log softmax and cross entropy drive gradient descent in PyTorch.
Train and evaluate the model15:01
Train an embeddings model on the GPU with a batch size of 32 across epochs to minimize loss. Extract pre- and post-trained embeddings and analyze cosine similarity for upcoming code challenges.
CodeChallenge: How the embeddings change18:29
Explore how embeddings evolve with training by comparing pre and post training distributions via histograms. Analyze time and machine embeddings and their cosine similarity, with direct visualizations of embeddings matrices.
CodeChallenge: How stable are embeddings?17:40
Train ten reinitialized embeddings for 16 epochs on identical data to test stability, compare losses and vectors, and analyze cosine similarities among time, machine, and she to reveal embedding relationships.

Why build when you can download?4:19
Skip building a model from scratch for real use due to high costs and complexity, and instead leverage free pre-trained models while learning transformer architectures and attention mechanism.
Model 1: Embedding (input) and unembedding (output)30:16
Build a small from-scratch language model, from embeddings to unembedding, matching the GPT-2 small. Learn text generation using logits, softmax probabilities, and probabilistic token sampling.
Understanding nn.Embedding and nn.Linear12:45
Learn how nn.Embedding and nn.Linear act as wrappers around fundamental PyTorch weights, showing their shared under-the-hood structure, with differences in sizing, indexing, and initialization.
CodeChallenge: GELU vs. ReLU26:50
Compare gelu and relu activation functions, exploring their non-linearities and exact versus approximate formulas, and their impact on gradient descent in deep learning.
Softmax (and temperature): math, numpy, and pytorch21:42
Explore softmax, its temperature parameter and logsoftmax with numpy and pytorch demos, and learn how these transformations create probability distributions for ai large language models.
Randomly sampling words with torch.multinomial17:18
Explore how torch.multinomial samples from softmax probabilities, compare with NumPy's random choice, handle PyTorch errors, and understand replacement vs non-replacement in language model token selection.
Other token sampling methods: greedy, top-k, and top-p5:35
Explore greedy, top-k, and top-p token sampling methods and their trade-offs from deterministic output to probabilistic choices, clarifying when randomness enhances or hinders accuracy in various tasks.
CodeChallenge: More softmax explorations25:13
Demonstrate how softmax sharpens probability distributions, explore iterative softmax and the role of data range and temperature, and explain why normalization prevents vanishing or exploding values.
What, why, when, and how to layernorm18:36
Explore how layer norm stabilizes deep learning by standardizing data to zero mean and unit variance, then scale with gamma and shift with beta, applied across columns or entire matrices.
Model 2: Position embedding, layernorm, tied output, temperature26:27
Explore how position embeddings are added to token embeddings in the GPT-2 model. Learn how layernorm and tied output reduce parameters, and how temperature controls generation in model two.
Temporal causality via linear algebra (theory)13:46
Explore causal attention and time-based masks, softmax dynamics, and the GPT-style decoder versus Bert encoder distinction for training and generation.
Averaging the past while ignoring the future (code)17:41
Explore causal masking in token prediction by implementing a lower-triangular softmax mask to ignore future tokens, and see how past activations influence future predictions through matrix operations.
The "attention" algorithm (theory)10:39
Explore the attention mechanism in transformers, including query, key, value, and softmax scoring. See how embeddings are adjusted in single-head attention, with the causal mask and scaling.
CodeChallenge: Code Attention manually and in Pytorch21:55
Implement the attention mechanism in PyTorch, including q, k, v and a pass from tokens to embeddings. Compare attention with PyTorch scaled dot-product attention and measure CPU versus GPU performance.
Model 3: One attention head22:52
Explore how a single attention head updates token and position embeddings within a transformer block, combining attention and an MLP to refine language model representations.
The Transformer block (theory)8:35
Explore how a transformer block uses attention and an mlp sublayer with residual connections and layer normalization. Learn how dimensionality expansion and contraction with a non-linear activation enables richer representations.
The Transformer block (code)16:34
Translate the transformer block into code by building modular classes for attention and the transformer, illustrating forward passes, layernorm, residual connections, and the MLP expansion to model language.
Model 4: Multiple Transformer blocks20:28
Expand to 12 transformer blocks in a GPT-style decoder by using a flexible sequential arrangement, reusing a single transformer block with multi-head attention, layer normalization, and feedforward sublayers.
Multihead attention: theory and implementation23:58
Explore multi-head attention by slicing Q/K/V into equal heads processed in parallel, and learn the math and implementation, including head dimensionality, scaling by d/h, and final W0 mixing.
Working on the GPU21:00
Demonstrates why shifting from cpu to gpu speeds up matrix multiplications in deep learning, shows how to move data and models with PyTorch in Colab, and discusses transfer overhead.
Model 5: Complete GPT2 on the GPU18:25
Build and run a complete GPT-2 style language model on the GPU, using a fused query–key–value weight matrix for efficient attention and transformer blocks to reveal trainable parameters.
CodeChallenge: Time model5 on CPU and GPU12:15
Compare model five initialization, forward pass, and backprop times on cpu and gpu, revealing gpu speedups and discussing higher model demands and gpu safety regulation implications.
Inspecting OpenAI's GPT212:07
Explore a pre-trained GPT-2 small model from Hugging Face, using its tokenizer and auto model for causal LM to inspect embeddings, transformer blocks, attention, and text generation in Python.
Summarizing GPT using equations18:32
Examine how token embeddings transform through the transformer's attention heads and the mlp block, using equations to map inputs to logits via embedding and projection matrices.
Visualizing nano-GPT6:17
Explore nano GPT architecture from token embeddings and position embeddings through three transformer blocks, attention with causal mask, and MLP blocks to produce logits for next-token predictions.
CodeChallenge: How many parameters? (part 1)19:14
Count trainable parameters across GPT-2 variants by importing models from Huggingface, organizing them in a dict, and looping through weights and embeddings to compare totals.
CodeChallenge: How many parameters? (part 2)10:18
Continue the code challenge by comparing attention and MLP parameters in GPT-2 models, visualizing a bar plot of the percentage of total weights, and exploring layernorm contributions toward mechanistic interpretability.
CodeChallenge: GPT2 trained weights distributions17:55
Investigate weight distributions in GPT-2 small pre-trained model by extracting token and position embeddings, creating histograms, and comparing attention and MLP weight ranges across transformer blocks using density plots.
CodeChallenge: Do we really need Q?22:03
Explore causal mechanistic interpretability by replacing the Q matrix with noise in a GPT-2 code challenge, comparing CPU and GPU behavior, and examining targeted, selective manipulations of transformer weights.

What is "pretraining" and is it necessary?12:32
Learn how large language models transition from random weights to a pre-trained base model by learning textual statistics in unsupervised pre-training, and distinguish it from fine tuning and instruction tuning.
Introducing huggingface.co6:29
Explore Hugging Face’s open source NLP resources, including models and datasets, and learn how to access free resources via Python. Discover their YouTube channel for tutorials and practical AI development.
The AdamW optimizer12:10
Explore how AdamW decouples weight decay from the gradient-based update, applying L2 regularization after the Adam update to improve generalization in large models.
CodeChallenge: SGD vs. Adam vs. AdamW26:31
Compare sgd, Adam, and AdamW in a PyTorch mini-training loop that tunes a single weight toward pi, highlighting gradient accumulation and the impact of zeroing gradients.
Train model 122:11
Train model 1 demonstrates building a tiny language model from embeddings and a non-linearity, training with negative log likelihood on PyTorch for 25 epochs to illustrate loss and text generation.
CodeChallenge: Add a test set20:24
Add a test set and train/test split to assess a language model's generalization, using a 90/10 split on the Time Machine book with 21,000 training sequences and 2,500 test sequences.
CodeChallenge: Train model 1 with GPT2's embeddings10:15
Copy GPT-2 embeddings (50,000 by 768) into a smaller model, freeze those weights during training, and compare frozen versus trainable embeddings for practical insights.
CodeChallenge: Train model 5 with modifications15:41
Train model five with modified sampling to pre-train on Gulliver's Travels using the GPT-2 small tokenizer, compare GPU versus CPU runtimes, and evaluate losses after 500 samples.
Create a custom loss function13:33
Learn to design and implement custom loss functions in PyTorch, compare L1 and L2 losses, and understand how loss guides parameter learning and model training.
CodeChallenge: Train a model to like "X"23:56
Train a simple language model on a GPU to prefer tokens containing the letter X, using a KL divergence loss, and examine the ethical implications for AI safety.
CodeChallenge: Numerical scaling issues in DL models22:12
Explore how softmax sensitivity to data scale affects attention by scaling Q transpose matrix by the square root of dimensionality, and examine GPT-2 layernorm parameters for stability and token selection.
Weight initializations10:30
Discover how weight initialization aids training stability in very large models by keeping weights small, with three methods—Chi Ming, Xavier, and a code-based normal distribution example.
CodeChallenge: Train model 5 with weight inits23:15
Apply a language model init method across all modules, initialize nonlinear layers with a normal distribution and zero biases, use Xavier for embeddings, and track attention weight distributions during training.
Dropout in theory and in Pytorch21:19
Explore dropout as a regularization technique in deep learning and PyTorch, including class and function implementations and the training versus evaluation effects on activation and generalization.
Should you output logits or log-softmax(logits)?4:03
Explore when to output raw logits versus log-softmax in language models. Learn how temperature affects generation, training, and classification tasks.
The FineWeb dataset6:21
Explore the fine web dataset from hugging face, a vast deduplicated corpus of 15 trillion tokens for training large language models, and learn to import and tokenize portions in python.
CodeChallenge: Fine dropout in model 5 (part 1)17:21
Learn to implement dropout regularization in a transformer language model during training, covering embedding, attention, and MLP dropout. Explore handling raw logits and temperature in the generation step.
CodeChallenge: Fine dropout in model 5 (part 2)13:33
Train model five with dropout on the final token loss. Use logsoftmax inputs, switch to eval mode for testing, then return to train mode.
CodeChallenge: What happens to unused tokens?25:39
Explore how a language model forms internal representations for tokens across frequency categories. Analyze frequency, never-used tokens, and logsoftmax learning dynamics using four exercises and Gulliver's Travels.
Optimization options4:32
Explore pre-training concepts and practical strategies to accelerate training of large models. Discover how batch size, precision, GPUs, gradient techniques, and fused attention algorithms reduce training time without sacrificing learning.

What does "fine-tuning" mean?4:49
Fine tuning large language models uses a pre-trained base model and a domain-specific dataset to tailor performance, with small learning rates and selective layer freezing.
Fine-tune a pretrained GPT222:54
Fine-tune a pretrained GPT-2 on Gulliver's Travels using PyTorch, explore tokenization and frequent tokens, adjust learning rate, monitor loss, and evaluate generation quality and overfitting.
CodeChallenge: Gulliver's learning rates14:56
Compare three learning rates for fine-tuning a GPT-2 model on Gulliver's Travels, analyzing train loss and token-based evaluations to balance specialization with preserving base knowledge.
On generating text from pretrained models14:21
Explore how tokenizers and pre-trained models like GPT-2 handle padding, end-of-sequence tokens, and attention masks in batched inputs. Learn to parameterize generate with input IDs, max length, top-k, and top-p.
CodeChallenge: Maximize the "X" factor23:00
Explore a pre-trained GPT-2 medium model, inspect its 24 transformer blocks and embeddings, and implement a KL-divergence loss by converting raw logits to log probabilities to maximize token x selections.
Alice in Wonderland and Edgar Allen Poe (with GPT-neo)19:53
Fine tune two GPT-Neo models on Alice in Wonderland and Edgar Allan Poe texts, compare architecture and tokenizer, and assess outputs through a qualitative prompt.
CodeChallenge: Quantify the Alice/Edgar fine-tuning20:55
Fine-tune two models to adopt Alice in Wonderland or Edgar Allan Poe style, then quantify and qualitatively evaluate their generated tokens before and after training, including a crossover analysis.
CodeChallenge: A chat between Alice and Edgar10:20
Train the Alice and Edgar models, set up two optimizers, and run a back-and-forth chat that expands the token context window with each exchange.
Partial fine-tuning by freezing attention weights15:15
Explore how to freeze weights in large language models using PyTorch, selectively freezing transformer blocks to reduce training costs, preserve syntax, and tailor fine-tuning.
CodeChallenge: Fine-tuning and targeted freezing (part 1)21:59
Explore precision freezing and targeted fine-tuning of large language models by training two GPT-2 Neo variants on Moby Dick, and measure training loss, token usage, and compute time.
CodeChallenge: Fine-tuning and targeted freezing (part 2)14:03
Explore fine-tuning and targeted freezing in language models through hands-on matplotlib visualizations comparing frozen versus fully trainable models, losses, token sampling from Moby Dick, and computation time.
Parameter-efficient fine-tuning (PEFT)9:51
Explore parameter-efficient fine-tuning (peft) techniques that freeze most parameters and train only a small set of adapters, prefix modules, or bias terms for large pre-trained models guiding narrow tasks.
CodeGen for code completion13:37
Discover how Codegen enables code completion using Salesforce's pre-trained base models, from 350 million to 16 billion parameters, and how to download, inspect, and fine-tune them with Hugging Face.
CodeChallenge: Fine-tune codeGen for calculus7:00
Fine-tune the CodeGen model on calculus Python code and assess post-training outputs. Explore learning rate choices, training epochs, tokenizer setup, and qualitative evaluation, plus instruction tuning concepts.
Fine-tuning BERT for classification31:18
Fine-tune a pre-trained BERT for classification by adding a two-output head to predict movie reviews as positive or negative; cover IMDb data from Hugging Face and batching.
CodeChallenge: IMDB sentiment analysis using BERT16:08
Explore IMDb sentiment analysis with bert in a code challenge, training a classifier on movie reviews and experimenting with freezing embeddings and attention while training mlp and final classifier.
Gradient clipping and learning rate scheduler (part 1)18:55
Master gradient clipping and learning rate scheduling for training large models. See how gradient norms are clipped to a threshold and how warm-up with cosine or linear schedulers shapes learning.
Gradient clipping and learning rate scheduler (part 2)11:52
Explore how to pair an optimizer with a gradient clipping and learning rate scheduler. The demo covers cosine and linear schedulers, warm-up steps, and training cycles.
CodeChallenge: Clip, freeze, and schedule BERT18:11
Explore fine-tuning a pre-trained BERT model for sentiment classification, using gradient clipping and learning rate schedulers, with visualization of gradients, loss, and accuracy.
Saving and loading trained models13:45
Learn to save and load models across Hugging Face and PyTorch, using folder-based saves with config and safe tensors, or PyTorch state dictionaries, and how to download or upload them.
BERT decides: Alice or Edgar?14:38
Train a Bert classifier to distinguish Alice versus Edgar texts on GPU, compare mean smoothing vs raw loss, and save the model for the next code challenge.
CodeChallenge: Evolution of Alice and Edgar (part 1)19:38
Train and evaluate a Bert classifier with two Eleuther GPT Neo models in a single Python session, managing memory and token translation between Bert and Eleuther for 128-token sequences.
CodeChallenge: Evolution of Alice and Edgar (part 2)9:20
Fine-tune alice and edgar models on their texts with 121 training samples and a 1e-5 learning rate, tracking Bert classification accuracy over the training.
Why fine-tune when you can use AGI?7:07
Fine-tuning your own model can offer greater control, ownership, and task-specific performance for narrow applications, even when state-of-the-art models exist; assess practicality, data access, and on-device speed.

What is instruction tuning?2:56
Discover how instruction tuning shapes chatbots to interact with users like humans, using next token prediction, negative log likelihood, and gradient descent, while aligning with human scripts and constraints.
Some datasets for instruction tuning12:57
Explore three public datasets used for instruction tuning to shape model interactions. Learn how data format, references, and lightweight preprocessing influence model behavior.
Training a chatbot with system-user-assistant13:46
Learn how base language models are trained into chatbots using next token prediction and gradient descent, with thousands of curated, human-crafted examples built from system, user, and assistant prompts.
Instruction tuning with GPT218:18
Fine-tune a GPT-2 model on a question-answer dataset using tokenization, batching, and GPU training to build a chatbot, then examine its authoritative tone and AI safety implications.
CodeChallenge: Instruction tuning GPT2-large (part 1)23:10
Apply instruction tuning to GPT-2 large, analyze token distributions in Q&A data, and compare padding strategies with 256-token sequences, attention masks, and start-of-question formatting.
CodeChallenge: Instruction tuning GPT2-large (part 2)13:31
Continue the code challenge by fine tuning GPT-2 large and evaluating results. Explore batch size, learning rate, and GPU RAM management, including intermediate variables and logits, to understand training dynamics.
Reinforcement learning from human feedback (RLHF)10:13
Reinforcement learning from human feedback uses human ratings of model outputs, via a separate reward model, to align language models with human values and desires beyond token-based losses.

Requirements

Motivation to learn about large language models and AI
Experience with coding is helpful but not necessary
Familiarity with machine learning is helpful but not necessary
Basic linear algebra is helpful
Deep learning, including gradient descent, is helpful but not necessary

Description

Deep Understanding of Large Language Models (LLMs): Architecture, Training, and Mechanisms

Description

Large Language Models (LLMs) like ChatGPT, GPT-4, , GPT5, Claude, Gemini, and LLaMA are transforming artificial intelligence, natural language processing (NLP), and machine learning. But most courses only teach you how to use LLMs. This 90+ hour intensive course teaches you how they actually work — and how to dissect them using machine-learning and mechanistic interpretability methods.

This is a deep, end-to-end exploration of transformer architectures, self-attention mechanisms, embeddings layers, training pipelines, and inference strategies — with hands-on Python and PyTorch code at every step.

Whether your goal is to build your own transformer from scratch, fine-tune existing models, or understand the mathematics and engineering behind state-of-the-art generative AI, this course will give you the foundation and tools you need.

What You’ll Learn

The complete architecture of LLMs — tokenization, embeddings, encoders, decoders, attention heads, feedforward networks, and layer normalization
Mathematics of attention mechanisms — dot-product attention, multi-head attention, positional encoding, causal masking, probabilistic token selection
Training LLMs — optimization (Adam, AdamW), loss functions, gradient accumulation, batch processing, learning-rate schedulers, regularization (L1, L2, decorrelation), gradient clipping
Fine-tuning and prompt engineering for downstream NLP tasks, system-tuning
Evaluation metrics — perplexity, accuracy, and benchmark datasets such as MAUVE, HellaSwag, SuperGLUE, and ways to assess bias and fairness
Practical PyTorch implementations of transformers, attention layers, and language model training loops, custom classes, custom loss functions
Inference techniques — greedy decoding, beam search, top-k sampling, temperature scaling
Scaling laws and trade-offs between model size, training data, and performance
Limitations and biases in LLMs — interpretability, ethical considerations, and responsible AI
Decoder-only transformers
Embeddings, including token embeddings and positional embeddings
Sampling techniques — methods for generating new text, including top-p, top-k, multinomial, and greedy

Why This Course Is Different

93+ hours of HD video lectures — blending theory, code, and practical application
Code challenges in every section — with full, downloadable solutions
Builds from first principles — starting from basic Python/Numpy implementations and progressing to full PyTorch LLMs
Suitable for researchers, engineers, and advanced learners who want to go beyond “black box” API usage
Clear explanations without dumbing down the content — intensive but approachable

Who Is This Course For?

Machine learning engineers and data scientists
AI researchers and NLP specialists
Software developers interested in deep learning and generative AI
Graduate students or self-learners with intermediate Python skills and basic ML knowledge

Technologies & Tools Covered

Python and PyTorch for deep learning
NumPy and Matplotlib for numerical computing and visualization
Google Colab for free GPU access
Hugging Face Transformers for working with pre-trained models
Tokenizers and text preprocessing tools
Implement Transformers in PyTorch, fine-tune LLMs, decode with attention mechanisms, and probe model internals

What if you have questions about the material?

This course has a Q&A (question and answer) section where you can post your questions about the course material (about the maths, statistics, coding, or machine learning aspects). I try to answer all questions within a day. You can also see all other questions and answers, which really improves how much you can learn! And you can contribute to the Q&A by posting to ongoing discussions.

By the end of this course, you won’t just know how to work with LLMs — you’ll understand why they work the way they do, and be able to design, train, evaluate, and deploy your own transformer-based language models.

Enroll now and start mastering Large Language Models from the ground up.

Who this course is for:

AI engineers
Scientists interested in modern autoregressive modeling
Natural language processing enthusiasts
Students in a machine-learning or data science course
Graduate students or self-learners
Undergraduates interested in large language models
Machine-learning or data science practitioners
Researchers in explainable AI

A deep understanding of AI large language model mechanisms

What you'll learn

Explore related topics

Course content

Introductions5 lectures • 44min

—— ** —— Part 1: Tokenizations and embeddings —— ** ——1 lecture • 1min

Words to tokens to numbers22 lectures • 5hr 11min

Embeddings spaces18 lectures • 6hr 35min

—— ** —— Part 2: Large language models —— ** ——1 lecture • 1min

Build a GPT29 lectures • 8hr 24min

Pretrain LLMs20 lectures • 5hr 12min

Fine-tune pretrained models24 lectures • 6hr 14min

Instruction tuning7 lectures • 1hr 35min

—— ** —— Part 3: Evaluating LLMs —— ** ——1 lecture • 1min

Requirements

Description

Who this course is for: