Cut the gut out from llm: design LLM from zero

dive deep and dissect the underlining multi-head attention algorithm of llm

Created byTylor Chen

Last updated 1/2026

English

What you'll learn

how to preprocess large data set
how to convert text into vector embedding
how to design transformer architecture and coding attention mechanism
how to train and tuning a gpt model from scratch to generate text and classification

Course content

3 sections • 26 lectures • 9h 34m total length

Introduction6:56
In this video, we give some introduction to the goal and structure of the course
Small talk about designing LLM11:33
In this video, we give simple introduction of architecture , key components and training process of LLM
Code for preprocess data34:00
In this video, we show how to use code to convert text to tokens , and convert tokens to ids
Code implementation of Algorithm used by OpenAI for data preprocessing31:45
Introduce to sliding window12:22
In this video, we introduce the concept of sliding window, we will use it to implement data preprocessing strategy
Code implementation for data preprocessing28:15
word embedding and position embedding26:16
in this video, we show how to convert word token with its position in the sentence to vector

introduce to self attention with code34:08
In this video, we use code to show steps for computing attention scores for given two words
Make a Trainable self attention24:41
In this video, we show how to use matrix to make fixed values turn into trainable
Wrapping attention process by using matrix22:02
In this video, we show how to wrap the whole attention computing process in a class
Using linear operation to improve attention process17:27
In this video, we show how to use nn.Linear to improve the computing process
Casual attention algorithm20:47
In this video, we talk about mask attention that is prevent future words interfere with the predict of current word
Improved casual attention algorithm17:32
In this video, we show how to improve the process of mask attention
Causal attention with dropout32:29
In this video, we show how to use dropout to make training has better outcome
Multi-head attention21:25
In this video, we show how to design multi-head attention to extract more meaning from the same sentence
Improved multi-head attention part 130:19
In this video, we show how to convert for loop into matrix operation
Improved multi-head attention part227:33
In this video, we continue to show the detail implementation of using matrix to replace looping

The overall view of GPT model9:38
In this video, we give out the over all view of gpt model , indicate all components that are used to build the whole gpt model, then in later videos, we will dive deep into each layer
2.Build up skeleton for gpt model33:19
In this video, we use code to build the skeleton of the gpt model, then we can fill in different layers in later sections
3.Algorithm details of layer normalization21:01
In this video, we give detail algorithm steps for layer normalization
Torch implementation of layer normalization9:43
In this video, we give layer implementation by uisng torch framework
Sandwich layer with GELU activation18:52
In this video, we see how to implement a sandwich like structure with GELU activation function
23.Shortcut connection26:21
In this video, we talk about the trick of shirtcut connection which is useful for speed up the training of deep learning network
implemetation of transformer block12:44
In this video, we show how to construct transformer block by using all components we have done before
25.Construct the final GPT model19:29
In this video, we will see how to create the whole gpt model by using transformer blocks and layers we have seen before
Generate text from model24:22
In this video, we show how to use model to predict words

Requirements

one year python programming experience and basic knowledge for deep learning

Description

In an age where large language models (LLMs) are at the forefront of AI, knowing how to call an API isn’t enough to set you apart. Mastery comes from understanding the core architecture, mechanics, and fine-tuning techniques behind these powerful models. This course is designed for those who want to go beyond the basics and learn to build a large language model (LLM) from scratch, gaining insight into the internal components that make them work.

Starting with an introduction to the transformer architecture, you’ll learn how models process language by dissecting the elements of word embeddings, token encoding, and position encoding. This course covers the complete process of creating token embeddings, understanding and coding attention mechanisms, and building a simple yet effective model based on the principles of GPT.

You’ll gain hands-on experience implementing code to preprocess unlabeled data, developing the model to generate coherent text, and even fine-tuning for specific tasks like classification and instruction-following. By constructing and refining an LLM from scratch, you’ll learn the skills needed to debug, optimize, and innovate in a way that goes far beyond what’s possible by simply calling an API.

This deep dive into LLMs will empower you to:

Truly understand the inner workings of language models, including the transformer architecture and attention mechanisms.
Build and customize models suited to your specific needs, with a hands-on approach to code implementation and optimization.
Fine-tune models for targeted tasks, giving you full control over performance and functionality.

With these skills, you’ll not only become proficient in using LLMs but also develop the expertise needed to be a leader in this cutting-edge field. This course is perfect for developers, data scientists, and AI enthusiasts ready to dive deeper into the transformative world of large language models.

Who this course is for:

deep learning enthusiasts, engineers, students
practitioners who want to master the gut of llm
ambitious software engineer who not satisfy only calling api of openai but also want to know the internals of chatgpt

Cut the gut out from llm: design LLM from zero

What you'll learn

Explore related topics

Course content

Preprocessing text data for llm training7 lectures • 2hr 31min

Self attention10 lectures • 4hr 8min

Setup the whole gpt model9 lectures • 2hr 55min

Requirements

Description

Who this course is for: