Quantizing LLMs with PyTorch and Hugging Face

Name: Quantizing LLMs with PyTorch and Hugging Face
Rating: 4.6 (13 reviews)

Optimize Memory and Speed for Large Language Models with Advanced Quantization Techniques

Created byTensor Teach

Last updated 5/2025

English

What you'll learn

Gain an intuitive understanding of linear quantization
Learn different linear quantization techniques
Learn from a high-level how 2 & 4-bit quantization works
Learn how to quantize LLMs from Hugging Face

Course content

3 sections • 22 lectures • 2h 1m total length

Intro to Quantization3:09
Learn how quantization reduces model memory by lowering parameter precision from full 32-bit floats, enabling large models to fit in memory while maintaining performance.
Please Read: Instructor Note For The Following Lecture (#2)0:10
Data Types, Memory Requirements, and Bit Representations8:30
Linear Quantization: Building Intuition6:46
Linear Quantization Formula6:41
Explore how eight-bit quantization preserves relative distances by scaling and shifting values with a zero point, then quantize and dequantize using scale and zero point in PyTorch.
Quantizing an Array of Values11:42
Section Notebook0:02

Quantizing & Dequantizing Tensors8:12
Explore quantizing and dequantizing tensors in PyTorch, using eight-bit precision with scale and zero point. Learn how to compute qmin/qmax, map values, and evaluate quantization error and relative distances.
Computing Quantization Error2:25
Compute quantization error by comparing the original and quantized tensors, square their differences, and average them using mean squared error to evaluate quantization techniques.
Symmetric Quantization8:08
Implementing Symmetric Quantization Algorithm7:18
Quantization Per Channel7:12
Explore per channel quantization, quantizing each row with its own scale and zero point, improving accuracy compared to single-scale quantization for the entire tensor, while noting the memory trade-off.
Quantization Per Group10:24
Inference w/ Quantized Weights7:18
Section Notebook0:01

Requirements

Python programming experience
Experience working with Hugging Face Transformers
Intermediate Math skills

Description

As large language models (LLMs) continue to transform industries, the challenge of deploying these computationally intensive models efficiently has become paramount. This course, Quantizing LLMs with PyTorch and Hugging Face, equips you with the tools and techniques to harness quantization, an essential optimization method, to reduce memory usage and improve inference speed without significant loss of model accuracy.

In this hands-on course, you’ll start by mastering the fundamentals of quantization. Through intuitive explanations, you will demystify concepts like linear quantization, different data types and their memory requirements, and how to manually quantize values for practical understanding.

Next, delve into advanced quantization techniques, including symmetric and asymmetric quantization, and their applications. Gain practical experience with per-channel and per-group quantization methods, and learn how to compute and mitigate quantization errors. Through real-world examples, you'll see these methods come to life and understand their impact on model performance.

The final section focuses on cutting-edge topics such as 2-bit and 4-bit quantization. You’ll learn how bit packing and unpacking work, implement these techniques step-by-step, and apply them to real Hugging Face models. By the end of the course, you’ll be adept at using tools like PyTorch and Bits and Bytes to quantize models to varying precisions, enabling you to optimize both small-scale and enterprise-level LLM deployments.

Whether you are a machine learning practitioner, a data scientist exploring optimization techniques, or a systems engineer focused on efficient model deployment, this course provides a comprehensive guide to quantization. With a blend of theory and practical coding exercises, you’ll gain the expertise needed to reduce costs and improve computational efficiency in modern AI applications.

Who this course is for:

Advanced students looking to gain an in-depth understanding of quantization

Quantizing LLMs with PyTorch and Hugging Face

What you'll learn

Explore related topics

Course content

Introduction to Quantization7 lectures • 37min

Quantization Techniques8 lectures • 51min

Lower Bit Quantization & Quantizing Models from Hugging Face7 lectures • 34min

Requirements

Description

Who this course is for: