Fine-Tune & Deploy LLMs with QLoRA on Sagemaker + Streamlit

Name: Fine-Tune & Deploy LLMs with QLoRA on Sagemaker + Streamlit
Rating: 4.6 (62 reviews)

Master QLoRA Math, Mixed Precision Training, Double Quantization, Lambda functions, API Gateway & Streamlit deployment

Bestseller

Highest Rated

Created byPatrik Szepesi

Last updated 8/2025

English

What you'll learn

Train/Fine Tune LLMs in AWS Sagemaker using QLoRA and advanced 4-bit quantization on your own dataset
Create an interactive Streamlit app to deploy your fine tuned LLM with Sagemaker, Lambda Functions, and API Gateway
Master QLoRA fine-tuning — including adapter injection, memory optimization, parameter freezing, and the mathematics behind it
Leverage bfloat16 compute types for faster and more efficient training on modern GPUs
Understand mixed precision training with qLoRA in Sagemaker
Use Parameter Efficient Fine Tuning(PEFT) to dynamically find and inject LoRA layers
Understand the entire low-level fine-tuning pipeline — from raw dataset to trained model
Use double quantization and nf4 precision to compress models without losing performance
Discover how gradient checkpointing drastically reduces VRAM usage during training
Fine-tune large models like Mixtral on Amazon SageMaker using state-of-the-art GPU acceleration
Understand custom chunking code for LLMs
Merge LoRA weights and unload adapters for final model export — ready for deployment
Deploy your trained model to SageMaker Endpoints using Amazon's production infrastructure
Build real-time LLM APIs using Lambda functions and API Gateway
Securely Set up Training Jobs with IAM roles
AWS Budgeting, Server Management, and Pricing
Learn how to use AWS Quotas to use powerful GPUs

Course content

11 sections • 59 lectures • 7h 34m total length

What We Are Building5:23
Explore building and deploying a fine-tuned large language model on AWS SageMaker via a Streamlit app, using API gateway and Lambda, with quantization, LoRA, mixed precision training, and monitoring.

Sign in to AWS4:42
Set up and sign in to your AWS account, learn when to use root vs IAM users, enable multi-factor authentication, and review costs with cost explorer while understanding charges.
Creating IAM User5:45
Create an IAM user, enable the management console and CLI access, enforce password changes at first login, and apply administrator or least-privilege permissions for secure AWS access.
Signing in with new AWS User3:23
Sign in to your new AWS account using the CSV credentials and sign-in URL, reset your password to meet the policy, and use the IAM role for access.
What to Do In case you Get Hacked1:32
learn how to respond when an iam user is hacked by signing back into the root account, deleting the compromised user, and revoking access to prevent further sign-ins.

Create a Sagemaker Domain2:33
Create a SageMaker domain to host a Jupyter Lab notebook on AWS, select your region, and set up a single-user domain for streamlined AI development.
Logging into Our Sagemaker Environment5:11
Set up and manage a SageMaker domain, create a user profile with an execution role, enable idle shutdown, and access JupyterLab, training jobs, models, and deployments.
Introduction to JupyterLab8:05
Launch and manage a Jupyter Lab notebook on AWS SageMaker, select the region and mlt3 medium instance, use a Python kernel, and monitor on-demand costs to avoid charges.

Updates0:06
Sagemaker Sessions, Regions, and IAM Roles8:04
Launch and operate SageMaker's JupyterLab, installing and upgrading libraries and starting kernels. Align SageMaker session and IAM execution role with the region and S3 to ensure access and avoid charges.
2026 Sagemaker Update5:17
Examining Our Dataset from HuggingFace14:28
Load the Dall-E 15 dataset from Hugging Face and Databricks, and convert samples into instruction-context-response prompts for fine-tuning LLMs with diverse data. Handle context variations as needed.
Tokenization and Word Embeddings9:36
Tokenize text into token IDs, then map to word embeddings via an embedding table, using byte-pair encoding and subword tokens. Pad inputs with iOS tokens to ensure lengths for transformers.
HuggingFace Authentication with Sagemaker4:25
Master Hugging Face authentication in SageMaker by creating and configuring an access token, granting model access, and loading the tokenizer to prepare prompts with proper padding.
Applying the Templating Function to our Dataset8:54
Apply the templating function to convert each dataset item into a single text input with an end-of-sequence token, using dolly format and map to streamline the dataset for model use.
Attention Masks and Padding16:31
Learn to prepare model inputs using input IDs and attention masks, explain padding with fixed 2048 token chunks, and manage long prompts via a remainder dictionary in decoder-only LLMs.
Star Unpacking with Python4:17
Learn star unpacking in Python: how an iterable's elements are unpacked into individual arguments (one level only) using a list and a three-argument function, demonstrated in Google Colab.
Chain Iterator, List Constructor and Attention Mask example with Python10:34
Explore how to flatten and concatenate input IDs and attention masks using star unpacking, chain, and list constructors to prepare 2048-token chunks for LLM fine-tuning.
Understanding Batching8:18
Explore batching and remainder handling to build 2048-token chunks for fine-tuning LLMs, using input IDs and attention masks, concatenation, and pythonic slicing to create final streams.
Slicing and Chunking our Dataset7:41
Learn to slice the concatenated input ids and attention masks into fixed-size chunks of five, producing 0-5, 5-10, and 10-15 segments for model fine-tuning.
Creating our Custom Chunking Function17:01
Create a custom chunking function that splits samples into 2048-token chunks, tracks remainders with a global dictionary, and aligns input IDs with attention masks.
Tokenizing our Dataset9:44
Tokenize the entire dataset using a map with a tokenizer to create tokenized inputs, then batch and chunk into 2048-token samples, removing raw text post-tokenization.
Running our Chunking Function4:37
Run and debug the chunking function, reload the data set after edits, and fix errors from missing instruction, context, or category to produce 1528 chunks of 2048 tokens.
Understanding the Entire Chunking Process8:39
Demonstrates the chunking process by inspecting batch sizes and 2048 token chunks, showing how 101 chunks arise and 807 tokens carry over to the next batch.
Uploading the Training Data to AWS S36:19
Learn how to create an S3 bucket and upload the tokenized training data to S3. Configure the SageMaker execution role with S3 permissions and verify the data for model fine-tuning.

Setting Up Hyperparameters for the Training Job7:23
Configure a SageMaker fine-tuning training job with Hugging Face, setting hyperparameters such as model id, dataset path, and epochs, and enabling merge weights for efficient tuning.
Creating our HuggingFace Estimator in Sagemaker7:21
Create a Hugging Face estimator in SageMaker to orchestrate fine-tuning with QLoRA, configuring the entry point script, hyperparameters, and the training data path before launching a fit job.
Introduction to Low-rank adaptation (LoRA)8:35
Fine-tune large models efficiently with LoRA, freezing the original weight matrix and learning small A and B updates that form a low-rank change to W.
LoRA Numerical Example11:19
demonstrates a numerical LoRA example using small trainable matrices A and B to compute W effective, update delta W, and train via gradient descent while freezing the large W.
LoRA Summarization and Cost Saving Calculation9:28
LoRA injects low-rank patches inside existing linear layers, keeping the original weights frozen. Train only lightweight A and B matrices, achieving about 0.1% of parameters and major memory savings.
(Optional) Matrix Multiplication Refresher4:50
Explore an optional walkthrough of matrix multiplication, showing how input x shaped 1x2 multiplies a transposed weight matrix to yield a 1x1 result, with PyTorch dot_t used for transposition.
Understanding LoRA Programatically Part 112:59
Demonstrate how LoRA speeds training by applying a low-rank update with two small matrices B and A to frozen weights, comparing full tuning and LoRA on multivariate regression in SageMaker.
Understanding LoRA Programatically Part 26:04

Setting up Imports and Libraries for the Train Script7:44
Set up imports and libraries for the train script by configuring transformers components, tokenizers, training utilities, data loading from disk, and bits and bytes config in SageMaker.
Argument Parsing Function Part 18:35
Create a parse_arg function using Python's argparse to parse model_id and data_set_path. Include tokens for hugging_face_token and training hyperparameters like epochs, per_device_train_batch_size, lr, and seed for the train script.
Argument Parsing Function Part 211:40
Explore gradient checkpointing to save GPU memory during training by recomputing activations. Configure bf16 usage and merge weights with the base model's frozen weights for finetuning jobs.
Understanding Trainable Parameters Caveats15:40
Determine how to count trainable parameters in a model with a use_four_bit flag, noting bf16 mixed precision, quantization, and ds zero stage three memory optimization.
Introduction to Quantization7:46
Identifying Trainable Layers for LoRA7:45
Identify linear layers for LoRA by building a find_all_linear_names function that scans model modules. Collect names of four-bit quantized linear layers to inject trainable adapters, enabling fine-tuning with smaller matrices.
Setting up Parameter Efficient Fine Tuning4:51
Set up a parameter efficient fine tuning workflow by injecting LoRA adapters into a quantized model, enabling gradient checkpointing, and preparing the model for k-bit training with bf16.
Implement LoRA Configuration and Mixed Precision Training11:01
Implement LoRA configuration and mixed precision training to fine-tune large language models, enabling gradient checkpointing, selecting targeted modules, applying alpha scaling, and bf16 for faster memory-efficient training.
Understanding Double Quantization4:23
Learn how double quantization compresses weights to four-bit integers and uses meta scales to reconstruct them, gaining extra memory savings for training, fine-tuning, and inference.
Creating the Training Function Part 114:53
Create a reproducible training function by seeding, loading tokenized dataset, enabling four-bit quantization with BNB NF4, and using gradient checkpointing to fine-tune a casual language model with Hugging Face trainer.
Creating the Training Function Part 27:44
Finish a training script for fine-tuning with LoRA on SageMaker, saving the merged model and tokenizer for streamlined inference. Learn memory-efficient loading, serialization, and model merging with Peft.
Finishing our Sagemaker Script5:32
Finish a sagemaker script by saving int4 weights, reloading as float16, merging LoRA adapters into the base model via merge and unload for inference, and specify a requirements.txt for training.
Gaining Access to Powerful GPUs with AWS Quotas5:47
Learn how to access powerful GPUs on AWS by requesting SageMaker training quotas. Understand on-demand costs for ml.g5.24xlarge and estimate training expenses before you start.
Final Fixes Before Training4:06
Finalize pre-training cleanup by correcting the hugging face hub cache spelling, enabling force download for large models, and adding your hugging face token, using the course repo for guidance.

Starting our Training Job7:33
Start and monitor a SageMaker training job for fine-tuning LLMs with QLoRA, configure training input paths, epochs, and hyperparameters, and view logs in CloudWatch to manage costs.
Inspecting the Results of our Training Job and Monitoring with Cloudwatch12:08
Inspect and troubleshoot a SageMaker training job by viewing status and logs in CloudWatch, monitor GPU utilization, and locate the final 80GB model in S3 after a multi-part loading process.

Deploying our LLM to a Sagemaker Endpoint19:08
Deploy a trained LLM to a SageMaker endpoint by using a Hugging Face model, an ECR image, and the S3 URI for model data.
Testing our LLM in Sagemaker Locally9:06
Test and validate your fine-tuned LLM in SageMaker locally, using a payload and generation controls like do_sample, top_p, top_k, and temperature to tune results.
Creating the Lambda Function to Invoke our Endpoint9:35
Create a Python AWS Lambda function to invoke a SageMaker endpoint, configure timeout and permissions, pass a prompt payload, and return the model's generated text.
Creating API Gateway to Deploy the Model Through the Internet2:42
Create an http api gateway to expose the lambda endpoint to the internet, attach a post route to invoke the LM endpoint, and grant invocation permission.
Implementing our Streamlit App5:23
Update the lambda to read prompts from the event body, bind a Streamlit app to the API gateway, and deploy a live interface for interacting with the fine-tuned LLM.
Streamlit App Correction3:42
Fixes the Streamlit app error and previews a 7-billion-parameter model fine-tuned with mixed precision. The video discusses epochs, data fine-tuning, quantization, and tearing down resources to avoid costs.

Requirements

Familiarity with Python
Basic Linear Algebra(matrix multiplication)

Description

Large Language Models (LLMs) are redefining what's possible with AI — from chatbots to code generation — but the barrier to training and deploying them is still high. Expensive hardware, massive memory requirements, and complex toolchains often block individual practitioners and small teams. This course is built to change that.

In this hands-on, code-first training, you’ll learn how to fine-tune models like Mixtral-8x7B using QLoRA — a state-of-the-art method that enables efficient training by combining 4-bit quantization, LoRA adapters, and double quantization. You’ll also gain a deep understanding of quantized arithmetic, floating-point formats (like bfloat16 and INT8), and how they impact model size, memory bandwidth, and matrix multiplication operations.

You’ll write advanced Python code to preprocess datasets with custom token-aware chunking strategies, dynamically identify quantizable layers, and inject adapter modules using the PEFT (Parameter-Efficient Fine-Tuning) library. You’ll configure and launch distributed fine-tuning jobs on AWS SageMaker, leveraging powerful multi-GPU instances and optimizing them using gradient checkpointing, mixed-precision training, and bitsandbytes quantization.

After training, you’ll go all the way to deployment: merging adapter weights, saving your model for inference, and deploying it via SageMaker Endpoints. You’ll then expose your model through an AWS Lambda function and an API Gateway, and finally, build a Streamlit application to create a clean, responsive frontend interface.

Whether you’re a machine learning engineer, backend developer, or AI practitioner aiming to level up — this course will teach you how to move from academic toy models to real-world, scalable, production-ready LLMs using tools that today’s top companies rely on.

Who this course is for:

Machine Learning Engineers
Backend and MLOps Engineers
AI Researchers and Students
Anyone who wants to go beyond "prompt engineering" and start building, training, and deploying their own production-ready LLMs.

Fine-Tune & Deploy LLMs with QLoRA on Sagemaker + Streamlit

What you'll learn

Explore related topics

Course content

Course Overview1 lecture • 5min

Setting up Our AWS Account4 lectures • 15min

Setting Up AWS Sagemaker Environment3 lectures • 16min

Course Resources1 lecture • 1min

Gathering, Chunking, Tokenizing and Uploading our Dataset17 lectures • 2hr 25min

Understanding LoRA and Setting up HuggingFace Estimator8 lectures • 1hr 8min

Improving Training Speed with Bfloat 162 lectures • 15min

Setting up the QLoRA Training Script with Mixed Precision & Double Quantization14 lectures • 1hr 57min

Running our Fine Tuning Script for our LLM2 lectures • 20min

Deploying our Fine Tuned LLM6 lectures • 50min

Requirements

Description

Who this course is for: