
Explore building and deploying a fine-tuned large language model on AWS SageMaker via a Streamlit app, using API gateway and Lambda, with quantization, LoRA, mixed precision training, and monitoring.
Set up and sign in to your AWS account, learn when to use root vs IAM users, enable multi-factor authentication, and review costs with cost explorer while understanding charges.
Create an IAM user, enable the management console and CLI access, enforce password changes at first login, and apply administrator or least-privilege permissions for secure AWS access.
Sign in to your new AWS account using the CSV credentials and sign-in URL, reset your password to meet the policy, and use the IAM role for access.
learn how to respond when an iam user is hacked by signing back into the root account, deleting the compromised user, and revoking access to prevent further sign-ins.
Create a SageMaker domain to host a Jupyter Lab notebook on AWS, select your region, and set up a single-user domain for streamlined AI development.
Set up and manage a SageMaker domain, create a user profile with an execution role, enable idle shutdown, and access JupyterLab, training jobs, models, and deployments.
Launch and manage a Jupyter Lab notebook on AWS SageMaker, select the region and mlt3 medium instance, use a Python kernel, and monitor on-demand costs to avoid charges.
Launch and operate SageMaker's JupyterLab, installing and upgrading libraries and starting kernels. Align SageMaker session and IAM execution role with the region and S3 to ensure access and avoid charges.
Load the Dall-E 15 dataset from Hugging Face and Databricks, and convert samples into instruction-context-response prompts for fine-tuning LLMs with diverse data. Handle context variations as needed.
Tokenize text into token IDs, then map to word embeddings via an embedding table, using byte-pair encoding and subword tokens. Pad inputs with iOS tokens to ensure lengths for transformers.
Master Hugging Face authentication in SageMaker by creating and configuring an access token, granting model access, and loading the tokenizer to prepare prompts with proper padding.
Apply the templating function to convert each dataset item into a single text input with an end-of-sequence token, using dolly format and map to streamline the dataset for model use.
Learn to prepare model inputs using input IDs and attention masks, explain padding with fixed 2048 token chunks, and manage long prompts via a remainder dictionary in decoder-only LLMs.
Learn star unpacking in Python: how an iterable's elements are unpacked into individual arguments (one level only) using a list and a three-argument function, demonstrated in Google Colab.
Explore how to flatten and concatenate input IDs and attention masks using star unpacking, chain, and list constructors to prepare 2048-token chunks for LLM fine-tuning.
Explore batching and remainder handling to build 2048-token chunks for fine-tuning LLMs, using input IDs and attention masks, concatenation, and pythonic slicing to create final streams.
Learn to slice the concatenated input ids and attention masks into fixed-size chunks of five, producing 0-5, 5-10, and 10-15 segments for model fine-tuning.
Create a custom chunking function that splits samples into 2048-token chunks, tracks remainders with a global dictionary, and aligns input IDs with attention masks.
Tokenize the entire dataset using a map with a tokenizer to create tokenized inputs, then batch and chunk into 2048-token samples, removing raw text post-tokenization.
Run and debug the chunking function, reload the data set after edits, and fix errors from missing instruction, context, or category to produce 1528 chunks of 2048 tokens.
Demonstrates the chunking process by inspecting batch sizes and 2048 token chunks, showing how 101 chunks arise and 807 tokens carry over to the next batch.
Learn how to create an S3 bucket and upload the tokenized training data to S3. Configure the SageMaker execution role with S3 permissions and verify the data for model fine-tuning.
Configure a SageMaker fine-tuning training job with Hugging Face, setting hyperparameters such as model id, dataset path, and epochs, and enabling merge weights for efficient tuning.
Create a Hugging Face estimator in SageMaker to orchestrate fine-tuning with QLoRA, configuring the entry point script, hyperparameters, and the training data path before launching a fit job.
Fine-tune large models efficiently with LoRA, freezing the original weight matrix and learning small A and B updates that form a low-rank change to W.
demonstrates a numerical LoRA example using small trainable matrices A and B to compute W effective, update delta W, and train via gradient descent while freezing the large W.
LoRA injects low-rank patches inside existing linear layers, keeping the original weights frozen. Train only lightweight A and B matrices, achieving about 0.1% of parameters and major memory savings.
Explore an optional walkthrough of matrix multiplication, showing how input x shaped 1x2 multiplies a transposed weight matrix to yield a 1x1 result, with PyTorch dot_t used for transposition.
Demonstrate how LoRA speeds training by applying a low-rank update with two small matrices B and A to frozen weights, comparing full tuning and LoRA on multivariate regression in SageMaker.
Compare bfloat16 and float32 to explain how identical exponent ranges with different mantissa precision enable memory savings and faster training in mixed precision for machine learning.
Set up imports and libraries for the train script by configuring transformers components, tokenizers, training utilities, data loading from disk, and bits and bytes config in SageMaker.
Create a parse_arg function using Python's argparse to parse model_id and data_set_path. Include tokens for hugging_face_token and training hyperparameters like epochs, per_device_train_batch_size, lr, and seed for the train script.
Explore gradient checkpointing to save GPU memory during training by recomputing activations. Configure bf16 usage and merge weights with the base model's frozen weights for finetuning jobs.
Determine how to count trainable parameters in a model with a use_four_bit flag, noting bf16 mixed precision, quantization, and ds zero stage three memory optimization.
Identify linear layers for LoRA by building a find_all_linear_names function that scans model modules. Collect names of four-bit quantized linear layers to inject trainable adapters, enabling fine-tuning with smaller matrices.
Set up a parameter efficient fine tuning workflow by injecting LoRA adapters into a quantized model, enabling gradient checkpointing, and preparing the model for k-bit training with bf16.
Implement LoRA configuration and mixed precision training to fine-tune large language models, enabling gradient checkpointing, selecting targeted modules, applying alpha scaling, and bf16 for faster memory-efficient training.
Learn how double quantization compresses weights to four-bit integers and uses meta scales to reconstruct them, gaining extra memory savings for training, fine-tuning, and inference.
Create a reproducible training function by seeding, loading tokenized dataset, enabling four-bit quantization with BNB NF4, and using gradient checkpointing to fine-tune a casual language model with Hugging Face trainer.
Finish a training script for fine-tuning with LoRA on SageMaker, saving the merged model and tokenizer for streamlined inference. Learn memory-efficient loading, serialization, and model merging with Peft.
Finish a sagemaker script by saving int4 weights, reloading as float16, merging LoRA adapters into the base model via merge and unload for inference, and specify a requirements.txt for training.
Learn how to access powerful GPUs on AWS by requesting SageMaker training quotas. Understand on-demand costs for ml.g5.24xlarge and estimate training expenses before you start.
Finalize pre-training cleanup by correcting the hugging face hub cache spelling, enabling force download for large models, and adding your hugging face token, using the course repo for guidance.
Start and monitor a SageMaker training job for fine-tuning LLMs with QLoRA, configure training input paths, epochs, and hyperparameters, and view logs in CloudWatch to manage costs.
Inspect and troubleshoot a SageMaker training job by viewing status and logs in CloudWatch, monitor GPU utilization, and locate the final 80GB model in S3 after a multi-part loading process.
Deploy a trained LLM to a SageMaker endpoint by using a Hugging Face model, an ECR image, and the S3 URI for model data.
Test and validate your fine-tuned LLM in SageMaker locally, using a payload and generation controls like do_sample, top_p, top_k, and temperature to tune results.
Create a Python AWS Lambda function to invoke a SageMaker endpoint, configure timeout and permissions, pass a prompt payload, and return the model's generated text.
Create an http api gateway to expose the lambda endpoint to the internet, attach a post route to invoke the LM endpoint, and grant invocation permission.
Update the lambda to read prompts from the event body, bind a Streamlit app to the API gateway, and deploy a live interface for interacting with the fine-tuned LLM.
Fixes the Streamlit app error and previews a 7-billion-parameter model fine-tuned with mixed precision. The video discusses epochs, data fine-tuning, quantization, and tearing down resources to avoid costs.
Large Language Models (LLMs) are redefining what's possible with AI — from chatbots to code generation — but the barrier to training and deploying them is still high. Expensive hardware, massive memory requirements, and complex toolchains often block individual practitioners and small teams. This course is built to change that.
In this hands-on, code-first training, you’ll learn how to fine-tune models like Mixtral-8x7B using QLoRA — a state-of-the-art method that enables efficient training by combining 4-bit quantization, LoRA adapters, and double quantization. You’ll also gain a deep understanding of quantized arithmetic, floating-point formats (like bfloat16 and INT8), and how they impact model size, memory bandwidth, and matrix multiplication operations.
You’ll write advanced Python code to preprocess datasets with custom token-aware chunking strategies, dynamically identify quantizable layers, and inject adapter modules using the PEFT (Parameter-Efficient Fine-Tuning) library. You’ll configure and launch distributed fine-tuning jobs on AWS SageMaker, leveraging powerful multi-GPU instances and optimizing them using gradient checkpointing, mixed-precision training, and bitsandbytes quantization.
After training, you’ll go all the way to deployment: merging adapter weights, saving your model for inference, and deploying it via SageMaker Endpoints. You’ll then expose your model through an AWS Lambda function and an API Gateway, and finally, build a Streamlit application to create a clean, responsive frontend interface.
Whether you’re a machine learning engineer, backend developer, or AI practitioner aiming to level up — this course will teach you how to move from academic toy models to real-world, scalable, production-ready LLMs using tools that today’s top companies rely on.