AI Vision Systems for Self-Driving Cars in Production on AWS

Name: AI Vision Systems for Self-Driving Cars in Production on AWS
Rating: 5.0 (5 reviews)

Computer Vision on AWS: SageMaker, Rekognition, ViTs and Meta's Segment Anything Model for Detection+Segmentation + Math

Created byPatrik Szepesi

Last updated 3/2026

English

English [Auto],

What you'll learn

Build an end-to-end auto-labeling pipeline using Segment Anything (SAM) for large-scale image datasets
Understand how Vision Transformers (ViTs) work internally, including patch embeddings and self-attention
Explain the core mathematics behind SAM, including mask decoding and prompt conditioning
Run GPU-accelerated segmentation workloads efficiently using modern deep-learning stacks
Compare SAM ViT-B, ViT-L, and ViT-H models and choose the right one for cost, speed, and accuracy
Integrate AWS Rekognition for high-level object detection and metadata extraction
Combine AWS Rekognition outputs with SAM masks to create precise, pixel-level labels
Visualize segmentation masks, bounding boxes, and confidence scores for model debugging
Analyze trade-offs between open-source CV models and managed cloud services
Image Segmentation
How to Use Open Source Models in AWS Sagemaker
Optimize performance and memory usage when running SAM on large images
Use AWS-based pipelines to scale computer-vision workloads reliably
Bridge the gap between theory (math + models) and practical production pipelines
AWS Rekognition
Object Detection

Course content

8 sections • 50 lectures • 6h 18m total length

Final Product4:33
Launch a computer vision startup by building an AI labeling workflow with Meta's SAM model and the segment anything model, using AWS Rekognition for labels and segmentation masks.
Course Resources0:01

Vision Transformers vs Convolutional Neural Networks5:45
Compare vision transformers and CNNs, highlighting when to use global-context self-attention versus local feature detection, with guidance for large data sets, segmentation, and fast mobile inference.
Quadratic Operations10:19
Examine how attention in vision transformers links every token to every other, causing quadratic n squared complexity, and note patches as a solution amid CNNs' faster efficiency.
Introduction to ViTs and Joint Training with Embeddings11:17
Explore how vision transformers turn image patches into embeddings via a linear projection, with CLS tokens and learned positional encodings, trained jointly with the model for image classification.
Understanding Attention Mechanisms, Brief Summary5:29
Demonstrates how vision transformers use self-attention to compute patch-to-patch scores with query-key dot products, with gradients from the mlp head flowing to patch embeddings, including the cls token.
Understanding the Full ViT Pipeline17:36
Learn how a vision transformer uses a CLS token to aggregate image patches through multi-head attention, computing queries, keys, and values and forming a global representation for classification.

Introduction to Prompt Encoders for SAM5:06
Explain how sam's image encoder uses a vision transformer to produce image tokens and a feature embedding, with the prompt encoder turning clicks or auto prompts into sparse tokens.
SAM AutoPrompt Mode16:00
Explore SAM auto prompt mode, where learned object finder queries guide segmentation without user input, using prompt self-attention and cross-attention with vision transformer features to output masks.
SAM Manual Click Mode8:05
Explore how prompt self-attention in the SAM architecture enables auto prompts and manual foreground clicks to communicate, updating query vectors to bias attention toward the clicked region in image embeddings.
ViT Embeddings inside SAM5:08
Create patch embeddings from an input image using a vision transformer, add learned positional encodings, and apply single-head self-attention within the transformer encoder.
Calculating Attention Score for Vision Transformers in SAM17:21
Project patch tokens to queries, keys, and values to compute attention scores for vision transformers in sam, apply stabilized softmax, and blend with values to form context-aware, residual-connected representations.
How SAM is Trained8:17
Sam encodes images with a vision transformer into embeddings, then uses prompt tokens and two-way attention to generate mask candidates and IoU scores in manual or auto prompt modes.
Calculating Prompt Self Attention for SAM4:17
Explain prompt self-attention in auto prompt mode, covering query, key, value vectors, attention scores, scaling, and how updated prompt embeddings enable communication for subsequent image cross-attention with the Vit encoder.
Prompt Image Cross Attention7:44
Explore prompt image cross attention, where updated prompt embeddings query image token edges and colors to attend to relevant image regions, producing richer features for segmentation.
Image to Prompt Cross Attention6:04
Explore image to prompt cross-attention, updating image tokens with prompt-conditioned evidence. See how queries from image tokens align with updated prompt keys and values to highlight objects and suppress background.
(Optional) Finishing SAM Example Part 19:14
Watch an optional video that demonstrates the Sam architecture using prompt self-attention and image-to-prompt cross attention to build mask maps from patch features, with a toy upsampling example.
(Optional) Finishing SAM Example Part 26:42
Describe how the IOU token and tiny head score and select the best mask, then deduplicate overlaps. Compare toy and real Sam architectures, including multi-head attention and upsampling differences.
Finishing0:04

Creating our SagemakerAI Domain1:07
Create a SageMaker AI domain in the London region using the quick setup; organizations may tailor roles, encryption, and MLflow, while the process may take time.
Starting Domain and Understanding Pricing3:27
Open studio to create a JupyterLab space in SageMaker Studio, choosing a distribution image and an ml.t3.medium instance, and learn pricing in the London region including the free tier.
Installing Libraries4:13
Install and configure libraries to run the segment anything model in SageMaker, including pip installs and loading a Facebook Research model from S3.
Stopping Instances and Servers0:37
Stop instances and servers to avoid charges, save your work, shut down all, delete all workspaces, and close this window.

Downloading the SAM Model from Meta3:27
Download Meta's Sam model, upload the 2.4GB ViT H variant to an S3 bucket, and prepare to load it later in a Jupyter notebook via SageMaker AI.
Updating IAM Permissions1:59
Update the IAM role to include S3 access by attaching the Amazon S3 full access policy, consider the least privilege principle, then rerun the cell to apply changes.
Importing Libraries5:51
Import essential libraries for a computer vision workflow, including numpy, torch, matplotlib, PIL, cv2, and sam_model_registry and sam_automatic_mask_generator.
Understanding how we use Rekognition with SAM13:20
Learn how to integrate AWS Rekognition with the Sam model to produce labels and segmentation masks, using bounding boxes and overlap checks, with S3 bucket setup for inputs and outputs.
Defining Helper Functions13:21
Define helper functions to load models from S3 and manage temporary files for vision tasks. Implement image loading, RGB conversion, size checks, and Lanczos-based resizing with aspect ratio preservation.
Clarification Regarding Helper Functions1:42
Explain why we should avoid double rgb conversion in helper functions, keep code in rgb land for consistency, and discuss productionizing by converting once to improve efficiency.
Rekognition Detection and Filtering13:41
Learn to implement a Rekognition-based detection pipeline with min confidence threshold and filtering by target labels, convert percentage bounding boxes to pixels, deduplicate overlaps, and return unique detections.
Initialise SAM Model from S310:56
Initialize the Sam model from S3, download heavy weights, and load it on CPU or GPU with memory-efficient settings, then configure the automatic mask generator parameters for robust segmentation.
Main Processing Function Part 114:59
Implement the main processing function to extract cars and pedestrians with segmentation masks from images, using a recognition step and Sam segmentation, with CPU and GPU handling.
Main Processing Function Part 24:08
Execute the main processing function to find the best recognition and SAM matches, merge labels with precise masks, and return the image, masks, and objects; then run the processing loop.
Running the Main Processing Cell5:04
Run the main processing cell to trigger the pipeline, download the image from S3, detect objects with AWS Rekognition, and generate Sam masks. Print results with confidence and mask area.

Visualizing Rekognition Detections8:31
Visualize the raw recognition detections with bounding boxes and labels using matplotlib, detailing how to plot, style, and annotate detected objects.
Visualize All SAM Masks9:57
Visualize all SAM masks by overlaying random color masks on the image, sorting by area, and applying alpha blending to reveal masks clearly.
Visualizing Match Quality IOU Scores Part 110:59
Visualize match quality by computing the intersection over union (IOU) between Rekognition and Sam bounding boxes, converting x y w h to coordinates, and assessing overlap to gauge alignment.
Visualizing Match Quality IOU Scores Part 211:16
Visualize IOU scores for matched results by plotting a matplotlib bar chart that colors bars by score and labels each bar with the object and score, showing recognition vs Sam.
Visualizing Image Segmentations with Bounding Boxes11:16
Visualize segmented objects by drawing bounding boxes around detections and overlaying color masks on the original image to highlight cars and pedestrians with OpenCV.
Visualizing Image Segmentations with Bounding Boxes Part 20:59
Refine image visualizations by cleaning up plots, removing excess numbers and whitespace with tight layout, then save the image and prepare a new cell for future visualizations.
Visualizing Masks and Labels Without Bounding Boxes6:59
Visualize segmentation masks without bounding boxes by overlaying colored, transparent masks and bold labels on the original image to show final segmented objects.
Visualizing Segementations in Black and White Masks5:13
Visualize segmentation results by rendering matched masks as white on a black canvas the same size as the original image, using numpy zeros to create the canvas and overlaying masks.

Saving Metadata to S39:46
Save segmentation results to S3 by creating a save_results_s3 function, generate a timestamped file name, and upload a metadata json with detections and labels.
Save Images to S311:16
Save a matplotlib generated image with segmentation masks, bounding boxes, and labels to a memory buffer, then upload the PNG to S3 using a timestamped, annotated key.
Saving Individual Masks to S38:54
Save each object's binary mask as a separate file on S3 by looping through results, converting boolean masks to grayscale, buffering, and uploading with image-specific names for AI model training.
Minor Correction0:52
Move the if block outside the function to enable invocation, and only run the next function when the matched results length is greater than zero.

Adding a GPU Server to our Notebook and AWS Quotas5:42
Learn how to add a GPU-enabled notebook server on SageMaker, upload and test images from a bucket, and navigate AWS quotas to request higher notebook instance limits.
Testing Our Full Pipeline8:43
Test the full sam+vision transformers pipeline by loading the model, running recognition and segmentation, visualizing masks and bounding boxes, and saving outputs to your bucket while tuning confidence thresholds.
Minor Corrections13:40
Optimize memory in a sam+vision transformers workflow by clearing heavy variables, forcing garbage collection, and caching models, then download small images, monitor gpu memory, and save segmentation masks to s3.
Productionizing + Cleanup7:06
Separate Sam loading into its own cell, batch-infer over an s3 folder, and use manual prompt mode with bounding boxes to mask only targeted regions.

Requirements

Basic Python
HS math

Description

Building a successful computer vision product—especially for self-driving car perception—starts with two things: strong foundations and real, scalable systems.

In this course, you’ll learn how to build your own autonomous driving–style vision pipeline using Meta’s Segment Anything Model (SAM), Vision Transformers (ViTs), and AWS Rekognition—while actually understanding the math and intuition behind how these models work.

We begin by exploring Vision Transformers from the ground up, focusing on clear, intuitive explanations of patch embeddings, attention mechanisms, and model representations. You’ll see the underlying mathematics of attention, embeddings, and similarity—and how these ideas translate into the perception capabilities modern self-driving stacks rely on. From there, we dive into Meta’s SAM architecture, explaining how prompts, embeddings, and mask decoding work together to produce high-quality segmentation results—again connecting the math to the behavior you observe, without treating the model as a black box.

You’ll then see how these open-source models fit into real-world self-driving perception workflows. We integrate AWS Rekognition for high-level detection and metadata extraction, and combine it with SAM to create automated, pixel-level labeling pipelines—the kind used to scale dataset creation for autonomous driving. Throughout, you’ll learn how model outputs (scores, embeddings, masks) relate to the underlying objectives and representations that make the pipeline reliable.

A strong emphasis is placed on visualization and practical understanding. You’ll inspect masks, bounding boxes, confidence signals, embeddings, and failure cases, and learn how mathematical concepts translate directly into model behavior you can observe, debug, and improve—critical when building perception systems for safety-sensitive applications like self-driving cars.

By the end of the course, you won’t just know how to run SAM or call an AWS API. You’ll understand why the models work, how to combine managed cloud services with open-source research, and how to think like someone building a real computer vision startup focused on scalable autonomous vehicle perception—not just a demo.

This course is ideal if you want to go beyond surface-level tutorials and gain a clear, intuitive understanding of modern computer vision systems—from the math behind Transformers and segmentation to production-grade perception pipelines used in autonomous driving.

Who this course is for:

Machine Learning Engineers who want to build real-world computer vision pipelines beyond toy examples
Computer Vision Engineers looking to apply SAM and Vision Transformers in production workflows
Data Scientists who want to automate image labeling and accelerate dataset creation
AI Engineers interested in combining open-source vision models with AWS services
Software Engineers transitioning into applied machine learning and computer vision

AI Vision Systems for Self-Driving Cars in Production on AWS

What you'll learn

Explore related topics

Course content

What We Are Building2 lectures • 5min

Mathematics behind Vision Transformers5 lectures • 50min

Mathematics Behind Meta's SAM(Segment Anything Model)12 lectures • 1hr 34min

Setting up Our AWS Environment4 lectures • 9min

Setting up Open Source Models Like Meta's SAM11 lectures • 1hr 28min

Visualizing our Outputs8 lectures • 1hr 5min

Saving Results to S34 lectures • 31min

Testing + Setup4 lectures • 35min

Requirements

Description

Who this course is for: