Transformers in Computer Vision - English version

Name: Transformers in Computer Vision - English version
Rating: 4.0 (107 reviews)

Created byCoursat.ai Dr. Ahmad ElSallab

Last updated 1/2023

English

What you'll learn

What are transformer networks?
State of the Art architectures for CV Apps like Image Classification, Semantic Segmentation, Object Detection and Video Processing
Practical application of SoTA architectures like ViT, DETR, SWIN in Huggingface vision transformers
Attention mechanisms as a general Deep Learning idea
Inductive Bias and the landscape of DL models in terms of modeling assumptions
Transformers application in NLP and Machine Translation
Transformers in Computer Vision
Different types of attention in Computer Vision

Course content

10 sections • 38 lectures • 5h 32m total length

Introduction3:57

The Rise of Transformers10:19
Inductive Bias in Deep Neural Network Models23:53
Attention is a General DL idea7:46
Explore how attention mechanisms differ from fully connected layers by dynamically weighting input features via self-attention, enabling context-aware connections across image patches and words.
Attention in NLP20:47
Attention is ALL you need4:04
Self Attention Mechanisms12:45
Self Attention Matrix Equations6:36
Explore the self-attention mechanism as a matrix operation on word embeddings, using Q, K, V projections, dot products, and softmax to compute token-wise attention maps.
Multihead Attention9:49
Learn how multi-head attention in transformers encodes multiple feature views with separate Q, K, V projections, then concatenates heads to produce richer vision representations for computer vision.
Encoder-Decoder Attention6:15
Learn encoder-decoder attention in transformers, where the encoder uses self-attention and multi-head attention, and the decoder uses auto regressive decoding with encoder keys and values.
Transformers Pros and Cons9:53
Unsupervised Pre-training7:56
Unsupervised pre-training empowers transformers to learn from massive data without labels through generative, context-based, and cross-model self-supervised tasks, dramatically boosting accuracy before downstream supervised tasks.

Transformers in Object detection1:22
Obejct Detection methods review4:26
Compare traditional and deep learning object detection methods, from region proposals and one- and two-stage detectors to transformers-based approaches, using CNN feature maps and confidence-based filtering for multi-object classification.
Object Detection with ConvNet - YOLO25:13
DEtection TRansformers (DETR)19:46
DETR vs. YOLOv5 use case13:26

Module roadmap1:00
Explore transformers in image segmentation within computer vision, reviewing component-based segmentation and the unit architecture, and compare segmentation transformer and trans unit approaches for semantic and panoptic segmentation.
Image Segmentation using ConvNets11:29
Explore semantic, instance, and panoptic segmentation, and how conv nets with encoder-decoder structures like U-net and FPN produce pixelwise masks and labels.
Image Segmentation using Transformers12:30
Explore transformer-based image segmentation, replacing encoders with self-attention and using patch embeddings, for semantic, instance, and panoptic segmentation via segmenter and detection transformer variants.

Module roadmap2:29
Explore practical usage of transformers with pre-trained models using the hugging phase transformer pipeline api. Apply these vision transformer architectures to image classification, segmentation, and object detection.
Huggingface Pipeline overview5:47
Operate the Hugging Face pipeline to connect tasks, pre-trained models, preprocessing, and post-processing, delivering outputs from any input, with model hub options for sentiment analysis, question answering, and summarization.
Huggingface vision transformers7:27
Huggingface Demo using Gradio5:36

Requirements

Practical Machine Learning course
Practical Computer Vision course (ConvNets)
Introduction to NLP course

Description

Transformer Networks are the new trend in Deep Learning nowadays. Transformer models have taken the world of NLP by storm since 2017. Since then, they become the mainstream model in almost ALL NLP tasks. Transformers in CV are still lagging, however they started to take over since 2020.

We will start by introducing attention and the transformer networks. Since transformers were first introduced in NLP, they are easier to be described with some NLP example first. From there, we will understand the pros and cons of this architecture. Also, we will discuss the importance of unsupervised or semi supervised pre-training for the transformer architectures, discussing Large Scale Language Models (LLM) in brief, like BERT and GPT.

This will pave the way to introduce transformers in CV. Here we will try to extend the attention idea into the 2D spatial domain of the image. We will discuss how convolution can be generalized using self attention, within the encoder-decoder meta architecture. We will see how this generic architecture is almost the same in image as in text and NLP, which makes transformers a generic function approximator. We will discuss the channel and spatial attention, local vs. global attention among other topics.

In the next three modules, we will discuss the specific networks that solve the big problems in CV: classification, object detection and segmentation. We will discuss Vision Transformer (ViT) from Google, Shifter Window Transformer (SWIN) from Microsoft, Detection Transformer (DETR) from Facebook research, Segmentation Transformer (SETR) and many others. Then we will discuss the application of Transformers in video processing, through Spatio-Temporal Transformers with application to Moving Object Detection, along with Multi-Task Learning setup.

Finally, we will show how those pre-trained arcthiectures can be easily applied in practice using the famous Huggingface library using the Pipeline interface.

Who this course is for:

Intermediate to Advanced CV Engineers
Intermediate to Advanced CV Researchers

Transformers in Computer Vision - English version

What you'll learn

Explore related topics

Course content

Introduction1 lecture • 4min

Overview of Transformer Networks11 lectures • 2hr

Transformers in Computer Vision8 lectures • 51min

Transformers in Image Classification3 lectures • 27min

Transformers in Object Detection5 lectures • 1hr 4min

Transformers in Semantic Segmentation3 lectures • 25min

Spatio-Temporal Transformers1 lecture • 16min

Huggingface Vision Transformers4 lectures • 21min

Conclusion1 lecture • 4min

Material1 lecture • 1min

Requirements

Description

Who this course is for: