Deep Learning for NLP - Part 5

Name: Deep Learning for NLP - Part 5
Rating: 4.8 (5 reviews)

Part 5: Efficient Transformer models

Created byManish Gupta

Last updated 7/2021

English

What you'll learn

Deep Learning for Natural Language Processing
Efficient Transformer Models: Star Transformers, Sparse Transformers, Reformer, Longformer, Linformer, Synthesizer
Efficient Transformer Models: ETC (Extended Transformer Construction), Big bird, Linear Transformer,Performer,Sparse Sinkhorn Transformer,Routing transformers
Efficient Transformer benchmark: Long Range Arena
Comparison of various efficient Transformer methods
DL for NLP

Course content

2 sections • 18 lectures • 3h 31m total length

Introduction3:52
Explore efficient transformers that tackle the quadratic time and memory growth with sequence length, addressing long documents and multimodal inputs through models like Star Transformers, Reformer, Longformer, and Synthesizer.
Star Transformers18:10
Replace quadratic attention with a star-shaped topology using star transformers, introducing relay and satellite nodes with radical and ring connections to achieve linear complexity and faster attention for NLP tasks.
Sparse Transformers18:01
Explore sparse transformers that enable efficient self-attention over ultra-long sequences using fixed and striated attention, memory-saving backward recomputation, and redesigned residual blocks for scaling to tens of thousands of tokens.
Reformer20:13
Reformer replaces product attention with locality-sensitive hashing attention to achieve linear time, while reversible residual layers and chunked activations cut memory usage.
Longformer11:39
Longformer blends dilated sliding window attention with global tokens to capture local and global context, delivering linear memory and optimized sparse implementations for efficient long-sequence processing.
Linformer11:02
Linformer replaces self-attention with low-rank projections of keys and values for linear complexity and faster inference, using shared projections across heads or layers to balance memory, latency, and accuracy.
Synthesizer15:54
Summary2:13
Explore efficient transformer architectures for long sequences, including sparse transformers, longformer with dilated and global windows, and reformer variants using lsh and projection-based linearization of q, k, v.

Introduction2:49
Explore six efficient transformer methods to reduce quadratic attention complexity, including Star Transformers, Reformer, Longformer, Linformer, Synthesizer, and NBC extended transformer construction, with benchmark implications.
ETC (Extended Transformer Construction)20:19
Explore the extended transformer construction, combining relative position encodings with global and local sparse attention, using global tokens and CPC loss for structured input.
Big bird14:43
The Big Bird model blends global, local, and random attention to handle long sequences, achieving state of the art results on MLM, QA, summarization, and genomics benchmarks.
Linear attention Transformer13:24
Learn how linear attention converts quadratic complexity to linear for autoregressive, long-sequence modeling, using causal updates, kernel-based similarity, and fast, memory-efficient processing.
Performer21:29
Explore the performer, a transformer that achieves linear space and time complexity by approximating softmax attention with kernel-based feature maps, including positive random features and orthogonal variants.
Sparse Sinkhorn Transformer11:54
Explore the sparse sinkhorn transformer, where queries attend to key blocks via trainable sorting and block embeddings, using permutation normalization to enable a forward network driven mixture with standard attention.
Routing transformers7:47
Explore routing transformers that replace locality sensitive hashing with k-means clustering to route attention among queries and keys, achieving memory-efficient sparse attention with exponential moving average clustering and causal attention.
Efficient Transformer benchmark: Long Range Arena8:07
Explore the long range arena benchmark evaluating six long-sequence tasks—from list operations to Pathfinder image tasks—comparing models like Big Bird and linear transformer on accuracy, latency, and memory tradeoffs.
Comparison of various efficient Transformer methods8:00
Compare efficient transformer methods across axes of memory and complexity, detailing fixed versus learnable patterns, local and global attention, external and internal memory, compression, and recurrence for long-range sequences.
Summary2:14
Explore efficient transformers for long-range NLP tasks. Survey global, local, sparse, and dynamic attention methods like Big Bird, linear transformers, and routing transformers.

Requirements

Basics of machine learning
Basic understanding of Transformer based models and word embeddings

Description

This course is a part of "Deep Learning for NLP" Series. In this course, I will talk about various design schemes for efficient Transformer models. These techniques will come in very handy for academic as well as industry participants. For industry use cases, Transformer models have been shown to lead to very high accuracy values across many NLP tasks. But they have quadratic memory as well as computational complexity making it very difficult to ship them. Thus, this course which focuses on methods to make Transformers efficient is very critical for anyone who wants to ship Transformer models as part of their products.

Time and activation memory in Transformers grows quadratically with the sequence length. This is because in every layer, every attention head attempts to come up with a transformed representation for every position by "paying attention" to tokens at every other position. Quadratic complexity implies that practically the maximum input size is rather limited. Thus, we cannot extract semantic representation for long documents by passing them as input to Transformers. Hence, in this module we will talk about methods to address this challenge.

The course consists of two main sections as follows. In the two sections, I will talk about Efficient Transformer Models, Efficient Transformer benchmark and a Comparison of various efficient Transformer methods.

In the first section, I will talk about methods like Star Transformers, Sparse Transformers, Reformer, Longformer, Linformer, Synthesizer.

In the second section, I will talk about methods like ETC (Extended Transformer Construction), Big bird, Linear attention Transformer, Performer, Sparse Sinkhorn Transformer, Routing transformers. Long Range Arena is a recent benchmark for evaluating models on long sequence tasks with respect to accuracy, memory usage and inference time. We will discuss details about long range arena and finally wrap up with a philosophical categorization of various efficient Transformer methods.

For each method, we will discuss specific scheme for optimization, architecture and results obtained for pretraining as well as downstream tasks.

Who this course is for:

Beginners in deep learning
Python developers interested in data science concepts
Masters or PhD students who wish to learn deep learning concepts quickly
Folks wanting to ship their products across regions and languages (internationalization of their learning/predictive/generative models)

Deep Learning for NLP - Part 5

What you'll learn

Explore related topics

Course content

Efficient Transformers: Part 18 lectures • 1hr 41min

Efficient Transformers: Part 210 lectures • 1hr 51min

Requirements

Description

Who this course is for: