Deep Learning

RNN, Transformers & Attention

Recurrent networks, LSTM, transformer architecture, and attention mechanisms.

RNN & LSTM: Mastering Sequence Data

Why Recurrent Networks?

Feedforward networks assume independent inputs. For sequences (time series, text, audio), we need memory. RNNs maintain a hidden state that carries information across time steps.

hâ‚œ = tanh(WÂ·[hâ‚œâ‚‹â‚, xâ‚œ] + b) â†’ yâ‚œ = W_yÂ·hâ‚œ + b_y

Parameters are shared across time steps. The same W, b used at every step.

Vanilla RNN & Backpropagation Through Time

RNN Cell

hâ‚œ = tanh(W_ihÂ·xâ‚œ + b_ih + W_hhÂ·hâ‚œâ‚‹â‚ + b_hh)

Hidden state combines current input and previous hidden state.

# PyTorch RNN cell (single step)
import torch.nn as nn
rnn_cell = nn.RNNCell(input_size=10, hidden_size=20)

h = torch.zeros(1, 20)  # initial hidden
for t in range(seq_len):
    h = rnn_cell(x[t], h)

Backprop Through Time

Gradients flow backward through time steps. Chain rule multiplies across many tanh derivatives â†’ vanishing/exploding gradients.

Problem: RNNs struggle with long sequences (>10 steps).

# BPTT conceptually
for t in reversed(range(seq_len)):
    # gradient at time t depends on t+1
    grad_h[t] += grad_h[t+1] * W_hh.T * (1 - h[t]**2)

Truncated BPTT: Limit the number of time steps backpropagated (e.g., 20-50 steps). Common in training language models.

The Vanishing Gradient Problem

Why gradients vanish

During BPTT, gradient = âˆ(W_hháµ€ Â· diag(tanh')). tanh' â‰¤ 1. Repeated multiplication makes gradient â†’ 0 for long-term dependencies.

Effect: RNN cannot learn relationships between distant tokens.

Solutions

LSTM/GRU â€“ gating preserves gradients
ReLU + proper init (helps but not robust)
Gradient clipping (for explosion)
Residual connections

LSTM â€“ The Gated Solution

LSTM Gates

fâ‚œ = Ïƒ(W_fÂ·[hâ‚œâ‚‹â‚, xâ‚œ] + b_f) // forget gate
iâ‚œ = Ïƒ(W_iÂ·[hâ‚œâ‚‹â‚, xâ‚œ] + b_i) // input gate
oâ‚œ = Ïƒ(W_oÂ·[hâ‚œâ‚‹â‚, xâ‚œ] + b_o) // output gate
cÌƒâ‚œ = tanh(W_cÂ·[hâ‚œâ‚‹â‚, xâ‚œ] + b_c) // candidate
câ‚œ = fâ‚œ âŠ™ câ‚œâ‚‹â‚ + iâ‚œ âŠ™ cÌƒâ‚œ // cell state
hâ‚œ = oâ‚œ âŠ™ tanh(câ‚œ)

Cell state acts as a gradient highway. Forget gate controls what to keep/erase. Gradients flow through addition, not multiplication.

LSTM in 30 seconds

Forget â€“ reset cell state
Input â€“ write new info
Output â€“ expose cell state
Cell â€“ long-term memory
Hidden â€“ short-term / output

PyTorch LSTM â€“ from nn.LSTM to custom cell

import torch.nn as nn

# Built-in LSTM
lstm = nn.LSTM(input_size=100, hidden_size=256, num_layers=2, 
               batch_first=True, bidirectional=True)
output, (h_n, c_n) = lstm(x)  # x shape: (batch, seq, feature)

# Manual LSTM cell (for understanding)
class LSTMCell(nn.Module):
    def __init__(self, input_size, hidden_size):
        super().__init__()
        self.fc = nn.Linear(input_size + hidden_size, hidden_size * 4)
        
    def forward(self, x, h, c):
        gates = self.fc(torch.cat([x, h], dim=1))
        f, i, o, c_tilde = gates.chunk(4, dim=1)
        f, i, o = torch.sigmoid(f), torch.sigmoid(i), torch.sigmoid(o)
        c = f * c + i * torch.tanh(c_tilde)
        h = o * torch.tanh(c)
        return h, c

GRU â€“ LSTM's Leaner Cousin

GRU Gates (only two)

zâ‚œ = Ïƒ(W_zÂ·[hâ‚œâ‚‹â‚, xâ‚œ]) // update gate
râ‚œ = Ïƒ(W_rÂ·[hâ‚œâ‚‹â‚, xâ‚œ]) // reset gate
hÌƒâ‚œ = tanh(WÂ·[râ‚œâŠ™hâ‚œâ‚‹â‚, xâ‚œ])
hâ‚œ = (1-zâ‚œ)âŠ™hâ‚œâ‚‹â‚ + zâ‚œâŠ™hÌƒâ‚œ

Combines forget and input gates. Fewer parameters, often similar performance.

LSTM vs GRU

LSTM	3 gates, cell state, hidden state
GRU	2 gates, only hidden state
Parameters	LSTM â‰ˆ 4Ã—, GRU â‰ˆ 3Ã—
When GRU?	Smaller dataset, faster training

# PyTorch GRU
gru = nn.GRU(input_size, hidden_size, num_layers, batch_first=True)
output, h_n = gru(x)

# TensorFlow/Keras
tf.keras.layers.GRU(units=128, return_sequences=True)

Stacked & Bidirectional RNNs

Stacked (Deep) RNNs

Hidden state of layer t becomes input to next layer. Captures hierarchical features.

lstm = nn.LSTM(input_size, hidden_size, num_layers=3, dropout=0.3)

Dropout between layers (except last).

Bidirectional RNNs

Two independent RNNs: left-to-right and right-to-left. Concatenate outputs. Context from both sides.

lstm = nn.LSTM(input_size, hidden_size, bidirectional=True)
# output shape: (batch, seq, hidden*2)

NLP essential BERT uses bidirectional context.

Encoder-Decoder & Attention

Sequence-to-Sequence

Encoder compresses input sequence to context vector (final hidden). Decoder generates output from context.

Problem: Fixed context bottleneck for long sequences.

Attention Mechanism

Decoder looks at all encoder hidden states. Context = weighted sum of encoder outputs.

eáµ¢â±¼ = score(h_decáµ¢, h_encâ±¼)
Î±áµ¢â±¼ = softmax(eáµ¢â±¼)
cáµ¢ = âˆ‘ Î±áµ¢â±¼ h_encâ±¼

Attention scores: dot product, additive (Bahdanau), or multiplicative (Luong).

Attention is all you need: Transformers replace RNNs with self-attention. But RNN+attention still used in speech, streaming models.

RNN/LSTM in PyTorch & TensorFlow

PyTorch â€“ Sentiment LSTM

class SentimentLSTM(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, output_dim):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.lstm = nn.LSTM(embed_dim, hidden_dim, 
                           batch_first=True, bidirectional=True)
        self.fc = nn.Linear(hidden_dim*2, output_dim)
        self.dropout = nn.Dropout(0.3)
    
    def forward(self, x):
        embedded = self.embedding(x)
        output, (h_n, c_n) = self.lstm(embedded)
        # Concatenate final forward and backward hidden
        h_n = torch.cat((h_n[-2,:,:], h_n[-1,:,:]), dim=1)
        h_n = self.dropout(h_n)
        return self.fc(h_n)

TensorFlow/Keras LSTM

import tensorflow as tf

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(10000, 128, input_length=100),
    tf.keras.layers.Bidirectional(
        tf.keras.layers.LSTM(64, return_sequences=True)
    ),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

Real-World Applications

NLP

Language modeling, NER, translation

Speech

ASR, synthesis, keyword spotting

Time Series

Stock, weather, anomaly detection

Bioinformatics

Protein sequence, gene expression

Optimizer Comparison Table

Model	Gates	State	Long-range	Parameters	When to use
Vanilla RNN	0	h	âŒ	Low	Short sequences, debugging
LSTM	3	h, c	âœ…âœ…	High	Default for complex sequences
GRU	2	h	âœ…	Medium	Small data, faster training
Bidirectional	-	-	âœ…	2Ã—	NLP, complete context available
Stacked	-	-	âœ…	DepthÃ—	Hierarchical features

Training RNNs/LSTMs â€“ Best Practices

âœ… Gradient clipping: Essential. Clip norm to 1.0 or 5.0.

âœ… Initialize forget gate bias to 1 â€“ helps remember at start.

âœ… Layer normalization â€“ stabilizes LSTM training.

âš ï¸ Don't use RNN for very long sequences (>500) â€“ use Transformer or CNN.

âš ï¸ Watch for overfitting â€“ LSTMs have many parameters. Use dropout (variational dropout in PyTorch).

Pro tip: For time series, try batch_first=True and pack_padded_sequence for variable-length sequences.

RNN/LSTM CheatsheetRNN hâ‚œ = tanh(WÂ·[hâ‚œâ‚‹â‚,xâ‚œ])
LSTM 3 gates + cell
GRU update + reset
BPTT unroll then backprop
Bidirectional past+future
Packing variable length
Attention weighted sum
Seq2Seq encoder-decoder

Next Up: Transformers & Attention â€“ Beyond RNNs.

Transformers: Attention Is All You Need

What is a Transformer?

A Transformer is a deep learning architecture that relies entirely on self-attention to model relationships in sequences. Introduced in 2017 by Vaswani et al., it abandoned recurrence (RNNs) and convolution (CNNs) in favor of parallelizable attention mechanisms. It's the foundation of BERT, GPT, T5, Vision Transformers, and virtually all large language models.

Input Sequence â†’ Embedding â†’ + Positional Encoding â†’
[Encoder Block Ã— N]
â”Œâ”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”
â”‚ Multi-Head â”‚
â”‚ Self-Attentionâ”‚
â””â”€â”€â”€â”€â”€â”€â”€â”€â”¬â”€â”€â”€â”€â”€â”€â”€â”€â”˜
â†“ + & Norm
â”Œâ”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”€â”
â”‚ Feed-Forward â”‚
â””â”€â”€â”€â”€â”€â”€â”€â”€â”¬â”€â”€â”€â”€â”€â”€â”˜
â†“ + & Norm
â†’ Output Probabilities

Transformers process all tokens in parallel. Attention maps global dependencies.

Scaled Dot-Product Attention

Attention Formula

Attention(Q, K, V) = softmax(QKáµ€ / âˆšdâ‚–) V

Q: Queries, K: Keys, V: Values.
âˆšdâ‚–: scaling factor to prevent dot products from growing large.

Each token attends to all tokens. Weighted sum of values.

Self-Attention vs Cross-Attention

Self-attention Q, K, V from same sequence (encoder, decoder self-attention).

Cross-attention Q from decoder, K, V from encoder (encoder-decoder attention).

Masked Self-Attention

Prevents attending to future tokens. Used in autoregressive decoders (GPT). Set attention scores to -âˆž before softmax.

Scaled Dot-Product Attention from scratch (NumPy)

import numpy as np

def scaled_dot_product_attention(Q, K, V, mask=None):
    """
    Q, K, V: (..., seq_len, d_k)
    mask: (..., seq_len, seq_len) optional
    """
    d_k = Q.shape[-1]
    scores = np.matmul(Q, K.transpose(0,1,3,2)) / np.sqrt(d_k)  # (..., seq_len, seq_len)
    
    if mask is not None:
        scores = np.where(mask == 0, -1e9, scores)  # causal mask
    
    attention_weights = np.exp(scores) / np.sum(np.exp(scores), axis=-1, keepdims=True)
    output = np.matmul(attention_weights, V)
    return output, attention_weights

Multi-Head Attention

Instead of single attention, project Q, K, V h times with different learned linear projections, perform attention in parallel, concatenate, and project again.

Why multiple heads?

Each head learns different attention patterns: local, global, syntactic, semantic. Standard: h=8, 12, 16, 32 for large models.

MultiHead(Q,K,V)

Concat(headâ‚,...,headâ‚•)Wá´¼

headáµ¢ = Attention(QWáµ¢Q, KWáµ¢K, VWáµ¢V)

Intuition: Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions.

Positional Encoding: Injecting Order

Sinusoidal Encodings

PE(pos, 2i) = sin(pos / 10000^{2i/d_model})
PE(pos, 2i+1) = cos(pos / 10000^{2i/d_model})

Fixed, no learning. Enables extrapolation.

Learned Positional Embeddings

Trainable vector per position (BERT, GPT). Simpler, but limited to max length.

Modern variants: RoPE (Rotary), ALiBi (attention bias).

Sinusoidal Positional Encoding (PyTorch)

import torch
import math

def positional_encoding(seq_len, d_model):
    pe = torch.zeros(seq_len, d_model)
    position = torch.arange(0, seq_len, dtype=torch.float).unsqueeze(1)
    div_term = torch.exp(torch.arange(0, d_model, 2).float() * 
                        -(math.log(10000.0) / d_model))
    pe[:, 0::2] = torch.sin(position * div_term)
    pe[:, 1::2] = torch.cos(position * div_term)
    return pe.unsqueeze(0)  # (1, seq_len, d_model)

Encoder-Decoder & Variants

Encoder-Only

BERT, RoBERTa, DeBERTa. Bidirectional context. Best for understanding tasks: classification, NER, extraction.

Decoder-Only

GPT, Llama, Mistral, Gemini. Autoregressive. Best for generation. Causal masking.

Encoder-Decoder

T5, BART. Sequence-to-sequence. Best for translation, summarization.

Encoder: Self-attention + FFN. Decoder: Masked self-attention + cross-attention + FFN.

Iconic Transformer Models (2017â€“2025)

BERT (2018)

Bidirectional Encoder. Masked LM + Next Sentence Prediction. 110Mâ€“340M params.

GPT-3 (2020)

Autoregressive decoder. 175B params. In-context learning.

T5 (2019)

Text-to-Text Transfer Transformer. Unified framework.

Vision Transformer (ViT) 2020

Split image into patches, treat as sequence. No convolutions.

Llama (2023)

Open-source, efficient. RMSNorm, SwiGLU, RoPE.

Mixture of Experts

Switch Transformer, Mistral. Sparse activation.

Transformer Block in PyTorch

Complete Transformer Encoder Layer

import torch.nn as nn
import torch.nn.functional as F

class TransformerEncoderLayer(nn.Module):
    def __init__(self, d_model, nhead, dim_feedforward=2048, dropout=0.1):
        super().__init__()
        self.self_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)
        self.linear1 = nn.Linear(d_model, dim_feedforward)
        self.linear2 = nn.Linear(dim_feedforward, d_model)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)
        self.activation = F.relu
        
    def forward(self, src, src_mask=None, src_key_padding_mask=None):
        # Self-attention block with residual + norm
        x = src
        attn_out, _ = self.self_attn(x, x, x, attn_mask=src_mask,
                                     key_padding_mask=src_key_padding_mask)
        x = x + self.dropout1(attn_out)
        x = self.norm1(x)
        
        # Feedforward block with residual + norm
        ff_out = self.linear2(self.dropout2(self.activation(self.linear1(x))))
        x = x + self.dropout2(ff_out)
        x = self.norm2(x)
        return x

Training Large Language Models

ðŸ“š Pretraining objectives:

MLM: BERT-style, mask 15% tokens
Autoregressive (CLM): GPT-style, predict next token
Span corruption: T5-style

âš¡ Fine-tuning strategies:

Full fine-tuning
LoRA: Low-rank adapters
Prefix tuning, Adapters

LoRA-style parameter-efficient fine-tuning (conceptual)

# LoRA: W = W_original + B*A, only B, A trainable
class LoRALayer(nn.Linear):
    def __init__(self, in_features, out_features, rank=4):
        super().__init__(in_features, out_features)
        self.lora_A = nn.Parameter(torch.randn(in_features, rank) * 0.01)
        self.lora_B = nn.Parameter(torch.zeros(rank, out_features))
        self.requires_grad_(False)  # freeze original weights
    
    def forward(self, x):
        return super().forward(x) + x @ self.lora_A @ self.lora_B

Transformers Beyond Text

Vision

ViT, Swin, DINOv2. Image classification, detection, segmentation.

Audio

Whisper, AudioMAE. Speech recognition, generation.

Biology

AlphaFold2, ESM. Protein folding, sequences.

Reinforcement Learning

Decision Transformer, GATO.

Multimodal

CLIP, Flamingo, LLaVA, GPT-4V.

Time Series

Informer, Autoformer.

Transformer Variants & Use Cases â€“ CheatsheetBERT Encoder (understanding)
GPT Decoder (generation)
T5 Encoder-Decoder (seq2seq)
ViT Vision
Whisper Speech
RoPE Positional encoding
SwiGLU Activation
LoRA Efficient tuning

Transformer Model Comparison

Model	Year	Architecture	Params	Key Innovation
Transformer	2017	Enc-Dec	65M	Self-attention, no RNN/CNN
BERT	2018	Encoder	110M-340M	Masked LM, bidirectional
GPT-3	2020	Decoder	175B	Few-shot, in-context learning
T5	2019	Enc-Dec	11B	Text-to-text unified
ViT	2020	Encoder	86M-632M	Image patches as tokens
Llama 2	2023	Decoder	7B-70B	Open, efficient, Grouped-Query Attention

Transformer Pitfalls & Debugging

âš ï¸ Quadratric complexity: O(nÂ²) for attention. Use sparse attention, Linformer, or long-sequence optimizers.

âš ï¸ Training instability: Warmup (5-10% steps), gradient clipping, Adam betas=(0.9, 0.98).

âœ… Positional encoding: For very long sequences, use RoPE or ALiBi.

âœ… Debug attention: Visualize attention maps â€“ should be diffuse, not degenerate.

Attention Mechanism: The Eyes of Neural Networks

What is Attention?

Attention is a neural component that dynamically computes a weighted sum of values, where weights depend on the similarity between a query and corresponding keys. It allows models to focus on specific parts of the input when producing each output element â€” mimicking visual attention.

Query (Q) â†’ Similarity â† Keys (K)
â†“
Attention Weights (softmax)
â†“
Weighted Sum â†’ Context Vector Ã— Values (V)

Core idea: Not all input elements are equally important. Learn to assign importance dynamically.

The Alignment Problem: Why Attention?

Seq2Seq without Attention

Encoder compresses entire source into one fixed-size vector â†’ information bottleneck. Long sentences degrade rapidly.

"I love cats" â†’ fixed vector (5-dim) â†’ "Je ___ ?"

Seq2Seq with Attention

Decoder looks at all encoder states, weights them dynamically. Solves bottleneck, improves long-range translation.

Alignment: "cat" â†” "chat" at step 3

Breakthrough (2014): Bahdanau et al. introduced attention to neural machine translation. BLEU scores jumped, and long sentences became tractable.

Bahdanau Attention (Additive)

Additive Attention Score

eáµ¢â±¼ = váµƒ tanh(Wâ‚ [sáµ¢â‚‹â‚; hâ±¼])

or concat version: score(s, h) = váµƒ tanh(Wâ‚[s; h])

Context vector cáµ¢ = Î£â±¼ Î±áµ¢â±¼ hâ±¼

Historical Significance

First attention mechanism for NLP. Used in RNN encoder-decoders. Computationally expensive (fully connected layer per alignment).

Bidirectional RNN Concatenation tanh

Bahdanau Attention (PyTorch)

class BahdanauAttention(nn.Module):
    def __init__(self, hidden_dim):
        super().__init__()
        self.W_a = nn.Linear(hidden_dim * 2, hidden_dim)  # [s; h]
        self.v_a = nn.Linear(hidden_dim, 1, bias=False)
    
    def forward(self, query, encoder_outputs):
        # query: decoder hidden (batch, hidden)
        # encoder_outputs: (batch, seq_len, hidden)
        seq_len = encoder_outputs.size(1)
        query = query.unsqueeze(1).repeat(1, seq_len, 1)  # (batch, seq_len, hidden)
        
        # Combine query and encoder outputs
        energy = torch.tanh(self.W_a(torch.cat((query, encoder_outputs), dim=2)))  # (batch, seq_len, hidden)
        scores = self.v_a(energy).squeeze(2)  # (batch, seq_len)
        attn_weights = torch.softmax(scores, dim=1)
        
        context = torch.bmm(attn_weights.unsqueeze(1), encoder_outputs).squeeze(1)
        return context, attn_weights

Luong Attention (Multiplicative)

Scoring Functions

Dot: score = sáµ€ h

General: score = sáµ€ W h

Concat: score = váµ€ tanh(W[s; h])

Key Differences

Luong computes attention after decoder output (vs before in Bahdanau). Simpler, faster. Uses top-layer state only.

Types: global (all source steps) vs local (window).

Luong Dot-Product Attention

def luong_dot_attention(query, encoder_outputs):
    # query: (batch, 1, hidden)
    # encoder_outputs: (batch, seq_len, hidden)
    scores = torch.bmm(query, encoder_outputs.transpose(1, 2))  # (batch, 1, seq_len)
    attn_weights = torch.softmax(scores, dim=2)
    context = torch.bmm(attn_weights, encoder_outputs)
    return context, attn_weights

Scaled Dot-Product Attention

The Transformer Formula

Attention(Q, K, V) = softmax(QKáµ€ / âˆšdâ‚–) V

Q, K, V: queries, keys, values matrices.
âˆšdâ‚–: scaling factor prevents softmax saturation.

Why Scaling?

For large dâ‚–, dot products grow large in magnitude, pushing softmax into regions of vanishing gradients. Scaling fixes this.

Scaled Dot-Product Attention (NumPy/Torch)

def scaled_dot_product_attention(Q, K, V, mask=None):
    d_k = Q.size(-1)
    scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
    
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    
    attention_weights = torch.softmax(scores, dim=-1)
    output = torch.matmul(attention_weights, V)
    return output, attention_weights

Attention Matrix: Rows = queries, Cols = keys. Each row sums to 1.

Multi-Head Attention

Instead of one attention function, project Q, K, V h times with different linear projections, perform attention in parallel, concatenate, and project.

Diverse Representations

Each head learns different relationships: syntactic, semantic, coreference, positional.

MultiHead(Q,K,V)

Concat(headâ‚,...,headâ‚•)Wá´¼

headáµ¢ = Attention(QWáµ¢^Q, KWáµ¢^K, VWáµ¢^V)

Typical Values

h = 8, 12, 16, 32. dâ‚– = d_v = d_model / h.

Attention Variants: Self, Cross, Causal

Self-Attention

Q, K, V from same sequence. Each token attends to all tokens in the same sequence. Captures intra-sequence dependencies.

Encoders BERT

Cross-Attention

Q from decoder, K, V from encoder. Decoder attends to input sequence. Essential for seq2seq.

T5, BART

Causal (Masked) Attention

Prevents attending to future tokens. Upper triangular mask set to -âˆž. Used in autoregressive decoders.

GPT, Llama

Causal Attention Mask

def causal_mask(size):
    """Upper triangular matrix with zeros on diagonal and below, -inf above"""
    mask = torch.triu(torch.ones(size, size), diagonal=1).bool()
    return mask  # True where future tokens (to be masked)

Visualizing Attention Weights

Alignment Matrix

Plot attention weights as heatmap. Rows = decoder steps, Cols = encoder steps. Reveals word alignment.

[0.9, 0.05, 0.05]
[0.1, 0.8, 0.1]
[0.1, 0.1, 0.8]

Probing Attention Heads

Certain heads specialize: positional heads attend to previous/next token, syntactic heads attend to dependent tokens, rare word heads.

Tools: BertViz, exBERT, AttentionViz for interactive exploration.

Attention Beyond NLP

Vision

Spatial attention: Attend to relevant image regions. ViT uses self-attention on patches. Cross-attention in image captioning.

Audio

Speech recognition: Attend to acoustic frames. Listen, Attend and Spell (LAS).

Video

Temporal attention: Focus on relevant frames. Video transformers.

Multimodal

CLIP, Flamingo, LLaVA: cross-attention between image and text.

Graphs

Graph Attention Networks (GAT): attend to neighbor nodes.

Reinforcement Learning

Attend to relevant observations in memory.

Attention Types â€“ CheatsheetBahdanau Additive, concat
Luong Dot, general
Scaled Dot QKáµ€/âˆšd
Multi-Head Parallel
Self Intra-sequence
Cross Encoder-decoder
Causal Autoregressive
Spatial Vision

Attention Mechanism Comparison

Attention Type	Score Function	Complexity	Typical Use
Bahdanau (Additive)	váµƒ tanh(W[s; h])	O(nÂ·dÂ²)	RNN seq2seq
Luong (Dot)	sáµ€ h	O(nÂ·d)	RNN, efficient
Scaled Dot-Product	QKáµ€/âˆšd	O(nÂ²Â·d)	Transformers
Multi-Head	h Ã— scaled dot	O(nÂ²Â·dÂ·h)	BERT, GPT
Graph Attention	LeakyReLU(aáµ€[Wháµ¢; Whâ±¼])	O(EÂ·d)	Graph networks

Attention Pitfalls & Debugging

âš ï¸ Attention collapse: All weights equal. Causes: bad initialization, lack of training, model too small.

âš ï¸ Quadratic complexity: O(nÂ²) for self-attention. Use sparse attention, Linformer, Longformer.

âœ… Debug: Always visualize attention matrices. Entropy should be moderate (not 0, not uniform).

âœ… Multi-head diversity: Check correlation between heads. Low correlation = diverse features.

Previous Next