RNN, Transformers & Attention
Recurrent networks, LSTM, transformer architecture, and attention mechanisms.
RNN & LSTM: Mastering Sequence Data
Why Recurrent Networks?
Feedforward networks assume independent inputs. For sequences (time series, text, audio), we need memory. RNNs maintain a hidden state that carries information across time steps.
Parameters are shared across time steps. The same W, b used at every step.
Vanilla RNN & Backpropagation Through Time
RNN Cell
hₜ = tanh(W_ih·xₜ + b_ih + W_hh·hₜ₋₠+ b_hh)
Hidden state combines current input and previous hidden state.
# PyTorch RNN cell (single step)
import torch.nn as nn
rnn_cell = nn.RNNCell(input_size=10, hidden_size=20)
h = torch.zeros(1, 20) # initial hidden
for t in range(seq_len):
h = rnn_cell(x[t], h)
Backprop Through Time
Gradients flow backward through time steps. Chain rule multiplies across many tanh derivatives → vanishing/exploding gradients.
Problem: RNNs struggle with long sequences (>10 steps).
# BPTT conceptually
for t in reversed(range(seq_len)):
# gradient at time t depends on t+1
grad_h[t] += grad_h[t+1] * W_hh.T * (1 - h[t]**2)
The Vanishing Gradient Problem
Why gradients vanish
During BPTT, gradient = âˆ(W_hháµ€ · diag(tanh')). tanh' ≤ 1. Repeated multiplication makes gradient → 0 for long-term dependencies.
Effect: RNN cannot learn relationships between distant tokens.
Solutions
- LSTM/GRU – gating preserves gradients
- ReLU + proper init (helps but not robust)
- Gradient clipping (for explosion)
- Residual connections
LSTM – The Gated Solution
LSTM Gates
fₜ = σ(W_f·[hₜ₋â‚, xₜ] + b_f) // forget gate
iₜ = σ(W_i·[hₜ₋â‚, xₜ] + b_i) // input gate
oₜ = σ(W_o·[hₜ₋â‚, xₜ] + b_o) // output gate
c̃ₜ = tanh(W_c·[hₜ₋â‚, xₜ] + b_c) // candidate
cₜ = fₜ ⊙ cₜ₋₠+ iₜ ⊙ c̃ₜ // cell state
hₜ = oₜ ⊙ tanh(cₜ)
Cell state acts as a gradient highway. Forget gate controls what to keep/erase. Gradients flow through addition, not multiplication.
LSTM in 30 seconds
- Forget – reset cell state
- Input – write new info
- Output – expose cell state
- Cell – long-term memory
- Hidden – short-term / output
import torch.nn as nn
# Built-in LSTM
lstm = nn.LSTM(input_size=100, hidden_size=256, num_layers=2,
batch_first=True, bidirectional=True)
output, (h_n, c_n) = lstm(x) # x shape: (batch, seq, feature)
# Manual LSTM cell (for understanding)
class LSTMCell(nn.Module):
def __init__(self, input_size, hidden_size):
super().__init__()
self.fc = nn.Linear(input_size + hidden_size, hidden_size * 4)
def forward(self, x, h, c):
gates = self.fc(torch.cat([x, h], dim=1))
f, i, o, c_tilde = gates.chunk(4, dim=1)
f, i, o = torch.sigmoid(f), torch.sigmoid(i), torch.sigmoid(o)
c = f * c + i * torch.tanh(c_tilde)
h = o * torch.tanh(c)
return h, c
GRU – LSTM's Leaner Cousin
GRU Gates (only two)
zₜ = σ(W_z·[hₜ₋â‚, xₜ]) // update gate
rₜ = σ(W_r·[hₜ₋â‚, xₜ]) // reset gate
h̃ₜ = tanh(W·[rₜ⊙hₜ₋â‚, xₜ])
hₜ = (1-zₜ)⊙hₜ₋₠+ zₜ⊙h̃ₜ
Combines forget and input gates. Fewer parameters, often similar performance.
LSTM vs GRU
| LSTM | 3 gates, cell state, hidden state |
|---|---|
| GRU | 2 gates, only hidden state |
| Parameters | LSTM ≈ 4×, GRU ≈ 3× |
| When GRU? | Smaller dataset, faster training |
# PyTorch GRU
gru = nn.GRU(input_size, hidden_size, num_layers, batch_first=True)
output, h_n = gru(x)
# TensorFlow/Keras
tf.keras.layers.GRU(units=128, return_sequences=True)
Stacked & Bidirectional RNNs
Stacked (Deep) RNNs
Hidden state of layer t becomes input to next layer. Captures hierarchical features.
lstm = nn.LSTM(input_size, hidden_size, num_layers=3, dropout=0.3)
Dropout between layers (except last).
Bidirectional RNNs
Two independent RNNs: left-to-right and right-to-left. Concatenate outputs. Context from both sides.
lstm = nn.LSTM(input_size, hidden_size, bidirectional=True)
# output shape: (batch, seq, hidden*2)
NLP essential BERT uses bidirectional context.
Encoder-Decoder & Attention
Sequence-to-Sequence
Encoder compresses input sequence to context vector (final hidden). Decoder generates output from context.
Problem: Fixed context bottleneck for long sequences.
Attention Mechanism
Decoder looks at all encoder hidden states. Context = weighted sum of encoder outputs.
eᵢⱼ = score(h_decᵢ, h_encⱼ)
αᵢⱼ = softmax(eᵢⱼ)
cᵢ = ∑ αᵢⱼ h_encⱼ
Attention scores: dot product, additive (Bahdanau), or multiplicative (Luong).
RNN/LSTM in PyTorch & TensorFlow
class SentimentLSTM(nn.Module):
def __init__(self, vocab_size, embed_dim, hidden_dim, output_dim):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim)
self.lstm = nn.LSTM(embed_dim, hidden_dim,
batch_first=True, bidirectional=True)
self.fc = nn.Linear(hidden_dim*2, output_dim)
self.dropout = nn.Dropout(0.3)
def forward(self, x):
embedded = self.embedding(x)
output, (h_n, c_n) = self.lstm(embedded)
# Concatenate final forward and backward hidden
h_n = torch.cat((h_n[-2,:,:], h_n[-1,:,:]), dim=1)
h_n = self.dropout(h_n)
return self.fc(h_n)
import tensorflow as tf
model = tf.keras.Sequential([
tf.keras.layers.Embedding(10000, 128, input_length=100),
tf.keras.layers.Bidirectional(
tf.keras.layers.LSTM(64, return_sequences=True)
),
tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32)),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dropout(0.5),
tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])
Real-World Applications
NLP
Language modeling, NER, translation
Speech
ASR, synthesis, keyword spotting
Time Series
Stock, weather, anomaly detection
Bioinformatics
Protein sequence, gene expression
Optimizer Comparison Table
| Model | Gates | State | Long-range | Parameters | When to use |
|---|---|---|---|---|---|
| Vanilla RNN | 0 | h | ⌠| Low | Short sequences, debugging |
| LSTM | 3 | h, c | ✅✅ | High | Default for complex sequences |
| GRU | 2 | h | ✅ | Medium | Small data, faster training |
| Bidirectional | - | - | ✅ | 2× | NLP, complete context available |
| Stacked | - | - | ✅ | Depth× | Hierarchical features |
Training RNNs/LSTMs – Best Practices
Pro tip: For time series, try batch_first=True and pack_padded_sequence for variable-length sequences.
RNN/LSTM Cheatsheet
Transformers: Attention Is All You Need
What is a Transformer?
A Transformer is a deep learning architecture that relies entirely on self-attention to model relationships in sequences. Introduced in 2017 by Vaswani et al., it abandoned recurrence (RNNs) and convolution (CNNs) in favor of parallelizable attention mechanisms. It's the foundation of BERT, GPT, T5, Vision Transformers, and virtually all large language models.
[Encoder Block × N]
┌─────────────────â”
│ Multi-Head │
│ Self-Attention│
└────────┬────────┘
↓ + & Norm
┌───────────────â”
│ Feed-Forward │
└────────┬──────┘
↓ + & Norm
→ Output Probabilities
Transformers process all tokens in parallel. Attention maps global dependencies.
Scaled Dot-Product Attention
Attention Formula
Attention(Q, K, V) = softmax(QKᵀ / √dₖ) V
Q: Queries, K: Keys, V: Values.
√dₖ: scaling factor to prevent dot products from growing large.
Each token attends to all tokens. Weighted sum of values.
Self-Attention vs Cross-Attention
Self-attention Q, K, V from same sequence (encoder, decoder self-attention).
Cross-attention Q from decoder, K, V from encoder (encoder-decoder attention).
Masked Self-Attention
Prevents attending to future tokens. Used in autoregressive decoders (GPT). Set attention scores to -∞ before softmax.
import numpy as np
def scaled_dot_product_attention(Q, K, V, mask=None):
"""
Q, K, V: (..., seq_len, d_k)
mask: (..., seq_len, seq_len) optional
"""
d_k = Q.shape[-1]
scores = np.matmul(Q, K.transpose(0,1,3,2)) / np.sqrt(d_k) # (..., seq_len, seq_len)
if mask is not None:
scores = np.where(mask == 0, -1e9, scores) # causal mask
attention_weights = np.exp(scores) / np.sum(np.exp(scores), axis=-1, keepdims=True)
output = np.matmul(attention_weights, V)
return output, attention_weights
Multi-Head Attention
Instead of single attention, project Q, K, V h times with different learned linear projections, perform attention in parallel, concatenate, and project again.
Why multiple heads?
Each head learns different attention patterns: local, global, syntactic, semantic. Standard: h=8, 12, 16, 32 for large models.
MultiHead(Q,K,V)
Concat(headâ‚,...,headâ‚•)Wá´¼
headáµ¢ = Attention(QWáµ¢Q, KWáµ¢K, VWáµ¢V)
Positional Encoding: Injecting Order
Sinusoidal Encodings
PE(pos, 2i) = sin(pos / 10000^{2i/d_model})
PE(pos, 2i+1) = cos(pos / 10000^{2i/d_model})
Fixed, no learning. Enables extrapolation.
Learned Positional Embeddings
Trainable vector per position (BERT, GPT). Simpler, but limited to max length.
Modern variants: RoPE (Rotary), ALiBi (attention bias).
import torch
import math
def positional_encoding(seq_len, d_model):
pe = torch.zeros(seq_len, d_model)
position = torch.arange(0, seq_len, dtype=torch.float).unsqueeze(1)
div_term = torch.exp(torch.arange(0, d_model, 2).float() *
-(math.log(10000.0) / d_model))
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
return pe.unsqueeze(0) # (1, seq_len, d_model)
Encoder-Decoder & Variants
Encoder-Only
BERT, RoBERTa, DeBERTa. Bidirectional context. Best for understanding tasks: classification, NER, extraction.
Decoder-Only
GPT, Llama, Mistral, Gemini. Autoregressive. Best for generation. Causal masking.
Encoder-Decoder
T5, BART. Sequence-to-sequence. Best for translation, summarization.
Encoder: Self-attention + FFN. Decoder: Masked self-attention + cross-attention + FFN.
Iconic Transformer Models (2017–2025)
BERT (2018)
Bidirectional Encoder. Masked LM + Next Sentence Prediction. 110M–340M params.
GPT-3 (2020)
Autoregressive decoder. 175B params. In-context learning.
T5 (2019)
Text-to-Text Transfer Transformer. Unified framework.
Vision Transformer (ViT) 2020
Split image into patches, treat as sequence. No convolutions.
Llama (2023)
Open-source, efficient. RMSNorm, SwiGLU, RoPE.
Mixture of Experts
Switch Transformer, Mistral. Sparse activation.
Transformer Block in PyTorch
import torch.nn as nn
import torch.nn.functional as F
class TransformerEncoderLayer(nn.Module):
def __init__(self, d_model, nhead, dim_feedforward=2048, dropout=0.1):
super().__init__()
self.self_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)
self.linear1 = nn.Linear(d_model, dim_feedforward)
self.linear2 = nn.Linear(dim_feedforward, d_model)
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
self.dropout1 = nn.Dropout(dropout)
self.dropout2 = nn.Dropout(dropout)
self.activation = F.relu
def forward(self, src, src_mask=None, src_key_padding_mask=None):
# Self-attention block with residual + norm
x = src
attn_out, _ = self.self_attn(x, x, x, attn_mask=src_mask,
key_padding_mask=src_key_padding_mask)
x = x + self.dropout1(attn_out)
x = self.norm1(x)
# Feedforward block with residual + norm
ff_out = self.linear2(self.dropout2(self.activation(self.linear1(x))))
x = x + self.dropout2(ff_out)
x = self.norm2(x)
return x
Training Large Language Models
- MLM: BERT-style, mask 15% tokens
- Autoregressive (CLM): GPT-style, predict next token
- Span corruption: T5-style
- Full fine-tuning
- LoRA: Low-rank adapters
- Prefix tuning, Adapters
# LoRA: W = W_original + B*A, only B, A trainable
class LoRALayer(nn.Linear):
def __init__(self, in_features, out_features, rank=4):
super().__init__(in_features, out_features)
self.lora_A = nn.Parameter(torch.randn(in_features, rank) * 0.01)
self.lora_B = nn.Parameter(torch.zeros(rank, out_features))
self.requires_grad_(False) # freeze original weights
def forward(self, x):
return super().forward(x) + x @ self.lora_A @ self.lora_B
Transformers Beyond Text
Vision
ViT, Swin, DINOv2. Image classification, detection, segmentation.
Audio
Whisper, AudioMAE. Speech recognition, generation.
Biology
AlphaFold2, ESM. Protein folding, sequences.
Reinforcement Learning
Decision Transformer, GATO.
Multimodal
CLIP, Flamingo, LLaVA, GPT-4V.
Time Series
Informer, Autoformer.
Transformer Variants & Use Cases – Cheatsheet
Transformer Model Comparison
| Model | Year | Architecture | Params | Key Innovation |
|---|---|---|---|---|
| Transformer | 2017 | Enc-Dec | 65M | Self-attention, no RNN/CNN |
| BERT | 2018 | Encoder | 110M-340M | Masked LM, bidirectional |
| GPT-3 | 2020 | Decoder | 175B | Few-shot, in-context learning |
| T5 | 2019 | Enc-Dec | 11B | Text-to-text unified |
| ViT | 2020 | Encoder | 86M-632M | Image patches as tokens |
| Llama 2 | 2023 | Decoder | 7B-70B | Open, efficient, Grouped-Query Attention |
Transformer Pitfalls & Debugging
Attention Mechanism: The Eyes of Neural Networks
What is Attention?
Attention is a neural component that dynamically computes a weighted sum of values, where weights depend on the similarity between a query and corresponding keys. It allows models to focus on specific parts of the input when producing each output element — mimicking visual attention.
↓
Attention Weights (softmax)
↓
Weighted Sum → Context Vector × Values (V)
Core idea: Not all input elements are equally important. Learn to assign importance dynamically.
The Alignment Problem: Why Attention?
Seq2Seq without Attention
Encoder compresses entire source into one fixed-size vector → information bottleneck. Long sentences degrade rapidly.
"I love cats" → fixed vector (5-dim) → "Je ___ ?"
Seq2Seq with Attention
Decoder looks at all encoder states, weights them dynamically. Solves bottleneck, improves long-range translation.
Alignment: "cat" ↔ "chat" at step 3
Bahdanau Attention (Additive)
Additive Attention Score
eᵢⱼ = vᵃ tanh(Wâ‚ [sᵢ₋â‚; hâ±¼])
or concat version: score(s, h) = vᵃ tanh(Wâ‚[s; h])
Context vector cᵢ = Σⱼ αᵢⱼ hⱼ
Historical Significance
First attention mechanism for NLP. Used in RNN encoder-decoders. Computationally expensive (fully connected layer per alignment).
Bidirectional RNN Concatenation tanh
class BahdanauAttention(nn.Module):
def __init__(self, hidden_dim):
super().__init__()
self.W_a = nn.Linear(hidden_dim * 2, hidden_dim) # [s; h]
self.v_a = nn.Linear(hidden_dim, 1, bias=False)
def forward(self, query, encoder_outputs):
# query: decoder hidden (batch, hidden)
# encoder_outputs: (batch, seq_len, hidden)
seq_len = encoder_outputs.size(1)
query = query.unsqueeze(1).repeat(1, seq_len, 1) # (batch, seq_len, hidden)
# Combine query and encoder outputs
energy = torch.tanh(self.W_a(torch.cat((query, encoder_outputs), dim=2))) # (batch, seq_len, hidden)
scores = self.v_a(energy).squeeze(2) # (batch, seq_len)
attn_weights = torch.softmax(scores, dim=1)
context = torch.bmm(attn_weights.unsqueeze(1), encoder_outputs).squeeze(1)
return context, attn_weights
Luong Attention (Multiplicative)
Scoring Functions
Dot: score = sáµ€ h
General: score = sáµ€ W h
Concat: score = váµ€ tanh(W[s; h])
Key Differences
Luong computes attention after decoder output (vs before in Bahdanau). Simpler, faster. Uses top-layer state only.
Types: global (all source steps) vs local (window).
def luong_dot_attention(query, encoder_outputs):
# query: (batch, 1, hidden)
# encoder_outputs: (batch, seq_len, hidden)
scores = torch.bmm(query, encoder_outputs.transpose(1, 2)) # (batch, 1, seq_len)
attn_weights = torch.softmax(scores, dim=2)
context = torch.bmm(attn_weights, encoder_outputs)
return context, attn_weights
Scaled Dot-Product Attention
The Transformer Formula
Attention(Q, K, V) = softmax(QKᵀ / √dₖ) V
Q, K, V: queries, keys, values matrices.
√dₖ: scaling factor prevents softmax saturation.
Why Scaling?
For large dâ‚–, dot products grow large in magnitude, pushing softmax into regions of vanishing gradients. Scaling fixes this.
def scaled_dot_product_attention(Q, K, V, mask=None):
d_k = Q.size(-1)
scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
if mask is not None:
scores = scores.masked_fill(mask == 0, -1e9)
attention_weights = torch.softmax(scores, dim=-1)
output = torch.matmul(attention_weights, V)
return output, attention_weights
Multi-Head Attention
Instead of one attention function, project Q, K, V h times with different linear projections, perform attention in parallel, concatenate, and project.
Diverse Representations
Each head learns different relationships: syntactic, semantic, coreference, positional.
MultiHead(Q,K,V)
Concat(headâ‚,...,headâ‚•)Wá´¼
headáµ¢ = Attention(QWáµ¢^Q, KWáµ¢^K, VWáµ¢^V)
Typical Values
h = 8, 12, 16, 32. dâ‚– = d_v = d_model / h.
Attention Variants: Self, Cross, Causal
Self-Attention
Q, K, V from same sequence. Each token attends to all tokens in the same sequence. Captures intra-sequence dependencies.
Encoders BERT
Cross-Attention
Q from decoder, K, V from encoder. Decoder attends to input sequence. Essential for seq2seq.
T5, BART
Causal (Masked) Attention
Prevents attending to future tokens. Upper triangular mask set to -∞. Used in autoregressive decoders.
GPT, Llama
def causal_mask(size):
"""Upper triangular matrix with zeros on diagonal and below, -inf above"""
mask = torch.triu(torch.ones(size, size), diagonal=1).bool()
return mask # True where future tokens (to be masked)
Visualizing Attention Weights
Alignment Matrix
Plot attention weights as heatmap. Rows = decoder steps, Cols = encoder steps. Reveals word alignment.
[0.1, 0.8, 0.1]
[0.1, 0.1, 0.8]
Probing Attention Heads
Certain heads specialize: positional heads attend to previous/next token, syntactic heads attend to dependent tokens, rare word heads.
Attention Beyond NLP
Vision
Spatial attention: Attend to relevant image regions. ViT uses self-attention on patches. Cross-attention in image captioning.
Audio
Speech recognition: Attend to acoustic frames. Listen, Attend and Spell (LAS).
Video
Temporal attention: Focus on relevant frames. Video transformers.
Multimodal
CLIP, Flamingo, LLaVA: cross-attention between image and text.
Graphs
Graph Attention Networks (GAT): attend to neighbor nodes.
Reinforcement Learning
Attend to relevant observations in memory.
Attention Types – Cheatsheet
Attention Mechanism Comparison
| Attention Type | Score Function | Complexity | Typical Use |
|---|---|---|---|
| Bahdanau (Additive) | vᵃ tanh(W[s; h]) | O(n·d²) | RNN seq2seq |
| Luong (Dot) | sᵀ h | O(n·d) | RNN, efficient |
| Scaled Dot-Product | QKᵀ/√d | O(n²·d) | Transformers |
| Multi-Head | h × scaled dot | O(n²·d·h) | BERT, GPT |
| Graph Attention | LeakyReLU(aᵀ[Whᵢ; Whⱼ]) | O(E·d) | Graph networks |