Deep Learning

RNN, Transformers & Attention

Recurrent networks, LSTM, transformer architecture, and attention mechanisms.

RNN & LSTM: Mastering Sequence Data

Why Recurrent Networks?

Feedforward networks assume independent inputs. For sequences (time series, text, audio), we need memory. RNNs maintain a hidden state that carries information across time steps.

hₜ = tanh(W·[hₜ₋₁, xₜ] + b) → yₜ = W_y·hₜ + b_y

Parameters are shared across time steps. The same W, b used at every step.

Vanilla RNN & Backpropagation Through Time

RNN Cell

hₜ = tanh(W_ih·xₜ + b_ih + W_hh·hₜ₋₁ + b_hh)

Hidden state combines current input and previous hidden state.

# PyTorch RNN cell (single step)
import torch.nn as nn
rnn_cell = nn.RNNCell(input_size=10, hidden_size=20)

h = torch.zeros(1, 20)  # initial hidden
for t in range(seq_len):
    h = rnn_cell(x[t], h)
Backprop Through Time

Gradients flow backward through time steps. Chain rule multiplies across many tanh derivatives → vanishing/exploding gradients.

Problem: RNNs struggle with long sequences (>10 steps).

# BPTT conceptually
for t in reversed(range(seq_len)):
    # gradient at time t depends on t+1
    grad_h[t] += grad_h[t+1] * W_hh.T * (1 - h[t]**2)
Truncated BPTT: Limit the number of time steps backpropagated (e.g., 20-50 steps). Common in training language models.

The Vanishing Gradient Problem

Why gradients vanish

During BPTT, gradient = ∏(W_hhᵀ · diag(tanh')). tanh' ≤ 1. Repeated multiplication makes gradient → 0 for long-term dependencies.

Effect: RNN cannot learn relationships between distant tokens.

Solutions
  • LSTM/GRU – gating preserves gradients
  • ReLU + proper init (helps but not robust)
  • Gradient clipping (for explosion)
  • Residual connections

LSTM – The Gated Solution

LSTM Gates

fₜ = σ(W_f·[hₜ₋₁, xₜ] + b_f) // forget gate
iₜ = σ(W_i·[hₜ₋₁, xₜ] + b_i) // input gate
oₜ = σ(W_o·[hₜ₋₁, xₜ] + b_o) // output gate
c̃ₜ = tanh(W_c·[hₜ₋₁, xₜ] + b_c) // candidate
cₜ = fₜ ⊙ cₜ₋₁ + iₜ ⊙ c̃ₜ // cell state
hₜ = oₜ ⊙ tanh(cₜ)

Cell state acts as a gradient highway. Forget gate controls what to keep/erase. Gradients flow through addition, not multiplication.

LSTM in 30 seconds
  • Forget – reset cell state
  • Input – write new info
  • Output – expose cell state
  • Cell – long-term memory
  • Hidden – short-term / output
PyTorch LSTM – from nn.LSTM to custom cell
import torch.nn as nn

# Built-in LSTM
lstm = nn.LSTM(input_size=100, hidden_size=256, num_layers=2, 
               batch_first=True, bidirectional=True)
output, (h_n, c_n) = lstm(x)  # x shape: (batch, seq, feature)

# Manual LSTM cell (for understanding)
class LSTMCell(nn.Module):
    def __init__(self, input_size, hidden_size):
        super().__init__()
        self.fc = nn.Linear(input_size + hidden_size, hidden_size * 4)
        
    def forward(self, x, h, c):
        gates = self.fc(torch.cat([x, h], dim=1))
        f, i, o, c_tilde = gates.chunk(4, dim=1)
        f, i, o = torch.sigmoid(f), torch.sigmoid(i), torch.sigmoid(o)
        c = f * c + i * torch.tanh(c_tilde)
        h = o * torch.tanh(c)
        return h, c

GRU – LSTM's Leaner Cousin

GRU Gates (only two)

zₜ = σ(W_z·[hₜ₋₁, xₜ]) // update gate
rₜ = σ(W_r·[hₜ₋₁, xₜ]) // reset gate
h̃ₜ = tanh(W·[rₜ⊙hₜ₋₁, xₜ])
hₜ = (1-zₜ)⊙hₜ₋₁ + zₜ⊙h̃ₜ

Combines forget and input gates. Fewer parameters, often similar performance.

LSTM vs GRU
LSTM3 gates, cell state, hidden state
GRU2 gates, only hidden state
ParametersLSTM ≈ 4×, GRU ≈ 3×
When GRU?Smaller dataset, faster training
# PyTorch GRU
gru = nn.GRU(input_size, hidden_size, num_layers, batch_first=True)
output, h_n = gru(x)

# TensorFlow/Keras
tf.keras.layers.GRU(units=128, return_sequences=True)

Stacked & Bidirectional RNNs

Stacked (Deep) RNNs

Hidden state of layer t becomes input to next layer. Captures hierarchical features.

lstm = nn.LSTM(input_size, hidden_size, num_layers=3, dropout=0.3)

Dropout between layers (except last).

Bidirectional RNNs

Two independent RNNs: left-to-right and right-to-left. Concatenate outputs. Context from both sides.

lstm = nn.LSTM(input_size, hidden_size, bidirectional=True)
# output shape: (batch, seq, hidden*2)

NLP essential BERT uses bidirectional context.

Encoder-Decoder & Attention

Sequence-to-Sequence

Encoder compresses input sequence to context vector (final hidden). Decoder generates output from context.

Problem: Fixed context bottleneck for long sequences.

Attention Mechanism

Decoder looks at all encoder hidden states. Context = weighted sum of encoder outputs.

eᵢⱼ = score(h_decᵢ, h_encⱼ)
αᵢⱼ = softmax(eᵢⱼ)
cᵢ = ∑ αᵢⱼ h_encⱼ

Attention scores: dot product, additive (Bahdanau), or multiplicative (Luong).

Attention is all you need: Transformers replace RNNs with self-attention. But RNN+attention still used in speech, streaming models.

RNN/LSTM in PyTorch & TensorFlow

PyTorch – Sentiment LSTM
class SentimentLSTM(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, output_dim):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.lstm = nn.LSTM(embed_dim, hidden_dim, 
                           batch_first=True, bidirectional=True)
        self.fc = nn.Linear(hidden_dim*2, output_dim)
        self.dropout = nn.Dropout(0.3)
    
    def forward(self, x):
        embedded = self.embedding(x)
        output, (h_n, c_n) = self.lstm(embedded)
        # Concatenate final forward and backward hidden
        h_n = torch.cat((h_n[-2,:,:], h_n[-1,:,:]), dim=1)
        h_n = self.dropout(h_n)
        return self.fc(h_n)
TensorFlow/Keras LSTM
import tensorflow as tf

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(10000, 128, input_length=100),
    tf.keras.layers.Bidirectional(
        tf.keras.layers.LSTM(64, return_sequences=True)
    ),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

Real-World Applications

NLP

Language modeling, NER, translation

Speech

ASR, synthesis, keyword spotting

Time Series

Stock, weather, anomaly detection

Bioinformatics

Protein sequence, gene expression

Optimizer Comparison Table

Model Gates State Long-range Parameters When to use
Vanilla RNN0h❌LowShort sequences, debugging
LSTM3h, c✅✅HighDefault for complex sequences
GRU2h✅MediumSmall data, faster training
Bidirectional--✅2×NLP, complete context available
Stacked--✅Depth×Hierarchical features

Training RNNs/LSTMs – Best Practices

✅ Gradient clipping: Essential. Clip norm to 1.0 or 5.0.
✅ Initialize forget gate bias to 1 – helps remember at start.
✅ Layer normalization – stabilizes LSTM training.
⚠️ Don't use RNN for very long sequences (>500) – use Transformer or CNN.
⚠️ Watch for overfitting – LSTMs have many parameters. Use dropout (variational dropout in PyTorch).

Pro tip: For time series, try batch_first=True and pack_padded_sequence for variable-length sequences.

RNN/LSTM Cheatsheet

RNN hₜ = tanh(W·[hₜ₋₁,xₜ])
LSTM 3 gates + cell
GRU update + reset
BPTT unroll then backprop
Bidirectional past+future
Packing variable length
Attention weighted sum
Seq2Seq encoder-decoder
Next Up: Transformers & Attention – Beyond RNNs.

Transformers: Attention Is All You Need

What is a Transformer?

A Transformer is a deep learning architecture that relies entirely on self-attention to model relationships in sequences. Introduced in 2017 by Vaswani et al., it abandoned recurrence (RNNs) and convolution (CNNs) in favor of parallelizable attention mechanisms. It's the foundation of BERT, GPT, T5, Vision Transformers, and virtually all large language models.

Input Sequence → Embedding → + Positional Encoding →
[Encoder Block × N]
┌─────────────────┐
│ Multi-Head │
│ Self-Attention│
└────────┬────────┘
↓ + & Norm
┌───────────────┐
│ Feed-Forward │
└────────┬──────┘
↓ + & Norm
→ Output Probabilities

Transformers process all tokens in parallel. Attention maps global dependencies.

Scaled Dot-Product Attention

Attention Formula

Attention(Q, K, V) = softmax(QKᵀ / √dₖ) V

Q: Queries, K: Keys, V: Values.
√dₖ: scaling factor to prevent dot products from growing large.

Each token attends to all tokens. Weighted sum of values.

Self-Attention vs Cross-Attention

Self-attention Q, K, V from same sequence (encoder, decoder self-attention).

Cross-attention Q from decoder, K, V from encoder (encoder-decoder attention).

Masked Self-Attention

Prevents attending to future tokens. Used in autoregressive decoders (GPT). Set attention scores to -∞ before softmax.

Scaled Dot-Product Attention from scratch (NumPy)
import numpy as np

def scaled_dot_product_attention(Q, K, V, mask=None):
    """
    Q, K, V: (..., seq_len, d_k)
    mask: (..., seq_len, seq_len) optional
    """
    d_k = Q.shape[-1]
    scores = np.matmul(Q, K.transpose(0,1,3,2)) / np.sqrt(d_k)  # (..., seq_len, seq_len)
    
    if mask is not None:
        scores = np.where(mask == 0, -1e9, scores)  # causal mask
    
    attention_weights = np.exp(scores) / np.sum(np.exp(scores), axis=-1, keepdims=True)
    output = np.matmul(attention_weights, V)
    return output, attention_weights

Multi-Head Attention

Instead of single attention, project Q, K, V h times with different learned linear projections, perform attention in parallel, concatenate, and project again.

Why multiple heads?

Each head learns different attention patterns: local, global, syntactic, semantic. Standard: h=8, 12, 16, 32 for large models.

MultiHead(Q,K,V)

Concat(head₁,...,headₕ)Wᴼ

headáµ¢ = Attention(QWáµ¢Q, KWáµ¢K, VWáµ¢V)

Intuition: Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions.

Positional Encoding: Injecting Order

Sinusoidal Encodings

PE(pos, 2i) = sin(pos / 10000^{2i/d_model})
PE(pos, 2i+1) = cos(pos / 10000^{2i/d_model})

Fixed, no learning. Enables extrapolation.

Learned Positional Embeddings

Trainable vector per position (BERT, GPT). Simpler, but limited to max length.

Modern variants: RoPE (Rotary), ALiBi (attention bias).

Sinusoidal Positional Encoding (PyTorch)
import torch
import math

def positional_encoding(seq_len, d_model):
    pe = torch.zeros(seq_len, d_model)
    position = torch.arange(0, seq_len, dtype=torch.float).unsqueeze(1)
    div_term = torch.exp(torch.arange(0, d_model, 2).float() * 
                        -(math.log(10000.0) / d_model))
    pe[:, 0::2] = torch.sin(position * div_term)
    pe[:, 1::2] = torch.cos(position * div_term)
    return pe.unsqueeze(0)  # (1, seq_len, d_model)

Encoder-Decoder & Variants

Encoder-Only

BERT, RoBERTa, DeBERTa. Bidirectional context. Best for understanding tasks: classification, NER, extraction.

Decoder-Only

GPT, Llama, Mistral, Gemini. Autoregressive. Best for generation. Causal masking.

Encoder-Decoder

T5, BART. Sequence-to-sequence. Best for translation, summarization.

Encoder: Self-attention + FFN. Decoder: Masked self-attention + cross-attention + FFN.

Iconic Transformer Models (2017–2025)

BERT (2018)

Bidirectional Encoder. Masked LM + Next Sentence Prediction. 110M–340M params.

GPT-3 (2020)

Autoregressive decoder. 175B params. In-context learning.

T5 (2019)

Text-to-Text Transfer Transformer. Unified framework.

Vision Transformer (ViT) 2020

Split image into patches, treat as sequence. No convolutions.

Llama (2023)

Open-source, efficient. RMSNorm, SwiGLU, RoPE.

Mixture of Experts

Switch Transformer, Mistral. Sparse activation.

Transformer Block in PyTorch

Complete Transformer Encoder Layer
import torch.nn as nn
import torch.nn.functional as F

class TransformerEncoderLayer(nn.Module):
    def __init__(self, d_model, nhead, dim_feedforward=2048, dropout=0.1):
        super().__init__()
        self.self_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)
        self.linear1 = nn.Linear(d_model, dim_feedforward)
        self.linear2 = nn.Linear(dim_feedforward, d_model)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)
        self.activation = F.relu
        
    def forward(self, src, src_mask=None, src_key_padding_mask=None):
        # Self-attention block with residual + norm
        x = src
        attn_out, _ = self.self_attn(x, x, x, attn_mask=src_mask,
                                     key_padding_mask=src_key_padding_mask)
        x = x + self.dropout1(attn_out)
        x = self.norm1(x)
        
        # Feedforward block with residual + norm
        ff_out = self.linear2(self.dropout2(self.activation(self.linear1(x))))
        x = x + self.dropout2(ff_out)
        x = self.norm2(x)
        return x

Training Large Language Models

📚 Pretraining objectives:
  • MLM: BERT-style, mask 15% tokens
  • Autoregressive (CLM): GPT-style, predict next token
  • Span corruption: T5-style
âš¡ Fine-tuning strategies:
  • Full fine-tuning
  • LoRA: Low-rank adapters
  • Prefix tuning, Adapters
LoRA-style parameter-efficient fine-tuning (conceptual)
# LoRA: W = W_original + B*A, only B, A trainable
class LoRALayer(nn.Linear):
    def __init__(self, in_features, out_features, rank=4):
        super().__init__(in_features, out_features)
        self.lora_A = nn.Parameter(torch.randn(in_features, rank) * 0.01)
        self.lora_B = nn.Parameter(torch.zeros(rank, out_features))
        self.requires_grad_(False)  # freeze original weights
    
    def forward(self, x):
        return super().forward(x) + x @ self.lora_A @ self.lora_B

Transformers Beyond Text

Vision

ViT, Swin, DINOv2. Image classification, detection, segmentation.

Audio

Whisper, AudioMAE. Speech recognition, generation.

Biology

AlphaFold2, ESM. Protein folding, sequences.

Reinforcement Learning

Decision Transformer, GATO.

Multimodal

CLIP, Flamingo, LLaVA, GPT-4V.

Time Series

Informer, Autoformer.

Transformer Variants & Use Cases – Cheatsheet

BERT Encoder (understanding)
GPT Decoder (generation)
T5 Encoder-Decoder (seq2seq)
ViT Vision
Whisper Speech
RoPE Positional encoding
SwiGLU Activation
LoRA Efficient tuning

Transformer Model Comparison

Model Year Architecture Params Key Innovation
Transformer2017Enc-Dec65MSelf-attention, no RNN/CNN
BERT2018Encoder110M-340MMasked LM, bidirectional
GPT-32020Decoder175BFew-shot, in-context learning
T52019Enc-Dec11BText-to-text unified
ViT2020Encoder86M-632MImage patches as tokens
Llama 22023Decoder7B-70BOpen, efficient, Grouped-Query Attention

Transformer Pitfalls & Debugging

⚠️ Quadratric complexity: O(n²) for attention. Use sparse attention, Linformer, or long-sequence optimizers.
⚠️ Training instability: Warmup (5-10% steps), gradient clipping, Adam betas=(0.9, 0.98).
✅ Positional encoding: For very long sequences, use RoPE or ALiBi.
✅ Debug attention: Visualize attention maps – should be diffuse, not degenerate.

Attention Mechanism: The Eyes of Neural Networks

What is Attention?

Attention is a neural component that dynamically computes a weighted sum of values, where weights depend on the similarity between a query and corresponding keys. It allows models to focus on specific parts of the input when producing each output element — mimicking visual attention.

Query (Q) → Similarity ← Keys (K)
↓
Attention Weights (softmax)
↓
Weighted Sum → Context Vector × Values (V)

Core idea: Not all input elements are equally important. Learn to assign importance dynamically.

The Alignment Problem: Why Attention?

Seq2Seq without Attention

Encoder compresses entire source into one fixed-size vector → information bottleneck. Long sentences degrade rapidly.

"I love cats" → fixed vector (5-dim) → "Je ___ ?"

Seq2Seq with Attention

Decoder looks at all encoder states, weights them dynamically. Solves bottleneck, improves long-range translation.

Alignment: "cat" ↔ "chat" at step 3

Breakthrough (2014): Bahdanau et al. introduced attention to neural machine translation. BLEU scores jumped, and long sentences became tractable.

Bahdanau Attention (Additive)

Additive Attention Score

eᵢⱼ = vᵃ tanh(Wₐ [sᵢ₋₁; hⱼ])

or concat version: score(s, h) = vᵃ tanh(Wₐ[s; h])

Context vector cᵢ = Σⱼ αᵢⱼ hⱼ

Historical Significance

First attention mechanism for NLP. Used in RNN encoder-decoders. Computationally expensive (fully connected layer per alignment).

Bidirectional RNN Concatenation tanh

Bahdanau Attention (PyTorch)
class BahdanauAttention(nn.Module):
    def __init__(self, hidden_dim):
        super().__init__()
        self.W_a = nn.Linear(hidden_dim * 2, hidden_dim)  # [s; h]
        self.v_a = nn.Linear(hidden_dim, 1, bias=False)
    
    def forward(self, query, encoder_outputs):
        # query: decoder hidden (batch, hidden)
        # encoder_outputs: (batch, seq_len, hidden)
        seq_len = encoder_outputs.size(1)
        query = query.unsqueeze(1).repeat(1, seq_len, 1)  # (batch, seq_len, hidden)
        
        # Combine query and encoder outputs
        energy = torch.tanh(self.W_a(torch.cat((query, encoder_outputs), dim=2)))  # (batch, seq_len, hidden)
        scores = self.v_a(energy).squeeze(2)  # (batch, seq_len)
        attn_weights = torch.softmax(scores, dim=1)
        
        context = torch.bmm(attn_weights.unsqueeze(1), encoder_outputs).squeeze(1)
        return context, attn_weights

Luong Attention (Multiplicative)

Scoring Functions

Dot: score = sáµ€ h

General: score = sáµ€ W h

Concat: score = váµ€ tanh(W[s; h])

Key Differences

Luong computes attention after decoder output (vs before in Bahdanau). Simpler, faster. Uses top-layer state only.

Types: global (all source steps) vs local (window).

Luong Dot-Product Attention
def luong_dot_attention(query, encoder_outputs):
    # query: (batch, 1, hidden)
    # encoder_outputs: (batch, seq_len, hidden)
    scores = torch.bmm(query, encoder_outputs.transpose(1, 2))  # (batch, 1, seq_len)
    attn_weights = torch.softmax(scores, dim=2)
    context = torch.bmm(attn_weights, encoder_outputs)
    return context, attn_weights

Scaled Dot-Product Attention

The Transformer Formula

Attention(Q, K, V) = softmax(QKᵀ / √dₖ) V

Q, K, V: queries, keys, values matrices.
√dₖ: scaling factor prevents softmax saturation.

Why Scaling?

For large dâ‚–, dot products grow large in magnitude, pushing softmax into regions of vanishing gradients. Scaling fixes this.

Scaled Dot-Product Attention (NumPy/Torch)
def scaled_dot_product_attention(Q, K, V, mask=None):
    d_k = Q.size(-1)
    scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
    
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    
    attention_weights = torch.softmax(scores, dim=-1)
    output = torch.matmul(attention_weights, V)
    return output, attention_weights
Attention Matrix: Rows = queries, Cols = keys. Each row sums to 1.

Multi-Head Attention

Instead of one attention function, project Q, K, V h times with different linear projections, perform attention in parallel, concatenate, and project.

Diverse Representations

Each head learns different relationships: syntactic, semantic, coreference, positional.

MultiHead(Q,K,V)

Concat(head₁,...,headₕ)Wᴼ

headáµ¢ = Attention(QWáµ¢^Q, KWáµ¢^K, VWáµ¢^V)

Typical Values

h = 8, 12, 16, 32. dâ‚– = d_v = d_model / h.

Attention Variants: Self, Cross, Causal

Self-Attention

Q, K, V from same sequence. Each token attends to all tokens in the same sequence. Captures intra-sequence dependencies.

Encoders BERT

Cross-Attention

Q from decoder, K, V from encoder. Decoder attends to input sequence. Essential for seq2seq.

T5, BART

Causal (Masked) Attention

Prevents attending to future tokens. Upper triangular mask set to -∞. Used in autoregressive decoders.

GPT, Llama

Causal Attention Mask
def causal_mask(size):
    """Upper triangular matrix with zeros on diagonal and below, -inf above"""
    mask = torch.triu(torch.ones(size, size), diagonal=1).bool()
    return mask  # True where future tokens (to be masked)

Visualizing Attention Weights

Alignment Matrix

Plot attention weights as heatmap. Rows = decoder steps, Cols = encoder steps. Reveals word alignment.

[0.9, 0.05, 0.05]
[0.1, 0.8, 0.1]
[0.1, 0.1, 0.8]
Probing Attention Heads

Certain heads specialize: positional heads attend to previous/next token, syntactic heads attend to dependent tokens, rare word heads.

Tools: BertViz, exBERT, AttentionViz for interactive exploration.

Attention Beyond NLP

Vision

Spatial attention: Attend to relevant image regions. ViT uses self-attention on patches. Cross-attention in image captioning.

Audio

Speech recognition: Attend to acoustic frames. Listen, Attend and Spell (LAS).

Video

Temporal attention: Focus on relevant frames. Video transformers.

Multimodal

CLIP, Flamingo, LLaVA: cross-attention between image and text.

Graphs

Graph Attention Networks (GAT): attend to neighbor nodes.

Reinforcement Learning

Attend to relevant observations in memory.

Attention Types – Cheatsheet

Bahdanau Additive, concat
Luong Dot, general
Scaled Dot QKᵀ/√d
Multi-Head Parallel
Self Intra-sequence
Cross Encoder-decoder
Causal Autoregressive
Spatial Vision

Attention Mechanism Comparison

Attention Type Score Function Complexity Typical Use
Bahdanau (Additive)vᵃ tanh(W[s; h])O(n·d²)RNN seq2seq
Luong (Dot)sᵀ hO(n·d)RNN, efficient
Scaled Dot-ProductQKᵀ/√dO(n²·d)Transformers
Multi-Headh × scaled dotO(n²·d·h)BERT, GPT
Graph AttentionLeakyReLU(aᵀ[Whᵢ; Whⱼ])O(E·d)Graph networks

Attention Pitfalls & Debugging

⚠️ Attention collapse: All weights equal. Causes: bad initialization, lack of training, model too small.
⚠️ Quadratic complexity: O(n²) for self-attention. Use sparse attention, Linformer, Longformer.
✅ Debug: Always visualize attention matrices. Entropy should be moderate (not 0, not uniform).
✅ Multi-head diversity: Check correlation between heads. Low correlation = diverse features.