Neural Networks

RNN & Attention

Recurrent neural networks and attention mechanisms for sequence modeling.

Recurrent Neural Networks

Vanilla RNN

Schematically: h_t = tanh(W_hh h_tâˆ’1 + W_xh x_t + b). The output at each step can feed a loss (many-to-many), or only the final hidden state can classify the whole sequence (many-to-one). Bidirectional RNNs run one RNN forward and one backward, concatenating states so each position sees past and future contextâ€”common in tagging, not usable for causal autoregressive generation without masking tricks.

Training truncates BPTT to a fixed window to limit memory; very long dependencies still challenge plain RNNs.

LSTM and GRU

LSTM adds a cell state c_t and gates: forget, input, output. The cell updates additively, giving gradients a â€œhighwayâ€ that reduces vanishing compared to repeated tanh squashing alone. GRU merges ideas into fewer gates (reset/update)â€”often similar quality with fewer parameters. Both are drop-in replacements for nn.RNN in PyTorch.

For new sequence projects, try a Transformer or 1D CNN + attention first if data and compute allow; fall back to LSTM for tight latency or tiny footprints.

PyTorch: `nn.LSTM`

Batch-first LSTM

import torch
import torch.nn as nn

# x: (batch, seq_len, input_size)
lstm = nn.LSTM(input_size=128, hidden_size=256, num_layers=2, batch_first=True)
x = torch.randn(32, 50, 128)
out, (h_n, c_n) = lstm(x)
# out[:, -1, :] â€” last timestep; or use out for per-step heads

Summary

RNNs map sequences via a recurrent hidden state and shared weights across time.
BPTT causes gradient issues; LSTM/GRU gates mitigate vanishing over longer spans.
Bidirectional RNNs use future context; unidirectional suits online/decoding settings.
Next: Attentionâ€”soft, content-based aggregation that powers Transformers.

Attention Mechanism

Scaled Dot-Product Attention

For queries Q, keys K, values V (as matrices of row-vectors), Attention(Q, K, V) = softmax(QK^T / âˆšd_k) V. The dot product QK^T scores how much each query aligns with each key; dividing by âˆšd_k (dimension of key vectors) keeps softmax from saturating when d_k is large. The result is a mixture of value rowsâ€”each queryâ€™s output is a convex combination of values.

Multi-head attention runs several attention operations in parallel with different learned linear projections of Q, K, V, then concatenates and projects againâ€”different heads can specialize in syntax, long-range, or local patterns.

Masking

For language modeling, positions must not attend to future tokens. A causal mask sets logits to âˆ’âˆž above the diagonal before softmax so those weights are zero. Padding masks zero out attention to pad tokens in batched sequences. Vision Transformers apply attention over image patches with similar machinery.

Attention is O(TÂ²) in sequence length T for full self-attentionâ€”long contexts need sparse, linear, or chunked approximations in production systems.

PyTorch: `MultiheadAttention`

Self-attention layer (conceptual)

import torch.nn as nn

# embed_dim must be divisible by num_heads
mha = nn.MultiheadAttention(embed_dim=256, num_heads=8, batch_first=True)
# x: (batch, seq_len, embed_dim)
x = torch.randn(4, 100, 256)
out, attn_weights = mha(x, x, x)

Full Transformers stack MHA with feed-forward nets, residuals, and layer normâ€”see dedicated transformer tutorials for the complete block.

Summary

Attention = softmax-normalized keyâ€“query similarity applied to values.
Scaling by âˆšd_k stabilizes gradients; multi-head increases representational flexibility.
Encoderâ€“decoder vs self-attention differ in where Q, K, V are drawn from.
Masks enforce causality and ignore padding; complexity scales quadratically with length.

Previous Next

RNN & Attention

Recurrent Neural Networks

Vanilla RNN

LSTM and GRU

PyTorch: nn.LSTM

Summary

Attention Mechanism

Scaled Dot-Product Attention

Masking

PyTorch: MultiheadAttention

Summary

PyTorch: `nn.LSTM`

PyTorch: `MultiheadAttention`