Attention Mechanism

Attention is a differentiable way to build weighted sums: given a query, compare it to a set of keys, turn similarities into nonnegative weights with softmax, and sum corresponding values. Intuitively, the model learns what to look at. In encoderâ€“decoder attention, queries come from the decoder and keys/values from the encoder so each output token can focus on relevant source positions. In self-attention, queries, keys, and values all come from the same sequenceâ€”every position attends to every position (subject to masking for autoregressive decoding).

scaled dot-product softmax causal mask Transformer

Scaled Dot-Product Attention

For queries Q, keys K, values V (as matrices of row-vectors), Attention(Q, K, V) = softmax(QK^T / âˆšd_k) V. The dot product QK^T scores how much each query aligns with each key; dividing by âˆšd_k (dimension of key vectors) keeps softmax from saturating when d_k is large. The result is a mixture of value rowsâ€”each queryâ€™s output is a convex combination of values.

Multi-head attention runs several attention operations in parallel with different learned linear projections of Q, K, V, then concatenates and projects againâ€”different heads can specialize in syntax, long-range, or local patterns.

Masking

For language modeling, positions must not attend to future tokens. A causal mask sets logits to âˆ’âˆž above the diagonal before softmax so those weights are zero. Padding masks zero out attention to pad tokens in batched sequences. Vision Transformers apply attention over image patches with similar machinery.

Attention is O(TÂ²) in sequence length T for full self-attentionâ€”long contexts need sparse, linear, or chunked approximations in production systems.

PyTorch: `MultiheadAttention`

Self-attention layer (conceptual)

import torch.nn as nn

# embed_dim must be divisible by num_heads
mha = nn.MultiheadAttention(embed_dim=256, num_heads=8, batch_first=True)
# x: (batch, seq_len, embed_dim)
x = torch.randn(4, 100, 256)
out, attn_weights = mha(x, x, x)

Full Transformers stack MHA with feed-forward nets, residuals, and layer normâ€”see dedicated transformer tutorials for the complete block.

Summary

Attention = softmax-normalized keyâ€“query similarity applied to values.
Scaling by âˆšd_k stabilizes gradients; multi-head increases representational flexibility.
Encoderâ€“decoder vs self-attention differ in where Q, K, V are drawn from.
Masks enforce causality and ignore padding; complexity scales quadratically with length.

Next in the syllabus: transfer learningâ€”reuse pretrained representations for new tasks.

Previous: RNN Next: Transfer learning

Related Neural Networks Links