Neural Networks Attention
Q K V Multi-head

Attention Mechanism

Attention is a differentiable way to build weighted sums: given a query, compare it to a set of keys, turn similarities into nonnegative weights with softmax, and sum corresponding values. Intuitively, the model learns what to look at. In encoder–decoder attention, queries come from the decoder and keys/values from the encoder so each output token can focus on relevant source positions. In self-attention, queries, keys, and values all come from the same sequence—every position attends to every position (subject to masking for autoregressive decoding).

scaled dot-product softmax causal mask Transformer

Scaled Dot-Product Attention

For queries Q, keys K, values V (as matrices of row-vectors), Attention(Q, K, V) = softmax(QKT / √dk) V. The dot product QKT scores how much each query aligns with each key; dividing by √dk (dimension of key vectors) keeps softmax from saturating when dk is large. The result is a mixture of value rows—each query’s output is a convex combination of values.

Multi-head attention runs several attention operations in parallel with different learned linear projections of Q, K, V, then concatenates and projects again—different heads can specialize in syntax, long-range, or local patterns.

Masking

For language modeling, positions must not attend to future tokens. A causal mask sets logits to −∞ above the diagonal before softmax so those weights are zero. Padding masks zero out attention to pad tokens in batched sequences. Vision Transformers apply attention over image patches with similar machinery.

Attention is O(T²) in sequence length T for full self-attention—long contexts need sparse, linear, or chunked approximations in production systems.

PyTorch: MultiheadAttention

Self-attention layer (conceptual)
import torch.nn as nn

# embed_dim must be divisible by num_heads
mha = nn.MultiheadAttention(embed_dim=256, num_heads=8, batch_first=True)
# x: (batch, seq_len, embed_dim)
x = torch.randn(4, 100, 256)
out, attn_weights = mha(x, x, x)

Full Transformers stack MHA with feed-forward nets, residuals, and layer norm—see dedicated transformer tutorials for the complete block.

Summary

  • Attention = softmax-normalized key–query similarity applied to values.
  • Scaling by √dk stabilizes gradients; multi-head increases representational flexibility.
  • Encoder–decoder vs self-attention differ in where Q, K, V are drawn from.
  • Masks enforce causality and ignore padding; complexity scales quadratically with length.

Next in the syllabus: transfer learning—reuse pretrained representations for new tasks.