Related Neural Networks Links
Learn Attention Neural Networks Tutorial, validate concepts with Attention Neural Networks MCQ Questions, and prepare interviews through Attention Neural Networks Interview Questions and Answers.
Attention Mechanism
Attention is a differentiable way to build weighted sums: given a query, compare it to a set of keys, turn similarities into nonnegative weights with softmax, and sum corresponding values. Intuitively, the model learns what to look at. In encoder–decoder attention, queries come from the decoder and keys/values from the encoder so each output token can focus on relevant source positions. In self-attention, queries, keys, and values all come from the same sequence—every position attends to every position (subject to masking for autoregressive decoding).
scaled dot-product softmax causal mask Transformer
Scaled Dot-Product Attention
For queries Q, keys K, values V (as matrices of row-vectors), Attention(Q, K, V) = softmax(QKT / √dk) V. The dot product QKT scores how much each query aligns with each key; dividing by √dk (dimension of key vectors) keeps softmax from saturating when dk is large. The result is a mixture of value rows—each query’s output is a convex combination of values.
Multi-head attention runs several attention operations in parallel with different learned linear projections of Q, K, V, then concatenates and projects again—different heads can specialize in syntax, long-range, or local patterns.
Masking
For language modeling, positions must not attend to future tokens. A causal mask sets logits to −∞ above the diagonal before softmax so those weights are zero. Padding masks zero out attention to pad tokens in batched sequences. Vision Transformers apply attention over image patches with similar machinery.
PyTorch: MultiheadAttention
import torch.nn as nn
# embed_dim must be divisible by num_heads
mha = nn.MultiheadAttention(embed_dim=256, num_heads=8, batch_first=True)
# x: (batch, seq_len, embed_dim)
x = torch.randn(4, 100, 256)
out, attn_weights = mha(x, x, x)
Full Transformers stack MHA with feed-forward nets, residuals, and layer norm—see dedicated transformer tutorials for the complete block.
Summary
- Attention = softmax-normalized key–query similarity applied to values.
- Scaling by √dk stabilizes gradients; multi-head increases representational flexibility.
- Encoder–decoder vs self-attention differ in where Q, K, V are drawn from.
- Masks enforce causality and ignore padding; complexity scales quadratically with length.
Next in the syllabus: transfer learning—reuse pretrained representations for new tasks.