Transformers MCQ · test your attention knowledge
From self‑attention to BERT & GPT – 15 questions covering multi‑head attention, positional encoding, masking, and core transformer concepts.
Transformers: the foundation of modern NLP
The Transformer architecture, introduced in "Attention Is All You Need", replaces recurrence with self‑attention. This MCQ tests your understanding of attention mechanisms, positional encoding, and the building blocks of models like BERT and GPT.
Why self‑attention?
Self‑attention allows each token to attend to all other tokens in the sequence, capturing long‑range dependencies and enabling parallelization – a major improvement over RNNs/LSTMs.
Transformer glossary – key concepts
Self‑Attention (Scaled Dot‑Product)
Attention(Q,K,V) = softmax(QKᵀ/√dₖ)V. Computes relevance between all pairs of positions.
Multi‑Head Attention
Runs multiple attention heads in parallel, allowing the model to focus on different subspaces.
Positional Encoding
Adds information about token order (since self‑attention is permutation‑invariant). Usually sine/cosine or learned.
Encoder‑Decoder
Encoder maps input to representations; decoder generates output autoregressively (with cross‑attention).
Masked Attention
In decoders, prevents attending to future tokens (causal masking). Also used in BERT's masked language modeling.
BERT (Encoder‑only)
Bidirectional representations, pretrained with masked LM and next sentence prediction.
GPT (Decoder‑only)
Autoregressive language model, trained to predict next token. Uses causal masking.
# Scaled dot‑product attention (simplified)
def attention(Q, K, V, mask=None):
scores = Q @ K.T / sqrt(d_k)
if mask: scores = scores.masked_fill(mask == 0, -1e9)
weights = softmax(scores, dim=-1)
return weights @ V
Common Transformer interview questions
- Why is self‑attention considered more parallelizable than RNNs?
- Explain the role of the scaling factor √dₖ in attention.
- What is the purpose of multi‑head attention?
- How does positional encoding work? Why sine/cosine functions?
- Compare encoder‑only (BERT) vs decoder‑only (GPT) architectures.
- What is masked attention and where is it used?