Attention Mechanism MCQ · test your transformer knowledge
From Bahdanau to BERT – 15 questions covering self‑attention, multi‑head, scaled dot‑product, and modern architectures (Transformer, GPT).
Attention Mechanism: the key to modern NLP
Attention allows models to focus on relevant parts of the input when producing each output. Introduced for machine translation (Bahdanau et al.), it became the foundation of the Transformer architecture (Vaswani et al.) and models like BERT, GPT, and T5. This MCQ covers the core concepts: self‑attention, multi‑head attention, positional encoding, and variants.
Why attention?
It solves the bottleneck of fixed‑length context vectors in RNNs by providing direct access to all hidden states. The Transformer takes this further, relying solely on attention.
Attention glossary – key concepts
Self‑Attention
Attention mechanism where queries, keys, and values come from the same sequence. Each position attends to all positions.
Multi‑Head Attention
Runs multiple attention operations (heads) in parallel, each learning different relationships. Outputs are concatenated and projected.
Scaled Dot‑Product Attention
Attention(Q,K,V) = softmax(QK^T/√d_k)V. Scaling by √d_k prevents softmax saturation.
Positional Encoding
Since Transformers have no recurrence, positional encodings (sinusoidal or learned) inject order information.
Encoder‑Decoder Attention
In Transformer, decoder attends to encoder's output – queries from decoder, keys/values from encoder.
Masked Self‑Attention
Used in decoder to prevent attending to future tokens (causal mask).
BERT / GPT
BERT uses bidirectional self‑attention; GPT uses causal (masked) self‑attention for generation.
# Scaled dot‑product attention (NumPy style)
def scaled_dot_product_attention(Q, K, V, mask=None):
d_k = Q.shape[-1]
scores = Q @ K.T / np.sqrt(d_k)
if mask is not None: scores = scores.masked_fill(mask == 0, -1e9)
attn_weights = softmax(scores)
return attn_weights @ V
Common attention interview questions
- Why is the dot product scaled by 1/√d_k in Transformer attention?
- Explain the role of queries, keys, and values in attention.
- What is the difference between self‑attention and encoder‑decoder attention?
- How does multi‑head attention improve over single head?
- Why do Transformers need positional encodings?
- Describe masked self‑attention in autoregressive models (GPT).