Attention Mechanism â€” 15 Interview Questions

Q/K/V, scaled dot-product, causal masking, multi-head attention, and how attention stacks into a Transformer block.

Colored left borders per card; green / amber / red difficulty chips.

QKV Softmax Multi-head Causal

1 Intuition: what does attention compute?Easy

Answer: A weighted sum of values, where weights (attention scores) say how much each source position matters for the current queryâ€”soft lookup over a set of vectors.

2 What are Query, Key, and Value?Easy

Answer: Three linear projections of inputs (or cross-modal sources). Query asks â€œwhat I needâ€; keys label slots; values carry content mixed by attention weights.

3 Scaled dot-product attention formula.Medium

Answer: Attention(Q,K,V) = softmax(QKáµ€ / âˆšd_k) V. Scale by âˆšd_k keeps dot products from growing too large so softmax doesnâ€™t saturate.

softmax(QK^T / âˆšd_k) V

4 Self-attention vs cross-attention.Medium

Answer: Self: Q, K, V from same sequence (e.g. encoder). Cross: Q from one sequence (decoder), K,V from another (encoder output)â€”decoder attends to source.

5 Causal (look-ahead) mask in decoders.Medium

Answer: Set attention logits to âˆ’âˆž for future positions before softmaxâ€”position t cannot attend to t+1,â€¦â€”preserves autoregressive generation.

6 Multi-head attentionâ€”why multiple heads?Medium

Answer: Each head learns different subspaces of relationships in parallel; concatenating heads lets the model capture multiple dependency types (syntax, coreference, etc.).

7 Pre-LN vs Post-LN Transformer (brief).Hard

Answer: Post-LN: original â€œAttention â†’ Add&Normâ€. Pre-LN: norm before sublayersâ€”often more stable training for very deep stacks; both used in literature.

8 Why positional encoding?Easy

Answer: Attention is permutation-invariant without order infoâ€”add sinusoidal or learned positions so â€œcat bites dogâ€ â‰ â€œdog bites cat.â€

9 Time complexity of self-attention in sequence length n.Medium

Answer: O(nÂ² Â· d) for attention matrix over pairsâ€”quadratic in length is the main bottleneck for long contexts; motivates sparse/linear attention variants.

10 Bahdanau (additive) vs Luong (dot) attention.Hard

Answer: Older seq2seq: additive scores use a small MLP on [s_t; h_j]; multiplicative/dot uses direct similarityâ€”scaled dot-product is the modern dot family at scale.

11 What sits after attention in a Transformer block?Easy

Answer: Feed-forward network (MLP) applied per positionâ€”typically expand (4d) with GELU/ReLU then project back; residual + norm around each sublayer.

12 Attention dropoutâ€”where?Easy

Answer: Dropout on attention weights (after softmax) or on scores in some implementationsâ€”regularizes attention patterns.

13 Vision Transformerâ€”how is attention used?Medium

Answer: Split image into patches, embed as tokens, run Transformer encoder self-attentionâ€”global mixing of patch relationships without conv inductive bias (with data scale).

14 FlashAttention (interview one-liner).Hard

Answer: IO-aware exact attention implementation that fuses ops and tiles to SRAMâ€”same math, faster training on GPUs for long sequences.

15 Encoder-only vs decoder-only vs encoderâ€“decoder.Medium

Answer: Encoder-only (BERT): bidirectional context. Decoder-only (GPT): causal LM. Encâ€“Dec (T5, original Transformer): encoder sees source, decoder generates target with cross-attention.

Memorize scaled softmax(QKáµ€)V and be able to point to each matrixâ€™s shape.

Quick review checklist

Q/K/V; scaled dot-product; self vs cross; causal mask.
Multi-head; positional encoding; O(nÂ²) cost.
Transformer block; ViT; encoder/decoder roles.

Previous: RNN Next: Transfer learning

Related Neural Networks Links

Attention Mechanism â€” 15 Interview Questions

Quick review checklist