Neural Networks 15 Essential Q&A
Interview Prep

Attention Mechanism — 15 Interview Questions

Q/K/V, scaled dot-product, causal masking, multi-head attention, and how attention stacks into a Transformer block.

Colored left borders per card; green / amber / red difficulty chips.

QKV Softmax Multi-head Causal
1 Intuition: what does attention compute?Easy
Answer: A weighted sum of values, where weights (attention scores) say how much each source position matters for the current query—soft lookup over a set of vectors.
2 What are Query, Key, and Value?Easy
Answer: Three linear projections of inputs (or cross-modal sources). Query asks “what I need”; keys label slots; values carry content mixed by attention weights.
3 Scaled dot-product attention formula.Medium
Answer: Attention(Q,K,V) = softmax(QKᵀ / √d_k) V. Scale by √d_k keeps dot products from growing too large so softmax doesn’t saturate.
softmax(QK^T / √d_k) V
4 Self-attention vs cross-attention.Medium
Answer: Self: Q, K, V from same sequence (e.g. encoder). Cross: Q from one sequence (decoder), K,V from another (encoder output)—decoder attends to source.
5 Causal (look-ahead) mask in decoders.Medium
Answer: Set attention logits to −∞ for future positions before softmax—position t cannot attend to t+1,…—preserves autoregressive generation.
6 Multi-head attention—why multiple heads?Medium
Answer: Each head learns different subspaces of relationships in parallel; concatenating heads lets the model capture multiple dependency types (syntax, coreference, etc.).
7 Pre-LN vs Post-LN Transformer (brief).Hard
Answer: Post-LN: original “Attention → Add&Norm”. Pre-LN: norm before sublayers—often more stable training for very deep stacks; both used in literature.
8 Why positional encoding?Easy
Answer: Attention is permutation-invariant without order info—add sinusoidal or learned positions so “cat bites dog” ≠ “dog bites cat.”
9 Time complexity of self-attention in sequence length n.Medium
Answer: O(n² · d) for attention matrix over pairs—quadratic in length is the main bottleneck for long contexts; motivates sparse/linear attention variants.
10 Bahdanau (additive) vs Luong (dot) attention.Hard
Answer: Older seq2seq: additive scores use a small MLP on [s_t; h_j]; multiplicative/dot uses direct similarity—scaled dot-product is the modern dot family at scale.
11 What sits after attention in a Transformer block?Easy
Answer: Feed-forward network (MLP) applied per position—typically expand (4d) with GELU/ReLU then project back; residual + norm around each sublayer.
12 Attention dropout—where?Easy
Answer: Dropout on attention weights (after softmax) or on scores in some implementations—regularizes attention patterns.
13 Vision Transformer—how is attention used?Medium
Answer: Split image into patches, embed as tokens, run Transformer encoder self-attention—global mixing of patch relationships without conv inductive bias (with data scale).
14 FlashAttention (interview one-liner).Hard
Answer: IO-aware exact attention implementation that fuses ops and tiles to SRAM—same math, faster training on GPUs for long sequences.
15 Encoder-only vs decoder-only vs encoder–decoder.Medium
Answer: Encoder-only (BERT): bidirectional context. Decoder-only (GPT): causal LM. Enc–Dec (T5, original Transformer): encoder sees source, decoder generates target with cross-attention.
Memorize scaled softmax(QKᵀ)V and be able to point to each matrix’s shape.

Quick review checklist

  • Q/K/V; scaled dot-product; self vs cross; causal mask.
  • Multi-head; positional encoding; O(n²) cost.
  • Transformer block; ViT; encoder/decoder roles.