Attention Mechanism 20 Essential Q/A

DL Interview Prep

Attention Mechanism: 20 Interview Questions

Master Attention from seq2seq to Transformers: self-attention, multi-head, scaled dot-product, QKV, positional encoding, BERT, ViT, and advanced variants. Concise, interview-ready answers with formulas.

Self-Attention QKV Multi-Head Transformer Masked Attention Positional Encoding ViT

1 What is the attention mechanism? Why was it introduced in deep learning? ⚡ Easy

Answer: Attention allows a model to dynamically weigh the importance of different input elements when producing output. Introduced initially in seq2seq models (Bahdanau et al., 2015) to overcome the bottleneck of a fixed-length context vector. It provides a shortcut for gradients and improves long-range dependency capture.

Context = Σ α_i · h_i where α_i = softmax(score(h_dec, h_enc))

2 Compare Bahdanau (additive) and Luong (multiplicative) attention. 📊 Medium

Answer:

Bahdanau: Uses a feedforward network to compute alignment score. Score = v_a^T tanh(W_1 h_dec + W_2 h_enc). Computes attention for each decoder step, more expressive but slower.
Luong: Simpler dot-product (or general) score. Score = h_dec^T · h_enc (or h_dec^T W h_enc). Faster, often used with global/local attention.

Bahdanau: additive, more parameters, tanh. | Luong: multiplicative, simpler, dot or general.

3 Explain self-attention. How is it different from cross-attention? 🔥 Hard

Answer: Self-attention computes attention within the same sequence (Q, K, V from same source). Each token attends to all tokens in the sequence, including itself. Cross-attention: Q from one sequence (e.g., decoder), K,V from another (e.g., encoder). Core of Transformers.

4 What is scaled dot-product attention? Why scale by √d_k? 🔥 Hard

Answer: Attention(Q,K,V) = softmax(QK^T / √d_k) V. Scaling factor √d_k prevents dot products from growing large in magnitude, pushing softmax into regions of extremely small gradients. Stabilizes training, especially for high-dimensional keys.

Attention(Q,K,V) = softmax( QK^T / √d_k ) V

5 What is multi-head attention? Why is it beneficial? 📊 Medium

Answer: Multi-head attention projects Q, K, V into h subspaces, applies attention in parallel, then concatenates and projects. Each head can focus on different relationships (e.g., syntax, semantics, long-distance). Increases model capacity without quadratic parameter blowup.

MultiHead(Q,K,V) = Concat(head_1,...,head_h)W^O
head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)

6 Sketch the high-level Transformer architecture (encoder-decoder). 🔥 Hard

Answer:

Encoder: stack of N identical layers. Each layer: Multi-Head Self-Attention + FeedForward, with Add&Norm (residual + LayerNorm).
Decoder: stack of N layers. Each layer: Masked Multi-Head Self-Attention (causal) + Cross-Attention (Q from decoder, K,V from encoder) + FeedForward, with Add&Norm.

7 Why does Transformer need positional encoding? Describe sinusoidal encoding. 📊 Medium

Answer: Self-attention is permutation-invariant; no inherent notion of order. Positional encoding adds information about position. Sinusoidal: PE(pos, 2i) = sin(pos/10000^(2i/d)), PE(pos, 2i+1) = cos(...). Allows model to attend to relative positions, extrapolates beyond training length.

8 What is masked self-attention? Why is it used? 📊 Medium

Answer: In decoder, masked attention prevents positions from attending to future positions (causal). Achieved by setting attention scores to -∞ (or large negative) for illegal connections before softmax. Ensures autoregressive property: prediction at step t depends only on previous tokens.

[✓ -∞ -∞; ✓ ✓ -∞; ✓ ✓ ✓] (example causal mask)

9 Why is attention (Transformer) more parallelizable than RNNs? 📊 Medium

Answer: RNNs process tokens sequentially (O(n) steps). Self-attention computes all pairwise interactions in O(1) sequential steps (parallel across sequence). However, compute cost is O(n² d) vs RNN O(n d²). Trade-off: parallelization vs quadratic complexity.

10 What is the time and memory complexity of vanilla self-attention? 🔥 Hard

Answer: Time complexity: O(n² · d) where n is sequence length, d is dimension. Memory complexity: O(n²) for attention matrix. This limits long sequences. Solutions: sparse attention (Longformer), linear attention (Performer), sliding window.

Global receptive field

Quadratic cost, not O(n)

11 How is BERT's attention different from Transformer decoder? 🔥 Hard

Answer: BERT uses encoder-only Transformer with bidirectional self-attention (no masking). Transformer decoder uses causal (masked) self-attention + cross-attention. BERT is trained with MLM (masked language modeling), decoder with autoregressive LM.

12 How does Vision Transformer (ViT) apply attention to images? 🔥 Hard

Answer: ViT splits image into patches (e.g., 16x16), flattens and projects to embeddings, adds positional embeddings, and feeds to standard Transformer encoder. No convolutions. Competes with CNNs on image classification, scales well with data.

13 Where is cross-attention used besides encoder-decoder? 📊 Medium

Answer: Multi-modal models (CLIP, Flamingo, DALL-E): align image and text representations. In object detection (DETR): cross-attention between object queries and image features. In stable diffusion: cross-attention between text embeddings and image latents.

14 Name and describe sparse attention variants. 🔥 Hard

Answer:

Sliding window (Longformer): each token attends to w neighbors.
Dilated sliding window: like conv, larger receptive field.
Global + sliding: special tokens (CLS) attend to all.
Block sparse (BigBird): random + window + global.

15 What problem does FlashAttention solve? 🔥 Hard

Answer: Standard attention materializes O(n²) matrix, causing memory bottleneck. FlashAttention uses tiling to compute attention in blocks without writing the full matrix to GPU HBM. 2-4x speedup, linear memory growth, enables longer context.

16 Explain attention as a soft dictionary (key-value) retrieval. 📊 Medium

Answer: Query (Q) is what we're looking for. Keys (K) are indices. Values (V) are the actual content. Attention weights are similarity between Q and K; output is weighted sum of V. Differentiable, soft, end-to-end.

17 What is relative positional encoding? Why is it used? 🔥 Hard

Answer: Instead of adding absolute positions to embeddings, relative PE incorporates distance between tokens into attention scores (e.g., Transformer-XL, T5). Better generalization to longer sequences, captures pairwise relationships directly.

18 What are some alternative attention score functions? 📊 Medium

Answer: Dot product (transformer), additive (Bahdanau), cosine similarity, L1/L2 distance. Linear attention (Katharopoulos et al.): replace softmax with feature maps (elu(x)+1), reduces complexity from O(n²) to O(n). Used in efficient Transformers.

19 How is attention used in GNNs? (GAT) 🔥 Hard

Answer: Graph Attention Network (GAT) computes attention coefficients between connected nodes. Each node attends to its neighbors, learning importance weights. Multi-head GAT aggregates neighbor features. No spectral decomposition, inductive.

α_ij = softmax( LeakyReLU( a^T [W h_i || W h_j] ) )

20 What are alternatives to attention for long sequences? 🔥 Hard

Answer: State Space Models (S4, Mamba) offer linear-time sequence modeling. Use structured state spaces, selective mechanisms. Outperform Transformers on long-range tasks (Path-X, DNA) with faster inference. Potential replacement for attention in some domains.

Attention Mechanism – Interview Cheat Sheet

Core Attention

QKV Query, Key, Value
Softmax Σ α = 1
√d_k Scale factor

Types

Self Intra-sequence
Cross Inter-sequence
Masked Causal (decoder)

Transformer

Multi-head h subspaces
Pos enc Sinusoidal
Add&Norm Res + LN

Complexity

O(n²) Full attention
O(n) Linear/Flash

Verdict: "Attention = weighted average with dynamic weights. Transformer = attention + MLP + residuals + norms. The foundation of modern NLP/CV."

Object Detection