Attention Mechanism
20 Essential Q/A
DL Interview Prep
Attention Mechanism: 20 Interview Questions
Master Attention from seq2seq to Transformers: self-attention, multi-head, scaled dot-product, QKV, positional encoding, BERT, ViT, and advanced variants. Concise, interview-ready answers with formulas.
Self-Attention
QKV
Multi-Head
Transformer
Masked Attention
Positional Encoding
ViT
1
What is the attention mechanism? Why was it introduced in deep learning?
⚡ Easy
Answer: Attention allows a model to dynamically weigh the importance of different input elements when producing output. Introduced initially in seq2seq models (Bahdanau et al., 2015) to overcome the bottleneck of a fixed-length context vector. It provides a shortcut for gradients and improves long-range dependency capture.
Context = Σ α_i · h_i where α_i = softmax(score(h_dec, h_enc))
2
Compare Bahdanau (additive) and Luong (multiplicative) attention.
📊 Medium
Answer:
- Bahdanau: Uses a feedforward network to compute alignment score. Score = v_a^T tanh(W_1 h_dec + W_2 h_enc). Computes attention for each decoder step, more expressive but slower.
- Luong: Simpler dot-product (or general) score. Score = h_dec^T · h_enc (or h_dec^T W h_enc). Faster, often used with global/local attention.
Bahdanau: additive, more parameters, tanh. | Luong: multiplicative, simpler, dot or general.
3
Explain self-attention. How is it different from cross-attention?
🔥 Hard
Answer: Self-attention computes attention within the same sequence (Q, K, V from same source). Each token attends to all tokens in the sequence, including itself. Cross-attention: Q from one sequence (e.g., decoder), K,V from another (e.g., encoder). Core of Transformers.
4
What is scaled dot-product attention? Why scale by √d_k?
🔥 Hard
Answer: Attention(Q,K,V) = softmax(QK^T / √d_k) V. Scaling factor √d_k prevents dot products from growing large in magnitude, pushing softmax into regions of extremely small gradients. Stabilizes training, especially for high-dimensional keys.
Attention(Q,K,V) = softmax( QK^T / √d_k ) V
5
What is multi-head attention? Why is it beneficial?
📊 Medium
Answer: Multi-head attention projects Q, K, V into h subspaces, applies attention in parallel, then concatenates and projects. Each head can focus on different relationships (e.g., syntax, semantics, long-distance). Increases model capacity without quadratic parameter blowup.
MultiHead(Q,K,V) = Concat(head_1,...,head_h)W^O
head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)
head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)
6
Sketch the high-level Transformer architecture (encoder-decoder).
🔥 Hard
Answer:
- Encoder: stack of N identical layers. Each layer: Multi-Head Self-Attention + FeedForward, with Add&Norm (residual + LayerNorm).
- Decoder: stack of N layers. Each layer: Masked Multi-Head Self-Attention (causal) + Cross-Attention (Q from decoder, K,V from encoder) + FeedForward, with Add&Norm.
7
Why does Transformer need positional encoding? Describe sinusoidal encoding.
📊 Medium
Answer: Self-attention is permutation-invariant; no inherent notion of order. Positional encoding adds information about position. Sinusoidal: PE(pos, 2i) = sin(pos/10000^(2i/d)), PE(pos, 2i+1) = cos(...). Allows model to attend to relative positions, extrapolates beyond training length.
8
What is masked self-attention? Why is it used?
📊 Medium
Answer: In decoder, masked attention prevents positions from attending to future positions (causal). Achieved by setting attention scores to -∞ (or large negative) for illegal connections before softmax. Ensures autoregressive property: prediction at step t depends only on previous tokens.
[✓ -∞ -∞; ✓ ✓ -∞; ✓ ✓ ✓] (example causal mask)
9
Why is attention (Transformer) more parallelizable than RNNs?
📊 Medium
Answer: RNNs process tokens sequentially (O(n) steps). Self-attention computes all pairwise interactions in O(1) sequential steps (parallel across sequence). However, compute cost is O(n² d) vs RNN O(n d²). Trade-off: parallelization vs quadratic complexity.
10
What is the time and memory complexity of vanilla self-attention?
🔥 Hard
Answer: Time complexity: O(n² · d) where n is sequence length, d is dimension. Memory complexity: O(n²) for attention matrix. This limits long sequences. Solutions: sparse attention (Longformer), linear attention (Performer), sliding window.
Global receptive field
Quadratic cost, not O(n)
11
How is BERT's attention different from Transformer decoder?
🔥 Hard
Answer: BERT uses encoder-only Transformer with bidirectional self-attention (no masking). Transformer decoder uses causal (masked) self-attention + cross-attention. BERT is trained with MLM (masked language modeling), decoder with autoregressive LM.
12
How does Vision Transformer (ViT) apply attention to images?
🔥 Hard
Answer: ViT splits image into patches (e.g., 16x16), flattens and projects to embeddings, adds positional embeddings, and feeds to standard Transformer encoder. No convolutions. Competes with CNNs on image classification, scales well with data.
13
Where is cross-attention used besides encoder-decoder?
📊 Medium
Answer: Multi-modal models (CLIP, Flamingo, DALL-E): align image and text representations. In object detection (DETR): cross-attention between object queries and image features. In stable diffusion: cross-attention between text embeddings and image latents.
14
Name and describe sparse attention variants.
🔥 Hard
Answer:
- Sliding window (Longformer): each token attends to w neighbors.
- Dilated sliding window: like conv, larger receptive field.
- Global + sliding: special tokens (CLS) attend to all.
- Block sparse (BigBird): random + window + global.
15
What problem does FlashAttention solve?
🔥 Hard
Answer: Standard attention materializes O(n²) matrix, causing memory bottleneck. FlashAttention uses tiling to compute attention in blocks without writing the full matrix to GPU HBM. 2-4x speedup, linear memory growth, enables longer context.
16
Explain attention as a soft dictionary (key-value) retrieval.
📊 Medium
Answer: Query (Q) is what we're looking for. Keys (K) are indices. Values (V) are the actual content. Attention weights are similarity between Q and K; output is weighted sum of V. Differentiable, soft, end-to-end.
17
What is relative positional encoding? Why is it used?
🔥 Hard
Answer: Instead of adding absolute positions to embeddings, relative PE incorporates distance between tokens into attention scores (e.g., Transformer-XL, T5). Better generalization to longer sequences, captures pairwise relationships directly.
18
What are some alternative attention score functions?
📊 Medium
Answer: Dot product (transformer), additive (Bahdanau), cosine similarity, L1/L2 distance. Linear attention (Katharopoulos et al.): replace softmax with feature maps (elu(x)+1), reduces complexity from O(n²) to O(n). Used in efficient Transformers.
19
How is attention used in GNNs? (GAT)
🔥 Hard
Answer: Graph Attention Network (GAT) computes attention coefficients between connected nodes. Each node attends to its neighbors, learning importance weights. Multi-head GAT aggregates neighbor features. No spectral decomposition, inductive.
α_ij = softmax( LeakyReLU( a^T [W h_i || W h_j] ) )
20
What are alternatives to attention for long sequences?
🔥 Hard
Answer: State Space Models (S4, Mamba) offer linear-time sequence modeling. Use structured state spaces, selective mechanisms. Outperform Transformers on long-range tasks (Path-X, DNA) with faster inference. Potential replacement for attention in some domains.
Attention Mechanism – Interview Cheat Sheet
Core Attention
- QKV Query, Key, Value
- Softmax Σ α = 1
- √d_k Scale factor
Types
- Self Intra-sequence
- Cross Inter-sequence
- Masked Causal (decoder)
Transformer
- Multi-head h subspaces
- Pos enc Sinusoidal
- Add&Norm Res + LN
Complexity
- O(n²) Full attention
- O(n) Linear/Flash
Verdict: "Attention = weighted average with dynamic weights. Transformer = attention + MLP + residuals + norms. The foundation of modern NLP/CV."