Transformers: 20 Interview Questions

Question 1

1 What is the Transformer architecture? How is it different from RNN? ⚡ Easy

Answer

Answer: Transformer is an attention-based architecture without recurrence or convolution. It processes all tokens in parallel via self-attention and feed-forward layers. Unlike RNNs, it has no sequential dependency, allowing massive parallelization and better long-range dependency capture.

Question 2

2 How does self-attention work? Intuition. 📊 Medium

Answer

Answer: Each token attends to all tokens (including itself) to compute a weighted sum of values. Weights are derived from scaled dot-product of query and key. This captures contextual relationships irrespective of distance.

Question 3

3 Why multi-head attention? Benefits? 🔥 Hard

Answer

Answer: Multi-head runs multiple attention heads in parallel, each with different learned projections. Allows model to focus on different subspaces (e.g., syntactic, semantic roles). Outputs are concatenated and projected. Increases representational power.

Question 4

4 Why does Transformer need positional encoding? Describe sine/cosine version. 📊 Medium

Answer

Answer: Self-attention is permutation-invariant; no notion of order. Positional encodings (PE) inject sequence order. Sine/cosine functions at different frequencies: PE(pos, 2i) = sin(pos/10000^{2i/d}); PE(pos,2i+1)=cos(...). Enables relative position learning.

Question 5

5 Why scale dot-product by 1/√d_k in attention? 🔥 Hard

Answer

Answer: For large d_k, dot products grow in magnitude, pushing softmax into regions of extremely small gradients. Scaling by 1/√d_k counteracts variance growth, keeping gradients stable.

Question 6

6 Describe Transformer's encoder and decoder blocks. 📊 Medium

Answer

Answer: Encoder: self-attention + feed-forward + residual + layer norm. Decoder: masked self-attention (to prevent looking ahead), cross-attention over encoder output, then FFN. Both use residual connections.

Question 7

7 What is masked self-attention in decoder? 📊 Medium

Answer

Answer: Prevents positions from attending to subsequent (future) positions. Achieved by setting attention logits to -∞ before softmax. Ensures autoregressive generation.

Question 8

8 Why layer norm instead of batch norm in Transformers? 🔥 Hard

Answer

Answer: Layer norm is independent of batch size and works well with variable sequence lengths. Batch norm's statistics are unstable for varying T and small batch, common in NLP. Layer norm normalizes across features for each token.

Question 9

9 How is BERT different from GPT? Objectives? 🔥 Hard

Answer

Answer: BERT is encoder-only, bidirectional; pretrained with masked LM (MLM) and next sentence prediction. GPT is decoder-only, unidirectional (causal LM); autoregressive. BERT excels at NLU; GPT at generation.

Question 10

10 Explain pretraining and fine-tuning paradigm. 📊 Medium

Answer

Answer: Large-scale pretraining on unlabeled text (e.g., Wikipedia) learns general language representations. Fine-tuning adapts pretrained weights to downstream tasks with labeled data, efficient and data-efficient.

Question 11

11 How does Vision Transformer (ViT) work? 🔥 Hard

Answer

Answer: Split image into fixed-size patches, flatten and project linearly to embeddings, add positional embeddings, feed to standard Transformer encoder. Classification via [CLS] token. No convolutions.

Question 12

12 What is the main computational bottleneck of Transformers? 📊 Medium

Answer

Answer: Self-attention has O(n² d) complexity in sequence length n. Long sequences (e.g., documents, video) are expensive. Solutions: sparse attention, Linformer, Reformer, Longformer.

Question 13

13 Differentiate self-attention and cross-attention. ⚡ Easy

Answer

Answer: Self-attention: Q,K,V from same sequence. Cross-attention: Q from one sequence (e.g., decoder), K,V from another (encoder output). Used to align different modalities/languages.

Question 14

14 Why are residual connections critical in Transformers? 📊 Medium

Answer

Answer: Enable gradient flow through deep stacks (12+ layers). Mitigate vanishing gradient. Also preserve original token information through attention modifications.

Question 15

15 Why use learning rate warmup for Transformers? 🔥 Hard

Answer

Answer: Large gradients at start can destabilize training. Warmup (linear increase from 0) stabilizes optimization, especially for Adam with adaptive learning rates. Common in Transformer training.

Question 16

16 Compare Transformers and CNNs for vision tasks. 📊 Medium

Answer

Answer: Transformers have global receptive field from start; CNNs are local and inductive biased (translation equivariance). ViT needs more data; hybrids (ConvNeXt, CvT) combine benefits.

Question 17

17 Do Transformers share weights like RNNs? 📊 Medium

Answer

Answer: No. Each layer has independent weights. RNNs share same weight matrix across time steps; Transformer layers are stacked with different parameters, increasing capacity.

Question 18

18 What are relative position encodings? 🔥 Hard

Answer

Answer: Instead of adding absolute position to embeddings, relative PE injects pairwise distance information into attention logits. Improves generalization for longer sequences. Used in Transformer-XL, T5.

Question 19

19 How does GPT generate text? 📊 Medium

Answer

Answer: Causal LM: predicts next token given previous tokens. Uses masked self-attention. Decodes autoregressively (one token at a time). Can use greedy, sampling, beam search.

Question 20

20 What challenges arise when scaling Transformers to hundreds of billions of parameters? 🔥 Hard

Answer

Answer: Memory (activations, optimizer states), communication overhead, training instability, data quality. Solutions: model parallelism, pipeline parallelism, mixture-of-experts (MoE), activation checkpointing, fp16/bf16 mixed precision.

Transformers: 20 Interview Questions

Transformers – Interview Cheat Sheet

Core Concepts

Variants

RNN vs Transformer

Efficiency