Transformers
20 Essential Q/A
FAANG Interview Prep
Transformers: 20 Interview Questions
Master self-attention, multi-head, positional encoding, BERT, GPT, Vision Transformer (ViT), and core comparisons. Interview-ready concise answers with formulas and intuition.
Self-Attention
Multi-Head
Pos Encoding
BERT
GPT
ViT
1
What is the Transformer architecture? How is it different from RNN?
⚡ Easy
Answer: Transformer is an attention-based architecture without recurrence or convolution. It processes all tokens in parallel via self-attention and feed-forward layers. Unlike RNNs, it has no sequential dependency, allowing massive parallelization and better long-range dependency capture.
Attention(Q,K,V) = softmax(QK^T/√d_k) V
2
How does self-attention work? Intuition.
📊 Medium
Answer: Each token attends to all tokens (including itself) to compute a weighted sum of values. Weights are derived from scaled dot-product of query and key. This captures contextual relationships irrespective of distance.
Intuition: "Retrieve relevant values based on similarity between query and key."
3
Why multi-head attention? Benefits?
🔥 Hard
Answer: Multi-head runs multiple attention heads in parallel, each with different learned projections. Allows model to focus on different subspaces (e.g., syntactic, semantic roles). Outputs are concatenated and projected. Increases representational power.
MultiHead(Q,K,V) = Concat(head_1,...,head_h)W^O
where head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)
where head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)
4
Why does Transformer need positional encoding? Describe sine/cosine version.
📊 Medium
Answer: Self-attention is permutation-invariant; no notion of order. Positional encodings (PE) inject sequence order. Sine/cosine functions at different frequencies: PE(pos, 2i) = sin(pos/10000^{2i/d}); PE(pos,2i+1)=cos(...). Enables relative position learning.
5
Why scale dot-product by 1/√d_k in attention?
🔥 Hard
Answer: For large d_k, dot products grow in magnitude, pushing softmax into regions of extremely small gradients. Scaling by 1/√d_k counteracts variance growth, keeping gradients stable.
6
Describe Transformer's encoder and decoder blocks.
📊 Medium
Answer: Encoder: self-attention + feed-forward + residual + layer norm. Decoder: masked self-attention (to prevent looking ahead), cross-attention over encoder output, then FFN. Both use residual connections.
7
What is masked self-attention in decoder?
📊 Medium
Answer: Prevents positions from attending to subsequent (future) positions. Achieved by setting attention logits to -∞ before softmax. Ensures autoregressive generation.
8
Why layer norm instead of batch norm in Transformers?
🔥 Hard
Answer: Layer norm is independent of batch size and works well with variable sequence lengths. Batch norm's statistics are unstable for varying T and small batch, common in NLP. Layer norm normalizes across features for each token.
9
How is BERT different from GPT? Objectives?
🔥 Hard
Answer: BERT is encoder-only, bidirectional; pretrained with masked LM (MLM) and next sentence prediction. GPT is decoder-only, unidirectional (causal LM); autoregressive. BERT excels at NLU; GPT at generation.
10
Explain pretraining and fine-tuning paradigm.
📊 Medium
Answer: Large-scale pretraining on unlabeled text (e.g., Wikipedia) learns general language representations. Fine-tuning adapts pretrained weights to downstream tasks with labeled data, efficient and data-efficient.
11
How does Vision Transformer (ViT) work?
🔥 Hard
Answer: Split image into fixed-size patches, flatten and project linearly to embeddings, add positional embeddings, feed to standard Transformer encoder. Classification via [CLS] token. No convolutions.
12
What is the main computational bottleneck of Transformers?
📊 Medium
Answer: Self-attention has O(n² d) complexity in sequence length n. Long sequences (e.g., documents, video) are expensive. Solutions: sparse attention, Linformer, Reformer, Longformer.
Parallel, global receptive field
O(n²) memory/compute
13
Differentiate self-attention and cross-attention.
⚡ Easy
Answer: Self-attention: Q,K,V from same sequence. Cross-attention: Q from one sequence (e.g., decoder), K,V from another (encoder output). Used to align different modalities/languages.
14
Why are residual connections critical in Transformers?
📊 Medium
Answer: Enable gradient flow through deep stacks (12+ layers). Mitigate vanishing gradient. Also preserve original token information through attention modifications.
15
Why use learning rate warmup for Transformers?
🔥 Hard
Answer: Large gradients at start can destabilize training. Warmup (linear increase from 0) stabilizes optimization, especially for Adam with adaptive learning rates. Common in Transformer training.
16
Compare Transformers and CNNs for vision tasks.
📊 Medium
Answer: Transformers have global receptive field from start; CNNs are local and inductive biased (translation equivariance). ViT needs more data; hybrids (ConvNeXt, CvT) combine benefits.
17
Do Transformers share weights like RNNs?
📊 Medium
Answer: No. Each layer has independent weights. RNNs share same weight matrix across time steps; Transformer layers are stacked with different parameters, increasing capacity.
18
What are relative position encodings?
🔥 Hard
Answer: Instead of adding absolute position to embeddings, relative PE injects pairwise distance information into attention logits. Improves generalization for longer sequences. Used in Transformer-XL, T5.
19
How does GPT generate text?
📊 Medium
Answer: Causal LM: predicts next token given previous tokens. Uses masked self-attention. Decodes autoregressively (one token at a time). Can use greedy, sampling, beam search.
logits = model(input_ids); next_token = sample(softmax(logits[:, -1, :]))
20
What challenges arise when scaling Transformers to hundreds of billions of parameters?
🔥 Hard
Answer: Memory (activations, optimizer states), communication overhead, training instability, data quality. Solutions: model parallelism, pipeline parallelism, mixture-of-experts (MoE), activation checkpointing, fp16/bf16 mixed precision.
Transformers – Interview Cheat Sheet
Core Concepts
- Self-Attn Global interaction, O(n²)
- Multi-Head Different subspaces
- Pos Enc Sine/cosine or learned
- LayerNorm Pre/Post norm
Variants
- BERT Encoder MLM+ NSP
- GPT Decoder causal LM
- ViT Patch embeddings
RNN vs Transformer
- RNN Sequential, O(n), vanishing gradient
- Transformer Parallel, O(n²), global
Efficiency
- Sparse Attn Longformer
- Linformer O(n) linear
- FlashAttn IO-aware
Verdict: "Attention is all you need – but mind the quadratic cost."