RNN, Transformers & Attention — Interview Q&A

Question 1

1 What is a Recurrent Neural Network (RNN)? Typical applications? âš¡ Easy

Answer

Answer: RNNs process sequential data by maintaining a hidden state that captures information from previous time steps. They share weights across time. Applications: NLP (language modeling, translation), time series forecasting, speech recognition, video analysis.

Question 2

2 Why do vanilla RNNs suffer from vanishing/exploding gradient? ðŸ“Š Medium

Answer

Answer: During BPTT, gradients are multiplied by the same recurrent weight matrix W_hh at each time step. If eigenvalues of W_hh < 1, gradients vanish; if > 1, they explode. This prevents learning long-range dependencies.

Question 3

3 Explain LSTM architecture. How does it solve vanishing gradient? ðŸ”¥ Hard

Answer

Answer: LSTM introduces cell state (C_t) as a memory highway with additive updates. Three gates: forget (f), input (i), output (o). Gradient flows through cell state with constant error carousel (CEC) â€“ addition rather than multiplication, preserving gradient.

Question 4

4 What is the purpose of each LSTM gate? ðŸ“Š Medium

Answer

Answer:

Forget gate: decides what to discard from cell state.
Input gate: decides what new info to store.
Output gate: decides what to output based on cell state.

Question 5

5 How is GRU different from LSTM? When to prefer GRU? ðŸ“Š Medium

Answer

Answer: GRU has 2 gates (update, reset), no separate cell state; merges forget/input gates. Simpler, fewer parameters, less prone to overfitting on small data. LSTM is more expressive; GRU often matches performance with less compute.

Question 6

6 What is a bidirectional RNN? When to use? ðŸ“Š Medium

Answer

Answer: BRNN processes sequence forward and backward, concatenating hidden states. Captures context from both past and future. Used in NLP tasks (NER, POS tagging) where entire sequence is available. Not for real-time or streaming.

Question 7

7 Explain Backpropagation Through Time (BPTT). What is truncated BPTT? ðŸ”¥ Hard

Answer

Answer: BPTT unfolds RNN for all time steps, computes gradients over entire sequence (expensive). Truncated BPTT limits unfolding to k steps, approximates gradients, efficient for long sequences.

Question 8

8 Describe seq2seq model. Role of encoder and decoder? ðŸ“Š Medium

Answer

Answer: Seq2seq uses encoder RNN to compress input sequence into context vector (final hidden state). Decoder RNN generates output sequence from context. Used in machine translation, summarization.

Question 9

9 Why was attention introduced? How does it help RNNs? ðŸ”¥ Hard

Answer

Answer: Seq2seq with single context vector fails for long sentences (bottleneck). Attention allows decoder to look at all encoder hidden states, weighted by relevance. Provides shortcut to gradient flow and interpretability.

Question 10

10 What is peephole LSTM? ðŸ”¥ Hard

Answer

Answer: Peephole connections allow gates to see the cell state (C_{t-1}) in addition to h_{t-1} and x_t. Provides finer temporal control, but not always beneficial.

Question 11

11 What are stacked RNNs? Benefits? ðŸ“Š Medium

Answer

Answer: Multiple RNN layers where hidden state of one layer is input to next. Increases capacity, learns hierarchical representations. Higher layers capture longer-term abstractions.

Question 12

12 Why do RNNs share weights across time? ðŸ“Š Medium

Answer

Answer: Weight sharing enables generalization across sequence lengths and positions. Model learns transition function independent of time step. Reduces parameters dramatically.

Question 13

13 How to handle exploding gradient in RNNs? ðŸ“Š Medium

Answer

Answer: Gradient clipping: rescale gradient if norm exceeds threshold. Also weight regularization, careful initialization (e.g., identity matrix for recurrent weights).

Question 14

14 How do LSTM/GRU mitigate vanishing gradient specifically? ðŸ”¥ Hard

Answer

Answer: LSTM's cell state has additive (not multiplicative) gradient flow. Forget gate can be close to 1, preserving gradient. GRU similarly uses additive update via update gate. Both create gradient highways.

Question 15

15 How do RNNs handle variable-length sequences? âš¡ Easy

Answer

Answer: RNNs process tokens one by one; hidden state adapts. For batching, we pad sequences to same length and use masking to ignore padding.

Question 16

16 What is teacher forcing in RNN training? ðŸ“Š Medium

Answer

Answer: During decoder training, use ground truth previous output instead of model's own prediction. Speeds convergence, but creates exposure bias. Scheduled sampling gradually shifts to self-generated.

Question 17

17 Compare RNNs and Transformers for sequence modeling. ðŸ”¥ Hard

Answer

Answer: RNNs sequential (O(n) steps), Transformers parallel (self-attention O(nÂ²)). Transformers better long-range, but need position encoding. RNNs lighter for short sequences, lower memory.

Question 18

18 When would you still choose RNN/LSTM today? ðŸ“Š Medium

Answer

Answer: Low-latency streaming tasks (speech recognition), small datasets, mobile/edge devices (lightweight), or when interpretability of hidden states is useful.

Question 19

19 What is beam search in RNN decoders? ðŸ”¥ Hard

Answer

Answer: Instead of greedy decoding, beam search keeps k most probable sequences at each step. Finds higher-likelihood outputs at cost of computation.

Question 20

20 Why initialize LSTM forget gate bias to 1? ðŸ”¥ Hard

Answer

Answer: Setting forget gate bias to 1 (or large positive) at initialization helps gradient flow by reducing forgetting early in training. Standard practice (e.g., in TensorFlow/PyTorch).

Question 21

21 What is the Transformer architecture? How is it different from RNN? âš¡ Easy

Answer

Answer: Transformer is an attention-based architecture without recurrence or convolution. It processes all tokens in parallel via self-attention and feed-forward layers. Unlike RNNs, it has no sequential dependency, allowing massive parallelization and better long-range dependency capture.

Question 22

22 How does self-attention work? Intuition. ðŸ“Š Medium

Answer

Answer: Each token attends to all tokens (including itself) to compute a weighted sum of values. Weights are derived from scaled dot-product of query and key. This captures contextual relationships irrespective of distance.

Question 23

23 Why multi-head attention? Benefits? ðŸ”¥ Hard

Answer

Answer: Multi-head runs multiple attention heads in parallel, each with different learned projections. Allows model to focus on different subspaces (e.g., syntactic, semantic roles). Outputs are concatenated and projected. Increases representational power.

Question 24

24 Why does Transformer need positional encoding? Describe sine/cosine version. ðŸ“Š Medium

Answer

Answer: Self-attention is permutation-invariant; no notion of order. Positional encodings (PE) inject sequence order. Sine/cosine functions at different frequencies: PE(pos, 2i) = sin(pos/10000^{2i/d}); PE(pos,2i+1)=cos(...). Enables relative position learning.

Question 25

25 Why scale dot-product by 1/âˆšd_k in attention? ðŸ”¥ Hard

Answer

Answer: For large d_k, dot products grow in magnitude, pushing softmax into regions of extremely small gradients. Scaling by 1/âˆšd_k counteracts variance growth, keeping gradients stable.

Question 26

26 Describe Transformer's encoder and decoder blocks. ðŸ“Š Medium

Answer

Answer: Encoder: self-attention + feed-forward + residual + layer norm. Decoder: masked self-attention (to prevent looking ahead), cross-attention over encoder output, then FFN. Both use residual connections.

Question 27

27 What is masked self-attention in decoder? ðŸ“Š Medium

Answer

Answer: Prevents positions from attending to subsequent (future) positions. Achieved by setting attention logits to -âˆž before softmax. Ensures autoregressive generation.

Question 28

28 Why layer norm instead of batch norm in Transformers? ðŸ”¥ Hard

Answer

Answer: Layer norm is independent of batch size and works well with variable sequence lengths. Batch norm's statistics are unstable for varying T and small batch, common in NLP. Layer norm normalizes across features for each token.

Question 29

29 How is BERT different from GPT? Objectives? ðŸ”¥ Hard

Answer

Answer: BERT is encoder-only, bidirectional; pretrained with masked LM (MLM) and next sentence prediction. GPT is decoder-only, unidirectional (causal LM); autoregressive. BERT excels at NLU; GPT at generation.

Question 30

30 Explain pretraining and fine-tuning paradigm. ðŸ“Š Medium

Answer

Answer: Large-scale pretraining on unlabeled text (e.g., Wikipedia) learns general language representations. Fine-tuning adapts pretrained weights to downstream tasks with labeled data, efficient and data-efficient.

Question 31

31 How does Vision Transformer (ViT) work? ðŸ”¥ Hard

Answer

Answer: Split image into fixed-size patches, flatten and project linearly to embeddings, add positional embeddings, feed to standard Transformer encoder. Classification via [CLS] token. No convolutions.

Question 32

32 What is the main computational bottleneck of Transformers? ðŸ“Š Medium

Answer

Answer: Self-attention has O(nÂ² d) complexity in sequence length n. Long sequences (e.g., documents, video) are expensive. Solutions: sparse attention, Linformer, Reformer, Longformer.

Question 33

33 Differentiate self-attention and cross-attention. âš¡ Easy

Answer

Answer: Self-attention: Q,K,V from same sequence. Cross-attention: Q from one sequence (e.g., decoder), K,V from another (encoder output). Used to align different modalities/languages.

Question 34

34 Why are residual connections critical in Transformers? ðŸ“Š Medium

Answer

Answer: Enable gradient flow through deep stacks (12+ layers). Mitigate vanishing gradient. Also preserve original token information through attention modifications.

Question 35

35 Why use learning rate warmup for Transformers? ðŸ”¥ Hard

Answer

Answer: Large gradients at start can destabilize training. Warmup (linear increase from 0) stabilizes optimization, especially for Adam with adaptive learning rates. Common in Transformer training.

Question 36

36 Compare Transformers and CNNs for vision tasks. ðŸ“Š Medium

Answer

Answer: Transformers have global receptive field from start; CNNs are local and inductive biased (translation equivariance). ViT needs more data; hybrids (ConvNeXt, CvT) combine benefits.

Question 37

37 Do Transformers share weights like RNNs? ðŸ“Š Medium

Answer

Answer: No. Each layer has independent weights. RNNs share same weight matrix across time steps; Transformer layers are stacked with different parameters, increasing capacity.

Question 38

38 What are relative position encodings? ðŸ”¥ Hard

Answer

Answer: Instead of adding absolute position to embeddings, relative PE injects pairwise distance information into attention logits. Improves generalization for longer sequences. Used in Transformer-XL, T5.

Question 39

39 How does GPT generate text? ðŸ“Š Medium

Answer

Answer: Causal LM: predicts next token given previous tokens. Uses masked self-attention. Decodes autoregressively (one token at a time). Can use greedy, sampling, beam search.

Question 40

40 What challenges arise when scaling Transformers to hundreds of billions of parameters? ðŸ”¥ Hard

Answer

Answer: Memory (activations, optimizer states), communication overhead, training instability, data quality. Solutions: model parallelism, pipeline parallelism, mixture-of-experts (MoE), activation checkpointing, fp16/bf16 mixed precision.

Question 41

41 What is the attention mechanism? Why was it introduced in deep learning? âš¡ Easy

Answer

Answer: Attention allows a model to dynamically weigh the importance of different input elements when producing output. Introduced initially in seq2seq models (Bahdanau et al., 2015) to overcome the bottleneck of a fixed-length context vector. It provides a shortcut for gradients and improves long-range dependency capture.

Question 42

42 Compare Bahdanau (additive) and Luong (multiplicative) attention. ðŸ“Š Medium

Answer

Answer:

Bahdanau: Uses a feedforward network to compute alignment score. Score = v_a^T tanh(W_1 h_dec + W_2 h_enc). Computes attention for each decoder step, more expressive but slower.
Luong: Simpler dot-product (or general) score. Score = h_dec^T Â· h_enc (or h_dec^T W h_enc). Faster, often used with global/local attention.

Question 43

43 Explain self-attention. How is it different from cross-attention? ðŸ”¥ Hard

Answer

Answer: Self-attention computes attention within the same sequence (Q, K, V from same source). Each token attends to all tokens in the sequence, including itself. Cross-attention: Q from one sequence (e.g., decoder), K,V from another (e.g., encoder). Core of Transformers.

Question 44

44 What is scaled dot-product attention? Why scale by âˆšd_k? ðŸ”¥ Hard

Answer

Answer: Attention(Q,K,V) = softmax(QK^T / âˆšd_k) V. Scaling factor âˆšd_k prevents dot products from growing large in magnitude, pushing softmax into regions of extremely small gradients. Stabilizes training, especially for high-dimensional keys.

Question 45

45 What is multi-head attention? Why is it beneficial? ðŸ“Š Medium

Answer

Answer: Multi-head attention projects Q, K, V into h subspaces, applies attention in parallel, then concatenates and projects. Each head can focus on different relationships (e.g., syntax, semantics, long-distance). Increases model capacity without quadratic parameter blowup.

Question 46

46 Sketch the high-level Transformer architecture (encoder-decoder). ðŸ”¥ Hard

Answer

Answer:

Encoder: stack of N identical layers. Each layer: Multi-Head Self-Attention + FeedForward, with Add&Norm (residual + LayerNorm).
Decoder: stack of N layers. Each layer: Masked Multi-Head Self-Attention (causal) + Cross-Attention (Q from decoder, K,V from encoder) + FeedForward, with Add&Norm.

Question 47

47 Why does Transformer need positional encoding? Describe sinusoidal encoding. ðŸ“Š Medium

Answer

Answer: Self-attention is permutation-invariant; no inherent notion of order. Positional encoding adds information about position. Sinusoidal: PE(pos, 2i) = sin(pos/10000^(2i/d)), PE(pos, 2i+1) = cos(...). Allows model to attend to relative positions, extrapolates beyond training length.

Question 48

48 What is masked self-attention? Why is it used? ðŸ“Š Medium

Answer

Answer: In decoder, masked attention prevents positions from attending to future positions (causal). Achieved by setting attention scores to -âˆž (or large negative) for illegal connections before softmax. Ensures autoregressive property: prediction at step t depends only on previous tokens.

Question 49

49 Why is attention (Transformer) more parallelizable than RNNs? ðŸ“Š Medium

Answer

Answer: RNNs process tokens sequentially (O(n) steps). Self-attention computes all pairwise interactions in O(1) sequential steps (parallel across sequence). However, compute cost is O(nÂ² d) vs RNN O(n dÂ²). Trade-off: parallelization vs quadratic complexity.

Question 50

50 What is the time and memory complexity of vanilla self-attention? ðŸ”¥ Hard

Answer

Answer: Time complexity: O(nÂ² Â· d) where n is sequence length, d is dimension. Memory complexity: O(nÂ²) for attention matrix. This limits long sequences. Solutions: sparse attention (Longformer), linear attention (Performer), sliding window.

Question 51

51 How is BERT's attention different from Transformer decoder? ðŸ”¥ Hard

Answer

Answer: BERT uses encoder-only Transformer with bidirectional self-attention (no masking). Transformer decoder uses causal (masked) self-attention + cross-attention. BERT is trained with MLM (masked language modeling), decoder with autoregressive LM.

Question 52

52 How does Vision Transformer (ViT) apply attention to images? ðŸ”¥ Hard

Answer

Answer: ViT splits image into patches (e.g., 16x16), flattens and projects to embeddings, adds positional embeddings, and feeds to standard Transformer encoder. No convolutions. Competes with CNNs on image classification, scales well with data.

Question 53

53 Where is cross-attention used besides encoder-decoder? ðŸ“Š Medium

Answer

Answer: Multi-modal models (CLIP, Flamingo, DALL-E): align image and text representations. In object detection (DETR): cross-attention between object queries and image features. In stable diffusion: cross-attention between text embeddings and image latents.

Question 54

54 Name and describe sparse attention variants. ðŸ”¥ Hard

Answer

Answer:

Sliding window (Longformer): each token attends to w neighbors.
Dilated sliding window: like conv, larger receptive field.
Global + sliding: special tokens (CLS) attend to all.
Block sparse (BigBird): random + window + global.

Question 55

55 What problem does FlashAttention solve? ðŸ”¥ Hard

Answer

Answer: Standard attention materializes O(nÂ²) matrix, causing memory bottleneck. FlashAttention uses tiling to compute attention in blocks without writing the full matrix to GPU HBM. 2-4x speedup, linear memory growth, enables longer context.

Question 56

56 Explain attention as a soft dictionary (key-value) retrieval. ðŸ“Š Medium

Answer

Answer: Query (Q) is what we're looking for. Keys (K) are indices. Values (V) are the actual content. Attention weights are similarity between Q and K; output is weighted sum of V. Differentiable, soft, end-to-end.

Question 57

57 What is relative positional encoding? Why is it used? ðŸ”¥ Hard

Answer

Answer: Instead of adding absolute positions to embeddings, relative PE incorporates distance between tokens into attention scores (e.g., Transformer-XL, T5). Better generalization to longer sequences, captures pairwise relationships directly.

Question 58

58 What are some alternative attention score functions? ðŸ“Š Medium

Answer

Answer: Dot product (transformer), additive (Bahdanau), cosine similarity, L1/L2 distance. Linear attention (Katharopoulos et al.): replace softmax with feature maps (elu(x)+1), reduces complexity from O(nÂ²) to O(n). Used in efficient Transformers.

Question 59

59 How is attention used in GNNs? (GAT) ðŸ”¥ Hard

Answer

Answer: Graph Attention Network (GAT) computes attention coefficients between connected nodes. Each node attends to its neighbors, learning importance weights. Multi-head GAT aggregates neighbor features. No spectral decomposition, inductive.

Question 60

60 What are alternatives to attention for long sequences? ðŸ”¥ Hard

Answer

Answer: State Space Models (S4, Mamba) offer linear-time sequence modeling. Use structured state spaces, selective mechanisms. Outperform Transformers on long-range tasks (Path-X, DNA) with faster inference. Potential replacement for attention in some domains.

RNN, Transformers & Attention — Interview Q&A

RNN & LSTM: 20 Interview Questions

Transformers: 20 Interview Questions

Attention Mechanism: 20 Interview Questions