Interview Q&A60 Questions
RNN, Transformers & Attention — Interview Q&A
Recurrent networks, LSTM, transformer architecture, and attention mechanisms.
RNN & LSTM: 20 Interview Questions
1
What is a Recurrent Neural Network (RNN)? Typical applications?
âš¡ Easy
Answer: RNNs process sequential data by maintaining a hidden state that captures information from previous time steps. They share weights across time. Applications: NLP (language modeling, translation), time series forecasting, speech recognition, video analysis.
h_t = tanh(W_hh·h_{t-1} + W_xh·x_t + b_h) ; y_t = W_hy·h_t + b_y
2
Why do vanilla RNNs suffer from vanishing/exploding gradient?
📊 Medium
Answer: During BPTT, gradients are multiplied by the same recurrent weight matrix W_hh at each time step. If eigenvalues of W_hh < 1, gradients vanish; if > 1, they explode. This prevents learning long-range dependencies.
LSTM/GRU mitigate via gating
Vanilla RNN fails for long sequences
3
Explain LSTM architecture. How does it solve vanishing gradient?
🔥 Hard
Answer: LSTM introduces cell state (C_t) as a memory highway with additive updates. Three gates: forget (f), input (i), output (o). Gradient flows through cell state with constant error carousel (CEC) – addition rather than multiplication, preserving gradient.
f_t = σ(W_f·[h_{t-1}, x_t] + b_f)
i_t = σ(W_i·[h_{t-1}, x_t] + b_i)
C̃_t = tanh(W_c·[h_{t-1}, x_t] + b_c)
C_t = f_t * C_{t-1} + i_t * C̃_t
o_t = σ(W_o·[h_{t-1}, x_t] + b_o); h_t = o_t * tanh(C_t)
i_t = σ(W_i·[h_{t-1}, x_t] + b_i)
C̃_t = tanh(W_c·[h_{t-1}, x_t] + b_c)
C_t = f_t * C_{t-1} + i_t * C̃_t
o_t = σ(W_o·[h_{t-1}, x_t] + b_o); h_t = o_t * tanh(C_t)
4
What is the purpose of each LSTM gate?
📊 Medium
Answer:
- Forget gate: decides what to discard from cell state.
- Input gate: decides what new info to store.
- Output gate: decides what to output based on cell state.
5
How is GRU different from LSTM? When to prefer GRU?
📊 Medium
Answer: GRU has 2 gates (update, reset), no separate cell state; merges forget/input gates. Simpler, fewer parameters, less prone to overfitting on small data. LSTM is more expressive; GRU often matches performance with less compute.
LSTM: 3 gates + cell state; GRU: 2 gates, hidden state only.
6
What is a bidirectional RNN? When to use?
📊 Medium
Answer: BRNN processes sequence forward and backward, concatenating hidden states. Captures context from both past and future. Used in NLP tasks (NER, POS tagging) where entire sequence is available. Not for real-time or streaming.
7
Explain Backpropagation Through Time (BPTT). What is truncated BPTT?
🔥 Hard
Answer: BPTT unfolds RNN for all time steps, computes gradients over entire sequence (expensive). Truncated BPTT limits unfolding to k steps, approximates gradients, efficient for long sequences.
8
Describe seq2seq model. Role of encoder and decoder?
📊 Medium
Answer: Seq2seq uses encoder RNN to compress input sequence into context vector (final hidden state). Decoder RNN generates output sequence from context. Used in machine translation, summarization.
9
Why was attention introduced? How does it help RNNs?
🔥 Hard
Answer: Seq2seq with single context vector fails for long sentences (bottleneck). Attention allows decoder to look at all encoder hidden states, weighted by relevance. Provides shortcut to gradient flow and interpretability.
e_{t,i} = score(h_t^dec, h_i^enc); α_{t,i} = softmax(e_{t,i}); c_t = Σ α_{t,i} h_i^enc
10
What is peephole LSTM?
🔥 Hard
Answer: Peephole connections allow gates to see the cell state (C_{t-1}) in addition to h_{t-1} and x_t. Provides finer temporal control, but not always beneficial.
11
What are stacked RNNs? Benefits?
📊 Medium
Answer: Multiple RNN layers where hidden state of one layer is input to next. Increases capacity, learns hierarchical representations. Higher layers capture longer-term abstractions.
12
Why do RNNs share weights across time?
📊 Medium
Answer: Weight sharing enables generalization across sequence lengths and positions. Model learns transition function independent of time step. Reduces parameters dramatically.
13
How to handle exploding gradient in RNNs?
📊 Medium
Answer: Gradient clipping: rescale gradient if norm exceeds threshold. Also weight regularization, careful initialization (e.g., identity matrix for recurrent weights).
if grad_norm > threshold: grad = grad * (threshold / grad_norm)
14
How do LSTM/GRU mitigate vanishing gradient specifically?
🔥 Hard
Answer: LSTM's cell state has additive (not multiplicative) gradient flow. Forget gate can be close to 1, preserving gradient. GRU similarly uses additive update via update gate. Both create gradient highways.
15
How do RNNs handle variable-length sequences?
âš¡ Easy
Answer: RNNs process tokens one by one; hidden state adapts. For batching, we pad sequences to same length and use masking to ignore padding.
16
What is teacher forcing in RNN training?
📊 Medium
Answer: During decoder training, use ground truth previous output instead of model's own prediction. Speeds convergence, but creates exposure bias. Scheduled sampling gradually shifts to self-generated.
17
Compare RNNs and Transformers for sequence modeling.
🔥 Hard
Answer: RNNs sequential (O(n) steps), Transformers parallel (self-attention O(n²)). Transformers better long-range, but need position encoding. RNNs lighter for short sequences, lower memory.
18
When would you still choose RNN/LSTM today?
📊 Medium
Answer: Low-latency streaming tasks (speech recognition), small datasets, mobile/edge devices (lightweight), or when interpretability of hidden states is useful.
19
What is beam search in RNN decoders?
🔥 Hard
Answer: Instead of greedy decoding, beam search keeps k most probable sequences at each step. Finds higher-likelihood outputs at cost of computation.
20
Why initialize LSTM forget gate bias to 1?
🔥 Hard
Answer: Setting forget gate bias to 1 (or large positive) at initialization helps gradient flow by reducing forgetting early in training. Standard practice (e.g., in TensorFlow/PyTorch).
Transformers: 20 Interview Questions
21
What is the Transformer architecture? How is it different from RNN?
âš¡ Easy
Answer: Transformer is an attention-based architecture without recurrence or convolution. It processes all tokens in parallel via self-attention and feed-forward layers. Unlike RNNs, it has no sequential dependency, allowing massive parallelization and better long-range dependency capture.
Attention(Q,K,V) = softmax(QK^T/√d_k) V
22
How does self-attention work? Intuition.
📊 Medium
Answer: Each token attends to all tokens (including itself) to compute a weighted sum of values. Weights are derived from scaled dot-product of query and key. This captures contextual relationships irrespective of distance.
Intuition: "Retrieve relevant values based on similarity between query and key."
23
Why multi-head attention? Benefits?
🔥 Hard
Answer: Multi-head runs multiple attention heads in parallel, each with different learned projections. Allows model to focus on different subspaces (e.g., syntactic, semantic roles). Outputs are concatenated and projected. Increases representational power.
MultiHead(Q,K,V) = Concat(head_1,...,head_h)W^O
where head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)
where head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)
24
Why does Transformer need positional encoding? Describe sine/cosine version.
📊 Medium
Answer: Self-attention is permutation-invariant; no notion of order. Positional encodings (PE) inject sequence order. Sine/cosine functions at different frequencies: PE(pos, 2i) = sin(pos/10000^{2i/d}); PE(pos,2i+1)=cos(...). Enables relative position learning.
25
Why scale dot-product by 1/√d_k in attention?
🔥 Hard
Answer: For large d_k, dot products grow in magnitude, pushing softmax into regions of extremely small gradients. Scaling by 1/√d_k counteracts variance growth, keeping gradients stable.
26
Describe Transformer's encoder and decoder blocks.
📊 Medium
Answer: Encoder: self-attention + feed-forward + residual + layer norm. Decoder: masked self-attention (to prevent looking ahead), cross-attention over encoder output, then FFN. Both use residual connections.
27
What is masked self-attention in decoder?
📊 Medium
Answer: Prevents positions from attending to subsequent (future) positions. Achieved by setting attention logits to -∞ before softmax. Ensures autoregressive generation.
28
Why layer norm instead of batch norm in Transformers?
🔥 Hard
Answer: Layer norm is independent of batch size and works well with variable sequence lengths. Batch norm's statistics are unstable for varying T and small batch, common in NLP. Layer norm normalizes across features for each token.
29
How is BERT different from GPT? Objectives?
🔥 Hard
Answer: BERT is encoder-only, bidirectional; pretrained with masked LM (MLM) and next sentence prediction. GPT is decoder-only, unidirectional (causal LM); autoregressive. BERT excels at NLU; GPT at generation.
30
Explain pretraining and fine-tuning paradigm.
📊 Medium
Answer: Large-scale pretraining on unlabeled text (e.g., Wikipedia) learns general language representations. Fine-tuning adapts pretrained weights to downstream tasks with labeled data, efficient and data-efficient.
31
How does Vision Transformer (ViT) work?
🔥 Hard
Answer: Split image into fixed-size patches, flatten and project linearly to embeddings, add positional embeddings, feed to standard Transformer encoder. Classification via [CLS] token. No convolutions.
32
What is the main computational bottleneck of Transformers?
📊 Medium
Answer: Self-attention has O(n² d) complexity in sequence length n. Long sequences (e.g., documents, video) are expensive. Solutions: sparse attention, Linformer, Reformer, Longformer.
Parallel, global receptive field
O(n²) memory/compute
33
Differentiate self-attention and cross-attention.
âš¡ Easy
Answer: Self-attention: Q,K,V from same sequence. Cross-attention: Q from one sequence (e.g., decoder), K,V from another (encoder output). Used to align different modalities/languages.
34
Why are residual connections critical in Transformers?
📊 Medium
Answer: Enable gradient flow through deep stacks (12+ layers). Mitigate vanishing gradient. Also preserve original token information through attention modifications.
35
Why use learning rate warmup for Transformers?
🔥 Hard
Answer: Large gradients at start can destabilize training. Warmup (linear increase from 0) stabilizes optimization, especially for Adam with adaptive learning rates. Common in Transformer training.
36
Compare Transformers and CNNs for vision tasks.
📊 Medium
Answer: Transformers have global receptive field from start; CNNs are local and inductive biased (translation equivariance). ViT needs more data; hybrids (ConvNeXt, CvT) combine benefits.
37
Do Transformers share weights like RNNs?
📊 Medium
Answer: No. Each layer has independent weights. RNNs share same weight matrix across time steps; Transformer layers are stacked with different parameters, increasing capacity.
38
What are relative position encodings?
🔥 Hard
Answer: Instead of adding absolute position to embeddings, relative PE injects pairwise distance information into attention logits. Improves generalization for longer sequences. Used in Transformer-XL, T5.
39
How does GPT generate text?
📊 Medium
Answer: Causal LM: predicts next token given previous tokens. Uses masked self-attention. Decodes autoregressively (one token at a time). Can use greedy, sampling, beam search.
logits = model(input_ids); next_token = sample(softmax(logits[:, -1, :]))
40
What challenges arise when scaling Transformers to hundreds of billions of parameters?
🔥 Hard
Answer: Memory (activations, optimizer states), communication overhead, training instability, data quality. Solutions: model parallelism, pipeline parallelism, mixture-of-experts (MoE), activation checkpointing, fp16/bf16 mixed precision.
Attention Mechanism: 20 Interview Questions
41
What is the attention mechanism? Why was it introduced in deep learning?
âš¡ Easy
Answer: Attention allows a model to dynamically weigh the importance of different input elements when producing output. Introduced initially in seq2seq models (Bahdanau et al., 2015) to overcome the bottleneck of a fixed-length context vector. It provides a shortcut for gradients and improves long-range dependency capture.
Context = Σ α_i · h_i where α_i = softmax(score(h_dec, h_enc))
42
Compare Bahdanau (additive) and Luong (multiplicative) attention.
📊 Medium
Answer:
- Bahdanau: Uses a feedforward network to compute alignment score. Score = v_a^T tanh(W_1 h_dec + W_2 h_enc). Computes attention for each decoder step, more expressive but slower.
- Luong: Simpler dot-product (or general) score. Score = h_dec^T · h_enc (or h_dec^T W h_enc). Faster, often used with global/local attention.
Bahdanau: additive, more parameters, tanh. | Luong: multiplicative, simpler, dot or general.
43
Explain self-attention. How is it different from cross-attention?
🔥 Hard
Answer: Self-attention computes attention within the same sequence (Q, K, V from same source). Each token attends to all tokens in the sequence, including itself. Cross-attention: Q from one sequence (e.g., decoder), K,V from another (e.g., encoder). Core of Transformers.
44
What is scaled dot-product attention? Why scale by √d_k?
🔥 Hard
Answer: Attention(Q,K,V) = softmax(QK^T / √d_k) V. Scaling factor √d_k prevents dot products from growing large in magnitude, pushing softmax into regions of extremely small gradients. Stabilizes training, especially for high-dimensional keys.
Attention(Q,K,V) = softmax( QK^T / √d_k ) V
45
What is multi-head attention? Why is it beneficial?
📊 Medium
Answer: Multi-head attention projects Q, K, V into h subspaces, applies attention in parallel, then concatenates and projects. Each head can focus on different relationships (e.g., syntax, semantics, long-distance). Increases model capacity without quadratic parameter blowup.
MultiHead(Q,K,V) = Concat(head_1,...,head_h)W^O
head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)
head_i = Attention(QW_i^Q, KW_i^K, VW_i^V)
46
Sketch the high-level Transformer architecture (encoder-decoder).
🔥 Hard
Answer:
- Encoder: stack of N identical layers. Each layer: Multi-Head Self-Attention + FeedForward, with Add&Norm (residual + LayerNorm).
- Decoder: stack of N layers. Each layer: Masked Multi-Head Self-Attention (causal) + Cross-Attention (Q from decoder, K,V from encoder) + FeedForward, with Add&Norm.
47
Why does Transformer need positional encoding? Describe sinusoidal encoding.
📊 Medium
Answer: Self-attention is permutation-invariant; no inherent notion of order. Positional encoding adds information about position. Sinusoidal: PE(pos, 2i) = sin(pos/10000^(2i/d)), PE(pos, 2i+1) = cos(...). Allows model to attend to relative positions, extrapolates beyond training length.
48
What is masked self-attention? Why is it used?
📊 Medium
Answer: In decoder, masked attention prevents positions from attending to future positions (causal). Achieved by setting attention scores to -∞ (or large negative) for illegal connections before softmax. Ensures autoregressive property: prediction at step t depends only on previous tokens.
[✓ -∞ -∞; ✓ ✓ -∞; ✓ ✓ ✓] (example causal mask)
49
Why is attention (Transformer) more parallelizable than RNNs?
📊 Medium
Answer: RNNs process tokens sequentially (O(n) steps). Self-attention computes all pairwise interactions in O(1) sequential steps (parallel across sequence). However, compute cost is O(n² d) vs RNN O(n d²). Trade-off: parallelization vs quadratic complexity.
50
What is the time and memory complexity of vanilla self-attention?
🔥 Hard
Answer: Time complexity: O(n² · d) where n is sequence length, d is dimension. Memory complexity: O(n²) for attention matrix. This limits long sequences. Solutions: sparse attention (Longformer), linear attention (Performer), sliding window.
Global receptive field
Quadratic cost, not O(n)
51
How is BERT's attention different from Transformer decoder?
🔥 Hard
Answer: BERT uses encoder-only Transformer with bidirectional self-attention (no masking). Transformer decoder uses causal (masked) self-attention + cross-attention. BERT is trained with MLM (masked language modeling), decoder with autoregressive LM.
52
How does Vision Transformer (ViT) apply attention to images?
🔥 Hard
Answer: ViT splits image into patches (e.g., 16x16), flattens and projects to embeddings, adds positional embeddings, and feeds to standard Transformer encoder. No convolutions. Competes with CNNs on image classification, scales well with data.
53
Where is cross-attention used besides encoder-decoder?
📊 Medium
Answer: Multi-modal models (CLIP, Flamingo, DALL-E): align image and text representations. In object detection (DETR): cross-attention between object queries and image features. In stable diffusion: cross-attention between text embeddings and image latents.
54
Name and describe sparse attention variants.
🔥 Hard
Answer:
- Sliding window (Longformer): each token attends to w neighbors.
- Dilated sliding window: like conv, larger receptive field.
- Global + sliding: special tokens (CLS) attend to all.
- Block sparse (BigBird): random + window + global.
55
What problem does FlashAttention solve?
🔥 Hard
Answer: Standard attention materializes O(n²) matrix, causing memory bottleneck. FlashAttention uses tiling to compute attention in blocks without writing the full matrix to GPU HBM. 2-4x speedup, linear memory growth, enables longer context.
56
Explain attention as a soft dictionary (key-value) retrieval.
📊 Medium
Answer: Query (Q) is what we're looking for. Keys (K) are indices. Values (V) are the actual content. Attention weights are similarity between Q and K; output is weighted sum of V. Differentiable, soft, end-to-end.
57
What is relative positional encoding? Why is it used?
🔥 Hard
Answer: Instead of adding absolute positions to embeddings, relative PE incorporates distance between tokens into attention scores (e.g., Transformer-XL, T5). Better generalization to longer sequences, captures pairwise relationships directly.
58
What are some alternative attention score functions?
📊 Medium
Answer: Dot product (transformer), additive (Bahdanau), cosine similarity, L1/L2 distance. Linear attention (Katharopoulos et al.): replace softmax with feature maps (elu(x)+1), reduces complexity from O(n²) to O(n). Used in efficient Transformers.
59
How is attention used in GNNs? (GAT)
🔥 Hard
Answer: Graph Attention Network (GAT) computes attention coefficients between connected nodes. Each node attends to its neighbors, learning importance weights. Multi-head GAT aggregates neighbor features. No spectral decomposition, inductive.
α_ij = softmax( LeakyReLU( a^T [W h_i || W h_j] ) )
60
What are alternatives to attention for long sequences?
🔥 Hard
Answer: State Space Models (S4, Mamba) offer linear-time sequence modeling. Use structured state spaces, selective mechanisms. Outperform Transformers on long-range tasks (Path-X, DNA) with faster inference. Potential replacement for attention in some domains.