Interview Q&A30 Questions
RNN & Attention — Interview Q&A
Recurrent neural networks and attention mechanisms for sequence modeling.
Recurrent Neural Networks — 15 Interview Questions
1 What is a recurrent neural network?Easy
Answer: A model with a hidden state h_t updated each step: h_t = f(h_{t−1}, x_t)—same weights applied across time, suited to sequences.
2 Vanilla RNN update (simple form).Easy
Answer: Often h_t = tanh(W_h h_{t−1} + W_x x_t + b)—nonlinearity and affine combine previous state with current input.
h_t = tanh(W_h h_{t−1} + W_x x_t + b)
3 What is backpropagation through time (BPTT)?Medium
Answer: Unroll the network over T steps into a DAG, run backprop—gradient flows through every time link. Memory and compute grow with T.
4 Truncated BPTT—why?Medium
Answer: Limit backprop depth in time to a window—cheaper and stabilizes training; trades off long-range credit assignment.
5 Why do vanilla RNNs struggle with long sequences?Medium
Answer: Repeated Jacobian products over steps cause vanishing or exploding gradients—hard to learn long-range dependencies.
6 LSTM gates—names and roles.Medium
Answer: Forget (what to erase from cell), input (what to write), output (what to expose from cell). Cell state carries information additively—better gradient paths.
7 GRU vs LSTM—interview contrast.Easy
Answer: GRU merges forget+input into update gate, fewer parameters—often similar quality with less compute; LSTM still common historically.
8 Bidirectional RNN.Easy
Answer: Two RNNs: one forward, one backward; concatenate hidden states—uses future context; good for tagging/NLP, not for causal online prediction.
9 Encoder–decoder (seq2seq) idea.Medium
Answer: Encoder RNN compresses input sequence to context vector; decoder RNN generates output sequence—basis of early NMT before attention dominated.
10 Teacher forcing.Medium
Answer: During training, decoder gets ground-truth previous token as input instead of its own prediction—speeds convergence; exposure bias handled with scheduled sampling etc.
11 Padding and pack_padded_sequence—why?Hard
Answer: Batched variable-length sequences are padded; pack avoids wasted compute on pad tokens and keeps hidden state meaningful in frameworks like PyTorch.
12 Many-to-one vs many-to-many examples.Easy
Answer: Many-to-one: sentiment from a sentence. Many-to-many: POS tagging per token; seq2seq: translation.
13 When do Transformers replace RNNs?Medium
Answer: When you have data/compute for self-attention—parallel over length, long-range in O(1) layers per hop; RNNs sequential and slower on GPU for long sequences.
14 1D CNN for sequences vs RNN.Medium
Answer: 1D conv stacks local n-grams with depth for context—fast and parallel; RNN/attention better for very long flexible dependencies depending on design.
15 State one advantage of RNN family today.Easy
Answer: Small memory per step for streaming or tiny devices; some tasks still use LSTM baselines—though LLMs are Transformer-first.
Draw unrolled RNN for BPTT—classic whiteboard question.
Attention Mechanism — 15 Interview Questions
16 Intuition: what does attention compute?Easy
Answer: A weighted sum of values, where weights (attention scores) say how much each source position matters for the current query—soft lookup over a set of vectors.
17 What are Query, Key, and Value?Easy
Answer: Three linear projections of inputs (or cross-modal sources). Query asks “what I needâ€; keys label slots; values carry content mixed by attention weights.
18 Scaled dot-product attention formula.Medium
Answer: Attention(Q,K,V) = softmax(QKᵀ / √d_k) V. Scale by √d_k keeps dot products from growing too large so softmax doesn’t saturate.
softmax(QK^T / √d_k) V
19 Self-attention vs cross-attention.Medium
Answer: Self: Q, K, V from same sequence (e.g. encoder). Cross: Q from one sequence (decoder), K,V from another (encoder output)—decoder attends to source.
20 Causal (look-ahead) mask in decoders.Medium
Answer: Set attention logits to −∞ for future positions before softmax—position t cannot attend to t+1,…—preserves autoregressive generation.
21 Multi-head attention—why multiple heads?Medium
Answer: Each head learns different subspaces of relationships in parallel; concatenating heads lets the model capture multiple dependency types (syntax, coreference, etc.).
22 Pre-LN vs Post-LN Transformer (brief).Hard
Answer: Post-LN: original “Attention → Add&Normâ€. Pre-LN: norm before sublayers—often more stable training for very deep stacks; both used in literature.
23 Why positional encoding?Easy
Answer: Attention is permutation-invariant without order info—add sinusoidal or learned positions so “cat bites dog†≠“dog bites cat.â€
24 Time complexity of self-attention in sequence length n.Medium
Answer: O(n² · d) for attention matrix over pairs—quadratic in length is the main bottleneck for long contexts; motivates sparse/linear attention variants.
25 Bahdanau (additive) vs Luong (dot) attention.Hard
Answer: Older seq2seq: additive scores use a small MLP on [s_t; h_j]; multiplicative/dot uses direct similarity—scaled dot-product is the modern dot family at scale.
26 What sits after attention in a Transformer block?Easy
Answer: Feed-forward network (MLP) applied per position—typically expand (4d) with GELU/ReLU then project back; residual + norm around each sublayer.
27 Attention dropout—where?Easy
Answer: Dropout on attention weights (after softmax) or on scores in some implementations—regularizes attention patterns.
28 Vision Transformer—how is attention used?Medium
Answer: Split image into patches, embed as tokens, run Transformer encoder self-attention—global mixing of patch relationships without conv inductive bias (with data scale).
29 FlashAttention (interview one-liner).Hard
Answer: IO-aware exact attention implementation that fuses ops and tiles to SRAM—same math, faster training on GPUs for long sequences.
30 Encoder-only vs decoder-only vs encoder–decoder.Medium
Answer: Encoder-only (BERT): bidirectional context. Decoder-only (GPT): causal LM. Enc–Dec (T5, original Transformer): encoder sees source, decoder generates target with cross-attention.
Memorize scaled softmax(QKᵀ)V and be able to point to each matrix’s shape.