Interview Q&A30 Questions

RNN & Attention — Interview Q&A

Recurrent neural networks and attention mechanisms for sequence modeling.

Recurrent Neural Networks â€” 15 Interview Questions

1 What is a recurrent neural network?Easy

Answer: A model with a hidden state h_t updated each step: h_t = f(h_{tâˆ’1}, x_t)â€”same weights applied across time, suited to sequences.

2 Vanilla RNN update (simple form).Easy

Answer: Often h_t = tanh(W_h h_{tâˆ’1} + W_x x_t + b)â€”nonlinearity and affine combine previous state with current input.

h_t = tanh(W_h h_{tâˆ’1} + W_x x_t + b)

3 What is backpropagation through time (BPTT)?Medium

Answer: Unroll the network over T steps into a DAG, run backpropâ€”gradient flows through every time link. Memory and compute grow with T.

4 Truncated BPTTâ€”why?Medium

Answer: Limit backprop depth in time to a windowâ€”cheaper and stabilizes training; trades off long-range credit assignment.

5 Why do vanilla RNNs struggle with long sequences?Medium

Answer: Repeated Jacobian products over steps cause vanishing or exploding gradientsâ€”hard to learn long-range dependencies.

6 LSTM gatesâ€”names and roles.Medium

Answer: Forget (what to erase from cell), input (what to write), output (what to expose from cell). Cell state carries information additivelyâ€”better gradient paths.

7 GRU vs LSTMâ€”interview contrast.Easy

Answer: GRU merges forget+input into update gate, fewer parametersâ€”often similar quality with less compute; LSTM still common historically.

8 Bidirectional RNN.Easy

Answer: Two RNNs: one forward, one backward; concatenate hidden statesâ€”uses future context; good for tagging/NLP, not for causal online prediction.

9 Encoderâ€“decoder (seq2seq) idea.Medium

Answer: Encoder RNN compresses input sequence to context vector; decoder RNN generates output sequenceâ€”basis of early NMT before attention dominated.

10 Teacher forcing.Medium

Answer: During training, decoder gets ground-truth previous token as input instead of its own predictionâ€”speeds convergence; exposure bias handled with scheduled sampling etc.

11 Padding and pack_padded_sequenceâ€”why?Hard

Answer: Batched variable-length sequences are padded; pack avoids wasted compute on pad tokens and keeps hidden state meaningful in frameworks like PyTorch.

12 Many-to-one vs many-to-many examples.Easy

Answer: Many-to-one: sentiment from a sentence. Many-to-many: POS tagging per token; seq2seq: translation.

13 When do Transformers replace RNNs?Medium

Answer: When you have data/compute for self-attentionâ€”parallel over length, long-range in O(1) layers per hop; RNNs sequential and slower on GPU for long sequences.

14 1D CNN for sequences vs RNN.Medium

Answer: 1D conv stacks local n-grams with depth for contextâ€”fast and parallel; RNN/attention better for very long flexible dependencies depending on design.

15 State one advantage of RNN family today.Easy

Answer: Small memory per step for streaming or tiny devices; some tasks still use LSTM baselinesâ€”though LLMs are Transformer-first.

Draw unrolled RNN for BPTTâ€”classic whiteboard question.

Attention Mechanism â€” 15 Interview Questions

16 Intuition: what does attention compute?Easy

Answer: A weighted sum of values, where weights (attention scores) say how much each source position matters for the current queryâ€”soft lookup over a set of vectors.

17 What are Query, Key, and Value?Easy

Answer: Three linear projections of inputs (or cross-modal sources). Query asks â€œwhat I needâ€; keys label slots; values carry content mixed by attention weights.

18 Scaled dot-product attention formula.Medium

Answer: Attention(Q,K,V) = softmax(QKáµ€ / âˆšd_k) V. Scale by âˆšd_k keeps dot products from growing too large so softmax doesnâ€™t saturate.

softmax(QK^T / âˆšd_k) V

19 Self-attention vs cross-attention.Medium

Answer: Self: Q, K, V from same sequence (e.g. encoder). Cross: Q from one sequence (decoder), K,V from another (encoder output)â€”decoder attends to source.

20 Causal (look-ahead) mask in decoders.Medium

Answer: Set attention logits to âˆ’âˆž for future positions before softmaxâ€”position t cannot attend to t+1,â€¦â€”preserves autoregressive generation.

21 Multi-head attentionâ€”why multiple heads?Medium

Answer: Each head learns different subspaces of relationships in parallel; concatenating heads lets the model capture multiple dependency types (syntax, coreference, etc.).

22 Pre-LN vs Post-LN Transformer (brief).Hard

Answer: Post-LN: original â€œAttention â†’ Add&Normâ€. Pre-LN: norm before sublayersâ€”often more stable training for very deep stacks; both used in literature.

23 Why positional encoding?Easy

Answer: Attention is permutation-invariant without order infoâ€”add sinusoidal or learned positions so â€œcat bites dogâ€ â‰ â€œdog bites cat.â€

24 Time complexity of self-attention in sequence length n.Medium

Answer: O(nÂ² Â· d) for attention matrix over pairsâ€”quadratic in length is the main bottleneck for long contexts; motivates sparse/linear attention variants.

25 Bahdanau (additive) vs Luong (dot) attention.Hard

Answer: Older seq2seq: additive scores use a small MLP on [s_t; h_j]; multiplicative/dot uses direct similarityâ€”scaled dot-product is the modern dot family at scale.

26 What sits after attention in a Transformer block?Easy

Answer: Feed-forward network (MLP) applied per positionâ€”typically expand (4d) with GELU/ReLU then project back; residual + norm around each sublayer.

27 Attention dropoutâ€”where?Easy

Answer: Dropout on attention weights (after softmax) or on scores in some implementationsâ€”regularizes attention patterns.

28 Vision Transformerâ€”how is attention used?Medium

Answer: Split image into patches, embed as tokens, run Transformer encoder self-attentionâ€”global mixing of patch relationships without conv inductive bias (with data scale).

29 FlashAttention (interview one-liner).Hard

Answer: IO-aware exact attention implementation that fuses ops and tiles to SRAMâ€”same math, faster training on GPUs for long sequences.

30 Encoder-only vs decoder-only vs encoderâ€“decoder.Medium

Answer: Encoder-only (BERT): bidirectional context. Decoder-only (GPT): causal LM. Encâ€“Dec (T5, original Transformer): encoder sees source, decoder generates target with cross-attention.

Memorize scaled softmax(QKáµ€)V and be able to point to each matrixâ€™s shape.

Previous Next