Neural Networks 15 Essential Q&A
Interview Prep

Vanishing & Exploding Gradients — 15 Interview Questions

Products of Jacobians through depth and time, sigmoid saturation, ResNets, LSTM gates, and gradient clipping.

Colored left borders per card; green / amber / red difficulty chips.

Vanishing Exploding ResNet Clip
1 What is the vanishing gradient problem?Easy
Answer: In backprop, gradients are products of terms through layers. If many factors are < 1 (e.g. saturated sigmoid derivatives), early-layer gradients → 0 and those weights barely update.
2 What is exploding gradients?Easy
Answer: Same product picture: factors > 1 repeatedly → huge gradients, unstable updates, NaNs. Common in RNNs over long sequences if unrolled.
3 Why do sigmoid/tanh worsen vanishing?Medium
Answer: Derivatives are ≤ 0.25 (sigmoid) or small in saturation—each layer shrinks the backward signal when stacked deeply.
4 How does ReLU help (and one caveat)?Medium
Answer: Derivative is 1 for active neurons—less shrinkage than sigmoid. Caveat: dead ReLUs still pass zero gradient.
5 How do residual connections improve gradient flow?Medium
Answer: y = F(x)+x adds an identity path; gradient can bypass stacked layers via +1 shortcut—eases training of very deep nets.
6 LSTM vs vanilla RNN for vanishing gradients.Medium
Answer: LSTM’s cell state and additive updates with gates allow better long-range gradient flow than simple tanh recurrence where each step multiplies Jacobians.
7 GRU vs LSTM—gradient angle.Easy
Answer: GRU has fewer gates but similar idea: gating and update blending to mitigate vanishing in sequences—often comparable performance with less compute.
8 Gradient clipping—norm vs value.Medium
Answer: Clip by norm: if ||g|| > threshold, scale g down—common in RNNs. Value clipping caps each element—less common. Stops one batch from destroying weights.
9 Does batch norm fix vanishing gradients?Medium
Answer: It stabilizes activations and can help optimization indirectly—not a guarantee; deep nets still benefit from good init, ReLU, residuals.
10 Highway networks vs ResNets (brief).Hard
Answer: Highways use learned gates on skip vs transform; ResNet uses identity skip + simpler F(x). ResNet won for simplicity and performance in vision.
11 Do Transformers vanish like RNNs?Medium
Answer: Depth is finite and paths include residual + LN; attention mixes tokens in O(1) depth per layer—not the same T-step product as unrolled RNN, but very deep stacks still need design care.
12 Link to weight initialization.Easy
Answer: Good init keeps activations in a range where derivatives aren’t tiny everywhere—reduces extreme products at start of training.
13 How might you detect exploding gradients in logs?Easy
Answer: Sudden NaN loss, gradient norm spikes, weights blow up—watch global norm per step.
14 Mixed precision and loss scaling.Hard
Answer: Small gradients can underflow in fp16; loss scaling multiplies loss before backward so gradients stay in representable range, then unscales for the optimizer.
15 One-sentence fix menu for interviews.Easy
Answer: Use ReLU, good init, BN/LN, residuals, gated RNNs or attention, and gradient clipping when training recurrent or unstable nets.
Always mention products along the graph—the core math story.

Quick review checklist

  • Vanish vs explode; sigmoid vs ReLU; ResNet shortcut.
  • LSTM/GRU; gradient clipping; BN’s indirect role.
  • Transformers vs RNN depth; loss scaling mention.