Vanishing & Exploding Gradients â€” 15 Interview Questions

Products of Jacobians through depth and time, sigmoid saturation, ResNets, LSTM gates, and gradient clipping.

Colored left borders per card; green / amber / red difficulty chips.

Vanishing Exploding ResNet Clip

1 What is the vanishing gradient problem?Easy

Answer: In backprop, gradients are products of terms through layers. If many factors are < 1 (e.g. saturated sigmoid derivatives), early-layer gradients â†’ 0 and those weights barely update.

2 What is exploding gradients?Easy

Answer: Same product picture: factors > 1 repeatedly â†’ huge gradients, unstable updates, NaNs. Common in RNNs over long sequences if unrolled.

3 Why do sigmoid/tanh worsen vanishing?Medium

Answer: Derivatives are â‰¤ 0.25 (sigmoid) or small in saturationâ€”each layer shrinks the backward signal when stacked deeply.

4 How does ReLU help (and one caveat)?Medium

Answer: Derivative is 1 for active neuronsâ€”less shrinkage than sigmoid. Caveat: dead ReLUs still pass zero gradient.

5 How do residual connections improve gradient flow?Medium

Answer: y = F(x)+x adds an identity path; gradient can bypass stacked layers via +1 shortcutâ€”eases training of very deep nets.

6 LSTM vs vanilla RNN for vanishing gradients.Medium

Answer: LSTMâ€™s cell state and additive updates with gates allow better long-range gradient flow than simple tanh recurrence where each step multiplies Jacobians.

7 GRU vs LSTMâ€”gradient angle.Easy

Answer: GRU has fewer gates but similar idea: gating and update blending to mitigate vanishing in sequencesâ€”often comparable performance with less compute.

8 Gradient clippingâ€”norm vs value.Medium

Answer: Clip by norm: if ||g|| > threshold, scale g downâ€”common in RNNs. Value clipping caps each elementâ€”less common. Stops one batch from destroying weights.

9 Does batch norm fix vanishing gradients?Medium

Answer: It stabilizes activations and can help optimization indirectlyâ€”not a guarantee; deep nets still benefit from good init, ReLU, residuals.

10 Highway networks vs ResNets (brief).Hard

Answer: Highways use learned gates on skip vs transform; ResNet uses identity skip + simpler F(x). ResNet won for simplicity and performance in vision.

11 Do Transformers vanish like RNNs?Medium

Answer: Depth is finite and paths include residual + LN; attention mixes tokens in O(1) depth per layerâ€”not the same T-step product as unrolled RNN, but very deep stacks still need design care.

12 Link to weight initialization.Easy

Answer: Good init keeps activations in a range where derivatives arenâ€™t tiny everywhereâ€”reduces extreme products at start of training.

13 How might you detect exploding gradients in logs?Easy

Answer: Sudden NaN loss, gradient norm spikes, weights blow upâ€”watch global norm per step.

14 Mixed precision and loss scaling.Hard

Answer: Small gradients can underflow in fp16; loss scaling multiplies loss before backward so gradients stay in representable range, then unscales for the optimizer.

15 One-sentence fix menu for interviews.Easy

Answer: Use ReLU, good init, BN/LN, residuals, gated RNNs or attention, and gradient clipping when training recurrent or unstable nets.

Always mention products along the graphâ€”the core math story.

Quick review checklist

Vanish vs explode; sigmoid vs ReLU; ResNet shortcut.
LSTM/GRU; gradient clipping; BNâ€™s indirect role.
Transformers vs RNN depth; loss scaling mention.

Previous: Learning rate Next: CNN

Related Neural Networks Links

Vanishing & Exploding Gradients â€” 15 Interview Questions

Quick review checklist