Related Neural Networks Links
Learn Vanishing Gradients Neural Networks Tutorial, validate concepts with Vanishing Gradients Neural Networks MCQ Questions, and prepare interviews through Vanishing Gradients Neural Networks Interview Questions and Answers.
Vanishing & Exploding Gradients
Backpropagation multiplies partial derivatives layer by layer. In a deep stack, that product can shrink toward zero (vanishing gradients)—early layers barely learn—or grow without bound (exploding gradients)—updates become NaNs. Saturating activations like sigmoid and tanh (derivatives < 1) historically made very deep plain MLPs hard to train. Modern fixes include ReLU families, better initialization, batch normalization, residual shortcuts, and gradient clipping.
chain rule ResNet LSTM clip_grad_norm_
Why the Product Matters
The gradient of the loss with respect to an early-layer weight involves a chain of Jacobian factors—one per layer (and per activation). If typical factors have magnitude < 1, the product decays exponentially with depth; if > 1, it can explode. Random init without scaling can push you toward either extreme in deep linear-like stacks.
Vanishing starves shallow layers of useful signal (they update too slowly). Exploding gradients cause huge weight swings and numerical overflow. RNNs over long sequences face the same issue through time—hence LSTM and GRU gating to create additive paths where gradients flow more gently.
Mitigations
- ReLU (and variants): derivative 1 for positive pre-activations avoids the universal shrinkage of sigmoid in the active region.
- Residual connections (ResNet): add identity shortcuts so layers learn residuals; gradient has a direct path backward.
- Batch / layer normalization: stabilizes scale of activations and often improves gradient behavior.
- Initialization (He, Xavier): keeps forward variance and backward gradient scale in a reasonable range at layer boundaries.
- Gradient clipping: cap the global norm of gradients before
optimizer.step()—standard in NLP and RNN training.
PyTorch: Gradient Clipping
import torch.nn as nn
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
optimizer.zero_grad()
Summary
- Deep chains multiply Jacobians; products can vanish or explode.
- ReLU, good init, BN, and residuals are standard stabilizers for feedforward CNNs/MLPs.
- RNNs need gating (LSTM/GRU) or truncation; clipping helps exploding cases.
- Next: CNNs—structure built for images and local patterns.
Convolutional neural networks exploit spatial structure—shared filters and pooling—next in the series.