Vanishing & Exploding Gradients

Backpropagation multiplies partial derivatives layer by layer. In a deep stack, that product can shrink toward zero (vanishing gradients)â€”early layers barely learnâ€”or grow without bound (exploding gradients)â€”updates become NaNs. Saturating activations like sigmoid and tanh (derivatives < 1) historically made very deep plain MLPs hard to train. Modern fixes include ReLU families, better initialization, batch normalization, residual shortcuts, and gradient clipping.

chain rule ResNet LSTM clip_grad_norm_

Why the Product Matters

The gradient of the loss with respect to an early-layer weight involves a chain of Jacobian factorsâ€”one per layer (and per activation). If typical factors have magnitude < 1, the product decays exponentially with depth; if > 1, it can explode. Random init without scaling can push you toward either extreme in deep linear-like stacks.

Vanishing starves shallow layers of useful signal (they update too slowly). Exploding gradients cause huge weight swings and numerical overflow. RNNs over long sequences face the same issue through timeâ€”hence LSTM and GRU gating to create additive paths where gradients flow more gently.

Mitigations

ReLU (and variants): derivative 1 for positive pre-activations avoids the universal shrinkage of sigmoid in the active region.
Residual connections (ResNet): add identity shortcuts so layers learn residuals; gradient has a direct path backward.
Batch / layer normalization: stabilizes scale of activations and often improves gradient behavior.
Initialization (He, Xavier): keeps forward variance and backward gradient scale in a reasonable range at layer boundaries.
Gradient clipping: cap the global norm of gradients before optimizer.step()â€”standard in NLP and RNN training.

If you see NaN loss, check learning rate, loss scaling (mixed precision), and clip gradients before rewriting the model.

PyTorch: Gradient Clipping

Clip global norm before optimizer step

import torch.nn as nn

loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
optimizer.zero_grad()

Summary

Deep chains multiply Jacobians; products can vanish or explode.
ReLU, good init, BN, and residuals are standard stabilizers for feedforward CNNs/MLPs.
RNNs need gating (LSTM/GRU) or truncation; clipping helps exploding cases.
Next: CNNsâ€”structure built for images and local patterns.

Convolutional neural networks exploit spatial structureâ€”shared filters and poolingâ€”next in the series.

Previous: Learning rate Next: Convolutional neural networks

Related Neural Networks Links