Related Neural Networks Links
Learn Backpropagation Neural Networks Tutorial, validate concepts with Backpropagation Neural Networks MCQ Questions, and prepare interviews through Backpropagation Neural Networks Interview Questions and Answers.
Backpropagation
Backpropagation is not a separate learning rule—it is an efficient algorithm to compute ∂L/∂θ for every parameter θ in a feedforward network, given the loss L. It applies the chain rule of calculus repeatedly, propagating error signals backward from the output layer toward the input, reusing intermediate results so the cost grows roughly linearly with network size instead of exploding exponentially.
chain rule backward pass adjoint PyTorch backward
From the Chain Rule to “Error Signalsâ€
If L depends on z, and z depends on w, then ∂L/∂w = (∂L/∂z)(∂z/∂w). In a network, L depends on the output, which depends on the last layer’s activations, which depend on the last layer’s pre-activations, which depend on weights and the previous layer’s activations—and so on. Each link in this chain contributes a factor; backprop multiplies those factors along paths from L back to each w.
Implementation-wise, we store forward values (inputs to each op) needed to compute local derivatives on the way back. The backward pass visits operations in reverse topological order: each op receives an upstream gradient (“how much L changes per unit change in this op’s outputâ€) and returns gradients w.r.t. its inputs.
What Happens in One Dense Layer?
Suppose z = W a + b and a' = σ(z). During backward, we receive ∂L/∂a' (same shape as a'). First, ∂L/∂z = ∂L/∂a' ⊙ σ'(z) (element-wise product for pointwise σ). Then ∂L/∂a = Wᵀ (∂L/∂z) to pass the signal to the previous layer, and ∂L/∂W = (∂L/∂z) aᵀ in the appropriate batch layout (or outer products summed over examples). Biases sum the gradient across batch dimension.
You do not need to memorize every transpose if you use a framework; you do need the picture: matmul backward swaps the roles of activations and upstream gradients when flowing through weights, which is why shape discipline matters for custom layers.
ReLU, Sigmoid, and Softmax Layers
ReLU: derivative 1 where z>0, 0 where z<0. Dead neurons correspond to regions where the gradient is always zero—another reason Leaky ReLU or other activations appear.
Sigmoid/tanh: derivatives involve σ(1−σ) or 1−tanh²; in deep stacks these can shrink gradients (historical “vanishing gradient†story with saturating activations).
Softmax + cross-entropy: the combined backward pass toward logits simplifies beautifully for stable implementations; frameworks fuse them so you never form huge Jacobian matrices explicitly. Conceptually, the gradient on logits pushes probability mass toward the correct class and away from wrong ones.
Memory and Compute
Training typically stores activations from the forward pass for backward use. That is why activation checkpointing trades extra forward recomputation for less GPU memory on very large models. The backward pass has similar order-of-magnitude cost to the forward pass for many common layers.
autograd.Function, verify shapes and compare to torch.autograd.gradcheck on tiny random inputs (finite differences)—slow but trustworthy.
PyTorch: backward() and the Graph
PyTorch builds a dynamic graph each forward (eager mode). Tensors with requires_grad=True track operations. Calling loss.backward() runs backprop; optimizer.step() then uses stored .grad tensors. torch.no_grad() disables graph building for inference.
import torch
x = torch.tensor([2.0], requires_grad=True)
y = x ** 3 + 5 * x
y.backward() # dy/dx = 3x^2 + 5
print(x.grad) # tensor([17.])
Summary
- Backprop = reverse-mode differentiation applying the chain rule on the network’s computation graph.
- Each layer maps an upstream gradient into gradients w.r.t. its inputs and parameters.
- Modern frameworks hide Jacobian algebra but the structure explains vanishing/exploding behavior.
- Computational graphs (next page) make this bookkeeping explicit.
Next. See how operations as nodes and data as edges formalize forward and backward passes.