Backpropagation

Backpropagation is not a separate learning ruleâ€”it is an efficient algorithm to compute âˆ‚L/âˆ‚Î¸ for every parameter Î¸ in a feedforward network, given the loss L. It applies the chain rule of calculus repeatedly, propagating error signals backward from the output layer toward the input, reusing intermediate results so the cost grows roughly linearly with network size instead of exploding exponentially.

chain rule backward pass adjoint PyTorch backward

From the Chain Rule to â€œError Signalsâ€

If L depends on z, and z depends on w, then âˆ‚L/âˆ‚w = (âˆ‚L/âˆ‚z)(âˆ‚z/âˆ‚w). In a network, L depends on the output, which depends on the last layerâ€™s activations, which depend on the last layerâ€™s pre-activations, which depend on weights and the previous layerâ€™s activationsâ€”and so on. Each link in this chain contributes a factor; backprop multiplies those factors along paths from L back to each w.

Implementation-wise, we store forward values (inputs to each op) needed to compute local derivatives on the way back. The backward pass visits operations in reverse topological order: each op receives an upstream gradient (â€œhow much L changes per unit change in this opâ€™s outputâ€) and returns gradients w.r.t. its inputs.

Forward: x â†’ â€¦ â†’ Å· â†’ L Backward: âˆ‚L/âˆ‚Å· â†’ â€¦ â†’ âˆ‚L/âˆ‚x (plus âˆ‚L/âˆ‚W, âˆ‚L/âˆ‚b at each layer)

What Happens in One Dense Layer?

Suppose z = W a + b and a' = Ïƒ(z). During backward, we receive âˆ‚L/âˆ‚a' (same shape as a'). First, âˆ‚L/âˆ‚z = âˆ‚L/âˆ‚a' âŠ™ Ïƒ'(z) (element-wise product for pointwise Ïƒ). Then âˆ‚L/âˆ‚a = Wáµ€ (âˆ‚L/âˆ‚z) to pass the signal to the previous layer, and âˆ‚L/âˆ‚W = (âˆ‚L/âˆ‚z) aáµ€ in the appropriate batch layout (or outer products summed over examples). Biases sum the gradient across batch dimension.

You do not need to memorize every transpose if you use a framework; you do need the picture: matmul backward swaps the roles of activations and upstream gradients when flowing through weights, which is why shape discipline matters for custom layers.

ReLU, Sigmoid, and Softmax Layers

ReLU: derivative 1 where z>0, 0 where z<0. Dead neurons correspond to regions where the gradient is always zeroâ€”another reason Leaky ReLU or other activations appear.

Sigmoid/tanh: derivatives involve Ïƒ(1âˆ’Ïƒ) or 1âˆ’tanhÂ²; in deep stacks these can shrink gradients (historical â€œvanishing gradientâ€ story with saturating activations).

Softmax + cross-entropy: the combined backward pass toward logits simplifies beautifully for stable implementations; frameworks fuse them so you never form huge Jacobian matrices explicitly. Conceptually, the gradient on logits pushes probability mass toward the correct class and away from wrong ones.

Memory and Compute

Training typically stores activations from the forward pass for backward use. That is why activation checkpointing trades extra forward recomputation for less GPU memory on very large models. The backward pass has similar order-of-magnitude cost to the forward pass for many common layers.

Debugging. If you implement a custom autograd.Function, verify shapes and compare to torch.autograd.gradcheck on tiny random inputs (finite differences)â€”slow but trustworthy.

PyTorch: `backward()` and the Graph

PyTorch builds a dynamic graph each forward (eager mode). Tensors with requires_grad=True track operations. Calling loss.backward() runs backprop; optimizer.step() then uses stored .grad tensors. torch.no_grad() disables graph building for inference.

Minimal autograd trail

import torch

x = torch.tensor([2.0], requires_grad=True)
y = x ** 3 + 5 * x
y.backward()   # dy/dx = 3x^2 + 5
print(x.grad)  # tensor([17.])

Summary

Backprop = reverse-mode differentiation applying the chain rule on the networkâ€™s computation graph.
Each layer maps an upstream gradient into gradients w.r.t. its inputs and parameters.
Modern frameworks hide Jacobian algebra but the structure explains vanishing/exploding behavior.
Computational graphs (next page) make this bookkeeping explicit.

Next. See how operations as nodes and data as edges formalize forward and backward passes.

Previous: Gradient descent Next: Computational graph

Related Neural Networks Links