Backpropagation: The Engine of Deep Learning
Master the algorithm that trains neural networks. Chain rule, computational graphs, gradient flow, and pure NumPy implementation — no black boxes.
Chain Rule
Multivariate calculus
Computational Graph
Forward & backward
Gradient Flow
Weight updates
NumPy
From scratch
Why Backpropagation? The Credit Assignment Problem
In a multi-layer network, how does a small change in an early weight affect the final loss? Backpropagation (Rumelhart, Hinton, 1986) elegantly solves this credit assignment problem by recursively applying the chain rule.
Historical Breakthrough
- 1986 Backpropagation popularized
- 1989 Universal approximation proven
- 2012 AlexNet (backprop + GPU) wins ImageNet
Intuition
Backpropagation = forward pass computes predictions, backward pass propagates error gradients from output to each weight. "How much did each weight contribute to the error?"
Chain Rule: From Calculus to Computation
Backpropagation is the chain rule — applied efficiently to millions of parameters.
Scalar Chain Rule
If y = f(g(x)), then dy/dx = (dy/dg) * (dg/dx)
Multivariate: For L = f(z), z = Wx + b:
∂L/∂W = (∂L/∂z) · (∂z/∂W)
Computational Graphs: Visualizing Backprop
Modern frameworks (TensorFlow, PyTorch) build a computational graph during forward pass, then traverse it in reverse to compute gradients.
Forward:
x → *3 → +5 → z
Backward:
dz = 1 d+5 = dz * 1 d*3 = d+5 * 3 dx = d*3 * 1?
Automatic Differentiation
- Forward mode: compute derivatives alongside values
- Reverse mode (backprop): one forward pass, one backward pass → all gradients
- Efficient for many parameters (typical deep learning)
# Forward pass: z = (x * 3) + 5
x = 2.0
a = x * 3 # a = 6
z = a + 5 # z = 11
# Backward pass (dz/dz = 1)
dz = 1
da = dz * 1 # dz/da = 1
dx = da * 3 # da/dx = 3
print(dx) # Gradient = 3
Backpropagation Through a 2‑Layer MLP
← dW1 ← dz1 ← da1 ← dW2 ← dz2 ← dL/da2
Forward Pass (Caching)
z1 = X @ W1 + b1
a1 = sigmoid(z1)
z2 = a1 @ W2 + b2
a2 = sigmoid(z2)
loss = binary_crossentropy(y, a2)
Backward Pass (Gradients)
da2 = -(y/a2 - (1-y)/(1-a2)) # BCE derivative
dz2 = da2 * sigmoid_prime(z2) # (a2*(1-a2))
dW2 = a1.T @ dz2
db2 = np.sum(dz2, axis=0)
da1 = dz2 @ W2.T
dz1 = da1 * sigmoid_prime(z1)
dW1 = X.T @ dz1
db1 = np.sum(dz1, axis=0)
Pure NumPy Backprop – Full Training Loop
Every line explained. No frameworks, just math and NumPy.
import numpy as np
class NeuralNet:
def __init__(self, input_size, hidden_size, output_size, lr=0.5):
self.lr = lr
self.W1 = np.random.randn(input_size, hidden_size) * 0.5
self.b1 = np.zeros((1, hidden_size))
self.W2 = np.random.randn(hidden_size, output_size) * 0.5
self.b2 = np.zeros((1, output_size))
def sigmoid(self, x):
return 1 / (1 + np.exp(-x))
def sigmoid_deriv(self, x):
return x * (1 - x)
def forward(self, X):
self.z1 = X @ self.W1 + self.b1
self.a1 = self.sigmoid(self.z1)
self.z2 = self.a1 @ self.W2 + self.b2
self.a2 = self.sigmoid(self.z2)
return self.a2
def backward(self, X, y, output):
m = X.shape[0]
# Output layer gradients
self.dz2 = output - y.reshape(-1,1) # BCE derivative simplification
self.dW2 = (1/m) * self.a1.T @ self.dz2
self.db2 = (1/m) * np.sum(self.dz2, axis=0, keepdims=True)
# Hidden layer gradients
self.da1 = self.dz2 @ self.W2.T
self.dz1 = self.da1 * self.sigmoid_deriv(self.a1)
self.dW1 = (1/m) * X.T @ self.dz1
self.db1 = (1/m) * np.sum(self.dz1, axis=0, keepdims=True)
def update(self):
self.W1 -= self.lr * self.dW1
self.b1 -= self.lr * self.db1
self.W2 -= self.lr * self.dW2
self.b2 -= self.lr * self.db2
def train(self, X, y, epochs=5000):
for i in range(epochs):
output = self.forward(X)
self.backward(X, y, output)
self.update()
if i % 1000 == 0:
loss = np.mean((output - y)**2)
print(f'Epoch {i}, Loss: {loss:.6f}')
# XOR dataset
X = np.array([[0,0],[0,1],[1,0],[1,1]])
y = np.array([[0],[1],[1],[0]])
nn = NeuralNet(2, 4, 1, lr=0.7)
nn.train(X, y, epochs=6000)
print("Predictions:\n", nn.forward(X))
This implementation converges for XOR — the classic non-linear problem that a single perceptron cannot solve.
Gradient Checking: Verify Your Backprop
Numerical approximation of gradients to ensure analytical backprop is correct.
def numerical_gradient(f, params, epsilon=1e-7):
"""Finite difference approximation"""
grads = []
for param in params:
grad = np.zeros_like(param)
it = np.nditer(param, flags=['multi_index'], op_flags=['readwrite'])
while not it.finished:
idx = it.multi_index
old_val = param[idx]
param[idx] = old_val + epsilon
f_plus = f()
param[idx] = old_val - epsilon
f_minus = f()
grad[idx] = (f_plus - f_minus) / (2 * epsilon)
param[idx] = old_val
it.iternext()
grads.append(grad)
return grads
# Use: compare with backprop gradients (difference < 1e-6 is good)
Vanishing / Exploding Gradients
Deep networks suffer from unstable gradients. Why?
Vanishing
Sigmoid/tanh saturate → gradients → 0. Early layers learn extremely slowly.
# Solution: ReLU, residual connections, batch norm
Exploding
Large weights → gradients multiply exponentially → NaN.
# Solution: Gradient clipping, proper initialization
Modern mitigations
- ReLU/Leaky ReLU activations
- Xavier/He initialization
- Batch Normalization
- Residual connections (ResNet)
- Gradient clipping
Backprop in TensorFlow & PyTorch
Autograd computes gradients automatically — but understanding backprop helps you debug and design architectures.
TensorFlow
with tf.GradientTape() as tape:
y_pred = model(X)
loss = tf.keras.losses.MSE(y, y_pred)
grads = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
PyTorch
y_pred = model(X)
loss = nn.MSELoss()(y_pred, y)
loss.backward() # <-- one line backprop!
optimizer.step()
autograd computational graph dynamic vs static
Backpropagation = The Learning Algorithm
Every neural network — from 3-layer MLPs to GPT-4 — is trained using backpropagation (or its variant). Mastering backprop gives you superpowers: you can implement new architectures, fix vanishing gradients, and truly understand deep learning.
When You Need Backprop Deep‑Knowledge
Custom Layers
Implement your own forward/backward in frameworks.
Debugging
Why are gradients NaN? Why isn't this layer learning?
Research
Modify gradient flow (e.g., reversible nets, synthetic gradients).