Backpropagation Deep Learning Core
Intermediate Algorithm & Derivation

Backpropagation: The Engine of Deep Learning

Master the algorithm that trains neural networks. Chain rule, computational graphs, gradient flow, and pure NumPy implementation — no black boxes.

Chain Rule

Multivariate calculus

Computational Graph

Forward & backward

Gradient Flow

Weight updates

NumPy

From scratch

Why Backpropagation? The Credit Assignment Problem

In a multi-layer network, how does a small change in an early weight affect the final loss? Backpropagation (Rumelhart, Hinton, 1986) elegantly solves this credit assignment problem by recursively applying the chain rule.

Historical Breakthrough
  • 1986 Backpropagation popularized
  • 1989 Universal approximation proven
  • 2012 AlexNet (backprop + GPU) wins ImageNet
Intuition

Backpropagation = forward pass computes predictions, backward pass propagates error gradients from output to each weight. "How much did each weight contribute to the error?"

Prerequisites: Partial derivatives, chain rule, gradient descent. We'll derive everything step by step.

Chain Rule: From Calculus to Computation

Backpropagation is the chain rule — applied efficiently to millions of parameters.

Scalar Chain Rule

If y = f(g(x)), then dy/dx = (dy/dg) * (dg/dx)

Multivariate: For L = f(z), z = Wx + b:

∂L/∂W = (∂L/∂z) · (∂z/∂W)
∂L/∂W₁ = ∂L/∂a₃ · ∂a₃/∂z₃ · ∂z₃/∂a₂ · ∂a₂/∂z₂ · ∂z₂/∂a₁ · ∂a₁/∂z₁ · ∂z₁/∂W₁
Gradient flows backward through every intermediate function

Computational Graphs: Visualizing Backprop

Modern frameworks (TensorFlow, PyTorch) build a computational graph during forward pass, then traverse it in reverse to compute gradients.

Forward:

x → *3 → +5 → z

Backward:

dz = 1  
d+5 = dz * 1  
d*3 = d+5 * 3  
dx = d*3 * 1?
Automatic Differentiation
  • Forward mode: compute derivatives alongside values
  • Reverse mode (backprop): one forward pass, one backward pass → all gradients
  • Efficient for many parameters (typical deep learning)
🧮 Manual backprop through simple graph (NumPy)
# Forward pass: z = (x * 3) + 5
x = 2.0
a = x * 3      # a = 6
z = a + 5      # z = 11

# Backward pass (dz/dz = 1)
dz = 1
da = dz * 1    # dz/da = 1
dx = da * 3    # da/dx = 3
print(dx)      # Gradient = 3

Backpropagation Through a 2‑Layer MLP

X → (W1) → z1 → σ → a1 → (W2) → z2 → σ → a2 → Loss L(a2,y)
← dW1 ← dz1 ← da1 ← dW2 ← dz2 ← dL/da2
Forward Pass (Caching)
z1 = X @ W1 + b1
a1 = sigmoid(z1)
z2 = a1 @ W2 + b2
a2 = sigmoid(z2)
loss = binary_crossentropy(y, a2)
Backward Pass (Gradients)
da2 = -(y/a2 - (1-y)/(1-a2))  # BCE derivative
dz2 = da2 * sigmoid_prime(z2) # (a2*(1-a2))
dW2 = a1.T @ dz2
db2 = np.sum(dz2, axis=0)

da1 = dz2 @ W2.T
dz1 = da1 * sigmoid_prime(z1)
dW1 = X.T @ dz1
db1 = np.sum(dz1, axis=0)
Key pattern: Gradient w.r.t weight = activation_in.T @ gradient_out. This is consistent for all fully connected layers.

Pure NumPy Backprop – Full Training Loop

Every line explained. No frameworks, just math and NumPy.

⚙️ NeuralNetwork with backprop (XOR example)
import numpy as np

class NeuralNet:
    def __init__(self, input_size, hidden_size, output_size, lr=0.5):
        self.lr = lr
        self.W1 = np.random.randn(input_size, hidden_size) * 0.5
        self.b1 = np.zeros((1, hidden_size))
        self.W2 = np.random.randn(hidden_size, output_size) * 0.5
        self.b2 = np.zeros((1, output_size))
    
    def sigmoid(self, x):
        return 1 / (1 + np.exp(-x))
    
    def sigmoid_deriv(self, x):
        return x * (1 - x)
    
    def forward(self, X):
        self.z1 = X @ self.W1 + self.b1
        self.a1 = self.sigmoid(self.z1)
        self.z2 = self.a1 @ self.W2 + self.b2
        self.a2 = self.sigmoid(self.z2)
        return self.a2
    
    def backward(self, X, y, output):
        m = X.shape[0]
        # Output layer gradients
        self.dz2 = output - y.reshape(-1,1)     # BCE derivative simplification
        self.dW2 = (1/m) * self.a1.T @ self.dz2
        self.db2 = (1/m) * np.sum(self.dz2, axis=0, keepdims=True)
        # Hidden layer gradients
        self.da1 = self.dz2 @ self.W2.T
        self.dz1 = self.da1 * self.sigmoid_deriv(self.a1)
        self.dW1 = (1/m) * X.T @ self.dz1
        self.db1 = (1/m) * np.sum(self.dz1, axis=0, keepdims=True)
    
    def update(self):
        self.W1 -= self.lr * self.dW1
        self.b1 -= self.lr * self.db1
        self.W2 -= self.lr * self.dW2
        self.b2 -= self.lr * self.db2
    
    def train(self, X, y, epochs=5000):
        for i in range(epochs):
            output = self.forward(X)
            self.backward(X, y, output)
            self.update()
            if i % 1000 == 0:
                loss = np.mean((output - y)**2)
                print(f'Epoch {i}, Loss: {loss:.6f}')

# XOR dataset
X = np.array([[0,0],[0,1],[1,0],[1,1]])
y = np.array([[0],[1],[1],[0]])

nn = NeuralNet(2, 4, 1, lr=0.7)
nn.train(X, y, epochs=6000)
print("Predictions:\n", nn.forward(X))

This implementation converges for XOR — the classic non-linear problem that a single perceptron cannot solve.

Gradient Checking: Verify Your Backprop

Numerical approximation of gradients to ensure analytical backprop is correct.

🔬 Numerical gradient vs backprop
def numerical_gradient(f, params, epsilon=1e-7):
    """Finite difference approximation"""
    grads = []
    for param in params:
        grad = np.zeros_like(param)
        it = np.nditer(param, flags=['multi_index'], op_flags=['readwrite'])
        while not it.finished:
            idx = it.multi_index
            old_val = param[idx]
            param[idx] = old_val + epsilon
            f_plus = f()
            param[idx] = old_val - epsilon
            f_minus = f()
            grad[idx] = (f_plus - f_minus) / (2 * epsilon)
            param[idx] = old_val
            it.iternext()
        grads.append(grad)
    return grads

# Use: compare with backprop gradients (difference < 1e-6 is good)
Always gradient-check when implementing backprop from scratch!

Vanishing / Exploding Gradients

Deep networks suffer from unstable gradients. Why?

Vanishing

Sigmoid/tanh saturate → gradients → 0. Early layers learn extremely slowly.

# Solution: ReLU, residual connections, batch norm
Exploding

Large weights → gradients multiply exponentially → NaN.

# Solution: Gradient clipping, proper initialization
Modern mitigations
  • ReLU/Leaky ReLU activations
  • Xavier/He initialization
  • Batch Normalization
  • Residual connections (ResNet)
  • Gradient clipping

Backprop in TensorFlow & PyTorch

Autograd computes gradients automatically — but understanding backprop helps you debug and design architectures.

TensorFlow
with tf.GradientTape() as tape:
    y_pred = model(X)
    loss = tf.keras.losses.MSE(y, y_pred)
grads = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
PyTorch
y_pred = model(X)
loss = nn.MSELoss()(y_pred, y)
loss.backward()  # <-- one line backprop!
optimizer.step()

autograd computational graph dynamic vs static

Backpropagation = The Learning Algorithm

Every neural network — from 3-layer MLPs to GPT-4 — is trained using backpropagation (or its variant). Mastering backprop gives you superpowers: you can implement new architectures, fix vanishing gradients, and truly understand deep learning.

When You Need Backprop Deep‑Knowledge

Custom Layers

Implement your own forward/backward in frameworks.

Debugging

Why are gradients NaN? Why isn't this layer learning?

Research

Modify gradient flow (e.g., reversible nets, synthetic gradients).

Ready for advanced architectures? Next, learn how optimizers (SGD, Adam) use these gradients to update weights.