Deep Learning

Neural Network Fundamentals

Neural networks, activation functions, loss functions, backpropagation, and optimizers for deep learning.

Neural Networks Basics

The Perceptron — First Neural Model

Invented by Frank Rosenblatt in 1958, the perceptron is the simplest neural network: a single neuron that classifies linear separable patterns.

How it works
  • 1 Weighted sum: z = w·x + b
  • 2 Step activation: 1 if z ≥ 0 else 0
  • 3 Update: w = w + lr*(y - yÌ‚)*x
Limitation

Only linear separable functions (AND, OR) – cannot learn XOR. This triggered the first AI winter and led to multi-layer networks.

key insight: depth matters
📁 Perceptron from scratch – NumPy
import numpy as np

class Perceptron:
    def __init__(self, lr=0.01, epochs=15):
        self.lr = lr
        self.epochs = epochs
        self.weights = None
        self.bias = None

    def activation(self, z):
        return 1 if z >= 0 else 0

    def fit(self, X, y):
        n_samples, n_features = X.shape
        self.weights = np.zeros(n_features)
        self.bias = 0

        for _ in range(self.epochs):
            for idx, x_i in enumerate(X):
                linear = np.dot(x_i, self.weights) + self.bias
                y_pred = self.activation(linear)
                update = self.lr * (y[idx] - y_pred)
                self.weights += update * x_i
                self.bias += update

    def predict(self, X):
        linear = np.dot(X, self.weights) + self.bias
        return np.array([self.activation(z) for z in linear])

Try it on AND gate – converges in <10 iterations.

Activation Functions: Non-linearity is key

Without activation functions, stacked linear layers collapse into one linear transformation. Non-linear activations enable deep networks to approximate any function.

Sigmoid
def sigmoid(x):
    return 1/(1+np.exp(-x))

Range (0,1), great for binary output, but vanishing gradient.

Tanh
def tanh(x):
    return np.tanh(x)

Range (-1,1), zero-centered, stronger gradients.

ReLU
def relu(x):
    return np.maximum(0,x)

No saturation, sparse; dead neurons risk.

Leaky ReLU
def leaky_relu(x, alpha=0.1):
    return np.where(x>0, x, alpha*x)
Softmax
def softmax(x):
    ex = np.exp(x - np.max(x))
    return ex / ex.sum()

Multi-class probability.

Selection rule: ReLU for hidden layers, sigmoid for binary output, softmax for multi-class.

Forward Propagation & Backpropagation

Input X → [W1,b1] → z1 → a1 = σ(z1) → [W2,b2] → z2 → a2 = σ(z2) → Loss L(ŷ,y)
↻ Backward: dL/dW2 ← dL/da2 * da2/dz2 * dz2/dW2 ... chain rule
Forward pass

Compute activations layer by layer, cache intermediate values for gradient.

Backward pass (chain rule)

δL/δW = (δL/δa) * (δa/δz) * (δz/δW)

🔁 Backpropagation in 2-layer net (NumPy)
# assume sigmoid activation, MSE loss
def backward(self, X, y, a1, a2):
    m = X.shape[0]
    # output layer gradient
    dz2 = a2 - y.reshape(-1,1)          # dL/dz2
    dW2 = (1/m) * a1.T @ dz2
    db2 = (1/m) * np.sum(dz2, axis=0, keepdims=True)
    # hidden layer gradient
    da1 = dz2 @ self.W2.T
    dz1 = da1 * (a1 * (1 - a1))         # sigmoid derivative
    dW1 = (1/m) * X.T @ dz1
    db1 = (1/m) * np.sum(dz1, axis=0, keepdims=True)

Multi-Layer Perceptron (MLP) from Scratch

Complete implementation of a flexible neural network with one hidden layer using only NumPy. Foundation for modern deep learning.

🧠 NeuralNetwork class – forward, backward, train
import numpy as np

class MLP:
    def __init__(self, input_size, hidden_size, output_size, lr=0.1):
        self.lr = lr
        self.W1 = np.random.randn(input_size, hidden_size) * 0.5
        self.b1 = np.zeros((1, hidden_size))
        self.W2 = np.random.randn(hidden_size, output_size) * 0.5
        self.b2 = np.zeros((1, output_size))

    def sigmoid(self, x):
        return 1 / (1 + np.exp(-x))

    def sigmoid_deriv(self, x):
        return x * (1 - x)

    def forward(self, X):
        self.z1 = X @ self.W1 + self.b1
        self.a1 = self.sigmoid(self.z1)
        self.z2 = self.a1 @ self.W2 + self.b2
        self.a2 = self.sigmoid(self.z2)
        return self.a2

    def backward(self, X, y, output):
        m = X.shape[0]
        self.dz2 = output - y.reshape(-1,1)
        self.dW2 = (1/m) * self.a1.T @ self.dz2
        self.db2 = (1/m) * np.sum(self.dz2, axis=0, keepdims=True)
        self.da1 = self.dz2 @ self.W2.T
        self.dz1 = self.da1 * self.sigmoid_deriv(self.a1)
        self.dW1 = (1/m) * X.T @ self.dz1
        self.db1 = (1/m) * np.sum(self.dz1, axis=0, keepdims=True)

    def update(self):
        self.W1 -= self.lr * self.dW1
        self.b1 -= self.lr * self.db1
        self.W2 -= self.lr * self.dW2
        self.b2 -= self.lr * self.db2

    def fit(self, X, y, epochs=1000):
        for i in range(epochs):
            output = self.forward(X)
            self.backward(X, y, output)
            self.update()
            if i % 200 == 0:
                loss = np.mean((output - y)**2)
                print(f"epoch {i}, loss: {loss:.6f}")
Test on XOR: 2→4→1 network, sigmoid, trained 2000 epochs → converges below 0.005 MSE.

Neural Nets in Keras & PyTorch

TensorFlow/Keras
import tensorflow as tf
model = tf.keras.Sequential([
    tf.keras.layers.Dense(64, activation='relu', input_shape=(4,)),
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(optimizer='adam', loss='mse')
PyTorch
import torch.nn as nn
class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(4, 8)
        self.fc2 = nn.Linear(8, 4)
        self.out = nn.Linear(4, 1)
    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        return torch.sigmoid(self.out(x))

transfer learning autodiff GPU

Weight Initialization & Optimizers

Initialization
  • Zero init → symmetry, no learning
  • Small random (0.01) – ok for shallow
  • Xavier/Glorot for sigmoid/tanh
  • He init for ReLU
Optimizers

Batch GD, SGD, Mini-batch. Momentum, Adam, RMSprop adapt learning rates.

optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)

Why do neural networks work?

Universal Approximation Theorem: A feedforward network with a single hidden layer can approximate any continuous function, given sufficient neurons and non-linear activation.

Real‑world usage

Regression & Forecasting

Housing prices, stock trends, energy load.

Classification

Spam detection, credit risk, medical diagnosis.

Feature learning

Autoencoders, embeddings, representation learning.

Activation Functions: The Non-Linear Gatekeepers

Why do we need activation functions?

Without activation functions, neural networks would just be linear transformations. No matter how many layers, a linear combination of linear functions is still linear. Activation functions introduce non-linearity, allowing the network to learn complex patterns, decision boundaries, and hierarchical representations.

Weighted sum (z = w·x + b) Activation f(z) Output (non-linear)

Every neuron applies an activation function to its weighted input.

Classic Activation Functions: Sigmoid & Tanh

Sigmoid (σ)

Formula: σ(x) = 1 / (1 + e-x)

Derivative: σ(x)(1-σ(x))

import numpy as np
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(x):
    s = sigmoid(x)
    return s * (1 - s)

Output (0,1) Vanishing gradient Used in output layer (binary classification).

Tanh (Hyperbolic Tangent)

Formula: tanh(x) = (ex - e-x) / (ex + e-x)

Derivative: 1 - tanh²(x)

def tanh(x):
    return np.tanh(x)  # or manual

def tanh_derivative(x):
    return 1 - np.tanh(x)**2

Zero-centered (-1,1) Stronger gradients, still saturates.

Vanishing Gradient: Both Sigmoid and Tanh squash large inputs, making gradients near zero. Deep networks struggle to learn.

ReLU & Family: Solving Vanishing Gradient

ReLU (Rectified Linear Unit)

f(x) = max(0, x)

def relu(x):
    return np.maximum(0, x)
# derivative: 1 if x>0 else 0

Pros: Computationally cheap, sparse, no saturation for x>0. Cons: Dying ReLU (neurons stuck at 0).

Leaky ReLU

f(x) = x if x>0 else αx (α small, e.g., 0.01)

def leaky_relu(x, alpha=0.01):
    return np.where(x > 0, x, alpha * x)

Fixes dying ReLU; allows gradient flow for negative values.

ELU (Exponential Linear Unit)

f(x)= x if x>0 else α(e^x -1)

Smooth, negative values push mean closer to zero. Faster learning.

PReLU (Parametric ReLU)

α is learned during training.

# TensorFlow: tf.keras.layers.PReLU()

Softmax: From Logits to Probabilities

Softmax is used in the output layer for multi-class classification. It converts a vector of raw scores (logits) into a probability distribution over classes.

Softmax + CrossEntropy
def softmax(logits):
    exp_shifted = np.exp(logits - np.max(logits))  # numerical stability
    return exp_shifted / np.sum(exp_shifted, axis=-1, keepdims=True)

# Example: logits = [2.0, 1.0, 0.1] -> probabilities sum=1
Key property

All outputs ∈ (0,1) and sum to 1.
Ideal for mutually exclusive classes.

Modern Activation Functions (Transformers, CNNs)

Swish (Google, 2017)

f(x) = x * sigmoid(βx) (β learnable or constant =1)

Smooth, non-monotonic. Outperforms ReLU in deep nets.

# TF: tf.keras.activations.swish
GELU (Gaussian Error Linear Unit)

GELU(x) = x * Φ(x) (Φ is CDF of Gaussian).

Used in BERT, GPT, ViT. Smooth ReLU variant.

# from transformers, torch.nn.GELU
Mish (2019)

f(x) = x * tanh(softplus(x)).

Self-regularized, slightly better than Swish on some benchmarks.

Trend: Smooth, non-monotonic, often with no saturation. Swish & GELU are default in many modern architectures.

Activation Function Selection Guide

Function Range When to use Common in
Sigmoid(0,1)Binary output, probabilistic gateLogistic regression, some attention
Tanh(-1,1)Zero-centered hidden layers (older RNNs)LSTM candidate gates
ReLU[0,∞)Default for hidden layers (CNNs, MLPs)ResNet, VGG, YOLO
Leaky ReLU(-∞,∞)Avoid dead neuronsGANs, some detection models
Softmax(0,1) sum=1Multi-class classification outputClassification heads
Swish / SiLU(-∞,∞)Deep transformer-style modelsEfficientNet, RL
GELU(-∞,∞)NLP Transformers (BERT, GPT)Hugging Face models

Rule of thumb:

  • Start with ReLU for hidden layers.
  • For output: Sigmoid (binary), Softmax (multi-class), Linear (regression).
  • If ReLU causes dead neurons → Leaky ReLU / ELU.
  • For Transformers: GELU.
  • For very deep nets: consider Swish.

Activation Functions in TensorFlow & PyTorch

TensorFlow / Keras
import tensorflow as tf
# As activation string or layer
model = tf.keras.Sequential([
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(32, activation='leaky_relu'), 
    tf.keras.layers.Dense(10, activation='softmax')
])
# Advanced: tf.keras.activations.gelu, tf.nn.swish
PyTorch
import torch.nn as nn
class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(784, 128)
        self.act1 = nn.ReLU()
        self.fc2 = nn.Linear(128, 64)
        self.act2 = nn.LeakyReLU(0.02)
        self.out = nn.Linear(64, 10)
    def forward(self, x):
        x = self.act1(self.fc1(x))
        x = self.act2(self.fc2(x))
        return self.out(x)  # with CrossEntropyLoss, no softmax needed

Activation shapes at a glance

    Sigmoid:   ──▄▄▄▄▄▄▄▄▄▄▄▄▄▄──  squashes to [0,1]
    Tanh:      ─▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄─  [-1,1]
    ReLU:      ──────────────────▄▄▄▄▄▄▄▄▄▄   max(0,x)
    Leaky ReLU:─╱╱╱╱─────────────▄▄▄▄▄▄▄▄▄   small negative slope
    Softmax:   [0.2, 0.7, 0.1] probabilities

* ASCII illustration of activation function curves

Activation Pitfalls & Best Practices

⚠️ Vanishing gradient: Avoid Sigmoid/Tanh in deep hidden layers. Use ReLU or variants.
🎯 Dead ReLU: Use Leaky ReLU or ELU if many neurons output zero forever.
💡 Numerical stability: For softmax, always subtract max(logits) before exponentiation.
🧠 Output layer: Match activation to task: linear (regression), sigmoid (binary), softmax (multi-class).

Activation Function Cheatsheet

Sigmoid 0-1
Tanh -1 to 1
ReLU max(0,x)
Leaky ReLU α=0.01
ELU α(e^x-1)
Swish x·sigmoid
GELU x·Φ(x)
Softmax ∑=1
Next Up: Loss Functions – How neural networks learn.

Loss Functions: The Compass of Neural Networks

What is a Loss Function?

A loss function (also called cost/objective function) maps the model's predictions and ground truth to a scalar value. Lower loss = better predictions. During training, backpropagation computes gradients of the loss w.r.t. weights, and optimizers update weights to minimize this loss.

Predictions (ŷ) + True Targets (y) → Loss = L(ŷ, y) → Gradient ∇L

Loss functions define the learning objective.

Regression Losses: Predicting Continuous Values

MSE (L2 Loss)

MSE = 1/n Σ(y - ŷ)²

import numpy as np
def mse(y_true, y_pred):
    return np.mean((y_true - y_pred)**2)

Differentiable Sensitive to outliers Most common regression loss.

MAE (L1 Loss)

MAE = 1/n Σ|y - ŷ|

def mae(y_true, y_pred):
    return np.mean(np.abs(y_true - y_pred))

Robust to outliers Not differentiable at 0.

Huber Loss

Lδ = { ½(y-ŷ)² for \|y-ŷ\|≤δ, else δ\|y-ŷ\| - ½δ² }

def huber(y_true, y_pred, delta=1.0):
    error = y_true - y_pred
    is_small = np.abs(error) <= delta
    return np.mean(is_small * 0.5 * error**2 + 
                   ~is_small * (delta*np.abs(error) - 0.5*delta**2))

Combines MSE and MAE. Smooth, robust.

Log-Cosh Loss

L = log(cosh(Å· - y))

Smooth approximation to MAE, twice differentiable.

Quantile Loss

L = Σ max(q(y-ŷ), (q-1)(y-ŷ))

Used for predicting prediction intervals.

Classification Losses: Probability & Decision Boundaries

Binary Cross-Entropy (BCE)

BCE = -[y log(Å·) + (1-y) log(1-Å·)]

def binary_crossentropy(y, y_pred):
    y_pred = np.clip(y_pred, 1e-7, 1-1e-7)  # stability
    return -np.mean(y * np.log(y_pred) + 
                    (1 - y) * np.log(1 - y_pred))

Use: Binary classification, sigmoid output.

Categorical Cross-Entropy (CCE)

CCE = -Σ y_i log(ŷ_i)

def categorical_crossentropy(y, y_pred):
    y_pred = np.clip(y_pred, 1e-7, 1.0)
    return -np.sum(y * np.log(y_pred)) / y.shape[0]

Use: Multi-class, softmax output.

Sparse CCE

Same as CCE but targets are integers (not one-hot). Memory efficient.

tf.keras.losses.SparseCategoricalCrossentropy()
Hinge Loss

L = max(0, 1 - y·ŷ) (y ∈ {-1,1})

Used in SVMs, also with CNNs.

tf.keras.losses.Hinge()
Squared Hinge

L = max(0, 1 - y·ŷ)²

Differentiable, penalizes errors more.

Numerical stability: Always use framework implementations (e.g., tf.keras.losses.BinaryCrossentropy(from_logits=True)) which combine log and softmax/sigmoid in a numerically stable way.

Probabilistic Losses: Distributions & Divergence

KL Divergence

D_KL(P||Q) = Σ P(i) log(P(i)/Q(i))

Measures how one probability distribution diverges from another. Asymmetric.

def kl_divergence(p, q):
    p = np.clip(p, 1e-7, 1)
    q = np.clip(q, 1e-7, 1)
    return np.sum(p * np.log(p / q))

Used in VAEs, variational inference.

JS Divergence

Jensen-Shannon divergence. Symmetric, smoothed version of KL.

Used in GANs, domain adaptation.

Cross-Entropy vs KL

Cross-Entropy = H(P,Q) = H(P) + D_KL(P||Q). Minimizing cross-entropy is equivalent to minimizing KL when P is fixed (ground truth).

Advanced & Specialized Loss Functions

CTC Loss

Connectionist Temporal Classification. Used in speech recognition, handwriting recognition. Aligns sequences without alignment labels.

tf.nn.ctc_loss
Contrastive Loss

L = y*d² + (1-y)*max(margin-d,0)²

Used in Siamese networks, similarity learning.

Triplet Loss

max(d(a,p)-d(a,n)+margin, 0)

Face recognition (FaceNet), embeddings.

Dice Loss / F1 Score

1 - (2|X∩Y|)/(|X|+|Y|). For imbalanced segmentation, medical imaging.

Perceptual Loss

Loss based on feature maps of pre-trained networks (VGG). For style transfer, super-resolution.

Loss Function Selection Guide

Task Type Recommended Loss Output Activation Comments
Regression (normal)MSELinearSensitive to outliers
Regression (robust)Huber / MAELinearLess sensitive to outliers
Binary ClassificationBinary Cross-EntropySigmoidUse from_logits for stability
Multi-class ClassificationCategorical Cross-EntropySoftmaxUse sparse CE for integer labels
Multi-label ClassificationBinary Cross-EntropySigmoid (per class)Independent probabilities
Imbalanced DataWeighted CE / Focal LossSigmoid/SoftmaxFocuses on hard samples
Similarity LearningContrastive / TripletL2 normalizedEmbedding space
Generative ModelsBCE (GANs), KL (VAEs)VariesTask specific

Quick Selection Rules:

  • Regression: Start with MSE. If outliers are problematic, try MAE or Huber.
  • Binary classification: Binary cross-entropy.
  • Multi-class: Categorical cross-entropy.
  • Probabilistic outputs: KL Divergence.
  • Sequence alignment: CTC Loss.

Loss Functions in TensorFlow & PyTorch

TensorFlow / Keras
import tensorflow as tf

# Common losses
model.compile(loss='mse', optimizer='adam')  # regression
model.compile(loss='binary_crossentropy', ...)
model.compile(loss='categorical_crossentropy', ...)
model.compile(loss=tf.keras.losses.Huber(delta=1.5), ...)

# Custom loss function
def custom_mse(y_true, y_pred):
    return tf.reduce_mean(tf.square(y_true - y_pred))
PyTorch
import torch.nn as nn

criterion = nn.MSELoss()  # regression
criterion = nn.BCELoss()  # requires sigmoid
criterion = nn.BCEWithLogitsLoss()  # stable, from_logits
criterion = nn.CrossEntropyLoss()  # includes softmax
criterion = nn.KLDivLoss()  # KL divergence

# Custom
class CustomLoss(nn.Module):
    def forward(self, y_pred, y_true):
        return torch.mean((y_true - y_pred)**2)

Designing Custom Loss Functions

Sometimes you need a task-specific loss. Any differentiable function that maps (y_true, y_pred) to a scalar can be a loss.

Custom Loss in TensorFlow
def weighted_mse(y_true, y_pred):
    weights = tf.where(y_true > 0.5, 2.0, 1.0)
    return tf.reduce_mean(weights * (y_true - y_pred)**2)

model.compile(loss=weighted_mse, optimizer='adam')
Custom Loss in PyTorch
class FocalLoss(nn.Module):
    def __init__(self, alpha=1, gamma=2):
        super().__init__()
        self.alpha = alpha
        self.gamma = gamma
    
    def forward(self, y_pred, y_true):
        bce = nn.functional.binary_cross_entropy_with_logits(y_pred, y_true, reduction='none')
        p = torch.sigmoid(y_pred)
        focal = self.alpha * (1-p)**self.gamma * bce
        return focal.mean()
Tip: Ensure your custom loss is differentiable and numerically stable. Test with small inputs.

Loss Function Pitfalls & Best Practices

⚠️ Wrong loss for task: Using MSE for classification leads to poor convergence and probability estimates.
⚠️ Ignoring class imbalance: Use weighted cross-entropy or focal loss.
✅ Numerical stability: Use `from_logits=True` or `BCEWithLogitsLoss` to avoid log(0).
✅ Monitor loss curves: Loss decreasing? Plateau? NaN? Helps debug.

Loss Landscape: The shape of loss function affects optimization. MSE is convex, Cross-Entropy is convex for linear models, neural nets are non-convex.

Loss Functions Cheatsheet

MSE Regression
MAE Robust reg.
Huber Smooth robust
BCE Binary cls
CCE Multi-class
KL Divergence
Hinge Max-margin
CTC Sequence

Backpropagation: The Engine of Deep Learning

Why Backpropagation? The Credit Assignment Problem

In a multi-layer network, how does a small change in an early weight affect the final loss? Backpropagation (Rumelhart, Hinton, 1986) elegantly solves this credit assignment problem by recursively applying the chain rule.

Historical Breakthrough
  • 1986 Backpropagation popularized
  • 1989 Universal approximation proven
  • 2012 AlexNet (backprop + GPU) wins ImageNet
Intuition

Backpropagation = forward pass computes predictions, backward pass propagates error gradients from output to each weight. "How much did each weight contribute to the error?"

Prerequisites: Partial derivatives, chain rule, gradient descent. We'll derive everything step by step.

Chain Rule: From Calculus to Computation

Backpropagation is the chain rule — applied efficiently to millions of parameters.

Scalar Chain Rule

If y = f(g(x)), then dy/dx = (dy/dg) * (dg/dx)

Multivariate: For L = f(z), z = Wx + b:

∂L/∂W = (∂L/∂z) · (∂z/∂W)
∂L/∂W₁ = ∂L/∂a₃ · ∂a₃/∂z₃ · ∂z₃/∂a₂ · ∂a₂/∂z₂ · ∂z₂/∂a₁ · ∂a₁/∂z₁ · ∂z₁/∂W₁
Gradient flows backward through every intermediate function

Computational Graphs: Visualizing Backprop

Modern frameworks (TensorFlow, PyTorch) build a computational graph during forward pass, then traverse it in reverse to compute gradients.

Forward:

x → *3 → +5 → z

Backward:

dz = 1  
d+5 = dz * 1  
d*3 = d+5 * 3  
dx = d*3 * 1?
Automatic Differentiation
  • Forward mode: compute derivatives alongside values
  • Reverse mode (backprop): one forward pass, one backward pass → all gradients
  • Efficient for many parameters (typical deep learning)
🧮 Manual backprop through simple graph (NumPy)
# Forward pass: z = (x * 3) + 5
x = 2.0
a = x * 3      # a = 6
z = a + 5      # z = 11

# Backward pass (dz/dz = 1)
dz = 1
da = dz * 1    # dz/da = 1
dx = da * 3    # da/dx = 3
print(dx)      # Gradient = 3

Backpropagation Through a 2‑Layer MLP

X → (W1) → z1 → σ → a1 → (W2) → z2 → σ → a2 → Loss L(a2,y)
← dW1 ← dz1 ← da1 ← dW2 ← dz2 ← dL/da2
Forward Pass (Caching)
z1 = X @ W1 + b1
a1 = sigmoid(z1)
z2 = a1 @ W2 + b2
a2 = sigmoid(z2)
loss = binary_crossentropy(y, a2)
Backward Pass (Gradients)
da2 = -(y/a2 - (1-y)/(1-a2))  # BCE derivative
dz2 = da2 * sigmoid_prime(z2) # (a2*(1-a2))
dW2 = a1.T @ dz2
db2 = np.sum(dz2, axis=0)

da1 = dz2 @ W2.T
dz1 = da1 * sigmoid_prime(z1)
dW1 = X.T @ dz1
db1 = np.sum(dz1, axis=0)
Key pattern: Gradient w.r.t weight = activation_in.T @ gradient_out. This is consistent for all fully connected layers.

Pure NumPy Backprop – Full Training Loop

Every line explained. No frameworks, just math and NumPy.

⚙️ NeuralNetwork with backprop (XOR example)
import numpy as np

class NeuralNet:
    def __init__(self, input_size, hidden_size, output_size, lr=0.5):
        self.lr = lr
        self.W1 = np.random.randn(input_size, hidden_size) * 0.5
        self.b1 = np.zeros((1, hidden_size))
        self.W2 = np.random.randn(hidden_size, output_size) * 0.5
        self.b2 = np.zeros((1, output_size))
    
    def sigmoid(self, x):
        return 1 / (1 + np.exp(-x))
    
    def sigmoid_deriv(self, x):
        return x * (1 - x)
    
    def forward(self, X):
        self.z1 = X @ self.W1 + self.b1
        self.a1 = self.sigmoid(self.z1)
        self.z2 = self.a1 @ self.W2 + self.b2
        self.a2 = self.sigmoid(self.z2)
        return self.a2
    
    def backward(self, X, y, output):
        m = X.shape[0]
        # Output layer gradients
        self.dz2 = output - y.reshape(-1,1)     # BCE derivative simplification
        self.dW2 = (1/m) * self.a1.T @ self.dz2
        self.db2 = (1/m) * np.sum(self.dz2, axis=0, keepdims=True)
        # Hidden layer gradients
        self.da1 = self.dz2 @ self.W2.T
        self.dz1 = self.da1 * self.sigmoid_deriv(self.a1)
        self.dW1 = (1/m) * X.T @ self.dz1
        self.db1 = (1/m) * np.sum(self.dz1, axis=0, keepdims=True)
    
    def update(self):
        self.W1 -= self.lr * self.dW1
        self.b1 -= self.lr * self.db1
        self.W2 -= self.lr * self.dW2
        self.b2 -= self.lr * self.db2
    
    def train(self, X, y, epochs=5000):
        for i in range(epochs):
            output = self.forward(X)
            self.backward(X, y, output)
            self.update()
            if i % 1000 == 0:
                loss = np.mean((output - y)**2)
                print(f'Epoch {i}, Loss: {loss:.6f}')

# XOR dataset
X = np.array([[0,0],[0,1],[1,0],[1,1]])
y = np.array([[0],[1],[1],[0]])

nn = NeuralNet(2, 4, 1, lr=0.7)
nn.train(X, y, epochs=6000)
print("Predictions:\n", nn.forward(X))

This implementation converges for XOR — the classic non-linear problem that a single perceptron cannot solve.

Gradient Checking: Verify Your Backprop

Numerical approximation of gradients to ensure analytical backprop is correct.

🔬 Numerical gradient vs backprop
def numerical_gradient(f, params, epsilon=1e-7):
    """Finite difference approximation"""
    grads = []
    for param in params:
        grad = np.zeros_like(param)
        it = np.nditer(param, flags=['multi_index'], op_flags=['readwrite'])
        while not it.finished:
            idx = it.multi_index
            old_val = param[idx]
            param[idx] = old_val + epsilon
            f_plus = f()
            param[idx] = old_val - epsilon
            f_minus = f()
            grad[idx] = (f_plus - f_minus) / (2 * epsilon)
            param[idx] = old_val
            it.iternext()
        grads.append(grad)
    return grads

# Use: compare with backprop gradients (difference < 1e-6 is good)
Always gradient-check when implementing backprop from scratch!

Vanishing / Exploding Gradients

Deep networks suffer from unstable gradients. Why?

Vanishing

Sigmoid/tanh saturate → gradients → 0. Early layers learn extremely slowly.

# Solution: ReLU, residual connections, batch norm
Exploding

Large weights → gradients multiply exponentially → NaN.

# Solution: Gradient clipping, proper initialization
Modern mitigations
  • ReLU/Leaky ReLU activations
  • Xavier/He initialization
  • Batch Normalization
  • Residual connections (ResNet)
  • Gradient clipping

Backprop in TensorFlow & PyTorch

Autograd computes gradients automatically — but understanding backprop helps you debug and design architectures.

TensorFlow
with tf.GradientTape() as tape:
    y_pred = model(X)
    loss = tf.keras.losses.MSE(y, y_pred)
grads = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
PyTorch
y_pred = model(X)
loss = nn.MSELoss()(y_pred, y)
loss.backward()  # <-- one line backprop!
optimizer.step()

autograd computational graph dynamic vs static

Backpropagation = The Learning Algorithm

Every neural network — from 3-layer MLPs to GPT-4 — is trained using backpropagation (or its variant). Mastering backprop gives you superpowers: you can implement new architectures, fix vanishing gradients, and truly understand deep learning.

When You Need Backprop Deep‑Knowledge

Custom Layers

Implement your own forward/backward in frameworks.

Debugging

Why are gradients NaN? Why isn't this layer learning?

Research

Modify gradient flow (e.g., reversible nets, synthetic gradients).

Ready for advanced architectures? Next, learn how optimizers (SGD, Adam) use these gradients to update weights.

Optimizers: Driving Neural Network Training

What is an Optimizer?

Optimizers are algorithms that update model parameters (weights) to minimize the loss function. They determine how to move in the gradient direction — how fast, with what momentum, and with what adaptive scaling. The choice of optimizer critically affects training speed, stability, and final performance.

θₜ → ∇L(θₜ) → Optimizer Update Rule → θₜ₊₁ = θₜ - lr · f(∇L, history)

Optimizers incorporate gradient history, adaptive learning rates, and momentum.

Gradient Descent Variants

Batch GD

Uses entire dataset to compute gradient. θ = θ - lr · ∇L(θ; all data)

Slow Stable Not feasible for large datasets.

Stochastic GD (SGD)

θ = θ - lr · ∇L(θ; xᵢ, yᵢ)

Update per sample. High variance Online learning

Mini-batch GD

θ = θ - lr · ∇L(θ; batch)

Balanced Most common. Batch size 32-512.

Mini-batch SGD from scratch
import numpy as np

def sgd_update(params, grads, lr=0.01):
    """Simple SGD update"""
    for param, grad in zip(params, grads):
        param -= lr * grad
    return params

Momentum & Nesterov Accelerated Gradient

SGD with Momentum

vₜ = βvₜ₋₁ + (1-β)∇L(θₜ)
θₜ₊₁ = θₜ - lr · vₜ

Accumulates velocity to overcome ravines and accelerate convergence. β typically 0.9.

def momentum_update(params, grads, v, lr=0.01, beta=0.9):
    for i, (p, g) in enumerate(zip(params, grads)):
        v[i] = beta * v[i] + (1 - beta) * g
        p -= lr * v[i]
    return params, v
Nesterov Accelerated Gradient (NAG)

vₜ = βvₜ₋₁ + (1-β)∇L(θₜ - lr·βvₜ₋₁)
θₜ₊₁ = θₜ - lr · vₜ

Looks ahead at the approximate future position. Often faster and more stable than standard momentum.

Intuition: Momentum is like a ball rolling downhill – it accumulates speed. Nesterov is like a smart ball that looks ahead before updating.

Adaptive Learning Rate Methods

AdaGrad

Gₜ = Gₜ₋₁ + (∇L(θₜ))²
θₜ₊₁ = θₜ - lr/√(Gₜ + ε) · ∇L(θₜ)

Adapts per-parameter learning rates. Good for sparse data. Learning rate decays monotonically.

Weakness: LR becomes infinitesimally small.

RMSprop

E[g²]ₜ = βE[g²]ₜ₋₁ + (1-β)(∇L)²
θₜ₊₁ = θₜ - lr/√(E[g²]ₜ + ε) · ∇L

Unpublished, but widely used. Fixes AdaGrad's decaying LR problem. β typically 0.9.

RMSprop implementation
def rmsprop_update(params, grads, cache, lr=0.001, beta=0.9, eps=1e-8):
    for i, (p, g) in enumerate(zip(params, grads)):
        cache[i] = beta * cache[i] + (1 - beta) * g**2
        p -= lr * g / (np.sqrt(cache[i]) + eps)
    return params, cache

Adam & The Adaptive Moment Family

Adam (Adaptive Moment Estimation)

mₜ = β₁mₜ₋₁ + (1-β₁)∇L
vₜ = β₂vₜ₋₁ + (1-β₂)(∇L)²
θₜ₊₁ = θₜ - lr · m̂ₜ/(√v̂ₜ + ε)

Combines momentum (first moment) and RMSprop (second moment). Bias-corrected estimates. β₁=0.9, β₂=0.999, ε=1e-7.

Default optimizer for most tasks

AdamW

θₜ₊₁ = θₜ - lr · (m̂ₜ/(√v̂ₜ+ε) + λθₜ)

Decoupled weight decay. Improves generalization over Adam. Recommended over Adam.

# PyTorch: torch.optim.AdamW
# TensorFlow: tf.keras.optimizers.AdamW
Nadam

Adam + Nesterov momentum. Slightly faster convergence.

AMSGrad

Variant that uses maximum of past squared gradients. Addresses convergence issues.

AdaBelief

Stepsize scaled by belief in observed gradient. More stable.

Adam implementation intuition
# Simplified Adam update (conceptual)
def adam_step(param, grad, m, v, t, lr=0.001, b1=0.9, b2=0.999):
    m = b1 * m + (1 - b1) * grad
    v = b2 * v + (1 - b2) * grad**2
    m_hat = m / (1 - b1**t)
    v_hat = v / (1 - b2**t)
    param -= lr * m_hat / (np.sqrt(v_hat) + 1e-7)
    return param, m, v

Modern & Emerging Optimizers

Lion (EvoLved Sign Momentum)

mₜ = β₁mₜ₋₁ + (1-β₁)∇L
θₜ₊₁ = θₜ - lr · sign(β₂mₜ + (1-β₂)∇L)

Discovered by symbolic search. More memory-efficient than Adam. Used in Google's latest models.

Adafactor

Memory-efficient Adam for large models. Factorizes second moment estimates. Used in T5.

LAMB & LARS

Layer-wise Adaptive Rate Scaling. For large-batch training (BERT, ResNet on TPUs).

Learning Rate Scheduling

Even with adaptive optimizers, scheduling the learning rate improves convergence.

Step Decay

Drop LR by factor every few epochs.

# TF: tf.keras.optimizers.schedules.ExponentialDecay
# PyTorch: torch.optim.lr_scheduler.StepLR
Cosine Annealing

Smooth cyclic decay. Often with warm restarts.

tf.keras.optimizers.schedules.CosineDecay
Warmup

Linear increase from 0 to initial LR. Stabilizes large model training.

ReduceLROnPlateau

Reduce LR when validation loss plateaus.

Best practice: Use learning rate warmup for Transformers and very deep networks. Cosine decay often outperforms step decay.

Optimizer Selection Guide

Optimizer Adaptive Momentum When to use Memory
SGD❌❌Simple models, CV (with momentum)Low
SGD+Momentum❌✅Classic CNNs, needs LR tuningLow
RMSprop✅❌RNNs, online learningMedium
Adam✅✅Default for most tasksMedium
AdamW✅✅Transformers, NLP, better generalizationMedium
Nadam✅✅ (Nesterov)Slightly faster AdamMedium
Lion✅✅Memory efficient, vision tasksLow
Adafactor✅✅Giant models (LLMs)Very low

Quick Selection Rules:

  • Start with AdamW – works well out-of-the-box.
  • For NLP / Transformers: AdamW with cosine decay + warmup.
  • For Computer Vision: SGD with momentum can outperform Adam (requires tuning).
  • For large models (>1B params): Adafactor or Lion to save memory.
  • For sparse data: AdaGrad or Adam.

Optimizers in TensorFlow & PyTorch

TensorFlow / Keras
import tensorflow as tf

# Common optimizers
model.compile(optimizer='sgd', ...)
model.compile(optimizer='adam', ...)
model.compile(optimizer=tf.keras.optimizers.AdamW(learning_rate=1e-4, weight_decay=1e-4))

# Learning rate schedule
lr_schedule = tf.keras.optimizers.schedules.CosineDecay(1e-3, decay_steps=10000)
optimizer = tf.keras.optimizers.Adam(lr_schedule)

# Custom optimizer loop
@tf.function
def train_step(x, y):
    with tf.GradientTape() as tape:
        y_pred = model(x)
        loss = loss_fn(y, y_pred)
    grads = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(grads, model.trainable_variables))
PyTorch
import torch.optim as optim

# Optimizers
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
optimizer = optim.Adam(model.parameters(), lr=0.001)
optimizer = optim.AdamW(model.parameters(), lr=0.001, weight_decay=0.01)

# Learning rate schedulers
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=30, gamma=0.1)
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=50)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, patience=5)

# Training loop
for epoch in range(epochs):
    for x, y in dataloader:
        optimizer.zero_grad()
        y_pred = model(x)
        loss = loss_fn(y_pred, y)
        loss.backward()
        optimizer.step()
    scheduler.step()

Optimizer Hyperparameter Tuning

Learning Rate: Most critical. Defaults: Adam 1e-3, SGD 1e-2. Use LR range test.
Batch Size: Affects gradient noise. Tune together with LR.
Weight Decay: AdamW: 0.01-0.1, SGD: 1e-4. Prevents overfitting.

LR Range Test: Increase LR exponentially each batch, plot loss. Optimal LR is just before loss explodes.

Optimizer Pitfalls & Solutions

⚠️ Adam generalizes worse than SGD? Myth partially. AdamW fixes generalization. SGD with proper tuning can still outperform.
⚠️ Loss not decreasing: LR too high/low, gradient clipping needed, or bug in model.
✅ Gradient clipping: Essential for RNNs, Transformers. Clip norm to 1.0.
✅ Debug: Monitor gradient norms per layer. Vanishing/exploding?

Optimizer Cheatsheet

SGD+M CV
Adam Default
AdamW Best overall
RMSprop RNN
Lion Efficient
Adafactor LLMs
Nadam Slightly faster
AdaGrad Sparse