Deep Learning

Neural Network Fundamentals

Neural networks, activation functions, loss functions, backpropagation, and optimizers for deep learning.

Neural Networks Basics

The Perceptron â€” First Neural Model

Invented by Frank Rosenblatt in 1958, the perceptron is the simplest neural network: a single neuron that classifies linear separable patterns.

How it works

1 Weighted sum: z = wÂ·x + b
2 Step activation: 1 if z â‰¥ 0 else 0
3 Update: w = w + lr*(y - yÌ‚)*x

Limitation

Only linear separable functions (AND, OR) â€“ cannot learn XOR. This triggered the first AI winter and led to multi-layer networks.

key insight: depth matters

ðŸ“ Perceptron from scratch â€“ NumPy

import numpy as np

class Perceptron:
    def __init__(self, lr=0.01, epochs=15):
        self.lr = lr
        self.epochs = epochs
        self.weights = None
        self.bias = None

    def activation(self, z):
        return 1 if z >= 0 else 0

    def fit(self, X, y):
        n_samples, n_features = X.shape
        self.weights = np.zeros(n_features)
        self.bias = 0

        for _ in range(self.epochs):
            for idx, x_i in enumerate(X):
                linear = np.dot(x_i, self.weights) + self.bias
                y_pred = self.activation(linear)
                update = self.lr * (y[idx] - y_pred)
                self.weights += update * x_i
                self.bias += update

    def predict(self, X):
        linear = np.dot(X, self.weights) + self.bias
        return np.array([self.activation(z) for z in linear])

Try it on AND gate â€“ converges in <10 iterations.

Activation Functions: Non-linearity is key

Without activation functions, stacked linear layers collapse into one linear transformation. Non-linear activations enable deep networks to approximate any function.

Sigmoid

def sigmoid(x):
    return 1/(1+np.exp(-x))

Range (0,1), great for binary output, but vanishing gradient.

Tanh

def tanh(x):
    return np.tanh(x)

Range (-1,1), zero-centered, stronger gradients.

ReLU

def relu(x):
    return np.maximum(0,x)

No saturation, sparse; dead neurons risk.

Leaky ReLU

def leaky_relu(x, alpha=0.1):
    return np.where(x>0, x, alpha*x)

Softmax

def softmax(x):
    ex = np.exp(x - np.max(x))
    return ex / ex.sum()

Multi-class probability.

Selection rule: ReLU for hidden layers, sigmoid for binary output, softmax for multi-class.

Forward Propagation & Backpropagation

Input X â†’ [W1,b1] â†’ z1 â†’ a1 = Ïƒ(z1) â†’ [W2,b2] â†’ z2 â†’ a2 = Ïƒ(z2) â†’ Loss L(yÌ‚,y)

â†» Backward: dL/dW2 â† dL/da2 * da2/dz2 * dz2/dW2 ... chain rule

Forward pass

Compute activations layer by layer, cache intermediate values for gradient.

Backward pass (chain rule)

Î´L/Î´W = (Î´L/Î´a) * (Î´a/Î´z) * (Î´z/Î´W)

ðŸ” Backpropagation in 2-layer net (NumPy)

# assume sigmoid activation, MSE loss
def backward(self, X, y, a1, a2):
    m = X.shape[0]
    # output layer gradient
    dz2 = a2 - y.reshape(-1,1)          # dL/dz2
    dW2 = (1/m) * a1.T @ dz2
    db2 = (1/m) * np.sum(dz2, axis=0, keepdims=True)
    # hidden layer gradient
    da1 = dz2 @ self.W2.T
    dz1 = da1 * (a1 * (1 - a1))         # sigmoid derivative
    dW1 = (1/m) * X.T @ dz1
    db1 = (1/m) * np.sum(dz1, axis=0, keepdims=True)

Multi-Layer Perceptron (MLP) from Scratch

Complete implementation of a flexible neural network with one hidden layer using only NumPy. Foundation for modern deep learning.

ðŸ§ NeuralNetwork class â€“ forward, backward, train

import numpy as np

class MLP:
    def __init__(self, input_size, hidden_size, output_size, lr=0.1):
        self.lr = lr
        self.W1 = np.random.randn(input_size, hidden_size) * 0.5
        self.b1 = np.zeros((1, hidden_size))
        self.W2 = np.random.randn(hidden_size, output_size) * 0.5
        self.b2 = np.zeros((1, output_size))

    def sigmoid(self, x):
        return 1 / (1 + np.exp(-x))

    def sigmoid_deriv(self, x):
        return x * (1 - x)

    def forward(self, X):
        self.z1 = X @ self.W1 + self.b1
        self.a1 = self.sigmoid(self.z1)
        self.z2 = self.a1 @ self.W2 + self.b2
        self.a2 = self.sigmoid(self.z2)
        return self.a2

    def backward(self, X, y, output):
        m = X.shape[0]
        self.dz2 = output - y.reshape(-1,1)
        self.dW2 = (1/m) * self.a1.T @ self.dz2
        self.db2 = (1/m) * np.sum(self.dz2, axis=0, keepdims=True)
        self.da1 = self.dz2 @ self.W2.T
        self.dz1 = self.da1 * self.sigmoid_deriv(self.a1)
        self.dW1 = (1/m) * X.T @ self.dz1
        self.db1 = (1/m) * np.sum(self.dz1, axis=0, keepdims=True)

    def update(self):
        self.W1 -= self.lr * self.dW1
        self.b1 -= self.lr * self.db1
        self.W2 -= self.lr * self.dW2
        self.b2 -= self.lr * self.db2

    def fit(self, X, y, epochs=1000):
        for i in range(epochs):
            output = self.forward(X)
            self.backward(X, y, output)
            self.update()
            if i % 200 == 0:
                loss = np.mean((output - y)**2)
                print(f"epoch {i}, loss: {loss:.6f}")

Test on XOR: 2â†’4â†’1 network, sigmoid, trained 2000 epochs â†’ converges below 0.005 MSE.

Neural Nets in Keras & PyTorch

TensorFlow/Keras

import tensorflow as tf
model = tf.keras.Sequential([
    tf.keras.layers.Dense(64, activation='relu', input_shape=(4,)),
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(optimizer='adam', loss='mse')

PyTorch

import torch.nn as nn
class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(4, 8)
        self.fc2 = nn.Linear(8, 4)
        self.out = nn.Linear(4, 1)
    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        return torch.sigmoid(self.out(x))

transfer learning autodiff GPU

Weight Initialization & Optimizers

Initialization

Zero init â†’ symmetry, no learning
Small random (0.01) â€“ ok for shallow
Xavier/Glorot for sigmoid/tanh
He init for ReLU

Optimizers

Batch GD, SGD, Mini-batch. Momentum, Adam, RMSprop adapt learning rates.

optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)

Why do neural networks work?

Universal Approximation Theorem: A feedforward network with a single hidden layer can approximate any continuous function, given sufficient neurons and non-linear activation.

Realâ€‘world usage

Regression & Forecasting

Housing prices, stock trends, energy load.

Classification

Spam detection, credit risk, medical diagnosis.

Feature learning

Autoencoders, embeddings, representation learning.

Activation Functions: The Non-Linear Gatekeepers

Why do we need activation functions?

Without activation functions, neural networks would just be linear transformations. No matter how many layers, a linear combination of linear functions is still linear. Activation functions introduce non-linearity, allowing the network to learn complex patterns, decision boundaries, and hierarchical representations.

Weighted sum (z = w·x + b) → Activation f(z) → Output (non-linear)

Every neuron applies an activation function to its weighted input.

Classic Activation Functions: Sigmoid & Tanh

Sigmoid (σ)

Formula: σ(x) = 1 / (1 + e^-x)

Derivative: σ(x)(1-σ(x))

import numpy as np
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(x):
    s = sigmoid(x)
    return s * (1 - s)

Output (0,1) Vanishing gradient Used in output layer (binary classification).

Tanh (Hyperbolic Tangent)

Formula: tanh(x) = (e^x - e^-x) / (e^x + e^-x)

Derivative: 1 - tanh²(x)

def tanh(x):
    return np.tanh(x)  # or manual

def tanh_derivative(x):
    return 1 - np.tanh(x)**2

Zero-centered (-1,1) Stronger gradients, still saturates.

Vanishing Gradient: Both Sigmoid and Tanh squash large inputs, making gradients near zero. Deep networks struggle to learn.

ReLU & Family: Solving Vanishing Gradient

ReLU (Rectified Linear Unit)

f(x) = max(0, x)

def relu(x):
    return np.maximum(0, x)
# derivative: 1 if x>0 else 0

Pros: Computationally cheap, sparse, no saturation for x>0. Cons: Dying ReLU (neurons stuck at 0).

Leaky ReLU

f(x) = x if x>0 else αx (α small, e.g., 0.01)

def leaky_relu(x, alpha=0.01):
    return np.where(x > 0, x, alpha * x)

Fixes dying ReLU; allows gradient flow for negative values.

ELU (Exponential Linear Unit)

f(x)= x if x>0 else α(e^x -1)

Smooth, negative values push mean closer to zero. Faster learning.

PReLU (Parametric ReLU)

α is learned during training.

# TensorFlow: tf.keras.layers.PReLU()

Softmax: From Logits to Probabilities

Softmax is used in the output layer for multi-class classification. It converts a vector of raw scores (logits) into a probability distribution over classes.

Softmax + CrossEntropy

def softmax(logits):
    exp_shifted = np.exp(logits - np.max(logits))  # numerical stability
    return exp_shifted / np.sum(exp_shifted, axis=-1, keepdims=True)

# Example: logits = [2.0, 1.0, 0.1] -> probabilities sum=1

Key property

All outputs ∈ (0,1) and sum to 1.
Ideal for mutually exclusive classes.

Modern Activation Functions (Transformers, CNNs)

Swish (Google, 2017)

f(x) = x * sigmoid(βx) (β learnable or constant =1)

Smooth, non-monotonic. Outperforms ReLU in deep nets.

# TF: tf.keras.activations.swish

GELU (Gaussian Error Linear Unit)

GELU(x) = x * Φ(x) (Φ is CDF of Gaussian).

Used in BERT, GPT, ViT. Smooth ReLU variant.

# from transformers, torch.nn.GELU

Mish (2019)

f(x) = x * tanh(softplus(x)).

Self-regularized, slightly better than Swish on some benchmarks.

Trend: Smooth, non-monotonic, often with no saturation. Swish & GELU are default in many modern architectures.

Activation Function Selection Guide

Function	Range	When to use	Common in
Sigmoid	(0,1)	Binary output, probabilistic gate	Logistic regression, some attention
Tanh	(-1,1)	Zero-centered hidden layers (older RNNs)	LSTM candidate gates
ReLU	[0,∞)	Default for hidden layers (CNNs, MLPs)	ResNet, VGG, YOLO
Leaky ReLU	(-∞,∞)	Avoid dead neurons	GANs, some detection models
Softmax	(0,1) sum=1	Multi-class classification output	Classification heads
Swish / SiLU	(-∞,∞)	Deep transformer-style models	EfficientNet, RL
GELU	(-∞,∞)	NLP Transformers (BERT, GPT)	Hugging Face models

                     Rule of thumb:
                    Start with ReLU for hidden layers.
For output: Sigmoid (binary), Softmax (multi-class), Linear (regression).
If ReLU causes dead neurons → Leaky ReLU / ELU.
For Transformers: GELU.
For very deep nets: consider Swish.

                

Activation Functions in TensorFlow & PyTorch

TensorFlow / Keras

import tensorflow as tf
# As activation string or layer
model = tf.keras.Sequential([
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(32, activation='leaky_relu'), 
    tf.keras.layers.Dense(10, activation='softmax')
])
# Advanced: tf.keras.activations.gelu, tf.nn.swish

PyTorch

import torch.nn as nn
class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(784, 128)
        self.act1 = nn.ReLU()
        self.fc2 = nn.Linear(128, 64)
        self.act2 = nn.LeakyReLU(0.02)
        self.out = nn.Linear(64, 10)
    def forward(self, x):
        x = self.act1(self.fc1(x))
        x = self.act2(self.fc2(x))
        return self.out(x)  # with CrossEntropyLoss, no softmax needed

Activation shapes at a glance

    Sigmoid:   ──▄▄▄▄▄▄▄▄▄▄▄▄▄▄──  squashes to [0,1]
    Tanh:      ─▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄─  [-1,1]
    ReLU:      ──────────────────▄▄▄▄▄▄▄▄▄▄   max(0,x)
    Leaky ReLU:─╱╱╱╱─────────────▄▄▄▄▄▄▄▄▄   small negative slope
    Softmax:   [0.2, 0.7, 0.1] probabilities

* ASCII illustration of activation function curves

Activation Pitfalls & Best Practices

⚠️ Vanishing gradient: Avoid Sigmoid/Tanh in deep hidden layers. Use ReLU or variants.

🎯 Dead ReLU: Use Leaky ReLU or ELU if many neurons output zero forever.

💡 Numerical stability: For softmax, always subtract max(logits) before exponentiation.

🧠 Output layer: Match activation to task: linear (regression), sigmoid (binary), softmax (multi-class).

Activation Function CheatsheetSigmoid 0-1
Tanh -1 to 1
ReLU max(0,x)
Leaky ReLU α=0.01
ELU α(e^x-1)
Swish x·sigmoid
GELU x·Φ(x)
Softmax ∑=1

Next Up: Loss Functions – How neural networks learn.

Loss Functions: The Compass of Neural Networks

What is a Loss Function?

A loss function (also called cost/objective function) maps the model's predictions and ground truth to a scalar value. Lower loss = better predictions. During training, backpropagation computes gradients of the loss w.r.t. weights, and optimizers update weights to minimize this loss.

Predictions (Å·) + True Targets (y) â†’ Loss = L(Å·, y) â†’ Gradient âˆ‡L

Loss functions define the learning objective.

Regression Losses: Predicting Continuous Values

MSE (L2 Loss)

MSE = 1/n Î£(y - Å·)Â²

import numpy as np
def mse(y_true, y_pred):
    return np.mean((y_true - y_pred)**2)

Differentiable Sensitive to outliers Most common regression loss.

MAE (L1 Loss)

MAE = 1/n Î£|y - Å·|

def mae(y_true, y_pred):
    return np.mean(np.abs(y_true - y_pred))

Robust to outliers Not differentiable at 0.

Huber Loss

LÎ´ = { Â½(y-Å·)Â² for \|y-Å·\|â‰¤Î´, else Î´\|y-Å·\| - Â½Î´Â² }

def huber(y_true, y_pred, delta=1.0):
    error = y_true - y_pred
    is_small = np.abs(error) <= delta
    return np.mean(is_small * 0.5 * error**2 + 
                   ~is_small * (delta*np.abs(error) - 0.5*delta**2))

Combines MSE and MAE. Smooth, robust.

Log-Cosh Loss

L = log(cosh(Å· - y))

Smooth approximation to MAE, twice differentiable.

Quantile Loss

L = Î£ max(q(y-Å·), (q-1)(y-Å·))

Used for predicting prediction intervals.

Classification Losses: Probability & Decision Boundaries

Binary Cross-Entropy (BCE)

BCE = -[y log(Å·) + (1-y) log(1-Å·)]

def binary_crossentropy(y, y_pred):
    y_pred = np.clip(y_pred, 1e-7, 1-1e-7)  # stability
    return -np.mean(y * np.log(y_pred) + 
                    (1 - y) * np.log(1 - y_pred))

Use: Binary classification, sigmoid output.

Categorical Cross-Entropy (CCE)

CCE = -Î£ y_i log(Å·_i)

def categorical_crossentropy(y, y_pred):
    y_pred = np.clip(y_pred, 1e-7, 1.0)
    return -np.sum(y * np.log(y_pred)) / y.shape[0]

Use: Multi-class, softmax output.

Sparse CCE

Same as CCE but targets are integers (not one-hot). Memory efficient.

tf.keras.losses.SparseCategoricalCrossentropy()

Hinge Loss

L = max(0, 1 - yÂ·Å·) (y âˆˆ {-1,1})

Used in SVMs, also with CNNs.

tf.keras.losses.Hinge()

Squared Hinge

L = max(0, 1 - yÂ·Å·)Â²

Differentiable, penalizes errors more.

Numerical stability: Always use framework implementations (e.g., tf.keras.losses.BinaryCrossentropy(from_logits=True)) which combine log and softmax/sigmoid in a numerically stable way.

Probabilistic Losses: Distributions & Divergence

KL Divergence

D_KL(P||Q) = Î£ P(i) log(P(i)/Q(i))

Measures how one probability distribution diverges from another. Asymmetric.

def kl_divergence(p, q):
    p = np.clip(p, 1e-7, 1)
    q = np.clip(q, 1e-7, 1)
    return np.sum(p * np.log(p / q))

Used in VAEs, variational inference.

JS Divergence

Jensen-Shannon divergence. Symmetric, smoothed version of KL.

Used in GANs, domain adaptation.

Cross-Entropy vs KL

Cross-Entropy = H(P,Q) = H(P) + D_KL(P||Q). Minimizing cross-entropy is equivalent to minimizing KL when P is fixed (ground truth).

Advanced & Specialized Loss Functions

CTC Loss

Connectionist Temporal Classification. Used in speech recognition, handwriting recognition. Aligns sequences without alignment labels.

tf.nn.ctc_loss

Contrastive Loss

L = y*dÂ² + (1-y)*max(margin-d,0)Â²

Used in Siamese networks, similarity learning.

Triplet Loss

max(d(a,p)-d(a,n)+margin, 0)

Face recognition (FaceNet), embeddings.

Dice Loss / F1 Score

1 - (2|Xâˆ©Y|)/(|X|+|Y|). For imbalanced segmentation, medical imaging.

Perceptual Loss

Loss based on feature maps of pre-trained networks (VGG). For style transfer, super-resolution.

Loss Function Selection Guide

Task Type	Recommended Loss	Output Activation	Comments
Regression (normal)	MSE	Linear	Sensitive to outliers
Regression (robust)	Huber / MAE	Linear	Less sensitive to outliers
Binary Classification	Binary Cross-Entropy	Sigmoid	Use from_logits for stability
Multi-class Classification	Categorical Cross-Entropy	Softmax	Use sparse CE for integer labels
Multi-label Classification	Binary Cross-Entropy	Sigmoid (per class)	Independent probabilities
Imbalanced Data	Weighted CE / Focal Loss	Sigmoid/Softmax	Focuses on hard samples
Similarity Learning	Contrastive / Triplet	L2 normalized	Embedding space
Generative Models	BCE (GANs), KL (VAEs)	Varies	Task specific

                     Quick Selection Rules:
                    Regression: Start with MSE. If outliers are problematic, try MAE or Huber.
Binary classification: Binary cross-entropy.
Multi-class: Categorical cross-entropy.
Probabilistic outputs: KL Divergence.
Sequence alignment: CTC Loss.

                

Loss Functions in TensorFlow & PyTorch

TensorFlow / Keras

import tensorflow as tf

# Common losses
model.compile(loss='mse', optimizer='adam')  # regression
model.compile(loss='binary_crossentropy', ...)
model.compile(loss='categorical_crossentropy', ...)
model.compile(loss=tf.keras.losses.Huber(delta=1.5), ...)

# Custom loss function
def custom_mse(y_true, y_pred):
    return tf.reduce_mean(tf.square(y_true - y_pred))

PyTorch

import torch.nn as nn

criterion = nn.MSELoss()  # regression
criterion = nn.BCELoss()  # requires sigmoid
criterion = nn.BCEWithLogitsLoss()  # stable, from_logits
criterion = nn.CrossEntropyLoss()  # includes softmax
criterion = nn.KLDivLoss()  # KL divergence

# Custom
class CustomLoss(nn.Module):
    def forward(self, y_pred, y_true):
        return torch.mean((y_true - y_pred)**2)

Designing Custom Loss Functions

Sometimes you need a task-specific loss. Any differentiable function that maps (y_true, y_pred) to a scalar can be a loss.

Custom Loss in TensorFlow

def weighted_mse(y_true, y_pred):
    weights = tf.where(y_true > 0.5, 2.0, 1.0)
    return tf.reduce_mean(weights * (y_true - y_pred)**2)

model.compile(loss=weighted_mse, optimizer='adam')

Custom Loss in PyTorch

class FocalLoss(nn.Module):
    def __init__(self, alpha=1, gamma=2):
        super().__init__()
        self.alpha = alpha
        self.gamma = gamma
    
    def forward(self, y_pred, y_true):
        bce = nn.functional.binary_cross_entropy_with_logits(y_pred, y_true, reduction='none')
        p = torch.sigmoid(y_pred)
        focal = self.alpha * (1-p)**self.gamma * bce
        return focal.mean()

Tip: Ensure your custom loss is differentiable and numerically stable. Test with small inputs.

Loss Function Pitfalls & Best Practices

âš ï¸ Wrong loss for task: Using MSE for classification leads to poor convergence and probability estimates.

âš ï¸ Ignoring class imbalance: Use weighted cross-entropy or focal loss.

âœ… Numerical stability: Use `from_logits=True` or `BCEWithLogitsLoss` to avoid log(0).

âœ… Monitor loss curves: Loss decreasing? Plateau? NaN? Helps debug.

Loss Landscape: The shape of loss function affects optimization. MSE is convex, Cross-Entropy is convex for linear models, neural nets are non-convex.

Loss Functions CheatsheetMSE Regression
MAE Robust reg.
Huber Smooth robust
BCE Binary cls
CCE Multi-class
KL Divergence
Hinge Max-margin
CTC Sequence

Backpropagation: The Engine of Deep Learning

Why Backpropagation? The Credit Assignment Problem

In a multi-layer network, how does a small change in an early weight affect the final loss? Backpropagation (Rumelhart, Hinton, 1986) elegantly solves this credit assignment problem by recursively applying the chain rule.

Historical Breakthrough

1986 Backpropagation popularized
1989 Universal approximation proven
2012 AlexNet (backprop + GPU) wins ImageNet

Intuition

Backpropagation = forward pass computes predictions, backward pass propagates error gradients from output to each weight. "How much did each weight contribute to the error?"

Prerequisites: Partial derivatives, chain rule, gradient descent. We'll derive everything step by step.

Chain Rule: From Calculus to Computation

Backpropagation is the chain rule â€” applied efficiently to millions of parameters.

Scalar Chain Rule

If y = f(g(x)), then dy/dx = (dy/dg) * (dg/dx)

Multivariate: For L = f(z), z = Wx + b:

âˆ‚L/âˆ‚W = (âˆ‚L/âˆ‚z) Â· (âˆ‚z/âˆ‚W)

âˆ‚L/âˆ‚Wâ‚ = âˆ‚L/âˆ‚aâ‚ƒ Â· âˆ‚aâ‚ƒ/âˆ‚zâ‚ƒ Â· âˆ‚zâ‚ƒ/âˆ‚aâ‚‚ Â· âˆ‚aâ‚‚/âˆ‚zâ‚‚ Â· âˆ‚zâ‚‚/âˆ‚aâ‚ Â· âˆ‚aâ‚/âˆ‚zâ‚ Â· âˆ‚zâ‚/âˆ‚Wâ‚

Gradient flows backward through every intermediate function

Computational Graphs: Visualizing Backprop

Modern frameworks (TensorFlow, PyTorch) build a computational graph during forward pass, then traverse it in reverse to compute gradients.

Forward:

x â†’ *3 â†’ +5 â†’ z

Backward:

dz = 1  
d+5 = dz * 1  
d*3 = d+5 * 3  
dx = d*3 * 1?

Automatic Differentiation

Forward mode: compute derivatives alongside values
Reverse mode (backprop): one forward pass, one backward pass â†’ all gradients
Efficient for many parameters (typical deep learning)

ðŸ§® Manual backprop through simple graph (NumPy)

# Forward pass: z = (x * 3) + 5
x = 2.0
a = x * 3      # a = 6
z = a + 5      # z = 11

# Backward pass (dz/dz = 1)
dz = 1
da = dz * 1    # dz/da = 1
dx = da * 3    # da/dx = 3
print(dx)      # Gradient = 3

Backpropagation Through a 2â€‘Layer MLP

X â†’ (W1) â†’ z1 â†’ Ïƒ â†’ a1 â†’ (W2) â†’ z2 â†’ Ïƒ â†’ a2 â†’ Loss L(a2,y)
â† dW1 â† dz1 â† da1 â† dW2 â† dz2 â† dL/da2

Forward Pass (Caching)

z1 = X @ W1 + b1
a1 = sigmoid(z1)
z2 = a1 @ W2 + b2
a2 = sigmoid(z2)
loss = binary_crossentropy(y, a2)

Backward Pass (Gradients)

da2 = -(y/a2 - (1-y)/(1-a2))  # BCE derivative
dz2 = da2 * sigmoid_prime(z2) # (a2*(1-a2))
dW2 = a1.T @ dz2
db2 = np.sum(dz2, axis=0)

da1 = dz2 @ W2.T
dz1 = da1 * sigmoid_prime(z1)
dW1 = X.T @ dz1
db1 = np.sum(dz1, axis=0)

Key pattern: Gradient w.r.t weight = activation_in.T @ gradient_out. This is consistent for all fully connected layers.

Pure NumPy Backprop â€“ Full Training Loop

Every line explained. No frameworks, just math and NumPy.

âš™ï¸ NeuralNetwork with backprop (XOR example)

import numpy as np

class NeuralNet:
    def __init__(self, input_size, hidden_size, output_size, lr=0.5):
        self.lr = lr
        self.W1 = np.random.randn(input_size, hidden_size) * 0.5
        self.b1 = np.zeros((1, hidden_size))
        self.W2 = np.random.randn(hidden_size, output_size) * 0.5
        self.b2 = np.zeros((1, output_size))
    
    def sigmoid(self, x):
        return 1 / (1 + np.exp(-x))
    
    def sigmoid_deriv(self, x):
        return x * (1 - x)
    
    def forward(self, X):
        self.z1 = X @ self.W1 + self.b1
        self.a1 = self.sigmoid(self.z1)
        self.z2 = self.a1 @ self.W2 + self.b2
        self.a2 = self.sigmoid(self.z2)
        return self.a2
    
    def backward(self, X, y, output):
        m = X.shape[0]
        # Output layer gradients
        self.dz2 = output - y.reshape(-1,1)     # BCE derivative simplification
        self.dW2 = (1/m) * self.a1.T @ self.dz2
        self.db2 = (1/m) * np.sum(self.dz2, axis=0, keepdims=True)
        # Hidden layer gradients
        self.da1 = self.dz2 @ self.W2.T
        self.dz1 = self.da1 * self.sigmoid_deriv(self.a1)
        self.dW1 = (1/m) * X.T @ self.dz1
        self.db1 = (1/m) * np.sum(self.dz1, axis=0, keepdims=True)
    
    def update(self):
        self.W1 -= self.lr * self.dW1
        self.b1 -= self.lr * self.db1
        self.W2 -= self.lr * self.dW2
        self.b2 -= self.lr * self.db2
    
    def train(self, X, y, epochs=5000):
        for i in range(epochs):
            output = self.forward(X)
            self.backward(X, y, output)
            self.update()
            if i % 1000 == 0:
                loss = np.mean((output - y)**2)
                print(f'Epoch {i}, Loss: {loss:.6f}')

# XOR dataset
X = np.array([[0,0],[0,1],[1,0],[1,1]])
y = np.array([[0],[1],[1],[0]])

nn = NeuralNet(2, 4, 1, lr=0.7)
nn.train(X, y, epochs=6000)
print("Predictions:\n", nn.forward(X))

This implementation converges for XOR â€” the classic non-linear problem that a single perceptron cannot solve.

Gradient Checking: Verify Your Backprop

Numerical approximation of gradients to ensure analytical backprop is correct.

ðŸ”¬ Numerical gradient vs backprop

def numerical_gradient(f, params, epsilon=1e-7):
    """Finite difference approximation"""
    grads = []
    for param in params:
        grad = np.zeros_like(param)
        it = np.nditer(param, flags=['multi_index'], op_flags=['readwrite'])
        while not it.finished:
            idx = it.multi_index
            old_val = param[idx]
            param[idx] = old_val + epsilon
            f_plus = f()
            param[idx] = old_val - epsilon
            f_minus = f()
            grad[idx] = (f_plus - f_minus) / (2 * epsilon)
            param[idx] = old_val
            it.iternext()
        grads.append(grad)
    return grads

# Use: compare with backprop gradients (difference < 1e-6 is good)

Always gradient-check when implementing backprop from scratch!

Vanishing / Exploding Gradients

Deep networks suffer from unstable gradients. Why?

Vanishing

Sigmoid/tanh saturate â†’ gradients â†’ 0. Early layers learn extremely slowly.

# Solution: ReLU, residual connections, batch norm

Exploding

Large weights â†’ gradients multiply exponentially â†’ NaN.

# Solution: Gradient clipping, proper initialization

                     Modern mitigations
                    ReLU/Leaky ReLU activations
Xavier/He initialization
Batch Normalization
Residual connections (ResNet)
Gradient clipping

                

Backprop in TensorFlow & PyTorch

Autograd computes gradients automatically â€” but understanding backprop helps you debug and design architectures.

TensorFlow

with tf.GradientTape() as tape:
    y_pred = model(X)
    loss = tf.keras.losses.MSE(y, y_pred)
grads = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(grads, model.trainable_variables))

PyTorch

y_pred = model(X)
loss = nn.MSELoss()(y_pred, y)
loss.backward()  # <-- one line backprop!
optimizer.step()

autograd computational graph dynamic vs static

Backpropagation = The Learning Algorithm

Every neural network â€” from 3-layer MLPs to GPT-4 â€” is trained using backpropagation (or its variant). Mastering backprop gives you superpowers: you can implement new architectures, fix vanishing gradients, and truly understand deep learning.

When You Need Backprop Deepâ€‘Knowledge

Custom Layers

Implement your own forward/backward in frameworks.

Debugging

Why are gradients NaN? Why isn't this layer learning?

Research

Modify gradient flow (e.g., reversible nets, synthetic gradients).

Ready for advanced architectures? Next, learn how optimizers (SGD, Adam) use these gradients to update weights.

Optimizers: Driving Neural Network Training

What is an Optimizer?

Optimizers are algorithms that update model parameters (weights) to minimize the loss function. They determine how to move in the gradient direction â€” how fast, with what momentum, and with what adaptive scaling. The choice of optimizer critically affects training speed, stability, and final performance.

Î¸â‚œ â†’ âˆ‡L(Î¸â‚œ) â†’ Optimizer Update Rule â†’ Î¸â‚œâ‚Šâ‚ = Î¸â‚œ - lr Â· f(âˆ‡L, history)

Optimizers incorporate gradient history, adaptive learning rates, and momentum.

Gradient Descent Variants

Batch GD

Uses entire dataset to compute gradient. Î¸ = Î¸ - lr Â· âˆ‡L(Î¸; all data)

Slow Stable Not feasible for large datasets.

Stochastic GD (SGD)

Î¸ = Î¸ - lr Â· âˆ‡L(Î¸; xáµ¢, yáµ¢)

Update per sample. High variance Online learning

Mini-batch GD

Î¸ = Î¸ - lr Â· âˆ‡L(Î¸; batch)

Balanced Most common. Batch size 32-512.

Mini-batch SGD from scratch

import numpy as np

def sgd_update(params, grads, lr=0.01):
    """Simple SGD update"""
    for param, grad in zip(params, grads):
        param -= lr * grad
    return params

Momentum & Nesterov Accelerated Gradient

SGD with Momentum

vâ‚œ = Î²vâ‚œâ‚‹â‚ + (1-Î²)âˆ‡L(Î¸â‚œ)
Î¸â‚œâ‚Šâ‚ = Î¸â‚œ - lr Â· vâ‚œ

Accumulates velocity to overcome ravines and accelerate convergence. Î² typically 0.9.

def momentum_update(params, grads, v, lr=0.01, beta=0.9):
    for i, (p, g) in enumerate(zip(params, grads)):
        v[i] = beta * v[i] + (1 - beta) * g
        p -= lr * v[i]
    return params, v

Nesterov Accelerated Gradient (NAG)

vâ‚œ = Î²vâ‚œâ‚‹â‚ + (1-Î²)âˆ‡L(Î¸â‚œ - lrÂ·Î²vâ‚œâ‚‹â‚)
Î¸â‚œâ‚Šâ‚ = Î¸â‚œ - lr Â· vâ‚œ

Looks ahead at the approximate future position. Often faster and more stable than standard momentum.

Intuition: Momentum is like a ball rolling downhill â€“ it accumulates speed. Nesterov is like a smart ball that looks ahead before updating.

Adaptive Learning Rate Methods

AdaGrad

Gâ‚œ = Gâ‚œâ‚‹â‚ + (âˆ‡L(Î¸â‚œ))Â²
Î¸â‚œâ‚Šâ‚ = Î¸â‚œ - lr/âˆš(Gâ‚œ + Îµ) Â· âˆ‡L(Î¸â‚œ)

Adapts per-parameter learning rates. Good for sparse data. Learning rate decays monotonically.

Weakness: LR becomes infinitesimally small.

RMSprop

E[gÂ²]â‚œ = Î²E[gÂ²]â‚œâ‚‹â‚ + (1-Î²)(âˆ‡L)Â²
Î¸â‚œâ‚Šâ‚ = Î¸â‚œ - lr/âˆš(E[gÂ²]â‚œ + Îµ) Â· âˆ‡L

Unpublished, but widely used. Fixes AdaGrad's decaying LR problem. Î² typically 0.9.

RMSprop implementation

def rmsprop_update(params, grads, cache, lr=0.001, beta=0.9, eps=1e-8):
    for i, (p, g) in enumerate(zip(params, grads)):
        cache[i] = beta * cache[i] + (1 - beta) * g**2
        p -= lr * g / (np.sqrt(cache[i]) + eps)
    return params, cache

Adam & The Adaptive Moment Family

Adam (Adaptive Moment Estimation)

mâ‚œ = Î²â‚mâ‚œâ‚‹â‚ + (1-Î²â‚)âˆ‡L
vâ‚œ = Î²â‚‚vâ‚œâ‚‹â‚ + (1-Î²â‚‚)(âˆ‡L)Â²
Î¸â‚œâ‚Šâ‚ = Î¸â‚œ - lr Â· mÌ‚â‚œ/(âˆšvÌ‚â‚œ + Îµ)

Combines momentum (first moment) and RMSprop (second moment). Bias-corrected estimates. Î²â‚=0.9, Î²â‚‚=0.999, Îµ=1e-7.

Default optimizer for most tasks

AdamW

Î¸â‚œâ‚Šâ‚ = Î¸â‚œ - lr Â· (mÌ‚â‚œ/(âˆšvÌ‚â‚œ+Îµ) + Î»Î¸â‚œ)

Decoupled weight decay. Improves generalization over Adam. Recommended over Adam.

# PyTorch: torch.optim.AdamW
# TensorFlow: tf.keras.optimizers.AdamW

Nadam

Adam + Nesterov momentum. Slightly faster convergence.

AMSGrad

Variant that uses maximum of past squared gradients. Addresses convergence issues.

AdaBelief

Stepsize scaled by belief in observed gradient. More stable.

Adam implementation intuition

# Simplified Adam update (conceptual)
def adam_step(param, grad, m, v, t, lr=0.001, b1=0.9, b2=0.999):
    m = b1 * m + (1 - b1) * grad
    v = b2 * v + (1 - b2) * grad**2
    m_hat = m / (1 - b1**t)
    v_hat = v / (1 - b2**t)
    param -= lr * m_hat / (np.sqrt(v_hat) + 1e-7)
    return param, m, v

Modern & Emerging Optimizers

Lion (EvoLved Sign Momentum)

mâ‚œ = Î²â‚mâ‚œâ‚‹â‚ + (1-Î²â‚)âˆ‡L
Î¸â‚œâ‚Šâ‚ = Î¸â‚œ - lr Â· sign(Î²â‚‚mâ‚œ + (1-Î²â‚‚)âˆ‡L)

Discovered by symbolic search. More memory-efficient than Adam. Used in Google's latest models.

Adafactor

Memory-efficient Adam for large models. Factorizes second moment estimates. Used in T5.

LAMB & LARS

Layer-wise Adaptive Rate Scaling. For large-batch training (BERT, ResNet on TPUs).

Learning Rate Scheduling

Even with adaptive optimizers, scheduling the learning rate improves convergence.

Step Decay

Drop LR by factor every few epochs.

# TF: tf.keras.optimizers.schedules.ExponentialDecay
# PyTorch: torch.optim.lr_scheduler.StepLR

Cosine Annealing

Smooth cyclic decay. Often with warm restarts.

tf.keras.optimizers.schedules.CosineDecay

Warmup

Linear increase from 0 to initial LR. Stabilizes large model training.

ReduceLROnPlateau

Reduce LR when validation loss plateaus.

Best practice: Use learning rate warmup for Transformers and very deep networks. Cosine decay often outperforms step decay.

Optimizer Selection Guide

Optimizer	Adaptive	Momentum	When to use	Memory
SGD	âŒ	âŒ	Simple models, CV (with momentum)	Low
SGD+Momentum	âŒ	âœ…	Classic CNNs, needs LR tuning	Low
RMSprop	âœ…	âŒ	RNNs, online learning	Medium
Adam	âœ…	âœ…	Default for most tasks	Medium
AdamW	âœ…	âœ…	Transformers, NLP, better generalization	Medium
Nadam	âœ…	âœ… (Nesterov)	Slightly faster Adam	Medium
Lion	âœ…	âœ…	Memory efficient, vision tasks	Low
Adafactor	âœ…	âœ…	Giant models (LLMs)	Very low

                     Quick Selection Rules:
                    Start with AdamW â€“ works well out-of-the-box.
For NLP / Transformers: AdamW with cosine decay + warmup.
For Computer Vision: SGD with momentum can outperform Adam (requires tuning).
For large models (>1B params): Adafactor or Lion to save memory.
For sparse data: AdaGrad or Adam.

                

Optimizers in TensorFlow & PyTorch

TensorFlow / Keras

import tensorflow as tf

# Common optimizers
model.compile(optimizer='sgd', ...)
model.compile(optimizer='adam', ...)
model.compile(optimizer=tf.keras.optimizers.AdamW(learning_rate=1e-4, weight_decay=1e-4))

# Learning rate schedule
lr_schedule = tf.keras.optimizers.schedules.CosineDecay(1e-3, decay_steps=10000)
optimizer = tf.keras.optimizers.Adam(lr_schedule)

# Custom optimizer loop
@tf.function
def train_step(x, y):
    with tf.GradientTape() as tape:
        y_pred = model(x)
        loss = loss_fn(y, y_pred)
    grads = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(grads, model.trainable_variables))

PyTorch

import torch.optim as optim

# Optimizers
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
optimizer = optim.Adam(model.parameters(), lr=0.001)
optimizer = optim.AdamW(model.parameters(), lr=0.001, weight_decay=0.01)

# Learning rate schedulers
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=30, gamma=0.1)
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=50)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, patience=5)

# Training loop
for epoch in range(epochs):
    for x, y in dataloader:
        optimizer.zero_grad()
        y_pred = model(x)
        loss = loss_fn(y_pred, y)
        loss.backward()
        optimizer.step()
    scheduler.step()

Optimizer Hyperparameter Tuning

Learning Rate: Most critical. Defaults: Adam 1e-3, SGD 1e-2. Use LR range test.

Batch Size: Affects gradient noise. Tune together with LR.

Weight Decay: AdamW: 0.01-0.1, SGD: 1e-4. Prevents overfitting.

LR Range Test: Increase LR exponentially each batch, plot loss. Optimal LR is just before loss explodes.

Optimizer Pitfalls & Solutions

âš ï¸ Adam generalizes worse than SGD? Myth partially. AdamW fixes generalization. SGD with proper tuning can still outperform.

âš ï¸ Loss not decreasing: LR too high/low, gradient clipping needed, or bug in model.

âœ… Gradient clipping: Essential for RNNs, Transformers. Clip norm to 1.0.

âœ… Debug: Monitor gradient norms per layer. Vanishing/exploding?

Optimizer CheatsheetSGD+M CV
Adam Default
AdamW Best overall
RMSprop RNN
Lion Efficient
Adafactor LLMs
Nadam Slightly faster
AdaGrad Sparse

Previous Next