Neural Network Fundamentals
Neural networks, activation functions, loss functions, backpropagation, and optimizers for deep learning.
Neural Networks Basics
The Perceptron — First Neural Model
Invented by Frank Rosenblatt in 1958, the perceptron is the simplest neural network: a single neuron that classifies linear separable patterns.
How it works
- 1 Weighted sum:
z = w·x + b - 2 Step activation: 1 if z ≥ 0 else 0
- 3 Update:
w = w + lr*(y - ŷ)*x
Limitation
Only linear separable functions (AND, OR) – cannot learn XOR. This triggered the first AI winter and led to multi-layer networks.
key insight: depth mattersimport numpy as np
class Perceptron:
def __init__(self, lr=0.01, epochs=15):
self.lr = lr
self.epochs = epochs
self.weights = None
self.bias = None
def activation(self, z):
return 1 if z >= 0 else 0
def fit(self, X, y):
n_samples, n_features = X.shape
self.weights = np.zeros(n_features)
self.bias = 0
for _ in range(self.epochs):
for idx, x_i in enumerate(X):
linear = np.dot(x_i, self.weights) + self.bias
y_pred = self.activation(linear)
update = self.lr * (y[idx] - y_pred)
self.weights += update * x_i
self.bias += update
def predict(self, X):
linear = np.dot(X, self.weights) + self.bias
return np.array([self.activation(z) for z in linear])
Try it on AND gate – converges in <10 iterations.
Activation Functions: Non-linearity is key
Without activation functions, stacked linear layers collapse into one linear transformation. Non-linear activations enable deep networks to approximate any function.
Sigmoid
def sigmoid(x):
return 1/(1+np.exp(-x))
Range (0,1), great for binary output, but vanishing gradient.
Tanh
def tanh(x):
return np.tanh(x)
Range (-1,1), zero-centered, stronger gradients.
ReLU
def relu(x):
return np.maximum(0,x)
No saturation, sparse; dead neurons risk.
Leaky ReLU
def leaky_relu(x, alpha=0.1):
return np.where(x>0, x, alpha*x)
Softmax
def softmax(x):
ex = np.exp(x - np.max(x))
return ex / ex.sum()
Multi-class probability.
Forward Propagation & Backpropagation
Forward pass
Compute activations layer by layer, cache intermediate values for gradient.
Backward pass (chain rule)
δL/δW = (δL/δa) * (δa/δz) * (δz/δW)
# assume sigmoid activation, MSE loss
def backward(self, X, y, a1, a2):
m = X.shape[0]
# output layer gradient
dz2 = a2 - y.reshape(-1,1) # dL/dz2
dW2 = (1/m) * a1.T @ dz2
db2 = (1/m) * np.sum(dz2, axis=0, keepdims=True)
# hidden layer gradient
da1 = dz2 @ self.W2.T
dz1 = da1 * (a1 * (1 - a1)) # sigmoid derivative
dW1 = (1/m) * X.T @ dz1
db1 = (1/m) * np.sum(dz1, axis=0, keepdims=True)
Multi-Layer Perceptron (MLP) from Scratch
Complete implementation of a flexible neural network with one hidden layer using only NumPy. Foundation for modern deep learning.
import numpy as np
class MLP:
def __init__(self, input_size, hidden_size, output_size, lr=0.1):
self.lr = lr
self.W1 = np.random.randn(input_size, hidden_size) * 0.5
self.b1 = np.zeros((1, hidden_size))
self.W2 = np.random.randn(hidden_size, output_size) * 0.5
self.b2 = np.zeros((1, output_size))
def sigmoid(self, x):
return 1 / (1 + np.exp(-x))
def sigmoid_deriv(self, x):
return x * (1 - x)
def forward(self, X):
self.z1 = X @ self.W1 + self.b1
self.a1 = self.sigmoid(self.z1)
self.z2 = self.a1 @ self.W2 + self.b2
self.a2 = self.sigmoid(self.z2)
return self.a2
def backward(self, X, y, output):
m = X.shape[0]
self.dz2 = output - y.reshape(-1,1)
self.dW2 = (1/m) * self.a1.T @ self.dz2
self.db2 = (1/m) * np.sum(self.dz2, axis=0, keepdims=True)
self.da1 = self.dz2 @ self.W2.T
self.dz1 = self.da1 * self.sigmoid_deriv(self.a1)
self.dW1 = (1/m) * X.T @ self.dz1
self.db1 = (1/m) * np.sum(self.dz1, axis=0, keepdims=True)
def update(self):
self.W1 -= self.lr * self.dW1
self.b1 -= self.lr * self.db1
self.W2 -= self.lr * self.dW2
self.b2 -= self.lr * self.db2
def fit(self, X, y, epochs=1000):
for i in range(epochs):
output = self.forward(X)
self.backward(X, y, output)
self.update()
if i % 200 == 0:
loss = np.mean((output - y)**2)
print(f"epoch {i}, loss: {loss:.6f}")
Neural Nets in Keras & PyTorch
TensorFlow/Keras
import tensorflow as tf
model = tf.keras.Sequential([
tf.keras.layers.Dense(64, activation='relu', input_shape=(4,)),
tf.keras.layers.Dense(32, activation='relu'),
tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(optimizer='adam', loss='mse')
PyTorch
import torch.nn as nn
class Net(nn.Module):
def __init__(self):
super().__init__()
self.fc1 = nn.Linear(4, 8)
self.fc2 = nn.Linear(8, 4)
self.out = nn.Linear(4, 1)
def forward(self, x):
x = torch.relu(self.fc1(x))
x = torch.relu(self.fc2(x))
return torch.sigmoid(self.out(x))
transfer learning autodiff GPU
Weight Initialization & Optimizers
Initialization
- Zero init → symmetry, no learning
- Small random (0.01) – ok for shallow
- Xavier/Glorot for sigmoid/tanh
- He init for ReLU
Optimizers
Batch GD, SGD, Mini-batch. Momentum, Adam, RMSprop adapt learning rates.
optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)
Why do neural networks work?
Universal Approximation Theorem: A feedforward network with a single hidden layer can approximate any continuous function, given sufficient neurons and non-linear activation.
Real‑world usage
Regression & Forecasting
Housing prices, stock trends, energy load.
Classification
Spam detection, credit risk, medical diagnosis.
Feature learning
Autoencoders, embeddings, representation learning.
Activation Functions: The Non-Linear Gatekeepers
Why do we need activation functions?
Without activation functions, neural networks would just be linear transformations. No matter how many layers, a linear combination of linear functions is still linear. Activation functions introduce non-linearity, allowing the network to learn complex patterns, decision boundaries, and hierarchical representations.
Every neuron applies an activation function to its weighted input.
Classic Activation Functions: Sigmoid & Tanh
Sigmoid (σ)
Formula: σ(x) = 1 / (1 + e-x)
Derivative: σ(x)(1-σ(x))
import numpy as np
def sigmoid(x):
return 1 / (1 + np.exp(-x))
def sigmoid_derivative(x):
s = sigmoid(x)
return s * (1 - s)
Output (0,1) Vanishing gradient Used in output layer (binary classification).
Tanh (Hyperbolic Tangent)
Formula: tanh(x) = (ex - e-x) / (ex + e-x)
Derivative: 1 - tanh²(x)
def tanh(x):
return np.tanh(x) # or manual
def tanh_derivative(x):
return 1 - np.tanh(x)**2
Zero-centered (-1,1) Stronger gradients, still saturates.
ReLU & Family: Solving Vanishing Gradient
ReLU (Rectified Linear Unit)
f(x) = max(0, x)
def relu(x):
return np.maximum(0, x)
# derivative: 1 if x>0 else 0
Pros: Computationally cheap, sparse, no saturation for x>0. Cons: Dying ReLU (neurons stuck at 0).
Leaky ReLU
f(x) = x if x>0 else αx (α small, e.g., 0.01)
def leaky_relu(x, alpha=0.01):
return np.where(x > 0, x, alpha * x)
Fixes dying ReLU; allows gradient flow for negative values.
ELU (Exponential Linear Unit)
f(x)= x if x>0 else α(e^x -1)
Smooth, negative values push mean closer to zero. Faster learning.
PReLU (Parametric ReLU)
α is learned during training.
# TensorFlow: tf.keras.layers.PReLU()
Softmax: From Logits to Probabilities
Softmax is used in the output layer for multi-class classification. It converts a vector of raw scores (logits) into a probability distribution over classes.
def softmax(logits):
exp_shifted = np.exp(logits - np.max(logits)) # numerical stability
return exp_shifted / np.sum(exp_shifted, axis=-1, keepdims=True)
# Example: logits = [2.0, 1.0, 0.1] -> probabilities sum=1
Key property
All outputs ∈ (0,1) and sum to 1.
Ideal for mutually exclusive classes.
Modern Activation Functions (Transformers, CNNs)
Swish (Google, 2017)
f(x) = x * sigmoid(βx) (β learnable or constant =1)
Smooth, non-monotonic. Outperforms ReLU in deep nets.
# TF: tf.keras.activations.swish
GELU (Gaussian Error Linear Unit)
GELU(x) = x * Φ(x) (Φ is CDF of Gaussian).
Used in BERT, GPT, ViT. Smooth ReLU variant.
# from transformers, torch.nn.GELU
Mish (2019)
f(x) = x * tanh(softplus(x)).
Self-regularized, slightly better than Swish on some benchmarks.
Activation Function Selection Guide
| Function | Range | When to use | Common in |
|---|---|---|---|
| Sigmoid | (0,1) | Binary output, probabilistic gate | Logistic regression, some attention |
| Tanh | (-1,1) | Zero-centered hidden layers (older RNNs) | LSTM candidate gates |
| ReLU | [0,∞) | Default for hidden layers (CNNs, MLPs) | ResNet, VGG, YOLO |
| Leaky ReLU | (-∞,∞) | Avoid dead neurons | GANs, some detection models |
| Softmax | (0,1) sum=1 | Multi-class classification output | Classification heads |
| Swish / SiLU | (-∞,∞) | Deep transformer-style models | EfficientNet, RL |
| GELU | (-∞,∞) | NLP Transformers (BERT, GPT) | Hugging Face models |
Rule of thumb:
- Start with ReLU for hidden layers.
- For output: Sigmoid (binary), Softmax (multi-class), Linear (regression).
- If ReLU causes dead neurons → Leaky ReLU / ELU.
- For Transformers: GELU.
- For very deep nets: consider Swish.
Activation Functions in TensorFlow & PyTorch
TensorFlow / Keras
import tensorflow as tf
# As activation string or layer
model = tf.keras.Sequential([
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(32, activation='leaky_relu'),
tf.keras.layers.Dense(10, activation='softmax')
])
# Advanced: tf.keras.activations.gelu, tf.nn.swish
PyTorch
import torch.nn as nn
class Net(nn.Module):
def __init__(self):
super().__init__()
self.fc1 = nn.Linear(784, 128)
self.act1 = nn.ReLU()
self.fc2 = nn.Linear(128, 64)
self.act2 = nn.LeakyReLU(0.02)
self.out = nn.Linear(64, 10)
def forward(self, x):
x = self.act1(self.fc1(x))
x = self.act2(self.fc2(x))
return self.out(x) # with CrossEntropyLoss, no softmax needed
Activation shapes at a glance
Sigmoid: ──▄▄▄▄▄▄▄▄▄▄▄▄▄▄── squashes to [0,1]
Tanh: ─▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄─ [-1,1]
ReLU: ──────────────────▄▄▄▄▄▄▄▄▄▄ max(0,x)
Leaky ReLU:─╱╱╱╱─────────────▄▄▄▄▄▄▄▄▄ small negative slope
Softmax: [0.2, 0.7, 0.1] probabilities
* ASCII illustration of activation function curves
Activation Pitfalls & Best Practices
Activation Function Cheatsheet
Loss Functions: The Compass of Neural Networks
What is a Loss Function?
A loss function (also called cost/objective function) maps the model's predictions and ground truth to a scalar value. Lower loss = better predictions. During training, backpropagation computes gradients of the loss w.r.t. weights, and optimizers update weights to minimize this loss.
Loss functions define the learning objective.
Regression Losses: Predicting Continuous Values
MSE (L2 Loss)
MSE = 1/n Σ(y - ŷ)²
import numpy as np
def mse(y_true, y_pred):
return np.mean((y_true - y_pred)**2)
Differentiable Sensitive to outliers Most common regression loss.
MAE (L1 Loss)
MAE = 1/n Σ|y - ŷ|
def mae(y_true, y_pred):
return np.mean(np.abs(y_true - y_pred))
Robust to outliers Not differentiable at 0.
Huber Loss
Lδ = { ½(y-ŷ)² for \|y-ŷ\|≤δ, else δ\|y-ŷ\| - ½δ² }
def huber(y_true, y_pred, delta=1.0):
error = y_true - y_pred
is_small = np.abs(error) <= delta
return np.mean(is_small * 0.5 * error**2 +
~is_small * (delta*np.abs(error) - 0.5*delta**2))
Combines MSE and MAE. Smooth, robust.
Log-Cosh Loss
L = log(cosh(Å· - y))
Smooth approximation to MAE, twice differentiable.
Quantile Loss
L = Σ max(q(y-ŷ), (q-1)(y-ŷ))
Used for predicting prediction intervals.
Classification Losses: Probability & Decision Boundaries
Binary Cross-Entropy (BCE)
BCE = -[y log(Å·) + (1-y) log(1-Å·)]
def binary_crossentropy(y, y_pred):
y_pred = np.clip(y_pred, 1e-7, 1-1e-7) # stability
return -np.mean(y * np.log(y_pred) +
(1 - y) * np.log(1 - y_pred))
Use: Binary classification, sigmoid output.
Categorical Cross-Entropy (CCE)
CCE = -Σ y_i log(ŷ_i)
def categorical_crossentropy(y, y_pred):
y_pred = np.clip(y_pred, 1e-7, 1.0)
return -np.sum(y * np.log(y_pred)) / y.shape[0]
Use: Multi-class, softmax output.
Sparse CCE
Same as CCE but targets are integers (not one-hot). Memory efficient.
tf.keras.losses.SparseCategoricalCrossentropy()
Hinge Loss
L = max(0, 1 - y·ŷ) (y ∈ {-1,1})
Used in SVMs, also with CNNs.
tf.keras.losses.Hinge()
Squared Hinge
L = max(0, 1 - y·ŷ)²
Differentiable, penalizes errors more.
tf.keras.losses.BinaryCrossentropy(from_logits=True)) which combine log and softmax/sigmoid in a numerically stable way.
Probabilistic Losses: Distributions & Divergence
KL Divergence
D_KL(P||Q) = Σ P(i) log(P(i)/Q(i))
Measures how one probability distribution diverges from another. Asymmetric.
def kl_divergence(p, q):
p = np.clip(p, 1e-7, 1)
q = np.clip(q, 1e-7, 1)
return np.sum(p * np.log(p / q))
Used in VAEs, variational inference.
JS Divergence
Jensen-Shannon divergence. Symmetric, smoothed version of KL.
Used in GANs, domain adaptation.
Cross-Entropy vs KL
Cross-Entropy = H(P,Q) = H(P) + D_KL(P||Q). Minimizing cross-entropy is equivalent to minimizing KL when P is fixed (ground truth).
Advanced & Specialized Loss Functions
CTC Loss
Connectionist Temporal Classification. Used in speech recognition, handwriting recognition. Aligns sequences without alignment labels.
tf.nn.ctc_loss
Contrastive Loss
L = y*d² + (1-y)*max(margin-d,0)²
Used in Siamese networks, similarity learning.
Triplet Loss
max(d(a,p)-d(a,n)+margin, 0)
Face recognition (FaceNet), embeddings.
Dice Loss / F1 Score
1 - (2|X∩Y|)/(|X|+|Y|). For imbalanced segmentation, medical imaging.
Perceptual Loss
Loss based on feature maps of pre-trained networks (VGG). For style transfer, super-resolution.
Loss Function Selection Guide
| Task Type | Recommended Loss | Output Activation | Comments |
|---|---|---|---|
| Regression (normal) | MSE | Linear | Sensitive to outliers |
| Regression (robust) | Huber / MAE | Linear | Less sensitive to outliers |
| Binary Classification | Binary Cross-Entropy | Sigmoid | Use from_logits for stability |
| Multi-class Classification | Categorical Cross-Entropy | Softmax | Use sparse CE for integer labels |
| Multi-label Classification | Binary Cross-Entropy | Sigmoid (per class) | Independent probabilities |
| Imbalanced Data | Weighted CE / Focal Loss | Sigmoid/Softmax | Focuses on hard samples |
| Similarity Learning | Contrastive / Triplet | L2 normalized | Embedding space |
| Generative Models | BCE (GANs), KL (VAEs) | Varies | Task specific |
Quick Selection Rules:
- Regression: Start with MSE. If outliers are problematic, try MAE or Huber.
- Binary classification: Binary cross-entropy.
- Multi-class: Categorical cross-entropy.
- Probabilistic outputs: KL Divergence.
- Sequence alignment: CTC Loss.
Loss Functions in TensorFlow & PyTorch
TensorFlow / Keras
import tensorflow as tf
# Common losses
model.compile(loss='mse', optimizer='adam') # regression
model.compile(loss='binary_crossentropy', ...)
model.compile(loss='categorical_crossentropy', ...)
model.compile(loss=tf.keras.losses.Huber(delta=1.5), ...)
# Custom loss function
def custom_mse(y_true, y_pred):
return tf.reduce_mean(tf.square(y_true - y_pred))
PyTorch
import torch.nn as nn
criterion = nn.MSELoss() # regression
criterion = nn.BCELoss() # requires sigmoid
criterion = nn.BCEWithLogitsLoss() # stable, from_logits
criterion = nn.CrossEntropyLoss() # includes softmax
criterion = nn.KLDivLoss() # KL divergence
# Custom
class CustomLoss(nn.Module):
def forward(self, y_pred, y_true):
return torch.mean((y_true - y_pred)**2)
Designing Custom Loss Functions
Sometimes you need a task-specific loss. Any differentiable function that maps (y_true, y_pred) to a scalar can be a loss.
def weighted_mse(y_true, y_pred):
weights = tf.where(y_true > 0.5, 2.0, 1.0)
return tf.reduce_mean(weights * (y_true - y_pred)**2)
model.compile(loss=weighted_mse, optimizer='adam')
class FocalLoss(nn.Module):
def __init__(self, alpha=1, gamma=2):
super().__init__()
self.alpha = alpha
self.gamma = gamma
def forward(self, y_pred, y_true):
bce = nn.functional.binary_cross_entropy_with_logits(y_pred, y_true, reduction='none')
p = torch.sigmoid(y_pred)
focal = self.alpha * (1-p)**self.gamma * bce
return focal.mean()
Loss Function Pitfalls & Best Practices
Loss Landscape: The shape of loss function affects optimization. MSE is convex, Cross-Entropy is convex for linear models, neural nets are non-convex.
Loss Functions Cheatsheet
Backpropagation: The Engine of Deep Learning
Why Backpropagation? The Credit Assignment Problem
In a multi-layer network, how does a small change in an early weight affect the final loss? Backpropagation (Rumelhart, Hinton, 1986) elegantly solves this credit assignment problem by recursively applying the chain rule.
Historical Breakthrough
- 1986 Backpropagation popularized
- 1989 Universal approximation proven
- 2012 AlexNet (backprop + GPU) wins ImageNet
Intuition
Backpropagation = forward pass computes predictions, backward pass propagates error gradients from output to each weight. "How much did each weight contribute to the error?"
Chain Rule: From Calculus to Computation
Backpropagation is the chain rule — applied efficiently to millions of parameters.
Scalar Chain Rule
If y = f(g(x)), then dy/dx = (dy/dg) * (dg/dx)
Multivariate: For L = f(z), z = Wx + b:
∂L/∂W = (∂L/∂z) · (∂z/∂W)
Computational Graphs: Visualizing Backprop
Modern frameworks (TensorFlow, PyTorch) build a computational graph during forward pass, then traverse it in reverse to compute gradients.
Forward:
x → *3 → +5 → z
Backward:
dz = 1 d+5 = dz * 1 d*3 = d+5 * 3 dx = d*3 * 1?
Automatic Differentiation
- Forward mode: compute derivatives alongside values
- Reverse mode (backprop): one forward pass, one backward pass → all gradients
- Efficient for many parameters (typical deep learning)
# Forward pass: z = (x * 3) + 5
x = 2.0
a = x * 3 # a = 6
z = a + 5 # z = 11
# Backward pass (dz/dz = 1)
dz = 1
da = dz * 1 # dz/da = 1
dx = da * 3 # da/dx = 3
print(dx) # Gradient = 3
Backpropagation Through a 2‑Layer MLP
↠dW1 ↠dz1 ↠da1 ↠dW2 ↠dz2 ↠dL/da2
Forward Pass (Caching)
z1 = X @ W1 + b1
a1 = sigmoid(z1)
z2 = a1 @ W2 + b2
a2 = sigmoid(z2)
loss = binary_crossentropy(y, a2)
Backward Pass (Gradients)
da2 = -(y/a2 - (1-y)/(1-a2)) # BCE derivative
dz2 = da2 * sigmoid_prime(z2) # (a2*(1-a2))
dW2 = a1.T @ dz2
db2 = np.sum(dz2, axis=0)
da1 = dz2 @ W2.T
dz1 = da1 * sigmoid_prime(z1)
dW1 = X.T @ dz1
db1 = np.sum(dz1, axis=0)
Pure NumPy Backprop – Full Training Loop
Every line explained. No frameworks, just math and NumPy.
import numpy as np
class NeuralNet:
def __init__(self, input_size, hidden_size, output_size, lr=0.5):
self.lr = lr
self.W1 = np.random.randn(input_size, hidden_size) * 0.5
self.b1 = np.zeros((1, hidden_size))
self.W2 = np.random.randn(hidden_size, output_size) * 0.5
self.b2 = np.zeros((1, output_size))
def sigmoid(self, x):
return 1 / (1 + np.exp(-x))
def sigmoid_deriv(self, x):
return x * (1 - x)
def forward(self, X):
self.z1 = X @ self.W1 + self.b1
self.a1 = self.sigmoid(self.z1)
self.z2 = self.a1 @ self.W2 + self.b2
self.a2 = self.sigmoid(self.z2)
return self.a2
def backward(self, X, y, output):
m = X.shape[0]
# Output layer gradients
self.dz2 = output - y.reshape(-1,1) # BCE derivative simplification
self.dW2 = (1/m) * self.a1.T @ self.dz2
self.db2 = (1/m) * np.sum(self.dz2, axis=0, keepdims=True)
# Hidden layer gradients
self.da1 = self.dz2 @ self.W2.T
self.dz1 = self.da1 * self.sigmoid_deriv(self.a1)
self.dW1 = (1/m) * X.T @ self.dz1
self.db1 = (1/m) * np.sum(self.dz1, axis=0, keepdims=True)
def update(self):
self.W1 -= self.lr * self.dW1
self.b1 -= self.lr * self.db1
self.W2 -= self.lr * self.dW2
self.b2 -= self.lr * self.db2
def train(self, X, y, epochs=5000):
for i in range(epochs):
output = self.forward(X)
self.backward(X, y, output)
self.update()
if i % 1000 == 0:
loss = np.mean((output - y)**2)
print(f'Epoch {i}, Loss: {loss:.6f}')
# XOR dataset
X = np.array([[0,0],[0,1],[1,0],[1,1]])
y = np.array([[0],[1],[1],[0]])
nn = NeuralNet(2, 4, 1, lr=0.7)
nn.train(X, y, epochs=6000)
print("Predictions:\n", nn.forward(X))
This implementation converges for XOR — the classic non-linear problem that a single perceptron cannot solve.
Gradient Checking: Verify Your Backprop
Numerical approximation of gradients to ensure analytical backprop is correct.
def numerical_gradient(f, params, epsilon=1e-7):
"""Finite difference approximation"""
grads = []
for param in params:
grad = np.zeros_like(param)
it = np.nditer(param, flags=['multi_index'], op_flags=['readwrite'])
while not it.finished:
idx = it.multi_index
old_val = param[idx]
param[idx] = old_val + epsilon
f_plus = f()
param[idx] = old_val - epsilon
f_minus = f()
grad[idx] = (f_plus - f_minus) / (2 * epsilon)
param[idx] = old_val
it.iternext()
grads.append(grad)
return grads
# Use: compare with backprop gradients (difference < 1e-6 is good)
Vanishing / Exploding Gradients
Deep networks suffer from unstable gradients. Why?
Vanishing
Sigmoid/tanh saturate → gradients → 0. Early layers learn extremely slowly.
# Solution: ReLU, residual connections, batch norm
Exploding
Large weights → gradients multiply exponentially → NaN.
# Solution: Gradient clipping, proper initialization
Modern mitigations
- ReLU/Leaky ReLU activations
- Xavier/He initialization
- Batch Normalization
- Residual connections (ResNet)
- Gradient clipping
Backprop in TensorFlow & PyTorch
Autograd computes gradients automatically — but understanding backprop helps you debug and design architectures.
TensorFlow
with tf.GradientTape() as tape:
y_pred = model(X)
loss = tf.keras.losses.MSE(y, y_pred)
grads = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
PyTorch
y_pred = model(X)
loss = nn.MSELoss()(y_pred, y)
loss.backward() # <-- one line backprop!
optimizer.step()
autograd computational graph dynamic vs static
Backpropagation = The Learning Algorithm
Every neural network — from 3-layer MLPs to GPT-4 — is trained using backpropagation (or its variant). Mastering backprop gives you superpowers: you can implement new architectures, fix vanishing gradients, and truly understand deep learning.
When You Need Backprop Deep‑Knowledge
Custom Layers
Implement your own forward/backward in frameworks.
Debugging
Why are gradients NaN? Why isn't this layer learning?
Research
Modify gradient flow (e.g., reversible nets, synthetic gradients).
Optimizers: Driving Neural Network Training
What is an Optimizer?
Optimizers are algorithms that update model parameters (weights) to minimize the loss function. They determine how to move in the gradient direction — how fast, with what momentum, and with what adaptive scaling. The choice of optimizer critically affects training speed, stability, and final performance.
Optimizers incorporate gradient history, adaptive learning rates, and momentum.
Gradient Descent Variants
Batch GD
Uses entire dataset to compute gradient. θ = θ - lr · ∇L(θ; all data)
Slow Stable Not feasible for large datasets.
Stochastic GD (SGD)
θ = θ - lr · ∇L(θ; xᵢ, yᵢ)
Update per sample. High variance Online learning
Mini-batch GD
θ = θ - lr · ∇L(θ; batch)
Balanced Most common. Batch size 32-512.
import numpy as np
def sgd_update(params, grads, lr=0.01):
"""Simple SGD update"""
for param, grad in zip(params, grads):
param -= lr * grad
return params
Momentum & Nesterov Accelerated Gradient
SGD with Momentum
vₜ = βvₜ₋₠+ (1-β)∇L(θₜ)
θₜ₊₠= θₜ - lr · vₜ
Accumulates velocity to overcome ravines and accelerate convergence. β typically 0.9.
def momentum_update(params, grads, v, lr=0.01, beta=0.9):
for i, (p, g) in enumerate(zip(params, grads)):
v[i] = beta * v[i] + (1 - beta) * g
p -= lr * v[i]
return params, v
Nesterov Accelerated Gradient (NAG)
vₜ = βvₜ₋₠+ (1-β)∇L(θₜ - lr·βvₜ₋â‚)
θₜ₊₠= θₜ - lr · vₜ
Looks ahead at the approximate future position. Often faster and more stable than standard momentum.
Adaptive Learning Rate Methods
AdaGrad
Gₜ = Gₜ₋₠+ (∇L(θₜ))²
θₜ₊₠= θₜ - lr/√(Gₜ + ε) · ∇L(θₜ)
Adapts per-parameter learning rates. Good for sparse data. Learning rate decays monotonically.
Weakness: LR becomes infinitesimally small.
RMSprop
E[g²]ₜ = βE[g²]ₜ₋₠+ (1-β)(∇L)²
θₜ₊₠= θₜ - lr/√(E[g²]ₜ + ε) · ∇L
Unpublished, but widely used. Fixes AdaGrad's decaying LR problem. β typically 0.9.
def rmsprop_update(params, grads, cache, lr=0.001, beta=0.9, eps=1e-8):
for i, (p, g) in enumerate(zip(params, grads)):
cache[i] = beta * cache[i] + (1 - beta) * g**2
p -= lr * g / (np.sqrt(cache[i]) + eps)
return params, cache
Adam & The Adaptive Moment Family
Adam (Adaptive Moment Estimation)
mₜ = βâ‚mₜ₋₠+ (1-βâ‚)∇L
vₜ = β₂vₜ₋₠+ (1-β₂)(∇L)²
θₜ₊₠= θₜ - lr · m̂ₜ/(√v̂ₜ + ε)
Combines momentum (first moment) and RMSprop (second moment). Bias-corrected estimates. βâ‚=0.9, β₂=0.999, ε=1e-7.
Default optimizer for most tasks
AdamW
θₜ₊₠= θₜ - lr · (m̂ₜ/(√v̂ₜ+ε) + λθₜ)
Decoupled weight decay. Improves generalization over Adam. Recommended over Adam.
# PyTorch: torch.optim.AdamW
# TensorFlow: tf.keras.optimizers.AdamW
Nadam
Adam + Nesterov momentum. Slightly faster convergence.
AMSGrad
Variant that uses maximum of past squared gradients. Addresses convergence issues.
AdaBelief
Stepsize scaled by belief in observed gradient. More stable.
# Simplified Adam update (conceptual)
def adam_step(param, grad, m, v, t, lr=0.001, b1=0.9, b2=0.999):
m = b1 * m + (1 - b1) * grad
v = b2 * v + (1 - b2) * grad**2
m_hat = m / (1 - b1**t)
v_hat = v / (1 - b2**t)
param -= lr * m_hat / (np.sqrt(v_hat) + 1e-7)
return param, m, v
Modern & Emerging Optimizers
Lion (EvoLved Sign Momentum)
mₜ = βâ‚mₜ₋₠+ (1-βâ‚)∇L
θₜ₊₠= θₜ - lr · sign(β₂mₜ + (1-β₂)∇L)
Discovered by symbolic search. More memory-efficient than Adam. Used in Google's latest models.
Adafactor
Memory-efficient Adam for large models. Factorizes second moment estimates. Used in T5.
LAMB & LARS
Layer-wise Adaptive Rate Scaling. For large-batch training (BERT, ResNet on TPUs).
Learning Rate Scheduling
Even with adaptive optimizers, scheduling the learning rate improves convergence.
Step Decay
Drop LR by factor every few epochs.
# TF: tf.keras.optimizers.schedules.ExponentialDecay
# PyTorch: torch.optim.lr_scheduler.StepLR
Cosine Annealing
Smooth cyclic decay. Often with warm restarts.
tf.keras.optimizers.schedules.CosineDecay
Warmup
Linear increase from 0 to initial LR. Stabilizes large model training.
ReduceLROnPlateau
Reduce LR when validation loss plateaus.
Optimizer Selection Guide
| Optimizer | Adaptive | Momentum | When to use | Memory |
|---|---|---|---|---|
| SGD | ⌠| ⌠| Simple models, CV (with momentum) | Low |
| SGD+Momentum | ⌠| ✅ | Classic CNNs, needs LR tuning | Low |
| RMSprop | ✅ | ⌠| RNNs, online learning | Medium |
| Adam | ✅ | ✅ | Default for most tasks | Medium |
| AdamW | ✅ | ✅ | Transformers, NLP, better generalization | Medium |
| Nadam | ✅ | ✅ (Nesterov) | Slightly faster Adam | Medium |
| Lion | ✅ | ✅ | Memory efficient, vision tasks | Low |
| Adafactor | ✅ | ✅ | Giant models (LLMs) | Very low |
Quick Selection Rules:
- Start with AdamW – works well out-of-the-box.
- For NLP / Transformers: AdamW with cosine decay + warmup.
- For Computer Vision: SGD with momentum can outperform Adam (requires tuning).
- For large models (>1B params): Adafactor or Lion to save memory.
- For sparse data: AdaGrad or Adam.
Optimizers in TensorFlow & PyTorch
TensorFlow / Keras
import tensorflow as tf
# Common optimizers
model.compile(optimizer='sgd', ...)
model.compile(optimizer='adam', ...)
model.compile(optimizer=tf.keras.optimizers.AdamW(learning_rate=1e-4, weight_decay=1e-4))
# Learning rate schedule
lr_schedule = tf.keras.optimizers.schedules.CosineDecay(1e-3, decay_steps=10000)
optimizer = tf.keras.optimizers.Adam(lr_schedule)
# Custom optimizer loop
@tf.function
def train_step(x, y):
with tf.GradientTape() as tape:
y_pred = model(x)
loss = loss_fn(y, y_pred)
grads = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
PyTorch
import torch.optim as optim
# Optimizers
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
optimizer = optim.Adam(model.parameters(), lr=0.001)
optimizer = optim.AdamW(model.parameters(), lr=0.001, weight_decay=0.01)
# Learning rate schedulers
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=30, gamma=0.1)
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=50)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, patience=5)
# Training loop
for epoch in range(epochs):
for x, y in dataloader:
optimizer.zero_grad()
y_pred = model(x)
loss = loss_fn(y_pred, y)
loss.backward()
optimizer.step()
scheduler.step()
Optimizer Hyperparameter Tuning
LR Range Test: Increase LR exponentially each batch, plot loss. Optimal LR is just before loss explodes.