Neural Networks

Training & Optimization

Gradient descent, backpropagation, optimizers, learning rate, and gradient issues.

Gradient Descent

The Core Update Rule

Let L(Î¸) be the loss as a function of all parameters Î¸ (weights and biases). The gradient âˆ‡L points in the direction of steepest ascent of L. To reduce loss, we step the opposite way:

Î¸ â† Î¸ âˆ’ Î· âˆ‡L(Î¸)

The positive scalar Î· is the learning rate. Too large: overshoot, oscillation, divergence. Too small: painfully slow progress, risk of getting stuck in flat regions for a long time. In deep learning, Î· is one of the first hyperparameters people tune, often alongside batch size and schedule (warmup, decay).

For a smooth convex function in low dimensions, with a well-chosen rate, gradient descent can converge to the global minimum. Neural loss surfaces are not convex in general; in practice we still use the same local linear model (â€œlinearize the loss around the current Î¸â€) because it works remarkably well at scale.

Batch, Mini-Batch, and Stochastic

Batch gradient descent uses the gradient of the loss averaged over all training examples each step. That gives a faithful direction but is expensive when data are huge and must be scanned every update.

Stochastic gradient descent (SGD) originally meant using one example per step: noisy but fast per iteration and can help escape shallow local features. In modern usage, mini-batch SGD is standard: each step uses a mini-batch of B examples (e.g. 32â€“256). The gradient is averaged over the batch, trading noise against compute efficiency and hardware utilization (GPUs like contiguous matrix work).

Noise from small batches acts like mild regularization; very large batches can require larger learning rates or special tricks (learning rate scaling rules, warmup) to retain generalization quality. The field has many refinementsâ€”momentum, RMSprop, Adamâ€”that adapt how past gradients influence the step, but they still sit on top of the same idea: use derivatives of the loss w.r.t. parameters.

One training step: sample batch â†’ forward â†’ loss â†’ backward â†’ Î¸ â† Î¸ âˆ’ (step computed from âˆ‡L and optimizer state)

Local Minima, Saddles, and Plateaus

High-dimensional loss landscapes are hard to visualize. Local minima (points where all directions go uphill) were once feared as show-stoppers; empirically, many deep nets find solutions that generalize even though the surface is non-convex. Saddle points (some directions down, some up) are more problematic in theory because the gradient can be very small even far from a good solutionâ€”optimization research studies how noise and curvature help escape.

Plateaus are regions where the gradient is tiny; training can crawl. Good initialization, activation choices (e.g. ReLU vs saturated sigmoids), batch normalization, and adaptive optimizers all affect how often you hit flat or pathological regions. This is why the full training story ties together architecture, loss, and optimizationâ€”not any single trick in isolation.

Practical note. If loss is flat, check learning rate, gradient norms, whether the model is in train() mode, and whether loss is wired correctly (e.g. logits vs softmax).

Toy Example: 2D Quadratic in NumPy

Minimize L(w) = wâ‚€Â² + 4wâ‚Â² (elongated bowl). The gradient is (2wâ‚€, 8wâ‚). Plain gradient descent with fixed Î· shrinks both components toward zero.

Vanilla GD on a quadratic

import numpy as np

w = np.array([3.0, 2.0])
eta = 0.15
for t in range(30):
    grad = np.array([2 * w[0], 8 * w[1]])
    w = w - eta * grad
print("w after steps:", w)

PyTorch: `optimizer.step()`

Frameworks compute âˆ‡L via automatic differentiation (next tutorial: backpropagation). The optimizer holds learning rate and momentum buffers; after loss.backward(), optimizer.step() applies the update.

One step pattern

import torch
import torch.nn as nn

model = nn.Linear(5, 1)
opt = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
x = torch.randn(8, 5)
y = torch.randn(8, 1)

loss_fn = nn.MSELoss()
opt.zero_grad()
loss = loss_fn(model(x), y)
loss.backward()
opt.step()

Summary

Gradient descent moves parameters opposite the loss gradient, scaled by learning rate.
Mini-batch SGD balances noise, speed, and hardware efficiency.
Non-convex landscapes still train well in practice with good habits and modern optimizers.
Backprop computes gradients; optimizers consume themâ€”next pages unpack both in depth.

Backpropagation

From the Chain Rule to â€œError Signalsâ€

If L depends on z, and z depends on w, then âˆ‚L/âˆ‚w = (âˆ‚L/âˆ‚z)(âˆ‚z/âˆ‚w). In a network, L depends on the output, which depends on the last layerâ€™s activations, which depend on the last layerâ€™s pre-activations, which depend on weights and the previous layerâ€™s activationsâ€”and so on. Each link in this chain contributes a factor; backprop multiplies those factors along paths from L back to each w.

Implementation-wise, we store forward values (inputs to each op) needed to compute local derivatives on the way back. The backward pass visits operations in reverse topological order: each op receives an upstream gradient (â€œhow much L changes per unit change in this opâ€™s outputâ€) and returns gradients w.r.t. its inputs.

Forward: x â†’ â€¦ â†’ Å· â†’ L Backward: âˆ‚L/âˆ‚Å· â†’ â€¦ â†’ âˆ‚L/âˆ‚x (plus âˆ‚L/âˆ‚W, âˆ‚L/âˆ‚b at each layer)

What Happens in One Dense Layer?

Suppose z = W a + b and a' = Ïƒ(z). During backward, we receive âˆ‚L/âˆ‚a' (same shape as a'). First, âˆ‚L/âˆ‚z = âˆ‚L/âˆ‚a' âŠ™ Ïƒ'(z) (element-wise product for pointwise Ïƒ). Then âˆ‚L/âˆ‚a = Wáµ€ (âˆ‚L/âˆ‚z) to pass the signal to the previous layer, and âˆ‚L/âˆ‚W = (âˆ‚L/âˆ‚z) aáµ€ in the appropriate batch layout (or outer products summed over examples). Biases sum the gradient across batch dimension.

You do not need to memorize every transpose if you use a framework; you do need the picture: matmul backward swaps the roles of activations and upstream gradients when flowing through weights, which is why shape discipline matters for custom layers.

ReLU, Sigmoid, and Softmax Layers

ReLU: derivative 1 where z>0, 0 where z<0. Dead neurons correspond to regions where the gradient is always zeroâ€”another reason Leaky ReLU or other activations appear.

Sigmoid/tanh: derivatives involve Ïƒ(1âˆ’Ïƒ) or 1âˆ’tanhÂ²; in deep stacks these can shrink gradients (historical â€œvanishing gradientâ€ story with saturating activations).

Softmax + cross-entropy: the combined backward pass toward logits simplifies beautifully for stable implementations; frameworks fuse them so you never form huge Jacobian matrices explicitly. Conceptually, the gradient on logits pushes probability mass toward the correct class and away from wrong ones.

Memory and Compute

Training typically stores activations from the forward pass for backward use. That is why activation checkpointing trades extra forward recomputation for less GPU memory on very large models. The backward pass has similar order-of-magnitude cost to the forward pass for many common layers.

Debugging. If you implement a custom autograd.Function, verify shapes and compare to torch.autograd.gradcheck on tiny random inputs (finite differences)â€”slow but trustworthy.

PyTorch: `backward()` and the Graph

PyTorch builds a dynamic graph each forward (eager mode). Tensors with requires_grad=True track operations. Calling loss.backward() runs backprop; optimizer.step() then uses stored .grad tensors. torch.no_grad() disables graph building for inference.

Minimal autograd trail

import torch

x = torch.tensor([2.0], requires_grad=True)
y = x ** 3 + 5 * x
y.backward()   # dy/dx = 3x^2 + 5
print(x.grad)  # tensor([17.])

Summary

Backprop = reverse-mode differentiation applying the chain rule on the networkâ€™s computation graph.
Each layer maps an upstream gradient into gradients w.r.t. its inputs and parameters.
Modern frameworks hide Jacobian algebra but the structure explains vanishing/exploding behavior.
Computational graphs (next page) make this bookkeeping explicit.

Optimizers (SGD, Adam, â€¦)

SGD and Momentum

Stochastic gradient descent: Î¸ â† Î¸ âˆ’ Î· âˆ‡L. Mini-batches approximate the full-dataset gradient cheaply. Without momentum, updates zig-zag in ill-conditioned valleys.

Momentum maintains v â† Î²v + âˆ‡L and updates Î¸ â† Î¸ âˆ’ Î·v. Nesterov momentum evaluates the gradient at a â€œlook-aheadâ€ point, often improving responsiveness. These methods use one global learning rate (per parameter group) and the same update scale for all coordinatesâ€”tuning Î· and batch size matters a lot.

RMSprop and Adam

RMSprop keeps a moving average of squared gradients and divides the update by its root mean square, giving per-parameter scaling. Adam combines momentum (first moment) with RMSprop-like variance normalization (second moment), with bias correction for early steps. Default hyperparameters (Î²â‚=0.9, Î²â‚‚=0.999, Îµ small) work surprisingly often out of the box.

AdamW fixes how weight decay interacts with Adamâ€™s adaptive scaling: L2 penalty is applied directly to weights rather than being mixed into the gradient that Adam rescalesâ€”this matches modern transformer and vision training recipes better than â€œAdam + L2â€ in older form.

Generalization. Adaptive methods sometimes find sharper minima than carefully tuned SGD; if your validation gap is odd, try SGD+Momentum with cosine schedule as an ablationâ€”not because Adam is â€œwrong,â€ but because the loss landscape interaction differs.

Practical Notes

Use parameter groups for different learning rates (e.g. backbone vs head in transfer learning).
Gradient clipping caps norm before the optimizer stepâ€”stabilizes RNNs and large language models.
Optimizer choice pairs with learning rate schedules (warmup, cosine decay)â€”see the next tutorial in the series.

PyTorch: `AdamW` and `SGD`

Typical setup

import torch.optim as optim

# Default for many deep models
opt_adamw = optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.01)

# ImageNet-style (example hyperparameters â€” tune for your setup)
opt_sgd = optim.SGD(
    model.parameters(),
    lr=0.1,
    momentum=0.9,
    weight_decay=1e-4,
    nesterov=True,
)

Summary

SGD is simple; momentum helps curvature; Adam/AdamW adapt per-parameter step sizes.
AdamW is preferred when using meaningful weight decay with Adam.
Tune learning rate (and schedule) with your optimizer and batch size together.
Next: learning rate schedules and warmup.

Learning Rate Schedules

Why Change the Learning Rate?

Early in training, weights are far from a good basin; a moderate-to-large Î· helps escape flat or unhelpful regions. Later, aggressive updates oscillate around a minimum. Step decay multiplies Î· by a constant every N epochs (e.g. Ã—0.1 at epochs 30, 60). Exponential or inverse-time decay shrinks Î· smoothly. These rules are simple and still widely used with SGD + momentum.

Cosine annealing decreases Î· along a cosine curve from a maximum to a near-zero minimum over T steps, sometimes with periodic restarts (SGDR) to escape local minima. OneCycleLR warms up then anneals in one cycle, often paired with momentum adjustmentâ€”popular for fast convergence experiments.

Warmup

In very deep models or large batches, early updates can be unstable. Linear warmup increases Î· from 0 (or a small value) to the base LR over the first W steps or epochs. After warmup, a cosine or constant-then-decay schedule often follows. Transformer training recipes (e.g. â€œAttention Is All You Needâ€) normalized this pattern.

If loss spikes at step 1, try lower peak LR, warmup, or gradient clipping before chasing a fancier architecture.

PyTorch Schedulers

PyTorch separates optimizer (stores params and step rule) from scheduler (updates optimizerâ€™s LR each epoch or step). Call scheduler.step() after optimizer.step() (or per batch, depending on the scheduler API).

Cosine annealing per epoch

import torch.optim as optim
from torch.optim.lr_scheduler import CosineAnnealingLR

optimizer = optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.01)
scheduler = CosineAnnealingLR(optimizer, T_max=100, eta_min=1e-6)

for epoch in range(100):
    train_one_epoch()
    scheduler.step()

For step-based warmup + cosine, many codebases use LambdaLR, SequentialLR, or libraries like Hugging Face get_cosine_schedule_with_warmup.

Summary

Decay Î· over training to stabilize late-stage optimization; warmup helps at scale.
Cosine and step decay are common; OneCycle is a strong alternative when tuned.
Re-tune LR when you change batch size or optimizer family.
Next: vanishing and exploding gradients in very deep stacks.

Vanishing & Exploding Gradients

Why the Product Matters

The gradient of the loss with respect to an early-layer weight involves a chain of Jacobian factorsâ€”one per layer (and per activation). If typical factors have magnitude < 1, the product decays exponentially with depth; if > 1, it can explode. Random init without scaling can push you toward either extreme in deep linear-like stacks.

Vanishing starves shallow layers of useful signal (they update too slowly). Exploding gradients cause huge weight swings and numerical overflow. RNNs over long sequences face the same issue through timeâ€”hence LSTM and GRU gating to create additive paths where gradients flow more gently.

Mitigations

ReLU (and variants): derivative 1 for positive pre-activations avoids the universal shrinkage of sigmoid in the active region.
Residual connections (ResNet): add identity shortcuts so layers learn residuals; gradient has a direct path backward.
Batch / layer normalization: stabilizes scale of activations and often improves gradient behavior.
Initialization (He, Xavier): keeps forward variance and backward gradient scale in a reasonable range at layer boundaries.
Gradient clipping: cap the global norm of gradients before optimizer.step()â€”standard in NLP and RNN training.

If you see NaN loss, check learning rate, loss scaling (mixed precision), and clip gradients before rewriting the model.

PyTorch: Gradient Clipping

Clip global norm before optimizer step

import torch.nn as nn

loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
optimizer.zero_grad()

Summary

Deep chains multiply Jacobians; products can vanish or explode.
ReLU, good init, BN, and residuals are standard stabilizers for feedforward CNNs/MLPs.
RNNs need gating (LSTM/GRU) or truncation; clipping helps exploding cases.
Next: CNNsâ€”structure built for images and local patterns.

Previous Next

Training & Optimization

Gradient Descent

The Core Update Rule

Batch, Mini-Batch, and Stochastic

Local Minima, Saddles, and Plateaus

Toy Example: 2D Quadratic in NumPy

PyTorch: optimizer.step()

Summary

Backpropagation

From the Chain Rule to â€œError Signalsâ€

What Happens in One Dense Layer?

ReLU, Sigmoid, and Softmax Layers

Memory and Compute

PyTorch: backward() and the Graph

Summary

Optimizers (SGD, Adam, â€¦)

SGD and Momentum

RMSprop and Adam

Practical Notes

PyTorch: AdamW and SGD

Summary

Learning Rate Schedules

Why Change the Learning Rate?

Warmup

PyTorch Schedulers

Summary

Vanishing &amp; Exploding Gradients

Why the Product Matters

Mitigations

PyTorch: Gradient Clipping

Summary

PyTorch: `optimizer.step()`

From the Chain Rule to â€œError Signalsâ€

PyTorch: `backward()` and the Graph

PyTorch: `AdamW` and `SGD`

Vanishing & Exploding Gradients