Neural Networks

Training & Optimization

Gradient descent, backpropagation, optimizers, learning rate, and gradient issues.

Gradient Descent

The Core Update Rule

Let L(θ) be the loss as a function of all parameters θ (weights and biases). The gradient ∇L points in the direction of steepest ascent of L. To reduce loss, we step the opposite way:

θ ← θ − η ∇L(θ)

The positive scalar η is the learning rate. Too large: overshoot, oscillation, divergence. Too small: painfully slow progress, risk of getting stuck in flat regions for a long time. In deep learning, η is one of the first hyperparameters people tune, often alongside batch size and schedule (warmup, decay).

For a smooth convex function in low dimensions, with a well-chosen rate, gradient descent can converge to the global minimum. Neural loss surfaces are not convex in general; in practice we still use the same local linear model (“linearize the loss around the current θ”) because it works remarkably well at scale.

Batch, Mini-Batch, and Stochastic

Batch gradient descent uses the gradient of the loss averaged over all training examples each step. That gives a faithful direction but is expensive when data are huge and must be scanned every update.

Stochastic gradient descent (SGD) originally meant using one example per step: noisy but fast per iteration and can help escape shallow local features. In modern usage, mini-batch SGD is standard: each step uses a mini-batch of B examples (e.g. 32–256). The gradient is averaged over the batch, trading noise against compute efficiency and hardware utilization (GPUs like contiguous matrix work).

Noise from small batches acts like mild regularization; very large batches can require larger learning rates or special tricks (learning rate scaling rules, warmup) to retain generalization quality. The field has many refinements—momentum, RMSprop, Adam—that adapt how past gradients influence the step, but they still sit on top of the same idea: use derivatives of the loss w.r.t. parameters.

One training step: sample batch → forward → loss → backward → θ ← θ − (step computed from ∇L and optimizer state)

Local Minima, Saddles, and Plateaus

High-dimensional loss landscapes are hard to visualize. Local minima (points where all directions go uphill) were once feared as show-stoppers; empirically, many deep nets find solutions that generalize even though the surface is non-convex. Saddle points (some directions down, some up) are more problematic in theory because the gradient can be very small even far from a good solution—optimization research studies how noise and curvature help escape.

Plateaus are regions where the gradient is tiny; training can crawl. Good initialization, activation choices (e.g. ReLU vs saturated sigmoids), batch normalization, and adaptive optimizers all affect how often you hit flat or pathological regions. This is why the full training story ties together architecture, loss, and optimization—not any single trick in isolation.

Practical note. If loss is flat, check learning rate, gradient norms, whether the model is in train() mode, and whether loss is wired correctly (e.g. logits vs softmax).

Toy Example: 2D Quadratic in NumPy

Minimize L(w) = w₀² + 4w₁² (elongated bowl). The gradient is (2w₀, 8w₁). Plain gradient descent with fixed η shrinks both components toward zero.

Vanilla GD on a quadratic
import numpy as np

w = np.array([3.0, 2.0])
eta = 0.15
for t in range(30):
    grad = np.array([2 * w[0], 8 * w[1]])
    w = w - eta * grad
print("w after steps:", w)

PyTorch: optimizer.step()

Frameworks compute ∇L via automatic differentiation (next tutorial: backpropagation). The optimizer holds learning rate and momentum buffers; after loss.backward(), optimizer.step() applies the update.

One step pattern
import torch
import torch.nn as nn

model = nn.Linear(5, 1)
opt = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
x = torch.randn(8, 5)
y = torch.randn(8, 1)

loss_fn = nn.MSELoss()
opt.zero_grad()
loss = loss_fn(model(x), y)
loss.backward()
opt.step()

Summary

  • Gradient descent moves parameters opposite the loss gradient, scaled by learning rate.
  • Mini-batch SGD balances noise, speed, and hardware efficiency.
  • Non-convex landscapes still train well in practice with good habits and modern optimizers.
  • Backprop computes gradients; optimizers consume them—next pages unpack both in depth.

Backpropagation

From the Chain Rule to “Error Signals”

If L depends on z, and z depends on w, then ∂L/∂w = (∂L/∂z)(∂z/∂w). In a network, L depends on the output, which depends on the last layer’s activations, which depend on the last layer’s pre-activations, which depend on weights and the previous layer’s activations—and so on. Each link in this chain contributes a factor; backprop multiplies those factors along paths from L back to each w.

Implementation-wise, we store forward values (inputs to each op) needed to compute local derivatives on the way back. The backward pass visits operations in reverse topological order: each op receives an upstream gradient (“how much L changes per unit change in this op’s output”) and returns gradients w.r.t. its inputs.

Forward: x → … → ŷ → L Backward: ∂L/∂ŷ → … → ∂L/∂x (plus ∂L/∂W, ∂L/∂b at each layer)

What Happens in One Dense Layer?

Suppose z = W a + b and a' = σ(z). During backward, we receive ∂L/∂a' (same shape as a'). First, ∂L/∂z = ∂L/∂a' ⊙ σ'(z) (element-wise product for pointwise σ). Then ∂L/∂a = Wᵀ (∂L/∂z) to pass the signal to the previous layer, and ∂L/∂W = (∂L/∂z) aᵀ in the appropriate batch layout (or outer products summed over examples). Biases sum the gradient across batch dimension.

You do not need to memorize every transpose if you use a framework; you do need the picture: matmul backward swaps the roles of activations and upstream gradients when flowing through weights, which is why shape discipline matters for custom layers.

ReLU, Sigmoid, and Softmax Layers

ReLU: derivative 1 where z>0, 0 where z<0. Dead neurons correspond to regions where the gradient is always zero—another reason Leaky ReLU or other activations appear.

Sigmoid/tanh: derivatives involve σ(1−σ) or 1−tanh²; in deep stacks these can shrink gradients (historical “vanishing gradient” story with saturating activations).

Softmax + cross-entropy: the combined backward pass toward logits simplifies beautifully for stable implementations; frameworks fuse them so you never form huge Jacobian matrices explicitly. Conceptually, the gradient on logits pushes probability mass toward the correct class and away from wrong ones.

Memory and Compute

Training typically stores activations from the forward pass for backward use. That is why activation checkpointing trades extra forward recomputation for less GPU memory on very large models. The backward pass has similar order-of-magnitude cost to the forward pass for many common layers.

Debugging. If you implement a custom autograd.Function, verify shapes and compare to torch.autograd.gradcheck on tiny random inputs (finite differences)—slow but trustworthy.

PyTorch: backward() and the Graph

PyTorch builds a dynamic graph each forward (eager mode). Tensors with requires_grad=True track operations. Calling loss.backward() runs backprop; optimizer.step() then uses stored .grad tensors. torch.no_grad() disables graph building for inference.

Minimal autograd trail
import torch

x = torch.tensor([2.0], requires_grad=True)
y = x ** 3 + 5 * x
y.backward()   # dy/dx = 3x^2 + 5
print(x.grad)  # tensor([17.])

Summary

  • Backprop = reverse-mode differentiation applying the chain rule on the network’s computation graph.
  • Each layer maps an upstream gradient into gradients w.r.t. its inputs and parameters.
  • Modern frameworks hide Jacobian algebra but the structure explains vanishing/exploding behavior.
  • Computational graphs (next page) make this bookkeeping explicit.

Optimizers (SGD, Adam, …)

SGD and Momentum

Stochastic gradient descent: θ ← θ − η ∇L. Mini-batches approximate the full-dataset gradient cheaply. Without momentum, updates zig-zag in ill-conditioned valleys.

Momentum maintains v ← βv + ∇L and updates θ ← θ − ηv. Nesterov momentum evaluates the gradient at a “look-ahead” point, often improving responsiveness. These methods use one global learning rate (per parameter group) and the same update scale for all coordinates—tuning η and batch size matters a lot.

RMSprop and Adam

RMSprop keeps a moving average of squared gradients and divides the update by its root mean square, giving per-parameter scaling. Adam combines momentum (first moment) with RMSprop-like variance normalization (second moment), with bias correction for early steps. Default hyperparameters (β₁=0.9, β₂=0.999, ε small) work surprisingly often out of the box.

AdamW fixes how weight decay interacts with Adam’s adaptive scaling: L2 penalty is applied directly to weights rather than being mixed into the gradient that Adam rescales—this matches modern transformer and vision training recipes better than “Adam + L2” in older form.

Generalization. Adaptive methods sometimes find sharper minima than carefully tuned SGD; if your validation gap is odd, try SGD+Momentum with cosine schedule as an ablation—not because Adam is “wrong,” but because the loss landscape interaction differs.

Practical Notes

  • Use parameter groups for different learning rates (e.g. backbone vs head in transfer learning).
  • Gradient clipping caps norm before the optimizer step—stabilizes RNNs and large language models.
  • Optimizer choice pairs with learning rate schedules (warmup, cosine decay)—see the next tutorial in the series.

PyTorch: AdamW and SGD

Typical setup
import torch.optim as optim

# Default for many deep models
opt_adamw = optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.01)

# ImageNet-style (example hyperparameters — tune for your setup)
opt_sgd = optim.SGD(
    model.parameters(),
    lr=0.1,
    momentum=0.9,
    weight_decay=1e-4,
    nesterov=True,
)

Summary

  • SGD is simple; momentum helps curvature; Adam/AdamW adapt per-parameter step sizes.
  • AdamW is preferred when using meaningful weight decay with Adam.
  • Tune learning rate (and schedule) with your optimizer and batch size together.
  • Next: learning rate schedules and warmup.

Learning Rate Schedules

Why Change the Learning Rate?

Early in training, weights are far from a good basin; a moderate-to-large η helps escape flat or unhelpful regions. Later, aggressive updates oscillate around a minimum. Step decay multiplies η by a constant every N epochs (e.g. ×0.1 at epochs 30, 60). Exponential or inverse-time decay shrinks η smoothly. These rules are simple and still widely used with SGD + momentum.

Cosine annealing decreases η along a cosine curve from a maximum to a near-zero minimum over T steps, sometimes with periodic restarts (SGDR) to escape local minima. OneCycleLR warms up then anneals in one cycle, often paired with momentum adjustment—popular for fast convergence experiments.

Warmup

In very deep models or large batches, early updates can be unstable. Linear warmup increases η from 0 (or a small value) to the base LR over the first W steps or epochs. After warmup, a cosine or constant-then-decay schedule often follows. Transformer training recipes (e.g. “Attention Is All You Need”) normalized this pattern.

If loss spikes at step 1, try lower peak LR, warmup, or gradient clipping before chasing a fancier architecture.

PyTorch Schedulers

PyTorch separates optimizer (stores params and step rule) from scheduler (updates optimizer’s LR each epoch or step). Call scheduler.step() after optimizer.step() (or per batch, depending on the scheduler API).

Cosine annealing per epoch
import torch.optim as optim
from torch.optim.lr_scheduler import CosineAnnealingLR

optimizer = optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.01)
scheduler = CosineAnnealingLR(optimizer, T_max=100, eta_min=1e-6)

for epoch in range(100):
    train_one_epoch()
    scheduler.step()

For step-based warmup + cosine, many codebases use LambdaLR, SequentialLR, or libraries like Hugging Face get_cosine_schedule_with_warmup.

Summary

  • Decay η over training to stabilize late-stage optimization; warmup helps at scale.
  • Cosine and step decay are common; OneCycle is a strong alternative when tuned.
  • Re-tune LR when you change batch size or optimizer family.
  • Next: vanishing and exploding gradients in very deep stacks.

Vanishing &amp; Exploding Gradients

Why the Product Matters

The gradient of the loss with respect to an early-layer weight involves a chain of Jacobian factors—one per layer (and per activation). If typical factors have magnitude < 1, the product decays exponentially with depth; if > 1, it can explode. Random init without scaling can push you toward either extreme in deep linear-like stacks.

Vanishing starves shallow layers of useful signal (they update too slowly). Exploding gradients cause huge weight swings and numerical overflow. RNNs over long sequences face the same issue through time—hence LSTM and GRU gating to create additive paths where gradients flow more gently.

Mitigations

  • ReLU (and variants): derivative 1 for positive pre-activations avoids the universal shrinkage of sigmoid in the active region.
  • Residual connections (ResNet): add identity shortcuts so layers learn residuals; gradient has a direct path backward.
  • Batch / layer normalization: stabilizes scale of activations and often improves gradient behavior.
  • Initialization (He, Xavier): keeps forward variance and backward gradient scale in a reasonable range at layer boundaries.
  • Gradient clipping: cap the global norm of gradients before optimizer.step()—standard in NLP and RNN training.
If you see NaN loss, check learning rate, loss scaling (mixed precision), and clip gradients before rewriting the model.

PyTorch: Gradient Clipping

Clip global norm before optimizer step
import torch.nn as nn

loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
optimizer.zero_grad()

Summary

  • Deep chains multiply Jacobians; products can vanish or explode.
  • ReLU, good init, BN, and residuals are standard stabilizers for feedforward CNNs/MLPs.
  • RNNs need gating (LSTM/GRU) or truncation; clipping helps exploding cases.
  • Next: CNNs—structure built for images and local patterns.