Gradient Descent

Gradient descent is the workhorse of neural network training: repeatedly measure how loss changes with respect to each parameter (the gradient), then nudge parameters in the direction that decreases loss fastest (the negative gradient). The picture is â€œwalking downhillâ€ on a high-dimensional surface defined by the lossâ€”except the surface is noisy, non-convex, and we only see stochastic estimates of the slope.

SGD learning rate mini-batch optimizer

The Core Update Rule

Let L(Î¸) be the loss as a function of all parameters Î¸ (weights and biases). The gradient âˆ‡L points in the direction of steepest ascent of L. To reduce loss, we step the opposite way:

Î¸ â† Î¸ âˆ’ Î· âˆ‡L(Î¸)

The positive scalar Î· is the learning rate. Too large: overshoot, oscillation, divergence. Too small: painfully slow progress, risk of getting stuck in flat regions for a long time. In deep learning, Î· is one of the first hyperparameters people tune, often alongside batch size and schedule (warmup, decay).

For a smooth convex function in low dimensions, with a well-chosen rate, gradient descent can converge to the global minimum. Neural loss surfaces are not convex in general; in practice we still use the same local linear model (â€œlinearize the loss around the current Î¸â€) because it works remarkably well at scale.

Batch, Mini-Batch, and Stochastic

Batch gradient descent uses the gradient of the loss averaged over all training examples each step. That gives a faithful direction but is expensive when data are huge and must be scanned every update.

Stochastic gradient descent (SGD) originally meant using one example per step: noisy but fast per iteration and can help escape shallow local features. In modern usage, mini-batch SGD is standard: each step uses a mini-batch of B examples (e.g. 32â€“256). The gradient is averaged over the batch, trading noise against compute efficiency and hardware utilization (GPUs like contiguous matrix work).

Noise from small batches acts like mild regularization; very large batches can require larger learning rates or special tricks (learning rate scaling rules, warmup) to retain generalization quality. The field has many refinementsâ€”momentum, RMSprop, Adamâ€”that adapt how past gradients influence the step, but they still sit on top of the same idea: use derivatives of the loss w.r.t. parameters.

One training step: sample batch â†’ forward â†’ loss â†’ backward â†’ Î¸ â† Î¸ âˆ’ (step computed from âˆ‡L and optimizer state)

Local Minima, Saddles, and Plateaus

High-dimensional loss landscapes are hard to visualize. Local minima (points where all directions go uphill) were once feared as show-stoppers; empirically, many deep nets find solutions that generalize even though the surface is non-convex. Saddle points (some directions down, some up) are more problematic in theory because the gradient can be very small even far from a good solutionâ€”optimization research studies how noise and curvature help escape.

Plateaus are regions where the gradient is tiny; training can crawl. Good initialization, activation choices (e.g. ReLU vs saturated sigmoids), batch normalization, and adaptive optimizers all affect how often you hit flat or pathological regions. This is why the full training story ties together architecture, loss, and optimizationâ€”not any single trick in isolation.

Practical note. If loss is flat, check learning rate, gradient norms, whether the model is in train() mode, and whether loss is wired correctly (e.g. logits vs softmax).

Toy Example: 2D Quadratic in NumPy

Minimize L(w) = wâ‚€Â² + 4wâ‚Â² (elongated bowl). The gradient is (2wâ‚€, 8wâ‚). Plain gradient descent with fixed Î· shrinks both components toward zero.

Vanilla GD on a quadratic

import numpy as np

w = np.array([3.0, 2.0])
eta = 0.15
for t in range(30):
    grad = np.array([2 * w[0], 8 * w[1]])
    w = w - eta * grad
print("w after steps:", w)

PyTorch: `optimizer.step()`

Frameworks compute âˆ‡L via automatic differentiation (next tutorial: backpropagation). The optimizer holds learning rate and momentum buffers; after loss.backward(), optimizer.step() applies the update.

One step pattern

import torch
import torch.nn as nn

model = nn.Linear(5, 1)
opt = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
x = torch.randn(8, 5)
y = torch.randn(8, 1)

loss_fn = nn.MSELoss()
opt.zero_grad()
loss = loss_fn(model(x), y)
loss.backward()
opt.step()

Summary

Gradient descent moves parameters opposite the loss gradient, scaled by learning rate.
Mini-batch SGD balances noise, speed, and hardware efficiency.
Non-convex landscapes still train well in practice with good habits and modern optimizers.
Backprop computes gradients; optimizers consume themâ€”next pages unpack both in depth.

Next. Backpropagation explains how âˆ‚L/âˆ‚w is obtained efficiently through the network.

Previous: Loss functions Next: Backpropagation

Related Neural Networks Links