Related Neural Networks Links
Learn Gradient Descent Neural Networks Tutorial, validate concepts with Gradient Descent Neural Networks MCQ Questions, and prepare interviews through Gradient Descent Neural Networks Interview Questions and Answers.
Gradient Descent
Gradient descent is the workhorse of neural network training: repeatedly measure how loss changes with respect to each parameter (the gradient), then nudge parameters in the direction that decreases loss fastest (the negative gradient). The picture is “walking downhill†on a high-dimensional surface defined by the loss—except the surface is noisy, non-convex, and we only see stochastic estimates of the slope.
SGD learning rate mini-batch optimizer
The Core Update Rule
Let L(θ) be the loss as a function of all parameters θ (weights and biases). The gradient ∇L points in the direction of steepest ascent of L. To reduce loss, we step the opposite way:
θ ↠θ − η ∇L(θ)
The positive scalar η is the learning rate. Too large: overshoot, oscillation, divergence. Too small: painfully slow progress, risk of getting stuck in flat regions for a long time. In deep learning, η is one of the first hyperparameters people tune, often alongside batch size and schedule (warmup, decay).
For a smooth convex function in low dimensions, with a well-chosen rate, gradient descent can converge to the global minimum. Neural loss surfaces are not convex in general; in practice we still use the same local linear model (“linearize the loss around the current θâ€) because it works remarkably well at scale.
Batch, Mini-Batch, and Stochastic
Batch gradient descent uses the gradient of the loss averaged over all training examples each step. That gives a faithful direction but is expensive when data are huge and must be scanned every update.
Stochastic gradient descent (SGD) originally meant using one example per step: noisy but fast per iteration and can help escape shallow local features. In modern usage, mini-batch SGD is standard: each step uses a mini-batch of B examples (e.g. 32–256). The gradient is averaged over the batch, trading noise against compute efficiency and hardware utilization (GPUs like contiguous matrix work).
Noise from small batches acts like mild regularization; very large batches can require larger learning rates or special tricks (learning rate scaling rules, warmup) to retain generalization quality. The field has many refinements—momentum, RMSprop, Adam—that adapt how past gradients influence the step, but they still sit on top of the same idea: use derivatives of the loss w.r.t. parameters.
Local Minima, Saddles, and Plateaus
High-dimensional loss landscapes are hard to visualize. Local minima (points where all directions go uphill) were once feared as show-stoppers; empirically, many deep nets find solutions that generalize even though the surface is non-convex. Saddle points (some directions down, some up) are more problematic in theory because the gradient can be very small even far from a good solution—optimization research studies how noise and curvature help escape.
Plateaus are regions where the gradient is tiny; training can crawl. Good initialization, activation choices (e.g. ReLU vs saturated sigmoids), batch normalization, and adaptive optimizers all affect how often you hit flat or pathological regions. This is why the full training story ties together architecture, loss, and optimization—not any single trick in isolation.
train() mode, and whether loss is wired correctly (e.g. logits vs softmax).
Toy Example: 2D Quadratic in NumPy
Minimize L(w) = w₀² + 4w₲ (elongated bowl). The gradient is (2wâ‚€, 8wâ‚). Plain gradient descent with fixed η shrinks both components toward zero.
import numpy as np
w = np.array([3.0, 2.0])
eta = 0.15
for t in range(30):
grad = np.array([2 * w[0], 8 * w[1]])
w = w - eta * grad
print("w after steps:", w)
PyTorch: optimizer.step()
Frameworks compute ∇L via automatic differentiation (next tutorial: backpropagation). The optimizer holds learning rate and momentum buffers; after loss.backward(), optimizer.step() applies the update.
import torch
import torch.nn as nn
model = nn.Linear(5, 1)
opt = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
x = torch.randn(8, 5)
y = torch.randn(8, 1)
loss_fn = nn.MSELoss()
opt.zero_grad()
loss = loss_fn(model(x), y)
loss.backward()
opt.step()
Summary
- Gradient descent moves parameters opposite the loss gradient, scaled by learning rate.
- Mini-batch SGD balances noise, speed, and hardware efficiency.
- Non-convex landscapes still train well in practice with good habits and modern optimizers.
- Backprop computes gradients; optimizers consume them—next pages unpack both in depth.
Next. Backpropagation explains how ∂L/∂w is obtained efficiently through the network.