Optimizers Gradient Descent
Training Engine Convergence

Optimizers: Driving Neural Network Training

Optimizers implement the parameter update rules that minimize the loss function. From vanilla SGD to adaptive methods like Adam and modern breakthroughs like Lion — complete mathematical and practical reference.

SGD

Vanilla, Momentum, NAG

Adaptive

AdaGrad, RMSprop

Adam

AdamW, Nadam, AMSGrad

Modern

Lion, Adafactor

What is an Optimizer?

Optimizers are algorithms that update model parameters (weights) to minimize the loss function. They determine how to move in the gradient direction — how fast, with what momentum, and with what adaptive scaling. The choice of optimizer critically affects training speed, stability, and final performance.

θₜ ∇L(θₜ) Optimizer Update Rule θₜ₊₁ = θₜ - lr · f(∇L, history)

Optimizers incorporate gradient history, adaptive learning rates, and momentum.

Gradient Descent Variants

Batch GD

Uses entire dataset to compute gradient. θ = θ - lr · ∇L(θ; all data)

Slow Stable Not feasible for large datasets.

Stochastic GD (SGD)

θ = θ - lr · ∇L(θ; xᵢ, yᵢ)

Update per sample. High variance Online learning

Mini-batch GD

θ = θ - lr · ∇L(θ; batch)

Balanced Most common. Batch size 32-512.

Mini-batch SGD from scratch
import numpy as np

def sgd_update(params, grads, lr=0.01):
    """Simple SGD update"""
    for param, grad in zip(params, grads):
        param -= lr * grad
    return params

Momentum & Nesterov Accelerated Gradient

SGD with Momentum

vₜ = βvₜ₋₁ + (1-β)∇L(θₜ)
θₜ₊₁ = θₜ - lr · vₜ

Accumulates velocity to overcome ravines and accelerate convergence. β typically 0.9.

def momentum_update(params, grads, v, lr=0.01, beta=0.9):
    for i, (p, g) in enumerate(zip(params, grads)):
        v[i] = beta * v[i] + (1 - beta) * g
        p -= lr * v[i]
    return params, v
Nesterov Accelerated Gradient (NAG)

vₜ = βvₜ₋₁ + (1-β)∇L(θₜ - lr·βvₜ₋₁)
θₜ₊₁ = θₜ - lr · vₜ

Looks ahead at the approximate future position. Often faster and more stable than standard momentum.

Intuition: Momentum is like a ball rolling downhill – it accumulates speed. Nesterov is like a smart ball that looks ahead before updating.

Adaptive Learning Rate Methods

AdaGrad

Gₜ = Gₜ₋₁ + (∇L(θₜ))²
θₜ₊₁ = θₜ - lr/√(Gₜ + ε) · ∇L(θₜ)

Adapts per-parameter learning rates. Good for sparse data. Learning rate decays monotonically.

Weakness: LR becomes infinitesimally small.

RMSprop

E[g²]ₜ = βE[g²]ₜ₋₁ + (1-β)(∇L)²
θₜ₊₁ = θₜ - lr/√(E[g²]ₜ + ε) · ∇L

Unpublished, but widely used. Fixes AdaGrad's decaying LR problem. β typically 0.9.

RMSprop implementation
def rmsprop_update(params, grads, cache, lr=0.001, beta=0.9, eps=1e-8):
    for i, (p, g) in enumerate(zip(params, grads)):
        cache[i] = beta * cache[i] + (1 - beta) * g**2
        p -= lr * g / (np.sqrt(cache[i]) + eps)
    return params, cache

Adam & The Adaptive Moment Family

Adam (Adaptive Moment Estimation)

mₜ = β₁mₜ₋₁ + (1-β₁)∇L
vₜ = β₂vₜ₋₁ + (1-β₂)(∇L)²
θₜ₊₁ = θₜ - lr · m̂ₜ/(√v̂ₜ + ε)

Combines momentum (first moment) and RMSprop (second moment). Bias-corrected estimates. β₁=0.9, β₂=0.999, ε=1e-7.

Default optimizer for most tasks

AdamW

θₜ₊₁ = θₜ - lr · (m̂ₜ/(√v̂ₜ+ε) + λθₜ)

Decoupled weight decay. Improves generalization over Adam. Recommended over Adam.

# PyTorch: torch.optim.AdamW
# TensorFlow: tf.keras.optimizers.AdamW
Nadam

Adam + Nesterov momentum. Slightly faster convergence.

AMSGrad

Variant that uses maximum of past squared gradients. Addresses convergence issues.

AdaBelief

Stepsize scaled by belief in observed gradient. More stable.

Adam implementation intuition
# Simplified Adam update (conceptual)
def adam_step(param, grad, m, v, t, lr=0.001, b1=0.9, b2=0.999):
    m = b1 * m + (1 - b1) * grad
    v = b2 * v + (1 - b2) * grad**2
    m_hat = m / (1 - b1**t)
    v_hat = v / (1 - b2**t)
    param -= lr * m_hat / (np.sqrt(v_hat) + 1e-7)
    return param, m, v

Modern & Emerging Optimizers

Lion (EvoLved Sign Momentum)

mₜ = β₁mₜ₋₁ + (1-β₁)∇L
θₜ₊₁ = θₜ - lr · sign(β₂mₜ + (1-β₂)∇L)

Discovered by symbolic search. More memory-efficient than Adam. Used in Google's latest models.

Adafactor

Memory-efficient Adam for large models. Factorizes second moment estimates. Used in T5.

LAMB & LARS

Layer-wise Adaptive Rate Scaling. For large-batch training (BERT, ResNet on TPUs).

Learning Rate Scheduling

Even with adaptive optimizers, scheduling the learning rate improves convergence.

Step Decay

Drop LR by factor every few epochs.

# TF: tf.keras.optimizers.schedules.ExponentialDecay
# PyTorch: torch.optim.lr_scheduler.StepLR
Cosine Annealing

Smooth cyclic decay. Often with warm restarts.

tf.keras.optimizers.schedules.CosineDecay
Warmup

Linear increase from 0 to initial LR. Stabilizes large model training.

ReduceLROnPlateau

Reduce LR when validation loss plateaus.

Best practice: Use learning rate warmup for Transformers and very deep networks. Cosine decay often outperforms step decay.

Optimizer Selection Guide

Optimizer Adaptive Momentum When to use Memory
SGDSimple models, CV (with momentum)Low
SGD+MomentumClassic CNNs, needs LR tuningLow
RMSpropRNNs, online learningMedium
AdamDefault for most tasksMedium
AdamWTransformers, NLP, better generalizationMedium
Nadam✅ (Nesterov)Slightly faster AdamMedium
LionMemory efficient, vision tasksLow
AdafactorGiant models (LLMs)Very low

Quick Selection Rules:

  • Start with AdamW – works well out-of-the-box.
  • For NLP / Transformers: AdamW with cosine decay + warmup.
  • For Computer Vision: SGD with momentum can outperform Adam (requires tuning).
  • For large models (>1B params): Adafactor or Lion to save memory.
  • For sparse data: AdaGrad or Adam.

Optimizers in TensorFlow & PyTorch

TensorFlow / Keras
import tensorflow as tf

# Common optimizers
model.compile(optimizer='sgd', ...)
model.compile(optimizer='adam', ...)
model.compile(optimizer=tf.keras.optimizers.AdamW(learning_rate=1e-4, weight_decay=1e-4))

# Learning rate schedule
lr_schedule = tf.keras.optimizers.schedules.CosineDecay(1e-3, decay_steps=10000)
optimizer = tf.keras.optimizers.Adam(lr_schedule)

# Custom optimizer loop
@tf.function
def train_step(x, y):
    with tf.GradientTape() as tape:
        y_pred = model(x)
        loss = loss_fn(y, y_pred)
    grads = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(grads, model.trainable_variables))
PyTorch
import torch.optim as optim

# Optimizers
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
optimizer = optim.Adam(model.parameters(), lr=0.001)
optimizer = optim.AdamW(model.parameters(), lr=0.001, weight_decay=0.01)

# Learning rate schedulers
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=30, gamma=0.1)
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=50)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, patience=5)

# Training loop
for epoch in range(epochs):
    for x, y in dataloader:
        optimizer.zero_grad()
        y_pred = model(x)
        loss = loss_fn(y_pred, y)
        loss.backward()
        optimizer.step()
    scheduler.step()

Optimizer Hyperparameter Tuning

Learning Rate: Most critical. Defaults: Adam 1e-3, SGD 1e-2. Use LR range test.
Batch Size: Affects gradient noise. Tune together with LR.
Weight Decay: AdamW: 0.01-0.1, SGD: 1e-4. Prevents overfitting.

LR Range Test: Increase LR exponentially each batch, plot loss. Optimal LR is just before loss explodes.

Optimizer Pitfalls & Solutions

⚠️ Adam generalizes worse than SGD? Myth partially. AdamW fixes generalization. SGD with proper tuning can still outperform.
⚠️ Loss not decreasing: LR too high/low, gradient clipping needed, or bug in model.
✅ Gradient clipping: Essential for RNNs, Transformers. Clip norm to 1.0.
✅ Debug: Monitor gradient norms per layer. Vanishing/exploding?

Optimizer Cheatsheet

SGD+M CV
Adam Default
AdamW Best overall
RMSprop RNN
Lion Efficient
Adafactor LLMs
Nadam Slightly faster
AdaGrad Sparse