Optimizers Gradient Descent

Training Engine Convergence

Optimizers: Driving Neural Network Training

Optimizers implement the parameter update rules that minimize the loss function. From vanilla SGD to adaptive methods like Adam and modern breakthroughs like Lion — complete mathematical and practical reference.

SGD

Vanilla, Momentum, NAG

Adaptive

AdaGrad, RMSprop

Adam

AdamW, Nadam, AMSGrad

Modern

Lion, Adafactor

What is an Optimizer?

Optimizers are algorithms that update model parameters (weights) to minimize the loss function. They determine how to move in the gradient direction — how fast, with what momentum, and with what adaptive scaling. The choice of optimizer critically affects training speed, stability, and final performance.

θₜ → ∇L(θₜ) → Optimizer Update Rule → θₜ₊₁ = θₜ - lr · f(∇L, history)

Optimizers incorporate gradient history, adaptive learning rates, and momentum.

Gradient Descent Variants

Batch GD

Uses entire dataset to compute gradient. θ = θ - lr · ∇L(θ; all data)

Slow Stable Not feasible for large datasets.

Stochastic GD (SGD)

θ = θ - lr · ∇L(θ; xᵢ, yᵢ)

Update per sample. High variance Online learning

Mini-batch GD

θ = θ - lr · ∇L(θ; batch)

Balanced Most common. Batch size 32-512.

Mini-batch SGD from scratch

import numpy as np

def sgd_update(params, grads, lr=0.01):
    """Simple SGD update"""
    for param, grad in zip(params, grads):
        param -= lr * grad
    return params

Momentum & Nesterov Accelerated Gradient

SGD with Momentum

vₜ = βvₜ₋₁ + (1-β)∇L(θₜ)
θₜ₊₁ = θₜ - lr · vₜ

Accumulates velocity to overcome ravines and accelerate convergence. β typically 0.9.

def momentum_update(params, grads, v, lr=0.01, beta=0.9):
    for i, (p, g) in enumerate(zip(params, grads)):
        v[i] = beta * v[i] + (1 - beta) * g
        p -= lr * v[i]
    return params, v

Nesterov Accelerated Gradient (NAG)

vₜ = βvₜ₋₁ + (1-β)∇L(θₜ - lr·βvₜ₋₁)
θₜ₊₁ = θₜ - lr · vₜ

Looks ahead at the approximate future position. Often faster and more stable than standard momentum.

Intuition: Momentum is like a ball rolling downhill – it accumulates speed. Nesterov is like a smart ball that looks ahead before updating.

Adaptive Learning Rate Methods

AdaGrad

Gₜ = Gₜ₋₁ + (∇L(θₜ))²
θₜ₊₁ = θₜ - lr/√(Gₜ + ε) · ∇L(θₜ)

Adapts per-parameter learning rates. Good for sparse data. Learning rate decays monotonically.

Weakness: LR becomes infinitesimally small.

RMSprop

E[g²]ₜ = βE[g²]ₜ₋₁ + (1-β)(∇L)²
θₜ₊₁ = θₜ - lr/√(E[g²]ₜ + ε) · ∇L

Unpublished, but widely used. Fixes AdaGrad's decaying LR problem. β typically 0.9.

RMSprop implementation

def rmsprop_update(params, grads, cache, lr=0.001, beta=0.9, eps=1e-8):
    for i, (p, g) in enumerate(zip(params, grads)):
        cache[i] = beta * cache[i] + (1 - beta) * g**2
        p -= lr * g / (np.sqrt(cache[i]) + eps)
    return params, cache

Adam & The Adaptive Moment Family

Adam (Adaptive Moment Estimation)

mₜ = β₁mₜ₋₁ + (1-β₁)∇L
vₜ = β₂vₜ₋₁ + (1-β₂)(∇L)²
θₜ₊₁ = θₜ - lr · m̂ₜ/(√v̂ₜ + ε)

Combines momentum (first moment) and RMSprop (second moment). Bias-corrected estimates. β₁=0.9, β₂=0.999, ε=1e-7.

Default optimizer for most tasks

AdamW

θₜ₊₁ = θₜ - lr · (m̂ₜ/(√v̂ₜ+ε) + λθₜ)

Decoupled weight decay. Improves generalization over Adam. Recommended over Adam.

# PyTorch: torch.optim.AdamW
# TensorFlow: tf.keras.optimizers.AdamW

Nadam

Adam + Nesterov momentum. Slightly faster convergence.

AMSGrad

Variant that uses maximum of past squared gradients. Addresses convergence issues.

AdaBelief

Stepsize scaled by belief in observed gradient. More stable.

Adam implementation intuition

# Simplified Adam update (conceptual)
def adam_step(param, grad, m, v, t, lr=0.001, b1=0.9, b2=0.999):
    m = b1 * m + (1 - b1) * grad
    v = b2 * v + (1 - b2) * grad**2
    m_hat = m / (1 - b1**t)
    v_hat = v / (1 - b2**t)
    param -= lr * m_hat / (np.sqrt(v_hat) + 1e-7)
    return param, m, v

Modern & Emerging Optimizers

Lion (EvoLved Sign Momentum)

mₜ = β₁mₜ₋₁ + (1-β₁)∇L
θₜ₊₁ = θₜ - lr · sign(β₂mₜ + (1-β₂)∇L)

Discovered by symbolic search. More memory-efficient than Adam. Used in Google's latest models.

Adafactor

Memory-efficient Adam for large models. Factorizes second moment estimates. Used in T5.

LAMB & LARS

Layer-wise Adaptive Rate Scaling. For large-batch training (BERT, ResNet on TPUs).

Learning Rate Scheduling

Even with adaptive optimizers, scheduling the learning rate improves convergence.

Step Decay

Drop LR by factor every few epochs.

# TF: tf.keras.optimizers.schedules.ExponentialDecay
# PyTorch: torch.optim.lr_scheduler.StepLR

Cosine Annealing

Smooth cyclic decay. Often with warm restarts.

tf.keras.optimizers.schedules.CosineDecay

Warmup

Linear increase from 0 to initial LR. Stabilizes large model training.

ReduceLROnPlateau

Reduce LR when validation loss plateaus.

Best practice: Use learning rate warmup for Transformers and very deep networks. Cosine decay often outperforms step decay.

Optimizer Selection Guide

Optimizer	Adaptive	Momentum	When to use	Memory
SGD	❌	❌	Simple models, CV (with momentum)	Low
SGD+Momentum	❌	✅	Classic CNNs, needs LR tuning	Low
RMSprop	✅	❌	RNNs, online learning	Medium
Adam	✅	✅	Default for most tasks	Medium
AdamW	✅	✅	Transformers, NLP, better generalization	Medium
Nadam	✅	✅ (Nesterov)	Slightly faster Adam	Medium
Lion	✅	✅	Memory efficient, vision tasks	Low
Adafactor	✅	✅	Giant models (LLMs)	Very low

                     Quick Selection Rules:
                    Start with AdamW – works well out-of-the-box.
For NLP / Transformers: AdamW with cosine decay + warmup.
For Computer Vision: SGD with momentum can outperform Adam (requires tuning).
For large models (>1B params): Adafactor or Lion to save memory.
For sparse data: AdaGrad or Adam.

                

Optimizers in TensorFlow & PyTorch

TensorFlow / Keras

import tensorflow as tf

# Common optimizers
model.compile(optimizer='sgd', ...)
model.compile(optimizer='adam', ...)
model.compile(optimizer=tf.keras.optimizers.AdamW(learning_rate=1e-4, weight_decay=1e-4))

# Learning rate schedule
lr_schedule = tf.keras.optimizers.schedules.CosineDecay(1e-3, decay_steps=10000)
optimizer = tf.keras.optimizers.Adam(lr_schedule)

# Custom optimizer loop
@tf.function
def train_step(x, y):
    with tf.GradientTape() as tape:
        y_pred = model(x)
        loss = loss_fn(y, y_pred)
    grads = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(grads, model.trainable_variables))

PyTorch

import torch.optim as optim

# Optimizers
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
optimizer = optim.Adam(model.parameters(), lr=0.001)
optimizer = optim.AdamW(model.parameters(), lr=0.001, weight_decay=0.01)

# Learning rate schedulers
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=30, gamma=0.1)
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=50)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, patience=5)

# Training loop
for epoch in range(epochs):
    for x, y in dataloader:
        optimizer.zero_grad()
        y_pred = model(x)
        loss = loss_fn(y_pred, y)
        loss.backward()
        optimizer.step()
    scheduler.step()

Optimizer Hyperparameter Tuning

Learning Rate: Most critical. Defaults: Adam 1e-3, SGD 1e-2. Use LR range test.

Batch Size: Affects gradient noise. Tune together with LR.

Weight Decay: AdamW: 0.01-0.1, SGD: 1e-4. Prevents overfitting.

LR Range Test: Increase LR exponentially each batch, plot loss. Optimal LR is just before loss explodes.

Optimizer Pitfalls & Solutions

⚠️ Adam generalizes worse than SGD? Myth partially. AdamW fixes generalization. SGD with proper tuning can still outperform.

⚠️ Loss not decreasing: LR too high/low, gradient clipping needed, or bug in model.

✅ Gradient clipping: Essential for RNNs, Transformers. Clip norm to 1.0.

✅ Debug: Monitor gradient norms per layer. Vanishing/exploding?

Optimizer CheatsheetSGD+M CV
Adam Default
AdamW Best overall
RMSprop RNN
Lion Efficient
Adafactor LLMs
Nadam Slightly faster
AdaGrad Sparse

Next: Tensorflow