Neural Networks Schedules
Warmup Cosine

Learning Rate Schedules

A fixed learning rate for all epochs is often suboptimal. Schedules reduce η over time so the optimizer takes large steps early (fast progress) and small steps late (fine convergence). Warmup ramps η up at the start of training—common in transformers and large-batch vision—to avoid instability when gradients are noisy. The right schedule interacts with batch size, optimizer, and dataset size; treat reported hyperparameters as a starting point, not a law.

step decay cosine OneCycle PyTorch

Why Change the Learning Rate?

Early in training, weights are far from a good basin; a moderate-to-large η helps escape flat or unhelpful regions. Later, aggressive updates oscillate around a minimum. Step decay multiplies η by a constant every N epochs (e.g. ×0.1 at epochs 30, 60). Exponential or inverse-time decay shrinks η smoothly. These rules are simple and still widely used with SGD + momentum.

Cosine annealing decreases η along a cosine curve from a maximum to a near-zero minimum over T steps, sometimes with periodic restarts (SGDR) to escape local minima. OneCycleLR warms up then anneals in one cycle, often paired with momentum adjustment—popular for fast convergence experiments.

Warmup

In very deep models or large batches, early updates can be unstable. Linear warmup increases η from 0 (or a small value) to the base LR over the first W steps or epochs. After warmup, a cosine or constant-then-decay schedule often follows. Transformer training recipes (e.g. “Attention Is All You Need”) normalized this pattern.

If loss spikes at step 1, try lower peak LR, warmup, or gradient clipping before chasing a fancier architecture.

PyTorch Schedulers

PyTorch separates optimizer (stores params and step rule) from scheduler (updates optimizer’s LR each epoch or step). Call scheduler.step() after optimizer.step() (or per batch, depending on the scheduler API).

Cosine annealing per epoch
import torch.optim as optim
from torch.optim.lr_scheduler import CosineAnnealingLR

optimizer = optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.01)
scheduler = CosineAnnealingLR(optimizer, T_max=100, eta_min=1e-6)

for epoch in range(100):
    train_one_epoch()
    scheduler.step()

For step-based warmup + cosine, many codebases use LambdaLR, SequentialLR, or libraries like Hugging Face get_cosine_schedule_with_warmup.

Summary

  • Decay η over training to stabilize late-stage optimization; warmup helps at scale.
  • Cosine and step decay are common; OneCycle is a strong alternative when tuned.
  • Re-tune LR when you change batch size or optimizer family.
  • Next: vanishing and exploding gradients in very deep stacks.

Very deep networks strain backpropagation—see how gradients shrink or blow up and what architectures do about it.