Learning Rate Schedules â€” 15 Interview Questions

Warmup, piecewise decay, cosine annealing, restarts, and how Î· interacts with batch size and adaptive optimizers.

Colored left borders per card; green / amber / red difficulty chips.

Î· Warmup Decay Restart

1 What is the learning rate in SGD?Easy

Answer: Scalar Î· scaling the gradient step: controls speed vs stabilityâ€”too large diverges, too small is slow. Same symbol in Adam as base step size (per-parameter scaling aside).

2 What is a learning rate schedule?Easy

Answer: A function of step or epoch that changes Î· during trainingâ€”e.g. decay after plateaus, cosine to zero, warmup then constant.

3 Step decay (piecewise constant).Easy

Answer: Multiply Î· by constant Î³ every N epochs or when metric plateausâ€”classic CV recipe; simple and interpretable.

4 Exponential decay formula (conceptual).Medium

Answer: Î·_t = Î·_0 Â· Î³^t or continuous Î·_0 e^(âˆ’kt)â€”smooth decrease; need tune decay rate.

Î·_t = Î·_0 Â· Î³^t (per step or epoch)

5 Cosine annealingâ€”idea.Medium

Answer: Î· follows a cosine curve from max to min over T stepsâ€”smooth decay used in many modern trainers (SGDR extends with restarts).

6 Why warmup for Transformers / large-batch Adam?Medium

Answer: Early steps have unreliable moment estimates; large Î· can destabilize. Linear warmup ramps Î· so early updates are smallerâ€”standard in BERT/GPT recipes.

7 One-cycle policy (high level).Hard

Answer: Increase Î· to a maximum then decrease in one cycleâ€”idea from Smith: helps fast training and can aid generalization; used with cyclical LR ideas.

8 LR range test / finderâ€”purpose.Medium

Answer: Increase Î· each batch, plot lossâ€”find region where loss drops fastest before explosion; heuristic to pick max LR for one-cycle or training.

9 Linear scaling rule (batch size vs learning rate).Hard

Answer: When batch size Ã—k, some recipes scale Î· Ã—k to keep gradient noise similarâ€”works as heuristic for SGD in some regimes; not universal (warmup, BN, Adam complicate).

10 ReduceLROnPlateau scheduler.Easy

Answer: When validation metric stops improving, multiply Î· by factor < 1â€”adaptive to training dynamics, no fixed epoch list.

11 Minimum learning rate / Î· floor.Easy

Answer: Schedulers often clamp Î· â‰¥ Î·_min so training doesnâ€™t stall completely; cosine schedules use explicit minimum.

12 Does Adam remove the need for schedules?Medium

Answer: Noâ€”Adam adapts per-parameter scale but global Î· still matters; Transformers use warmup + decay with AdamW routinely.

13 SGDR: stochastic gradient descent with warm restarts.Hard

Answer: Periodically reset Î· to high value (cosine cycles)â€”helps escape flat regions; related to ensemble of snapshots near restarts.

14 Grid search vs random search for Î·?Medium

Answer: Random search on log-uniform Î· often more efficient than gridâ€”covers orders of magnitude; use validation score.

15 Practical trio youâ€™d mention for a Transformer baseline.Easy

Answer: Peak learning rate, warmup steps, total steps / decay shape (cosine or linear)â€”cite paper recipe then tune.

Tie schedules to validation loss curves, not only train loss.

Quick review checklist

Constant vs schedule; step, exp, cosine; warmup why.
Plateau scheduler; SGDR; LR finder one-liner.
Adam still needs global Î·; batch-size scaling caveat.

Previous: Optimizers Next: Vanishing gradients

Related Neural Networks Links

Learning Rate Schedules â€” 15 Interview Questions

Quick review checklist