Neural Networks 15 Essential Q&A
Interview Prep

Learning Rate Schedules — 15 Interview Questions

Warmup, piecewise decay, cosine annealing, restarts, and how η interacts with batch size and adaptive optimizers.

Colored left borders per card; green / amber / red difficulty chips.

η Warmup Decay Restart
1 What is the learning rate in SGD?Easy
Answer: Scalar η scaling the gradient step: controls speed vs stability—too large diverges, too small is slow. Same symbol in Adam as base step size (per-parameter scaling aside).
2 What is a learning rate schedule?Easy
Answer: A function of step or epoch that changes η during training—e.g. decay after plateaus, cosine to zero, warmup then constant.
3 Step decay (piecewise constant).Easy
Answer: Multiply η by constant γ every N epochs or when metric plateaus—classic CV recipe; simple and interpretable.
4 Exponential decay formula (conceptual).Medium
Answer: η_t = η_0 · γ^t or continuous η_0 e^(−kt)—smooth decrease; need tune decay rate.
η_t = η_0 · γ^t   (per step or epoch)
5 Cosine annealing—idea.Medium
Answer: η follows a cosine curve from max to min over T steps—smooth decay used in many modern trainers (SGDR extends with restarts).
6 Why warmup for Transformers / large-batch Adam?Medium
Answer: Early steps have unreliable moment estimates; large η can destabilize. Linear warmup ramps η so early updates are smaller—standard in BERT/GPT recipes.
7 One-cycle policy (high level).Hard
Answer: Increase η to a maximum then decrease in one cycle—idea from Smith: helps fast training and can aid generalization; used with cyclical LR ideas.
8 LR range test / finder—purpose.Medium
Answer: Increase η each batch, plot loss—find region where loss drops fastest before explosion; heuristic to pick max LR for one-cycle or training.
9 Linear scaling rule (batch size vs learning rate).Hard
Answer: When batch size ×k, some recipes scale η ×k to keep gradient noise similar—works as heuristic for SGD in some regimes; not universal (warmup, BN, Adam complicate).
10 ReduceLROnPlateau scheduler.Easy
Answer: When validation metric stops improving, multiply η by factor < 1—adaptive to training dynamics, no fixed epoch list.
11 Minimum learning rate / η floor.Easy
Answer: Schedulers often clamp η ≥ η_min so training doesn’t stall completely; cosine schedules use explicit minimum.
12 Does Adam remove the need for schedules?Medium
Answer: No—Adam adapts per-parameter scale but global η still matters; Transformers use warmup + decay with AdamW routinely.
13 SGDR: stochastic gradient descent with warm restarts.Hard
Answer: Periodically reset η to high value (cosine cycles)—helps escape flat regions; related to ensemble of snapshots near restarts.
14 Grid search vs random search for η?Medium
Answer: Random search on log-uniform η often more efficient than grid—covers orders of magnitude; use validation score.
15 Practical trio you’d mention for a Transformer baseline.Easy
Answer: Peak learning rate, warmup steps, total steps / decay shape (cosine or linear)—cite paper recipe then tune.
Tie schedules to validation loss curves, not only train loss.

Quick review checklist

  • Constant vs schedule; step, exp, cosine; warmup why.
  • Plateau scheduler; SGDR; LR finder one-liner.
  • Adam still needs global η; batch-size scaling caveat.