Related Neural Networks Links
Learn Learning Rate Neural Networks Tutorial, validate concepts with Learning Rate Neural Networks MCQ Questions, and prepare interviews through Learning Rate Neural Networks Interview Questions and Answers.
Neural Networks
15 Essential Q&A
Interview Prep
Learning Rate Schedules — 15 Interview Questions
Warmup, piecewise decay, cosine annealing, restarts, and how η interacts with batch size and adaptive optimizers.
Colored left borders per card; green / amber / red difficulty chips.
η
Warmup
Decay
Restart
1 What is the learning rate in SGD?Easy
Answer: Scalar η scaling the gradient step: controls speed vs stability—too large diverges, too small is slow. Same symbol in Adam as base step size (per-parameter scaling aside).
2 What is a learning rate schedule?Easy
Answer: A function of step or epoch that changes η during training—e.g. decay after plateaus, cosine to zero, warmup then constant.
3 Step decay (piecewise constant).Easy
Answer: Multiply η by constant γ every N epochs or when metric plateaus—classic CV recipe; simple and interpretable.
4 Exponential decay formula (conceptual).Medium
Answer: η_t = η_0 · γ^t or continuous η_0 e^(−kt)—smooth decrease; need tune decay rate.
η_t = η_0 · γ^t (per step or epoch)
5 Cosine annealing—idea.Medium
Answer: η follows a cosine curve from max to min over T steps—smooth decay used in many modern trainers (SGDR extends with restarts).
6 Why warmup for Transformers / large-batch Adam?Medium
Answer: Early steps have unreliable moment estimates; large η can destabilize. Linear warmup ramps η so early updates are smaller—standard in BERT/GPT recipes.
7 One-cycle policy (high level).Hard
Answer: Increase η to a maximum then decrease in one cycle—idea from Smith: helps fast training and can aid generalization; used with cyclical LR ideas.
8 LR range test / finder—purpose.Medium
Answer: Increase η each batch, plot loss—find region where loss drops fastest before explosion; heuristic to pick max LR for one-cycle or training.
9 Linear scaling rule (batch size vs learning rate).Hard
Answer: When batch size ×k, some recipes scale η ×k to keep gradient noise similar—works as heuristic for SGD in some regimes; not universal (warmup, BN, Adam complicate).
10 ReduceLROnPlateau scheduler.Easy
Answer: When validation metric stops improving, multiply η by factor < 1—adaptive to training dynamics, no fixed epoch list.
11 Minimum learning rate / η floor.Easy
Answer: Schedulers often clamp η ≥ η_min so training doesn’t stall completely; cosine schedules use explicit minimum.
12 Does Adam remove the need for schedules?Medium
Answer: No—Adam adapts per-parameter scale but global η still matters; Transformers use warmup + decay with AdamW routinely.
13 SGDR: stochastic gradient descent with warm restarts.Hard
Answer: Periodically reset η to high value (cosine cycles)—helps escape flat regions; related to ensemble of snapshots near restarts.
14 Grid search vs random search for η?Medium
Answer: Random search on log-uniform η often more efficient than grid—covers orders of magnitude; use validation score.
15 Practical trio you’d mention for a Transformer baseline.Easy
Answer: Peak learning rate, warmup steps, total steps / decay shape (cosine or linear)—cite paper recipe then tune.
Tie schedules to validation loss curves, not only train loss.
Quick review checklist
- Constant vs schedule; step, exp, cosine; warmup why.
- Plateau scheduler; SGDR; LR finder one-liner.
- Adam still needs global η; batch-size scaling caveat.