Neural Networks 15 Essential Q&A
Interview Prep

Optimizers (SGD, Adam, …) — 15 Interview Questions

Vanilla SGD, momentum, adaptive learning rates, Adam vs AdamW, and practical defaults interviewers expect you to name.

Colored left borders per card; green / amber / red difficulty chips.

SGD Momentum Adam AdamW
1 What does an optimizer do?Easy
Answer: Uses gradients (from backprop) to update parameters each step—choosing direction and step size (and possibly per-parameter scaling).
2 Vanilla (mini-batch) SGD update.Easy
Answer: θ ← θ − η g where g is minibatch gradient and η is learning rate. Simple, well-understood baseline.
θ_{t+1} = θ_t − η ∇L(θ_t)
3 SGD + momentum—update form.Medium
Answer: Maintain velocity v: v ← β v + g, then θ ← θ − η v. Accelerates along consistent gradient directions, dampens oscillations in ravines.
4 Nesterov momentum—difference from classical momentum?Hard
Answer: Looks ahead: gradient evaluated at θ − β v (approx.)—more responsive near curvature changes. Often “NAG” in frameworks.
5 Adagrad—idea and downside.Medium
Answer: Accumulate sum of squared gradients per parameter; divide update by growing denominator—adaptive smaller steps for frequent features. Learning rate can shrink to zero too aggressively.
6 RMSprop—fix to Adagrad.Medium
Answer: Use exponential moving average of squared gradients (EMA) instead of full sum—prevents denominator from exploding, works better for non-convex deep nets.
7 Adam—what two moments does it track?Medium
Answer: First moment m: EMA of gradient (like momentum). Second moment v: EMA of squared gradient (like RMSprop). Bias-correct both early in training; divide step by √(v)+ε.
8 Adam vs AdamW.Medium
Answer: AdamW decouples weight decay: L2 penalty applied directly to weights, not mixed into the adaptive gradient—often better generalization and standard in Transformers.
9 Default β₁, β₂ in Adam—what do they mean?Easy
Answer: Typical β₁=0.9 (gradient EMA decay), β₂=0.999 (squared-gradient EMA). Control memory horizon of first and second moment estimates.
10 Why ε in Adam denominator?Easy
Answer: Numerical stability—avoid division by zero when second moment is tiny. Usually 1e-8 scale.
11 When might SGD+Momentum beat Adam?Hard
Answer: Some vision tasks generalize slightly better with tuned SGD+momentum + schedule; Adam can find sharper minima in some studies—not universal, dataset and tuning matter.
12 Do Adam-class methods use true second derivatives?Medium
Answer: No full Hessian—only diagonal curvature estimates from squared-gradient EMA. Much cheaper than Newton methods.
13 Sparse gradients—mention one optimizer variant.Hard
Answer: Lazy Adam / sparse Adam updates only touched parameters; useful in embedding-heavy models with huge vocabularies.
14 Memory: SGD vs Adam.Easy
Answer: Adam stores two moment buffers per parameter (m and v)—roughly 2× optimizer state vs SGD with momentum (one velocity). Matters for large models.
15 Default optimizer you’d pick for a Transformer?Easy
Answer: AdamW with weight decay, warmup + decay schedule, β defaults above—standard in BERT/GPT-style training.
Pair optimizer answers with learning rate schedule when discussing Transformers.

Quick review checklist

  • SGD; momentum; Adagrad vs RMSprop intuition.
  • Adam moments; bias correction; AdamW decoupled decay.
  • When SGD competes; memory; Transformer default.