Optimizers (SGD, Adam, â€¦) â€” 15 Interview Questions

Vanilla SGD, momentum, adaptive learning rates, Adam vs AdamW, and practical defaults interviewers expect you to name.

Colored left borders per card; green / amber / red difficulty chips.

SGD Momentum Adam AdamW

1 What does an optimizer do?Easy

Answer: Uses gradients (from backprop) to update parameters each stepâ€”choosing direction and step size (and possibly per-parameter scaling).

2 Vanilla (mini-batch) SGD update.Easy

Answer: Î¸ â† Î¸ âˆ’ Î· g where g is minibatch gradient and Î· is learning rate. Simple, well-understood baseline.

Î¸_{t+1} = Î¸_t âˆ’ Î· âˆ‡L(Î¸_t)

3 SGD + momentumâ€”update form.Medium

Answer: Maintain velocity v: v â† Î² v + g, then Î¸ â† Î¸ âˆ’ Î· v. Accelerates along consistent gradient directions, dampens oscillations in ravines.

4 Nesterov momentumâ€”difference from classical momentum?Hard

Answer: Looks ahead: gradient evaluated at Î¸ âˆ’ Î² v (approx.)â€”more responsive near curvature changes. Often â€œNAGâ€ in frameworks.

5 Adagradâ€”idea and downside.Medium

Answer: Accumulate sum of squared gradients per parameter; divide update by growing denominatorâ€”adaptive smaller steps for frequent features. Learning rate can shrink to zero too aggressively.

6 RMSpropâ€”fix to Adagrad.Medium

Answer: Use exponential moving average of squared gradients (EMA) instead of full sumâ€”prevents denominator from exploding, works better for non-convex deep nets.

7 Adamâ€”what two moments does it track?Medium

Answer: First moment m: EMA of gradient (like momentum). Second moment v: EMA of squared gradient (like RMSprop). Bias-correct both early in training; divide step by âˆš(v)+Îµ.

8 Adam vs AdamW.Medium

Answer: AdamW decouples weight decay: L2 penalty applied directly to weights, not mixed into the adaptive gradientâ€”often better generalization and standard in Transformers.

9 Default Î²â‚, Î²â‚‚ in Adamâ€”what do they mean?Easy

Answer: Typical Î²â‚=0.9 (gradient EMA decay), Î²â‚‚=0.999 (squared-gradient EMA). Control memory horizon of first and second moment estimates.

10 Why Îµ in Adam denominator?Easy

Answer: Numerical stabilityâ€”avoid division by zero when second moment is tiny. Usually 1e-8 scale.

11 When might SGD+Momentum beat Adam?Hard

Answer: Some vision tasks generalize slightly better with tuned SGD+momentum + schedule; Adam can find sharper minima in some studiesâ€”not universal, dataset and tuning matter.

12 Do Adam-class methods use true second derivatives?Medium

Answer: No full Hessianâ€”only diagonal curvature estimates from squared-gradient EMA. Much cheaper than Newton methods.

13 Sparse gradientsâ€”mention one optimizer variant.Hard

Answer: Lazy Adam / sparse Adam updates only touched parameters; useful in embedding-heavy models with huge vocabularies.

14 Memory: SGD vs Adam.Easy

Answer: Adam stores two moment buffers per parameter (m and v)â€”roughly 2Ã— optimizer state vs SGD with momentum (one velocity). Matters for large models.

15 Default optimizer youâ€™d pick for a Transformer?Easy

Answer: AdamW with weight decay, warmup + decay schedule, Î² defaults aboveâ€”standard in BERT/GPT-style training.

Pair optimizer answers with learning rate schedule when discussing Transformers.

Quick review checklist

SGD; momentum; Adagrad vs RMSprop intuition.
Adam moments; bias correction; AdamW decoupled decay.
When SGD competes; memory; Transformer default.

Previous: Dropout Next: Learning rate

Related Neural Networks Links

Optimizers (SGD, Adam, â€¦) â€” 15 Interview Questions

Quick review checklist