Related Neural Networks Links
Learn Optimizers Neural Networks Tutorial, validate concepts with Optimizers Neural Networks MCQ Questions, and prepare interviews through Optimizers Neural Networks Interview Questions and Answers.
Neural Networks
15 Essential Q&A
Interview Prep
Optimizers (SGD, Adam, …) — 15 Interview Questions
Vanilla SGD, momentum, adaptive learning rates, Adam vs AdamW, and practical defaults interviewers expect you to name.
Colored left borders per card; green / amber / red difficulty chips.
SGD
Momentum
Adam
AdamW
1 What does an optimizer do?Easy
Answer: Uses gradients (from backprop) to update parameters each step—choosing direction and step size (and possibly per-parameter scaling).
2 Vanilla (mini-batch) SGD update.Easy
Answer: θ ↠θ − η g where g is minibatch gradient and η is learning rate. Simple, well-understood baseline.
θ_{t+1} = θ_t − η ∇L(θ_t)
3 SGD + momentum—update form.Medium
Answer: Maintain velocity v: v ↠β v + g, then θ ↠θ − η v. Accelerates along consistent gradient directions, dampens oscillations in ravines.
4 Nesterov momentum—difference from classical momentum?Hard
Answer: Looks ahead: gradient evaluated at θ − β v (approx.)—more responsive near curvature changes. Often “NAG†in frameworks.
5 Adagrad—idea and downside.Medium
Answer: Accumulate sum of squared gradients per parameter; divide update by growing denominator—adaptive smaller steps for frequent features. Learning rate can shrink to zero too aggressively.
6 RMSprop—fix to Adagrad.Medium
Answer: Use exponential moving average of squared gradients (EMA) instead of full sum—prevents denominator from exploding, works better for non-convex deep nets.
7 Adam—what two moments does it track?Medium
Answer: First moment m: EMA of gradient (like momentum). Second moment v: EMA of squared gradient (like RMSprop). Bias-correct both early in training; divide step by √(v)+ε.
8 Adam vs AdamW.Medium
Answer: AdamW decouples weight decay: L2 penalty applied directly to weights, not mixed into the adaptive gradient—often better generalization and standard in Transformers.
9 Default βâ‚, β₂ in Adam—what do they mean?Easy
Answer: Typical βâ‚=0.9 (gradient EMA decay), β₂=0.999 (squared-gradient EMA). Control memory horizon of first and second moment estimates.
10 Why ε in Adam denominator?Easy
Answer: Numerical stability—avoid division by zero when second moment is tiny. Usually 1e-8 scale.
11 When might SGD+Momentum beat Adam?Hard
Answer: Some vision tasks generalize slightly better with tuned SGD+momentum + schedule; Adam can find sharper minima in some studies—not universal, dataset and tuning matter.
12 Do Adam-class methods use true second derivatives?Medium
Answer: No full Hessian—only diagonal curvature estimates from squared-gradient EMA. Much cheaper than Newton methods.
13 Sparse gradients—mention one optimizer variant.Hard
Answer: Lazy Adam / sparse Adam updates only touched parameters; useful in embedding-heavy models with huge vocabularies.
14 Memory: SGD vs Adam.Easy
Answer: Adam stores two moment buffers per parameter (m and v)—roughly 2× optimizer state vs SGD with momentum (one velocity). Matters for large models.
15 Default optimizer you’d pick for a Transformer?Easy
Answer: AdamW with weight decay, warmup + decay schedule, β defaults above—standard in BERT/GPT-style training.
Pair optimizer answers with learning rate schedule when discussing Transformers.
Quick review checklist
- SGD; momentum; Adagrad vs RMSprop intuition.
- Adam moments; bias correction; AdamW decoupled decay.
- When SGD competes; memory; Transformer default.