Optimizers MCQ · test your deep learning optimization knowledge
From SGD to AdamW – 15 questions covering gradient descent variants, adaptive methods, and learning rate schedules.
Optimizers: the engines of deep learning
Optimizers update neural network weights to minimize the loss function. This MCQ covers classical (SGD, Momentum, NAG), adaptive (AdaGrad, RMSprop), and hybrid (Adam, AdamW, Nadam) methods, as well as learning rate schedules and convergence properties.
Why optimizer choice matters
The right optimizer can speed up convergence, escape local minima, and improve final performance. Adaptive methods like Adam are often good defaults, but SGD with momentum can generalize better.
Optimizers glossary – key concepts
SGD (Stochastic Gradient Descent)
Updates weights using gradient of mini‑batch: θ = θ - η·∇L(θ). Simple but can be slow.
Momentum
Accumulates past gradients to smooth updates and accelerate convergence. v = βv + ∇L; θ = θ - ηv.
AdaGrad
Adapts learning rate per parameter using sum of squared gradients. Good for sparse features, but learning rate decays to zero.
RMSprop
Uses moving average of squared gradients to normalize updates. Solves AdaGrad's diminishing learning rate.
Adam (Adaptive Moment Estimation)
Combines momentum (first moment) and RMSprop (second moment) with bias correction. Popular default.
AdamW
Adam with decoupled weight decay (fixes L2 regularization interaction). Often yields better generalization.
Nesterov Accelerated Gradient
"Look ahead" version of momentum: computes gradient at approximate future position.
Learning rate schedules
Step decay, exponential decay, cosine annealing – adjust learning rate during training.
# Adam update rule (simplified) m = β1*m + (1-β1)*∇L # first moment v = β2*v + (1-β2)*(∇L)² # second moment m_hat = m/(1-β1^t) # bias correction v_hat = v/(1-β2^t) θ = θ - η * m_hat/(√v_hat + ε)
Common optimizer interview questions
- What is the difference between SGD with momentum and Adam?
- Why does AdaGrad's learning rate decrease over time?
- Explain the purpose of bias correction in Adam.
- How does weight decay differ in Adam vs AdamW?
- When might you prefer SGD over Adam?
- What is the role of the ε term in Adam/RMSprop?