Optimizers · Gradient Descent 15 questions 20 min

Optimizers MCQ · test your deep learning optimization knowledge

From SGD to AdamW – 15 questions covering gradient descent variants, adaptive methods, and learning rate schedules.

Easy: 5 Medium: 6 Hard: 4

SGD

Momentum

AdaGrad

Adam

Optimizers: the engines of deep learning

Optimizers update neural network weights to minimize the loss function. This MCQ covers classical (SGD, Momentum, NAG), adaptive (AdaGrad, RMSprop), and hybrid (Adam, AdamW, Nadam) methods, as well as learning rate schedules and convergence properties.

Why optimizer choice matters

The right optimizer can speed up convergence, escape local minima, and improve final performance. Adaptive methods like Adam are often good defaults, but SGD with momentum can generalize better.

Optimizers glossary – key concepts

SGD (Stochastic Gradient Descent)

Updates weights using gradient of mini‑batch: θ = θ - η·∇L(θ). Simple but can be slow.

Momentum

Accumulates past gradients to smooth updates and accelerate convergence. v = βv + ∇L; θ = θ - ηv.

AdaGrad

Adapts learning rate per parameter using sum of squared gradients. Good for sparse features, but learning rate decays to zero.

RMSprop

Uses moving average of squared gradients to normalize updates. Solves AdaGrad's diminishing learning rate.

Adam (Adaptive Moment Estimation)

Combines momentum (first moment) and RMSprop (second moment) with bias correction. Popular default.

AdamW

Adam with decoupled weight decay (fixes L2 regularization interaction). Often yields better generalization.

Nesterov Accelerated Gradient

"Look ahead" version of momentum: computes gradient at approximate future position.

Learning rate schedules

Step decay, exponential decay, cosine annealing – adjust learning rate during training.

# Adam update rule (simplified)
m = β1*m + (1-β1)*∇L     # first moment
v = β2*v + (1-β2)*(∇L)²   # second moment
m_hat = m/(1-β1^t)        # bias correction
v_hat = v/(1-β2^t)
θ = θ - η * m_hat/(√v_hat + ε)

Interview tip: Be ready to compare optimizers (Adam vs SGD with momentum), explain bias correction in Adam, discuss when adaptive methods fail, and describe weight decay (AdamW). This MCQ covers these distinctions.

Common optimizer interview questions

What is the difference between SGD with momentum and Adam?
Why does AdaGrad's learning rate decrease over time?
Explain the purpose of bias correction in Adam.
How does weight decay differ in Adam vs AdamW?
When might you prefer SGD over Adam?
What is the role of the ε term in Adam/RMSprop?

AI Hub Next: Regularization

Optimizers MCQ

Pro tip

Optimizers MCQ · test your deep learning optimization knowledge

Your optimizers score

Optimizers: the engines of deep learning

Why optimizer choice matters

Optimizers glossary – key concepts

SGD (Stochastic Gradient Descent)

Momentum

AdaGrad

RMSprop

Adam (Adaptive Moment Estimation)

AdamW

Nesterov Accelerated Gradient

Learning rate schedules

Common optimizer interview questions