Interview Q&A100 Questions
Neural Network Fundamentals — Interview Q&A
Neural networks, activation functions, loss functions, backpropagation, and optimizers for deep learning.
Neural Networks: 20 Interview Questions
1
What is a perceptron? How is it different from a neuron in deep learning?
âš¡ Easy
Answer: A perceptron is the simplest artificial neuron, invented by Rosenblatt. It takes binary inputs, applies weights, sums them, and passes through a step activation function (0 or 1). Modern deep learning neurons use continuous activation functions (ReLU, Sigmoid, Tanh) and are arranged in multiple layers.
Perceptron: Step function, binary output, linear separator
Modern Neuron: Differentiable, continuous output, stackable
Modern Neuron: Differentiable, continuous output, stackable
2
Why do we need non-linear activation functions?
âš¡ Easy
Answer: Without non-linearity, stacking multiple linear layers collapses into a single linear transformation. Non-linear activations (ReLU, sigmoid, tanh) allow neural networks to approximate any complex, non-linear function (universal approximation theorem).
# Linear composition: W2(W1*x) = (W2*W1)*x → Still linear!
3
Compare ReLU, Sigmoid, and Tanh activations. When to use each?
📊 Medium
Answer:
- ReLU (max(0,x)): Default for hidden layers. Fast, sparse, mitigates vanishing gradient. Dead neurons issue.
- Sigmoid (0 to 1): Output layer for binary classification. Prone to vanishing gradient.
- Tanh (-1 to 1): Zero-centered, often used in RNNs/classical nets. Still suffers saturation.
✅ ReLU: Most common for CNNs/Transformers
âš ï¸ Sigmoid/Tanh: Used in specific gates (LSTM) or binary output
âš ï¸ Sigmoid/Tanh: Used in specific gates (LSTM) or binary output
4
Explain backpropagation in simple terms.
📊 Medium
Answer: Backpropagation computes gradients of the loss with respect to each weight using the chain rule. It propagates error backward from output to input, layer by layer. These gradients are used by optimizers (SGD, Adam) to update weights and minimize loss.
∂L/∂w = ∂L/∂ŷ · ∂ŷ/∂z · ∂z/∂w
5
What is vanishing gradient? How do you fix it?
🔥 Hard
Answer: Vanishing gradient occurs when gradients become extremely small in early layers, preventing learning. Causes: deep networks with sigmoid/tanh. Fixes: Use ReLU, residual connections (ResNet), batch normalization, proper weight initialization (Xavier/He), LSTM gates.
ReLU, ResNet, BatchNorm
Sigmoid deep stacks
6
What is the difference between batch gradient descent, SGD, and mini-batch?
âš¡ Easy
Answer:
- Batch GD: Full dataset – accurate but slow, memory heavy.
- SGD: One sample at a time – fast updates, high variance.
- Mini-batch: Subset (e.g., 32, 64) – balance between speed and stability. Most common.
7
Explain Dropout. Why does it work?
📊 Medium
Answer: Dropout randomly deactivates neurons during training with probability p. It prevents co-adaptation of neurons, forces redundant representations, and acts as ensemble learning. At inference, all neurons are used (weights scaled by p).
model.add(Dropout(0.5)) # 50% neurons dropped each step
8
What is Batch Normalization? How does it help?
📊 Medium
Answer: BatchNorm normalizes layer outputs to zero mean and unit variance within each mini-batch. It stabilizes training, allows higher learning rates, reduces internal covariate shift, and acts as a regularizer.
9
What is weight initialization? Why is it important?
📊 Medium
Answer: Weight initialization sets initial values of weights. Poor init causes vanishing/exploding gradients. Xavier init for tanh/sigmoid, He init for ReLU. Proper init speeds convergence.
Xavier: Var = 1/n_in
He: Var = 2/n_in
He: Var = 2/n_in
10
What is the Universal Approximation Theorem?
🔥 Hard
Answer: A feedforward network with a single hidden layer and non-linear activation can approximate any continuous function on a compact domain, given enough neurons. Depth improves parameter efficiency, not just theoretical capacity.
11
What is the difference between epoch, batch, and iteration?
âš¡ Easy
Answer:
- Epoch: One full pass of entire training data.
- Batch: Number of samples processed before update.
- Iteration: One batch update = steps per epoch.
12
What is cross-entropy loss? When do you use it?
âš¡ Easy
Answer: Cross-entropy measures difference between predicted probability distribution and true labels. Used for classification (binary: binary cross-entropy, multi-class: categorical cross-entropy). Preferred over MSE for classification.
13
Explain underfitting and overfitting in neural networks.
âš¡ Easy
Answer:
- Underfitting: Model too simple, high bias – fails on training data. Fix: increase capacity, train longer.
- Overfitting: Model memorizes noise, high variance – low train error, high test error. Fix: dropout, regularization, more data.
14
What is the role of the learning rate?
âš¡ Easy
Answer: Learning rate controls step size during gradient descent. Too high: overshoot, divergence. Too low: slow convergence, gets stuck. Use learning rate schedules or adaptive optimizers (Adam).
15
Compare Adam vs SGD optimizer.
📊 Medium
Answer:
- SGD: Simple, requires manual LR tuning, may need momentum.
- Adam: Adaptive LR + momentum, works well out-of-box, less sensitive to hyperparameters. Tends to generalize slightly worse than tuned SGD.
16
What is gradient clipping? When is it needed?
📊 Medium
Answer: Gradient clipping caps gradients to a threshold value during backprop. Prevents exploding gradients, common in RNNs and Transformers. Maintains stable training.
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
17
What is the difference between a neural network and a deep neural network?
âš¡ Easy
Answer: "Neural network" is a broad term. Deep neural network (DNN) typically has more than 2-3 hidden layers. Depth allows hierarchical feature learning. Shallow nets may suffice for simple tasks.
18
What are skip connections? Why are they useful?
🔥 Hard
Answer: Skip connections (ResNet) add input of a layer to its output (F(x) + x). They alleviate vanishing gradient, enable training of very deep networks (>100 layers), and act as gradient superhighways.
19
What is the F1 score? When is it better than accuracy?
📊 Medium
Answer: F1 is harmonic mean of precision and recall. Better than accuracy for imbalanced datasets. For example, fraud detection (99.9% negative) – accuracy high but model useless; F1 reflects minority class performance.
20
How do you decide the number of layers and neurons?
🔥 Hard
Answer: No fixed rule. Start with architecture proven for similar tasks. Use validation error: increase capacity until overfitting, then add regularization. Heuristic: more data → deeper/wider. Automated via hyperparameter search (Grid/Random/Bayesian).
Start simple, scale up
Avoid guessing randomly
Deep Learning Activation Functions: 20 Interview Questions
21
What is an activation function? Why is it non-linear?
âš¡ Easy
Answer: An activation function decides whether a neuron should fire. It introduces non-linearity – without it, stacked linear layers collapse into a single linear transformation. Non-linearity enables neural networks to approximate arbitrary complex functions (universal approximation theorem).
y = activation(W·x + b) ; Without activation: y = Wâ‚‚(Wâ‚x + bâ‚)+bâ‚‚ = W'x + b' (linear)
22
Explain Sigmoid activation. Where is it used? Main drawback?
âš¡ Easy
Answer: Sigmoid: σ(x) = 1/(1+eâ»Ë£), output in (0,1). Used in binary classification output layer (probability). Also in gates of LSTM. Drawbacks: Vanishing gradient (saturated regions), not zero-centered, exp() computation.
smooth, probabilistic
vanish grad, not zero-centered
23
How is Tanh different from Sigmoid? Why is Tanh preferred in hidden layers?
📊 Medium
Answer: Tanh = (eË£ - eâ»Ë£)/(eË£ + eâ»Ë£), output range (-1,1). It is zero-centered, which helps gradient flow. Still suffers from vanishing gradient. Preferred over sigmoid in hidden units because zero-centered outputs prevent biased gradients.
sigmoid: [0,1] ; tanh: [-1,1] (zero-centered)
24
What is ReLU? Why is it so widely used?
âš¡ Easy
Answer: ReLU = max(0, x). Advantages: Non-saturating, cheap computation, sparse activation, mitigates vanishing gradient. It is the default for most CNNs and deep architectures.
import numpy as np
def relu(x): return np.maximum(0, x) # derivative: 1 if x>0 else 0
25
What is the “dying ReLU†problem? Solutions?
📊 Medium
Answer: When many neurons get stuck in negative region and output 0 for all inputs – gradient zero, never recover. Solutions: Leaky ReLU, PReLU, ELU, or use smaller learning rate.
dead neurons
Leaky ReLU
26
Differentiate Leaky ReLU, PReLU, and RReLU.
🔥 Hard
Answer: Leaky ReLU: f(x)=max(αx, x) with α fixed (0.01). PReLU: α learned. RReLU: α randomly sampled during training. All fix dying ReLU.
27
What are ELU and SELU? When to use SELU?
🔥 Hard
Answer: ELU: f(x)= x if x>0 else α(eˣ-1). Smooth, negative saturation, robust to noise. SELU is self-normalizing; with proper initialization, outputs automatically tend to zero mean/unit variance – works for deep FCNs.
28
Explain Softmax. Why use exponentials?
📊 Medium
Answer: Softmax converts logits to probability distribution: eá¶»â±/Σ eᶻʲ. Exponentials amplify differences and ensure positivity. Used in multi-class output layer.
P(y=i) = exp(z_i) / Σⱼ exp(z_j)
29
What are Swish and GeLU? Why do they outperform ReLU in Transformers?
🔥 Hard
Answer: Swish = x·sigmoid(x) (smooth, non-monotonic). GeLU = x·Φ(x) (Gaussian CDF). Both allow negative values with a smooth bump. Used in BERT, GPT, ViT; provide better gradient flow and often higher accuracy.
30
Which activation functions are prone to vanishing gradient? Why?
📊 Medium
Answer: Sigmoid and Tanh – gradients near 0 for large |x|. ReLU variants avoid left saturation. However, ReLU can have dead neurons; ELU/Swish have small negative gradients but not zero.
31
What activation function for regression output?
âš¡ Easy
Answer: Linear (identity). No activation or f(x)=x. Allows unbounded values. For positive-only regression, use ReLU or Softplus.
32
What is Softplus? Relation to ReLU?
âš¡ Easy
Answer: Softplus = ln(1+eˣ). Smooth approximation of ReLU. Used in some variational autoencoders (variance). Never zero, but computationally heavier.
33
Output activation for binary classification?
âš¡ Easy
Answer: Sigmoid (single unit). Gives probability P(y=1). Loss: binary cross-entropy.
34
Activation for multi-label classification?
📊 Medium
Answer: Sigmoid per output (independent probabilities). Not softmax (which forces one class). Use sigmoid + binary cross-entropy.
35
Why not use step function as activation?
📊 Medium
Answer: Step function derivative is 0 everywhere (except discontinuity). Gradient descent can't learn. Need differentiable (or subgradient) functions.
36
Why is zero-centered activation desirable?
🔥 Hard
Answer: If activations are always positive (sigmoid, ReLU), gradients of weights for a given neuron are all same sign. This can cause zig-zagging updates and slower convergence. Zero-centered (tanh) avoids this.
37
Write derivative of ReLU and Leaky ReLU.
📊 Medium
ReLU': 1 if x>0 else 0. Leaky ReLU': 1 if x>0 else α (e.g., 0.01)
38
Which activations are used in RNNs/LSTMs? Why?
📊 Medium
Answer: LSTM: sigmoid for gates (0-1 control), tanh for cell/ hidden updates. Prevents exploding gradient via gating. Modern RNNs sometimes use ReLU with careful init.
39
What is Maxout activation? Pros/cons?
🔥 Hard
Answer: Maxout: takes max of k linear combinations. Can approximate any convex function. No vanishing gradient, but doubles parameters. Rarely used today.
40
Heuristic: which activation for hidden layers?
âš¡ Easy
Answer: Default: ReLU. If dying ReLU, try Leaky ReLU/ELU. For very deep nets: Swish/GeLU. For regression head: linear. For self-normalizing nets: SELU.
ReLU → Leaky ReLU → ELU → Swish (increasing complexity)
Deep Learning Loss Functions: 20 Interview Questions
41
What is a loss function in deep learning?
âš¡ Easy
Answer: A loss function (cost/objective) quantifies the error between model predictions and true targets. Training minimizes this loss via gradient descent. Choice of loss depends on task: regression (L1, L2), classification (cross-entropy), ranking (hinge), etc.
â„’(Å·, y) : measure of "how wrong" the model is.
42
Compare MSE and MAE. When to use each?
📊 Medium
Answer: MSE = mean( (y-ŷ)² ), MAE = mean( |y-ŷ| ). MSE penalizes large errors more (squared), sensitive to outliers. MAE is robust to outliers. Use MSE when outliers are rare/need to be emphasized; MAE when robustness is needed. MSE gradient magnitude ∠error, MAE gradient constant (±1).
MSE: smooth, convex
sensitive to outliers
43
Why use cross-entropy for classification, not MSE?
🔥 Hard
Answer: Cross-entropy with softmax/sigmoid gives stronger gradients when prediction is wrong. MSE + sigmoid saturates quickly – vanishing gradient. CE is also probabilistic (minimizes KL divergence), directly optimizes log-likelihood. CE is convex in parameters for linear models.
Binary CE: -[y log(p) + (1-y) log(1-p)] vs MSE: (y-p)²
44
Binary vs Categorical Cross-Entropy: difference?
âš¡ Easy
Answer: Binary CE for 2 classes (single sigmoid output). Categorical CE for ≥3 classes (softmax output). For multi-label (multiple binary tasks), use binary CE per output.
45
What is Hinge loss? Where is it used?
📊 Medium
Answer: Hinge: max(0, 1 - y·ŷ) for y ∈ {-1,1}. Used in SVMs and max-margin classifiers. Encourages correct classification with a margin. Not differentiable at margin; subgradient used. Less common in deep nets but used in Siamese nets (contrastive hinge).
L = Σ max(0, 1 - y_i * f(x_i))
46
Explain Huber loss. When is it useful?
🔥 Hard
Answer: Huber loss = MSE for small error, MAE for large error (quadratic near zero, linear otherwise). Smooth, less sensitive to outliers than MSE, differentiable. Used in robust regression (e.g., object detection bounding boxes – Smooth L1 is similar).
# Smooth L1 (similar to Huber)
if |x| < 1: 0.5 * x² else |x| - 0.5
47
KL Divergence vs Cross-Entropy: relation?
🔥 Hard
Answer: Cross-Entropy = H(p,q) = H(p) + KL(p||q). Minimizing cross-entropy is equivalent to minimizing KL divergence if p is fixed (target distribution). In VAEs, we minimize KL(q(z|x) || p(z)) to regularize latent space.
48
What are Contrastive and Triplet losses?
🔥 Hard
Answer: Contrastive: pulls positive pairs together, pushes negative apart (margin). Triplet: anchor, positive, negative; loss = max(0, d(a,p) - d(a,n) + margin). Used in face recognition (FaceNet), siamese networks, self-supervised learning (SimCLR).
49
What is Focal Loss? Where is it used?
🔥 Hard
Answer: Focal loss = -(1-p_t)^γ * log(p_t). Modifies cross-entropy to down-weight easy examples, focus on hard misclassified. Solves class imbalance in object detection (RetinaNet). γ=2 common.
50
What is CTC loss? Why is it useful?
🔥 Hard
Answer: Connectionist Temporal Classification (CTC) aligns input sequences to output sequences without pre-alignment. Used in speech recognition, OCR. It sums probabilities over all possible alignments via dynamic programming.
51
Heuristics: choose L1, L2, or Huber for regression?
📊 Medium
Answer: L2 (MSE): default, but outlier-sensitive. L1 (MAE): robust, but slower convergence. Huber: best of both – quadratic for small errors, linear for large. Smooth L1 used in detectors.
52
Why is cross-entropy always ≥ 0?
📊 Medium
Answer: Cross-entropy = -Σ p(x) log q(x). Since p(x) ≤ 1 and log q(x) ≤ 0 (q(x) ≤ 1), product is negative; with minus sign becomes non-negative. Zero only if predictions exactly match targets.
53
Relation between perplexity and cross-entropy?
📊 Medium
Answer: Perplexity = 2^{H(p,q)} where H is cross-entropy (if using log base 2). It measures how "surprised" the model is. Lower perplexity = better language model.
54
NLL vs Cross-Entropy – same?
âš¡ Easy
Answer: For classification with one-hot targets, categorical cross-entropy = negative log-likelihood. NLL is just -log(p(y|x)). In PyTorch, `CrossEntropyLoss` = LogSoftmax + NLLLoss.
55
What is Dice loss? Where is it used?
🔥 Hard
Answer: Dice = 1 - (2|X∩Y|)/(|X|+|Y|). Differentiable approximation of IoU. Used in medical image segmentation, imbalanced data. Handles pixel-wise class imbalance well.
56
Why use log in cross-entropy loss?
📊 Medium
Answer: Log converts multiplicative probabilities to additive; numerically stable. Also, maximizing likelihood = minimizing negative log-likelihood. Log loss heavily penalizes very wrong confident predictions.
57
Compare gradients of MSE and MAE.
📊 Medium
∂MSE/∂ŷ = 2(ŷ - y) ; ∂MAE/∂ŷ = sign(ŷ - y)
MSE gradient scales with error; MAE gradient magnitude constant ±1. MSE converges faster but outlier-sensitive.
58
Loss function for ordinal regression?
🔥 Hard
Answer: CORAL loss (Cumulative link model) or square of difference in thresholds. Alternatively, treat as regression with rounding, or use MSE/MAE if scale meaningful.
59
What is energy-based loss?
🔥 Hard
Answer: Energy-based models (EBM) assign scalar energy to configurations. Loss designed to push down energy of correct answer, pull up incorrect. Example: contrastive loss, hinge loss for EBM.
60
Designing a custom loss: key requirements?
🔥 Hard
Answer: Must be differentiable (almost everywhere), should correlate with evaluation metric, numerically stable, efficient. Also consider convexity (not strictly required) and gradient behavior.
Example: custom IoU loss, focal loss, Huber.
Backpropagation: 20 Interview Questions & Intuition
61
What is backpropagation? Explain the intuition.
âš¡ Easy
Answer: Backpropagation computes the gradient of the loss function with respect to every weight using the chain rule. It propagates error signals backward from output to input. Intuition: each neuron's contribution to the final error is measured, then weights are adjusted to reduce loss.
∂L/∂w = ∂L/∂out · ∂out/∂z · ∂z/∂w (chain rule)
62
How does chain rule work in backpropagation?
📊 Medium
Answer: Chain rule multiplies local gradients along the path from loss to weight. For a composition f(g(x)), derivative = f'(g(x))·g'(x). In neural nets, gradients are multiplied backward layer by layer.
# Example: z = wx + b, a = σ(z), L = (a-y)²
dL/da = 2(a-y); da/dz = σ'(z); dz/dw = x → dL/dw = dL/da * da/dz * dz/dw
63
Difference between forward pass and backward pass?
âš¡ Easy
Answer: Forward pass computes predictions and caches intermediate activations. Backward pass computes gradients using cached values and chain rule. Forward is inference; backward is learning.
64
What causes vanishing gradient in backpropagation?
📊 Medium
Answer: Gradients of saturated activations (sigmoid, tanh) are <1; repeated multiplication makes gradients exponentially small in early layers. Also deep networks with many multiplications.
Fix: ReLU, residual connections, batch norm
Sigmoid/tanh in hidden layers
65
Explain exploding gradient. How to mitigate?
📊 Medium
Answer: Gradients become exponentially large due to large weights >1 or poor initialization. Causes unstable updates. Solutions: gradient clipping, weight regularization, careful initialization (Xavier/He).
Clip: if ||g|| > threshold, g = threshold * g / ||g||
66
What is a computational graph? How is it used in backprop?
🔥 Hard
Answer: Directed acyclic graph where nodes are operations/ variables, edges define dependencies. Backprop traverses graph in reverse topological order, multiplying gradients via chain rule. Frameworks (TF, PyTorch) build autograd on this.
67
How does automatic differentiation (autograd) relate to backprop?
🔥 Hard
Answer: Backprop is a special case of reverse-mode autodiff. It efficiently computes gradients of scalar loss w.r.t many parameters in one forward+backward pass. Autograd builds the graph dynamically (PyTorch) or statically (TF1).
68
Why does backprop multiply gradients? Why not add?
🔥 Hard
Answer: Chain rule for composite functions is multiplicative. Each layer's effect compounds; if one layer has zero gradient, whole branch dies. Multiplication reflects dependency strength. Addition would be for parallel paths (e.g., skip connections).
69
Why can't we initialize all weights to zero? Role of backprop?
📊 Medium
Answer: Zero init makes neurons symmetric – same gradient, same updates, no feature diversity. Backprop would compute identical gradients for all neurons in a layer, preventing learning. Random init breaks symmetry.
70
What is gradient checking? How to implement?
🔥 Hard
Answer: Numerically approximate gradient: (L(θ+ε)-L(θ-ε))/(2ε) and compare with analytical backprop gradient. Used for debugging. Must be disabled in training (expensive).
eps = 1e-7; numeric_grad = (loss(w+eps) - loss(w-eps)) / (2*eps)
71
How does backprop work through max pooling?
🔥 Hard
Answer: Gradient only passes to the neuron that achieved the max (argmax). Others get zero gradient. It's like a switch: route error to winner.
72
Derive gradient of softmax + cross-entropy loss.
🔥 Hard
Answer: Combined gradient simplifies to (p - y) where p is softmax output, y is one-hot target. Very elegant and numerically stable.
∂L/∂z_i = p_i - y_i
73
Why are in-place operations (e.g., .relu_()) problematic for backprop?
📊 Medium
Answer: Backprop requires intermediate activations (input to ReLU) to compute gradient. In-place overwrites them, breaking the graph. PyTorch/TF usually avoid or handle carefully.
74
Can backprop compute second-order gradients? How?
🔥 Hard
Answer: Yes, via automatic differentiation on the gradient graph (e.g., PyTorch `torch.autograd.grad`). Used in meta-learning, Hessian-free optimization, etc.
75
Is backprop the same as reverse-mode autodiff?
📊 Medium
Answer: Backprop is the algorithm applied to neural nets; reverse-mode autodiff is the general technique. Backprop = reverse-mode AD applied to a scalar loss with caching.
76
Role of Jacobian matrix in backpropagation?
🔥 Hard
Answer: For vector functions, the local gradient is a Jacobian matrix (∂output/∂input). Backprop multiplies Jacobians along the path. In practice, frameworks use vector-Jacobian products (VJPs) for efficiency.
v^T · J (VJP) instead of full J
77
Define the error signal (δ) in backprop.
📊 Medium
Answer: δ_i^l = ∂L / ∂z_i^l (pre-activation at layer l). It represents how much the total loss changes when the pre-activation changes. Propagated backward: δ^l = (θ^{l+1})^T δ^{l+1} ⊙ σ'(z^l).
78
What is Backpropagation Through Time (BPTT)?
📊 Medium
Answer: BPTT unfolds RNN through time steps, treats it as a deep network with shared weights. Gradients sum over time. Suffers from vanishing/exploding due to repeated multiplications. Truncated BPTT limits steps.
79
Why did greedy layerwise pretraining help backprop in early deep learning?
🔥 Hard
Answer: Initialized weights in a sensible region, avoiding vanishing gradients. Backprop then fine-tuned. Modern techniques (ReLU, batch norm, good init) made pretraining less critical.
80
How do skip connections (ResNet) help backpropagation?
📊 Medium
Answer: Skip connections create an alternative gradient highway – identity mapping. Gradient can flow directly through skip path, mitigating vanishing gradient and enabling very deep networks (>100 layers).
Gradient shorcut: ∂L/∂x = ∂L/∂F(x) + ∂L/∂x (identity)
Deep Learning Optimizers: 20 Interview Questions
81
What is an optimizer in deep learning?
âš¡ Easy
Answer: An optimizer is an algorithm that updates model parameters (weights) to minimize the loss function. It implements a variant of gradient descent, controlling learning rate, momentum, and adaptive per-parameter updates.
θ_{t+1} = θ_t - η · ∇L(θ_t) (SGD)
82
Vanilla SGD: pros and cons?
📊 Medium
Answer: Pros: Simple, memory efficient, generalizes well. Cons: Slow convergence, sensitive to learning rate, oscillations in ravines, struggles with sparse data.
generalizes, low memory
slow, plateau, oscillates
83
How does Momentum optimizer work?
📊 Medium
Answer: Accumulates a velocity vector in the direction of persistent gradients. Accelerates convergence, dampens oscillations. v_t = γ v_{t-1} + η∇L; θ -= v_t.
v_t = β·v_{t-1} + (1-β)·∇L; θ = θ - η·v_t (common formulation)
84
NAG vs standard Momentum: difference?
🔥 Hard
Answer: NAG computes gradient at the "lookahead" position (θ - γ·v_prev). This gives a more accurate update, reducing oscillations. Often faster convergence.
# NAG pseudo-code
v_prev = v
v = β*v + η·∇L(θ - β*v_prev)
θ = θ - v
85
Explain AdaGrad. When is it useful?
📊 Medium
Answer: AdaGrad adapts learning rate per parameter: scales inversely with sqrt(sum of squared gradients). Good for sparse data (e.g., embeddings, NLP). Major drawback: learning rate monotonically decays to zero.
G_t = G_{t-1} + (∇L)²; θ_{t+1} = θ_t - η/(√G_t+ε) · ∇L
86
How does RMSprop improve AdaGrad?
📊 Medium
Answer: RMSprop uses exponentially moving average of squared gradients, not cumulative sum. Prevents learning rate vanishing. E[g²]_t = β·E[g²]_{t-1} + (1-β)·(∇L)². Step = η/√(E[g²]+ε)·∇L.
87
Describe Adam optimizer. Key components?
🔥 Hard
Answer: Adam = RMSprop + Momentum. Maintains first moment (mean) and second moment (uncentered variance) of gradients. Bias correction for initial steps. Default β1=0.9, β2=0.999, ε=1e-8. Popular due to fast convergence and robustness.
m_t = β1·m_{t-1} + (1-β1)·∇L; v_t = β2·v_{t-1} + (1-β2)·(∇L)²
m̂ = m/(1-β1^t); v̂ = v/(1-β2^t); θ = θ - η·m̂/(√v̂+ε)
88
AdamW vs Adam: what's the difference?
🔥 Hard
Answer: AdamW decouples weight decay from gradient updates. In Adam, L2 regularization is added to loss; AdamW directly subtracts weight decay from parameters. Leads to better generalization, widely used in Transformers (BERT, ViT).
Adam: θ -= η·(m̂/(√v̂+ε) + λθ) | AdamW: θ -= η·m̂/(√v̂+ε) - η·λθ
89
What is Nadam? Advantage?
🔥 Hard
Answer: Nadam = Adam + Nesterov momentum. It applies Nesterov lookahead on top of Adam's momentum. Sometimes converges slightly faster than Adam.
90
AdaBelief – how is it different from Adam?
🔥 Hard
Answer: AdaBelief modifies Adam: second moment v_t = β2·v_{t-1} + (1-β2)·(∇L - m_t)². Stepsize is η/(√v̂+ε)·m̂. Intuition: adapts to "belief" in observed gradient direction. More stable, often better generalization.
91
What is Lion optimizer? Key idea?
🔥 Hard
Answer: Lion (Evolved Sign Momentum) uses sign of momentum and gradient combination. Update: θ = θ - η·sign(β1·m + (1-β1)·∇L). Memory efficient, outperforms AdamW in some large-scale tasks.
92
Common learning rate schedules? When to use?
📊 Medium
Answer: Step decay (reduce by factor every few epochs), exponential decay, cosine annealing, linear warmup. Warmup helps Adam in early training (prevents large variance). Cosine decay popular in Transformers.
93
Why does SGD generalize better than Adam?
🔥 Hard
Answer: Hypothesis: Adam may converge to sharper minima, while SGD finds flatter minima (better generalization). Also, adaptive methods have implicit regularization differences. However, AdamW with decoupled weight decay narrows the gap.
94
What is gradient clipping? Which optimizers need it?
📊 Medium
Answer: Clipping limits gradient magnitude to avoid exploding gradients (RNNs, Transformers). Applied per-sample or globally. Essential for LSTM, but also used with Adam in large Transformers.
95
Why use learning rate warmup with Adam?
🔥 Hard
Answer: In early steps, Adam's second moment (v) is small, causing large effective LR. Warmup gradually increases LR, stabilizing training. Critical for large-scale Transformer training (BERT, GPT).
96
What is AdaMax? Relation to Adam?
📊 Medium
Answer: AdaMax replaces L2 norm in Adam with L-infinity norm. v_t = max(β2·v_{t-1}, |∇L|). More stable for some problems, less common.
97
AMSGrad – what problem does it solve?
🔥 Hard
Answer: Adam can sometimes increase learning rate (when v decreases). AMSGrad ensures v_t is monotonic: v_hat = max(v_hat, v_t). Guarantees non-increasing step size. Marginal improvement in practice.
98
Best optimizer for sparse features (embeddings)?
📊 Medium
Answer: AdaGrad, RMSprop, or Adam with sparse updates (lazy Adam). Sparse gradients benefit from per-parameter adaptive LR.
99
Why not use second-order optimizers (L-BFGS) in deep learning?
🔥 Hard
Answer: Hessian is huge (billions of params). Approximations (L-BFGS) are expensive, need large batches, noisy gradients. Mostly used in small-batch convex problems or K-FAC (rare).
100
Heuristic: which optimizer to choose?
âš¡ Easy
Answer: Default: AdamW with cosine decay + warmup (Transformers, CNNs). For NLP/Transformers: AdamW. For CV: SGD with momentum (generalizes well) or AdamW. For sparse embeddings: Adam/AdaGrad. For memory-limited: SGD or Lion.
AdamW (SOTA), SGD (strong baseline), Lion (emerging)