Interview Q&A75 Questions

Training & Optimization — Interview Q&A

Gradient descent, backpropagation, optimizers, learning rate, and gradient issues.

Gradient Descent — 15 Interview Questions

1 What is (vanilla) gradient descent?Easy
Answer: Repeatedly update parameters θ by stepping opposite the gradient of the loss L: smaller loss along steepest descent direction (locally). Requires differentiable objective (or subgradients).
θ ← θ − η ∇_θ L(θ)
2 Full-batch vs mini-batch vs stochastic GD.Easy
Answer: Full-batch: gradient over all data—accurate but expensive. Mini-batch: noisy estimate, GPU-friendly. SGD often means mini-batch in practice; strict one-sample SGD is very noisy.
3 What happens if the learning rate is too large or too small?Easy
Answer: Too large: oscillation or divergence. Too small: painfully slow progress and risk of stalling in flat regions. Schedules and adaptive methods address this.
4 Local minima vs global minimum—what do we say for deep nets?Medium
Answer: Non-convex losses have many local minima and saddle points. In high dimensions saddles are common; “bad” local minima are less universal than folklore suggests, but optimization is still hard.
5 What is a saddle point?Medium
Answer: A point where the gradient is zero but it is neither minimum nor maximum—curvature positive in some directions, negative in others. GD can slow down near saddles; noise (minibatches) or momentum helps escape.
6 Momentum—intuition.Medium
Answer: Accumulate a velocity vector damped by friction so updates continue through noisy or ill-conditioned directions—like a ball rolling in a valley. Helps damp oscillations in narrow valleys.
7 When does GD find the global minimum (classically)?Hard
Answer: For convex smooth functions with appropriate step sizes, GD converges to a global minimizer. Deep neural network losses are generally not convex—this theorem does not apply directly.
8 How does batch size affect the gradient estimate?Medium
Answer: Larger batch → lower variance gradient, more stable steps, more memory. Smaller batch → noisier updates that can act like regularization and help generalization (with caveats).
9 Epoch vs iteration in training loops.Easy
Answer: One epoch is a full pass over the training set (possibly shuffled). One iteration is one parameter update (often one mini-batch). Multiple iterations per epoch.
10 Online learning—relation to SGD.Medium
Answer: Data arrives as a stream; each (or few) example(s) trigger an update—natural fit for stochastic style updates. Same GD template with a non-stationary distribution.
11 What is a plateau for optimization?Easy
Answer: A nearly flat region where gradients are tiny—progress slows. Learning-rate warmup, restarts, or second-order hints (in advanced optimizers) can help.
12 Can gradient noise help generalization?Hard
Answer: Small-batch stochasticity can help escape sharp minima and explore the landscape—linked to flat minima hypotheses. Not a guarantee; interaction with batch norm and scale matters.
14 What is preconditioning at a high level?Hard
Answer: Rescaling coordinates so the landscape is more isotropic—Newton-like methods use inverse Hessian; Adam/RMSprop use diagonal adaptive scaling as a cheap approximation.
15 GD update vs what backprop computes.Easy
Answer: Backpropagation computes ∇L; gradient descent uses that vector to update weights. They answer different questions: “which direction?” vs “how to step?”
Keep GD answers separate from chain rule mechanics—interviewers often ask both in sequence.

Backpropagation — 15 Interview Questions

16 What is backpropagation?Easy
Answer: The standard method to compute ∂L/∂θ for all parameters by applying the chain rule backward from the loss through each layer. It is reverse-mode automatic differentiation on the network graph.
17 State the chain rule for nested functions.Easy
Answer: If y = f(g(x)), then dy/dx = (df/dg)(dg/dx). In many dimensions, derivatives become Jacobians and products become appropriate matrix-vector multiplies.
18 Why reverse mode instead of forward mode for NNs?Medium
Answer: Loss is a scalar; we need one gradient vector w.r.t. millions of parameters. Reverse mode gives all partials in roughly one forward + one backward pass cost. Forward mode would repeat for each parameter.
19 Order of operations in the backward pass.Medium
Answer: Traverse layers from output toward input, propagating adjoint (gradient w.r.t. downstream activations). At each node, multiply local Jacobians into the incoming upstream gradient.
20 Backward through z = Wx + b—gradients w.r.t. W, x, b.Medium
Answer: Given upstream g = ∂L/∂z: ∂L/∂b = sum over batch of g, ∂L/∂W = xᵀg (layout-dependent), ∂L/∂x = gWᵀ. Interviewers check you know dimensions line up.
21 Backward through ReLU.Easy
Answer: Pass gradient through if pre-activation > 0, else zero. At exactly zero, use convention 0 or 1 (subgradient). Elementwise mask.
22 Jacobian–vector product (JVP) vs vector–Jacobian product (VJP).Hard
Answer: JVP pushes perturbations forward (forward mode). VJP pulls loss gradient backward—what each layer implements in backprop. Efficiency: we want VJPs for scalar loss.
23 Why does backprop need memory?Medium
Answer: To compute local derivatives at each layer you need forward activations (and sometimes intermediate tensors). Memory scales with network width, depth, and batch size.
24 Gradient checkpointing—trade-off?Hard
Answer: Don’t store every activation; recompute some during backward. Less memory, more compute—used for large models (Transformers).
25 Parameter shared across layers—gradient behavior?Medium
Answer: Gradients from all paths add (multivariate chain rule). Same weight used twice → two contribution terms to ∂L/∂w.
26 How does backprop relate to vanishing gradients?Medium
Answer: Backprop multiplies Jacobians layer by layer; if many factors are < 1 (saturating activations), the signal shrinks toward early layers—same math, architectural fix (ReLU, ResNet, gates).
27 Relationship to the computational graph.Easy
Answer: The network is a DAG of ops; forward evaluates nodes, backward applies chain rule along edges. Frameworks build this graph dynamically (eager with autograd) or statically.
28 Does standard training use second derivatives?Medium
Answer: SGD + backprop uses first-order gradients. Second-order (Hessian) methods exist but are expensive; some approximations (K-FAC, etc.) are niche.
29 loss.backward() in PyTorch—what happens?Easy
Answer: Traverses the autograd graph from loss, accumulating .grad on leaf tensors that require grad. Must call zero_grad between iterations unless gradients add intentionally.
30 Time complexity of forward vs backward (typical claim).Medium
Answer: For many networks, backward is roughly 2× the multiply-add cost of forward (same order—rule of thumb). Constant factors depend on fusion and framework.
Practice one small two-layer network by hand once—it locks in the chain rule story.

Optimizers (SGD, Adam, …) — 15 Interview Questions

31 What does an optimizer do?Easy
Answer: Uses gradients (from backprop) to update parameters each step—choosing direction and step size (and possibly per-parameter scaling).
32 Vanilla (mini-batch) SGD update.Easy
Answer: θ ← θ − η g where g is minibatch gradient and η is learning rate. Simple, well-understood baseline.
θ_{t+1} = θ_t − η ∇L(θ_t)
33 SGD + momentum—update form.Medium
Answer: Maintain velocity v: v ← β v + g, then θ ← θ − η v. Accelerates along consistent gradient directions, dampens oscillations in ravines.
34 Nesterov momentum—difference from classical momentum?Hard
Answer: Looks ahead: gradient evaluated at θ − β v (approx.)—more responsive near curvature changes. Often “NAG” in frameworks.
35 Adagrad—idea and downside.Medium
Answer: Accumulate sum of squared gradients per parameter; divide update by growing denominator—adaptive smaller steps for frequent features. Learning rate can shrink to zero too aggressively.
36 RMSprop—fix to Adagrad.Medium
Answer: Use exponential moving average of squared gradients (EMA) instead of full sum—prevents denominator from exploding, works better for non-convex deep nets.
37 Adam—what two moments does it track?Medium
Answer: First moment m: EMA of gradient (like momentum). Second moment v: EMA of squared gradient (like RMSprop). Bias-correct both early in training; divide step by √(v)+ε.
38 Adam vs AdamW.Medium
Answer: AdamW decouples weight decay: L2 penalty applied directly to weights, not mixed into the adaptive gradient—often better generalization and standard in Transformers.
39 Default β₁, β₂ in Adam—what do they mean?Easy
Answer: Typical β₁=0.9 (gradient EMA decay), β₂=0.999 (squared-gradient EMA). Control memory horizon of first and second moment estimates.
40 Why ε in Adam denominator?Easy
Answer: Numerical stability—avoid division by zero when second moment is tiny. Usually 1e-8 scale.
41 When might SGD+Momentum beat Adam?Hard
Answer: Some vision tasks generalize slightly better with tuned SGD+momentum + schedule; Adam can find sharper minima in some studies—not universal, dataset and tuning matter.
42 Do Adam-class methods use true second derivatives?Medium
Answer: No full Hessian—only diagonal curvature estimates from squared-gradient EMA. Much cheaper than Newton methods.
43 Sparse gradients—mention one optimizer variant.Hard
Answer: Lazy Adam / sparse Adam updates only touched parameters; useful in embedding-heavy models with huge vocabularies.
44 Memory: SGD vs Adam.Easy
Answer: Adam stores two moment buffers per parameter (m and v)—roughly 2× optimizer state vs SGD with momentum (one velocity). Matters for large models.
45 Default optimizer you’d pick for a Transformer?Easy
Answer: AdamW with weight decay, warmup + decay schedule, β defaults above—standard in BERT/GPT-style training.
Pair optimizer answers with learning rate schedule when discussing Transformers.

Learning Rate Schedules — 15 Interview Questions

46 What is the learning rate in SGD?Easy
Answer: Scalar η scaling the gradient step: controls speed vs stability—too large diverges, too small is slow. Same symbol in Adam as base step size (per-parameter scaling aside).
47 What is a learning rate schedule?Easy
Answer: A function of step or epoch that changes η during training—e.g. decay after plateaus, cosine to zero, warmup then constant.
48 Step decay (piecewise constant).Easy
Answer: Multiply η by constant γ every N epochs or when metric plateaus—classic CV recipe; simple and interpretable.
49 Exponential decay formula (conceptual).Medium
Answer: η_t = η_0 · γ^t or continuous η_0 e^(−kt)—smooth decrease; need tune decay rate.
η_t = η_0 · γ^t   (per step or epoch)
50 Cosine annealing—idea.Medium
Answer: η follows a cosine curve from max to min over T steps—smooth decay used in many modern trainers (SGDR extends with restarts).
51 Why warmup for Transformers / large-batch Adam?Medium
Answer: Early steps have unreliable moment estimates; large η can destabilize. Linear warmup ramps η so early updates are smaller—standard in BERT/GPT recipes.
52 One-cycle policy (high level).Hard
Answer: Increase η to a maximum then decrease in one cycle—idea from Smith: helps fast training and can aid generalization; used with cyclical LR ideas.
53 LR range test / finder—purpose.Medium
Answer: Increase η each batch, plot loss—find region where loss drops fastest before explosion; heuristic to pick max LR for one-cycle or training.
54 Linear scaling rule (batch size vs learning rate).Hard
Answer: When batch size ×k, some recipes scale η ×k to keep gradient noise similar—works as heuristic for SGD in some regimes; not universal (warmup, BN, Adam complicate).
55 ReduceLROnPlateau scheduler.Easy
Answer: When validation metric stops improving, multiply η by factor < 1—adaptive to training dynamics, no fixed epoch list.
56 Minimum learning rate / η floor.Easy
Answer: Schedulers often clamp η ≥ η_min so training doesn’t stall completely; cosine schedules use explicit minimum.
57 Does Adam remove the need for schedules?Medium
Answer: No—Adam adapts per-parameter scale but global η still matters; Transformers use warmup + decay with AdamW routinely.
58 SGDR: stochastic gradient descent with warm restarts.Hard
Answer: Periodically reset η to high value (cosine cycles)—helps escape flat regions; related to ensemble of snapshots near restarts.
59 Grid search vs random search for η?Medium
Answer: Random search on log-uniform η often more efficient than grid—covers orders of magnitude; use validation score.
60 Practical trio you’d mention for a Transformer baseline.Easy
Answer: Peak learning rate, warmup steps, total steps / decay shape (cosine or linear)—cite paper recipe then tune.
Tie schedules to validation loss curves, not only train loss.

Vanishing &amp; Exploding Gradients — 15 Interview Questions

61 What is the vanishing gradient problem?Easy
Answer: In backprop, gradients are products of terms through layers. If many factors are < 1 (e.g. saturated sigmoid derivatives), early-layer gradients → 0 and those weights barely update.
62 What is exploding gradients?Easy
Answer: Same product picture: factors > 1 repeatedly → huge gradients, unstable updates, NaNs. Common in RNNs over long sequences if unrolled.
63 Why do sigmoid/tanh worsen vanishing?Medium
Answer: Derivatives are ≤ 0.25 (sigmoid) or small in saturation—each layer shrinks the backward signal when stacked deeply.
64 How does ReLU help (and one caveat)?Medium
Answer: Derivative is 1 for active neurons—less shrinkage than sigmoid. Caveat: dead ReLUs still pass zero gradient.
65 How do residual connections improve gradient flow?Medium
Answer: y = F(x)+x adds an identity path; gradient can bypass stacked layers via +1 shortcut—eases training of very deep nets.
66 LSTM vs vanilla RNN for vanishing gradients.Medium
Answer: LSTM’s cell state and additive updates with gates allow better long-range gradient flow than simple tanh recurrence where each step multiplies Jacobians.
67 GRU vs LSTM—gradient angle.Easy
Answer: GRU has fewer gates but similar idea: gating and update blending to mitigate vanishing in sequences—often comparable performance with less compute.
68 Gradient clipping—norm vs value.Medium
Answer: Clip by norm: if ||g|| > threshold, scale g down—common in RNNs. Value clipping caps each element—less common. Stops one batch from destroying weights.
69 Does batch norm fix vanishing gradients?Medium
Answer: It stabilizes activations and can help optimization indirectly—not a guarantee; deep nets still benefit from good init, ReLU, residuals.
70 Highway networks vs ResNets (brief).Hard
Answer: Highways use learned gates on skip vs transform; ResNet uses identity skip + simpler F(x). ResNet won for simplicity and performance in vision.
71 Do Transformers vanish like RNNs?Medium
Answer: Depth is finite and paths include residual + LN; attention mixes tokens in O(1) depth per layer—not the same T-step product as unrolled RNN, but very deep stacks still need design care.
72 Link to weight initialization.Easy
Answer: Good init keeps activations in a range where derivatives aren’t tiny everywhere—reduces extreme products at start of training.
73 How might you detect exploding gradients in logs?Easy
Answer: Sudden NaN loss, gradient norm spikes, weights blow up—watch global norm per step.
74 Mixed precision and loss scaling.Hard
Answer: Small gradients can underflow in fp16; loss scaling multiplies loss before backward so gradients stay in representable range, then unscales for the optimizer.
75 One-sentence fix menu for interviews.Easy
Answer: Use ReLU, good init, BN/LN, residuals, gated RNNs or attention, and gradient clipping when training recurrent or unstable nets.
Always mention products along the graph—the core math story.