Interview Q&A75 Questions

Training & Optimization — Interview Q&A

Gradient descent, backpropagation, optimizers, learning rate, and gradient issues.

Gradient Descent â€” 15 Interview Questions

1 What is (vanilla) gradient descent?Easy

Answer: Repeatedly update parameters Î¸ by stepping opposite the gradient of the loss L: smaller loss along steepest descent direction (locally). Requires differentiable objective (or subgradients).

Î¸ â† Î¸ âˆ’ Î· âˆ‡_Î¸ L(Î¸)

2 Full-batch vs mini-batch vs stochastic GD.Easy

Answer: Full-batch: gradient over all dataâ€”accurate but expensive. Mini-batch: noisy estimate, GPU-friendly. SGD often means mini-batch in practice; strict one-sample SGD is very noisy.

3 What happens if the learning rate is too large or too small?Easy

Answer: Too large: oscillation or divergence. Too small: painfully slow progress and risk of stalling in flat regions. Schedules and adaptive methods address this.

4 Local minima vs global minimumâ€”what do we say for deep nets?Medium

Answer: Non-convex losses have many local minima and saddle points. In high dimensions saddles are common; â€œbadâ€ local minima are less universal than folklore suggests, but optimization is still hard.

5 What is a saddle point?Medium

Answer: A point where the gradient is zero but it is neither minimum nor maximumâ€”curvature positive in some directions, negative in others. GD can slow down near saddles; noise (minibatches) or momentum helps escape.

6 Momentumâ€”intuition.Medium

Answer: Accumulate a velocity vector damped by friction so updates continue through noisy or ill-conditioned directionsâ€”like a ball rolling in a valley. Helps damp oscillations in narrow valleys.

7 When does GD find the global minimum (classically)?Hard

Answer: For convex smooth functions with appropriate step sizes, GD converges to a global minimizer. Deep neural network losses are generally not convexâ€”this theorem does not apply directly.

8 How does batch size affect the gradient estimate?Medium

Answer: Larger batch â†’ lower variance gradient, more stable steps, more memory. Smaller batch â†’ noisier updates that can act like regularization and help generalization (with caveats).

9 Epoch vs iteration in training loops.Easy

Answer: One epoch is a full pass over the training set (possibly shuffled). One iteration is one parameter update (often one mini-batch). Multiple iterations per epoch.

10 Online learningâ€”relation to SGD.Medium

Answer: Data arrives as a stream; each (or few) example(s) trigger an updateâ€”natural fit for stochastic style updates. Same GD template with a non-stationary distribution.

11 What is a plateau for optimization?Easy

Answer: A nearly flat region where gradients are tinyâ€”progress slows. Learning-rate warmup, restarts, or second-order hints (in advanced optimizers) can help.

12 Can gradient noise help generalization?Hard

Answer: Small-batch stochasticity can help escape sharp minima and explore the landscapeâ€”linked to flat minima hypotheses. Not a guarantee; interaction with batch norm and scale matters.

13 Line search vs fixed learning rate.Medium

Answer: Line search picks step size along the descent direction by evaluating the objectiveâ€”common in classical optimization, rare in large deep learning (too expensive); deep learning favors hand-tuned or scheduled Î·.

14 What is preconditioning at a high level?Hard

Answer: Rescaling coordinates so the landscape is more isotropicâ€”Newton-like methods use inverse Hessian; Adam/RMSprop use diagonal adaptive scaling as a cheap approximation.

15 GD update vs what backprop computes.Easy

Answer: Backpropagation computes âˆ‡L; gradient descent uses that vector to update weights. They answer different questions: â€œwhich direction?â€ vs â€œhow to step?â€

Keep GD answers separate from chain rule mechanicsâ€”interviewers often ask both in sequence.

Backpropagation â€” 15 Interview Questions

16 What is backpropagation?Easy

Answer: The standard method to compute âˆ‚L/âˆ‚Î¸ for all parameters by applying the chain rule backward from the loss through each layer. It is reverse-mode automatic differentiation on the network graph.

17 State the chain rule for nested functions.Easy

Answer: If y = f(g(x)), then dy/dx = (df/dg)(dg/dx). In many dimensions, derivatives become Jacobians and products become appropriate matrix-vector multiplies.

18 Why reverse mode instead of forward mode for NNs?Medium

Answer: Loss is a scalar; we need one gradient vector w.r.t. millions of parameters. Reverse mode gives all partials in roughly one forward + one backward pass cost. Forward mode would repeat for each parameter.

19 Order of operations in the backward pass.Medium

Answer: Traverse layers from output toward input, propagating adjoint (gradient w.r.t. downstream activations). At each node, multiply local Jacobians into the incoming upstream gradient.

20 Backward through z = Wx + bâ€”gradients w.r.t. W, x, b.Medium

Answer: Given upstream g = âˆ‚L/âˆ‚z: âˆ‚L/âˆ‚b = sum over batch of g, âˆ‚L/âˆ‚W = xáµ€g (layout-dependent), âˆ‚L/âˆ‚x = gWáµ€. Interviewers check you know dimensions line up.

21 Backward through ReLU.Easy

Answer: Pass gradient through if pre-activation > 0, else zero. At exactly zero, use convention 0 or 1 (subgradient). Elementwise mask.

22 Jacobianâ€“vector product (JVP) vs vectorâ€“Jacobian product (VJP).Hard

Answer: JVP pushes perturbations forward (forward mode). VJP pulls loss gradient backwardâ€”what each layer implements in backprop. Efficiency: we want VJPs for scalar loss.

23 Why does backprop need memory?Medium

Answer: To compute local derivatives at each layer you need forward activations (and sometimes intermediate tensors). Memory scales with network width, depth, and batch size.

24 Gradient checkpointingâ€”trade-off?Hard

Answer: Donâ€™t store every activation; recompute some during backward. Less memory, more computeâ€”used for large models (Transformers).

25 Parameter shared across layersâ€”gradient behavior?Medium

Answer: Gradients from all paths add (multivariate chain rule). Same weight used twice â†’ two contribution terms to âˆ‚L/âˆ‚w.

26 How does backprop relate to vanishing gradients?Medium

Answer: Backprop multiplies Jacobians layer by layer; if many factors are < 1 (saturating activations), the signal shrinks toward early layersâ€”same math, architectural fix (ReLU, ResNet, gates).

27 Relationship to the computational graph.Easy

Answer: The network is a DAG of ops; forward evaluates nodes, backward applies chain rule along edges. Frameworks build this graph dynamically (eager with autograd) or statically.

28 Does standard training use second derivatives?Medium

Answer: SGD + backprop uses first-order gradients. Second-order (Hessian) methods exist but are expensive; some approximations (K-FAC, etc.) are niche.

29 loss.backward() in PyTorchâ€”what happens?Easy

Answer: Traverses the autograd graph from loss, accumulating .grad on leaf tensors that require grad. Must call zero_grad between iterations unless gradients add intentionally.

30 Time complexity of forward vs backward (typical claim).Medium

Answer: For many networks, backward is roughly 2Ã— the multiply-add cost of forward (same orderâ€”rule of thumb). Constant factors depend on fusion and framework.

Practice one small two-layer network by hand onceâ€”it locks in the chain rule story.

Optimizers (SGD, Adam, â€¦) â€” 15 Interview Questions

31 What does an optimizer do?Easy

Answer: Uses gradients (from backprop) to update parameters each stepâ€”choosing direction and step size (and possibly per-parameter scaling).

32 Vanilla (mini-batch) SGD update.Easy

Answer: Î¸ â† Î¸ âˆ’ Î· g where g is minibatch gradient and Î· is learning rate. Simple, well-understood baseline.

Î¸_{t+1} = Î¸_t âˆ’ Î· âˆ‡L(Î¸_t)

33 SGD + momentumâ€”update form.Medium

Answer: Maintain velocity v: v â† Î² v + g, then Î¸ â† Î¸ âˆ’ Î· v. Accelerates along consistent gradient directions, dampens oscillations in ravines.

34 Nesterov momentumâ€”difference from classical momentum?Hard

Answer: Looks ahead: gradient evaluated at Î¸ âˆ’ Î² v (approx.)â€”more responsive near curvature changes. Often â€œNAGâ€ in frameworks.

35 Adagradâ€”idea and downside.Medium

Answer: Accumulate sum of squared gradients per parameter; divide update by growing denominatorâ€”adaptive smaller steps for frequent features. Learning rate can shrink to zero too aggressively.

36 RMSpropâ€”fix to Adagrad.Medium

Answer: Use exponential moving average of squared gradients (EMA) instead of full sumâ€”prevents denominator from exploding, works better for non-convex deep nets.

37 Adamâ€”what two moments does it track?Medium

Answer: First moment m: EMA of gradient (like momentum). Second moment v: EMA of squared gradient (like RMSprop). Bias-correct both early in training; divide step by âˆš(v)+Îµ.

38 Adam vs AdamW.Medium

Answer: AdamW decouples weight decay: L2 penalty applied directly to weights, not mixed into the adaptive gradientâ€”often better generalization and standard in Transformers.

39 Default Î²â‚, Î²â‚‚ in Adamâ€”what do they mean?Easy

Answer: Typical Î²â‚=0.9 (gradient EMA decay), Î²â‚‚=0.999 (squared-gradient EMA). Control memory horizon of first and second moment estimates.

40 Why Îµ in Adam denominator?Easy

Answer: Numerical stabilityâ€”avoid division by zero when second moment is tiny. Usually 1e-8 scale.

41 When might SGD+Momentum beat Adam?Hard

Answer: Some vision tasks generalize slightly better with tuned SGD+momentum + schedule; Adam can find sharper minima in some studiesâ€”not universal, dataset and tuning matter.

42 Do Adam-class methods use true second derivatives?Medium

Answer: No full Hessianâ€”only diagonal curvature estimates from squared-gradient EMA. Much cheaper than Newton methods.

43 Sparse gradientsâ€”mention one optimizer variant.Hard

Answer: Lazy Adam / sparse Adam updates only touched parameters; useful in embedding-heavy models with huge vocabularies.

44 Memory: SGD vs Adam.Easy

Answer: Adam stores two moment buffers per parameter (m and v)â€”roughly 2Ã— optimizer state vs SGD with momentum (one velocity). Matters for large models.

45 Default optimizer youâ€™d pick for a Transformer?Easy

Answer: AdamW with weight decay, warmup + decay schedule, Î² defaults aboveâ€”standard in BERT/GPT-style training.

Pair optimizer answers with learning rate schedule when discussing Transformers.

Learning Rate Schedules â€” 15 Interview Questions

46 What is the learning rate in SGD?Easy

Answer: Scalar Î· scaling the gradient step: controls speed vs stabilityâ€”too large diverges, too small is slow. Same symbol in Adam as base step size (per-parameter scaling aside).

47 What is a learning rate schedule?Easy

Answer: A function of step or epoch that changes Î· during trainingâ€”e.g. decay after plateaus, cosine to zero, warmup then constant.

48 Step decay (piecewise constant).Easy

Answer: Multiply Î· by constant Î³ every N epochs or when metric plateausâ€”classic CV recipe; simple and interpretable.

49 Exponential decay formula (conceptual).Medium

Answer: Î·_t = Î·_0 Â· Î³^t or continuous Î·_0 e^(âˆ’kt)â€”smooth decrease; need tune decay rate.

Î·_t = Î·_0 Â· Î³^t (per step or epoch)

50 Cosine annealingâ€”idea.Medium

Answer: Î· follows a cosine curve from max to min over T stepsâ€”smooth decay used in many modern trainers (SGDR extends with restarts).

51 Why warmup for Transformers / large-batch Adam?Medium

Answer: Early steps have unreliable moment estimates; large Î· can destabilize. Linear warmup ramps Î· so early updates are smallerâ€”standard in BERT/GPT recipes.

52 One-cycle policy (high level).Hard

Answer: Increase Î· to a maximum then decrease in one cycleâ€”idea from Smith: helps fast training and can aid generalization; used with cyclical LR ideas.

53 LR range test / finderâ€”purpose.Medium

Answer: Increase Î· each batch, plot lossâ€”find region where loss drops fastest before explosion; heuristic to pick max LR for one-cycle or training.

54 Linear scaling rule (batch size vs learning rate).Hard

Answer: When batch size Ã—k, some recipes scale Î· Ã—k to keep gradient noise similarâ€”works as heuristic for SGD in some regimes; not universal (warmup, BN, Adam complicate).

55 ReduceLROnPlateau scheduler.Easy

Answer: When validation metric stops improving, multiply Î· by factor < 1â€”adaptive to training dynamics, no fixed epoch list.

56 Minimum learning rate / Î· floor.Easy

Answer: Schedulers often clamp Î· â‰¥ Î·_min so training doesnâ€™t stall completely; cosine schedules use explicit minimum.

57 Does Adam remove the need for schedules?Medium

Answer: Noâ€”Adam adapts per-parameter scale but global Î· still matters; Transformers use warmup + decay with AdamW routinely.

58 SGDR: stochastic gradient descent with warm restarts.Hard

Answer: Periodically reset Î· to high value (cosine cycles)â€”helps escape flat regions; related to ensemble of snapshots near restarts.

59 Grid search vs random search for Î·?Medium

Answer: Random search on log-uniform Î· often more efficient than gridâ€”covers orders of magnitude; use validation score.

60 Practical trio youâ€™d mention for a Transformer baseline.Easy

Answer: Peak learning rate, warmup steps, total steps / decay shape (cosine or linear)â€”cite paper recipe then tune.

Tie schedules to validation loss curves, not only train loss.

Vanishing & Exploding Gradients â€” 15 Interview Questions

61 What is the vanishing gradient problem?Easy

Answer: In backprop, gradients are products of terms through layers. If many factors are < 1 (e.g. saturated sigmoid derivatives), early-layer gradients â†’ 0 and those weights barely update.

62 What is exploding gradients?Easy

Answer: Same product picture: factors > 1 repeatedly â†’ huge gradients, unstable updates, NaNs. Common in RNNs over long sequences if unrolled.

63 Why do sigmoid/tanh worsen vanishing?Medium

Answer: Derivatives are â‰¤ 0.25 (sigmoid) or small in saturationâ€”each layer shrinks the backward signal when stacked deeply.

64 How does ReLU help (and one caveat)?Medium

Answer: Derivative is 1 for active neuronsâ€”less shrinkage than sigmoid. Caveat: dead ReLUs still pass zero gradient.

65 How do residual connections improve gradient flow?Medium

Answer: y = F(x)+x adds an identity path; gradient can bypass stacked layers via +1 shortcutâ€”eases training of very deep nets.

66 LSTM vs vanilla RNN for vanishing gradients.Medium

Answer: LSTMâ€™s cell state and additive updates with gates allow better long-range gradient flow than simple tanh recurrence where each step multiplies Jacobians.

67 GRU vs LSTMâ€”gradient angle.Easy

Answer: GRU has fewer gates but similar idea: gating and update blending to mitigate vanishing in sequencesâ€”often comparable performance with less compute.

68 Gradient clippingâ€”norm vs value.Medium

Answer: Clip by norm: if ||g|| > threshold, scale g downâ€”common in RNNs. Value clipping caps each elementâ€”less common. Stops one batch from destroying weights.

69 Does batch norm fix vanishing gradients?Medium

Answer: It stabilizes activations and can help optimization indirectlyâ€”not a guarantee; deep nets still benefit from good init, ReLU, residuals.

70 Highway networks vs ResNets (brief).Hard

Answer: Highways use learned gates on skip vs transform; ResNet uses identity skip + simpler F(x). ResNet won for simplicity and performance in vision.

71 Do Transformers vanish like RNNs?Medium

Answer: Depth is finite and paths include residual + LN; attention mixes tokens in O(1) depth per layerâ€”not the same T-step product as unrolled RNN, but very deep stacks still need design care.

72 Link to weight initialization.Easy

Answer: Good init keeps activations in a range where derivatives arenâ€™t tiny everywhereâ€”reduces extreme products at start of training.

73 How might you detect exploding gradients in logs?Easy

Answer: Sudden NaN loss, gradient norm spikes, weights blow upâ€”watch global norm per step.

74 Mixed precision and loss scaling.Hard

Answer: Small gradients can underflow in fp16; loss scaling multiplies loss before backward so gradients stay in representable range, then unscales for the optimizer.

75 One-sentence fix menu for interviews.Easy

Answer: Use ReLU, good init, BN/LN, residuals, gated RNNs or attention, and gradient clipping when training recurrent or unstable nets.

Always mention products along the graphâ€”the core math story.

Previous Next