Backpropagation â€” 15 Interview Questions

Chain rule from loss to weights, reverse-mode efficiency, Jacobianâ€“vector products, and why training memory grows with depth.

Colored left borders per card; green / amber / red difficulty chips.

Chain rule Reverse mode Activations Gradients

1 What is backpropagation?Easy

Answer: The standard method to compute âˆ‚L/âˆ‚Î¸ for all parameters by applying the chain rule backward from the loss through each layer. It is reverse-mode automatic differentiation on the network graph.

2 State the chain rule for nested functions.Easy

Answer: If y = f(g(x)), then dy/dx = (df/dg)(dg/dx). In many dimensions, derivatives become Jacobians and products become appropriate matrix-vector multiplies.

3 Why reverse mode instead of forward mode for NNs?Medium

Answer: Loss is a scalar; we need one gradient vector w.r.t. millions of parameters. Reverse mode gives all partials in roughly one forward + one backward pass cost. Forward mode would repeat for each parameter.

4 Order of operations in the backward pass.Medium

Answer: Traverse layers from output toward input, propagating adjoint (gradient w.r.t. downstream activations). At each node, multiply local Jacobians into the incoming upstream gradient.

5 Backward through z = Wx + bâ€”gradients w.r.t. W, x, b.Medium

Answer: Given upstream g = âˆ‚L/âˆ‚z: âˆ‚L/âˆ‚b = sum over batch of g, âˆ‚L/âˆ‚W = xáµ€g (layout-dependent), âˆ‚L/âˆ‚x = gWáµ€. Interviewers check you know dimensions line up.

6 Backward through ReLU.Easy

Answer: Pass gradient through if pre-activation > 0, else zero. At exactly zero, use convention 0 or 1 (subgradient). Elementwise mask.

7 Jacobianâ€“vector product (JVP) vs vectorâ€“Jacobian product (VJP).Hard

Answer: JVP pushes perturbations forward (forward mode). VJP pulls loss gradient backwardâ€”what each layer implements in backprop. Efficiency: we want VJPs for scalar loss.

8 Why does backprop need memory?Medium

Answer: To compute local derivatives at each layer you need forward activations (and sometimes intermediate tensors). Memory scales with network width, depth, and batch size.

9 Gradient checkpointingâ€”trade-off?Hard

Answer: Donâ€™t store every activation; recompute some during backward. Less memory, more computeâ€”used for large models (Transformers).

10 Parameter shared across layersâ€”gradient behavior?Medium

Answer: Gradients from all paths add (multivariate chain rule). Same weight used twice â†’ two contribution terms to âˆ‚L/âˆ‚w.

11 How does backprop relate to vanishing gradients?Medium

Answer: Backprop multiplies Jacobians layer by layer; if many factors are < 1 (saturating activations), the signal shrinks toward early layersâ€”same math, architectural fix (ReLU, ResNet, gates).

12 Relationship to the computational graph.Easy

Answer: The network is a DAG of ops; forward evaluates nodes, backward applies chain rule along edges. Frameworks build this graph dynamically (eager with autograd) or statically.

13 Does standard training use second derivatives?Medium

Answer: SGD + backprop uses first-order gradients. Second-order (Hessian) methods exist but are expensive; some approximations (K-FAC, etc.) are niche.

14 loss.backward() in PyTorchâ€”what happens?Easy

Answer: Traverses the autograd graph from loss, accumulating .grad on leaf tensors that require grad. Must call zero_grad between iterations unless gradients add intentionally.

15 Time complexity of forward vs backward (typical claim).Medium

Answer: For many networks, backward is roughly 2Ã— the multiply-add cost of forward (same orderâ€”rule of thumb). Constant factors depend on fusion and framework.

Practice one small two-layer network by hand onceâ€”it locks in the chain rule story.

Quick review checklist

Define backprop; chain rule; why reverse mode.
Linear + ReLU local grads; memory and checkpointing.
Vanishing gradients as product of Jacobians; autograd API basics.

Previous: Gradient descent Next: Computational graph

Related Neural Networks Links

Backpropagation â€” 15 Interview Questions

Quick review checklist