Neural Networks 15 Essential Q&A
Interview Prep

Backpropagation — 15 Interview Questions

Chain rule from loss to weights, reverse-mode efficiency, Jacobian–vector products, and why training memory grows with depth.

Colored left borders per card; green / amber / red difficulty chips.

Chain rule Reverse mode Activations Gradients
1 What is backpropagation?Easy
Answer: The standard method to compute ∂L/∂θ for all parameters by applying the chain rule backward from the loss through each layer. It is reverse-mode automatic differentiation on the network graph.
2 State the chain rule for nested functions.Easy
Answer: If y = f(g(x)), then dy/dx = (df/dg)(dg/dx). In many dimensions, derivatives become Jacobians and products become appropriate matrix-vector multiplies.
3 Why reverse mode instead of forward mode for NNs?Medium
Answer: Loss is a scalar; we need one gradient vector w.r.t. millions of parameters. Reverse mode gives all partials in roughly one forward + one backward pass cost. Forward mode would repeat for each parameter.
4 Order of operations in the backward pass.Medium
Answer: Traverse layers from output toward input, propagating adjoint (gradient w.r.t. downstream activations). At each node, multiply local Jacobians into the incoming upstream gradient.
5 Backward through z = Wx + b—gradients w.r.t. W, x, b.Medium
Answer: Given upstream g = ∂L/∂z: ∂L/∂b = sum over batch of g, ∂L/∂W = xᵀg (layout-dependent), ∂L/∂x = gWᵀ. Interviewers check you know dimensions line up.
6 Backward through ReLU.Easy
Answer: Pass gradient through if pre-activation > 0, else zero. At exactly zero, use convention 0 or 1 (subgradient). Elementwise mask.
7 Jacobian–vector product (JVP) vs vector–Jacobian product (VJP).Hard
Answer: JVP pushes perturbations forward (forward mode). VJP pulls loss gradient backward—what each layer implements in backprop. Efficiency: we want VJPs for scalar loss.
8 Why does backprop need memory?Medium
Answer: To compute local derivatives at each layer you need forward activations (and sometimes intermediate tensors). Memory scales with network width, depth, and batch size.
9 Gradient checkpointing—trade-off?Hard
Answer: Don’t store every activation; recompute some during backward. Less memory, more compute—used for large models (Transformers).
10 Parameter shared across layers—gradient behavior?Medium
Answer: Gradients from all paths add (multivariate chain rule). Same weight used twice → two contribution terms to ∂L/∂w.
11 How does backprop relate to vanishing gradients?Medium
Answer: Backprop multiplies Jacobians layer by layer; if many factors are < 1 (saturating activations), the signal shrinks toward early layers—same math, architectural fix (ReLU, ResNet, gates).
12 Relationship to the computational graph.Easy
Answer: The network is a DAG of ops; forward evaluates nodes, backward applies chain rule along edges. Frameworks build this graph dynamically (eager with autograd) or statically.
13 Does standard training use second derivatives?Medium
Answer: SGD + backprop uses first-order gradients. Second-order (Hessian) methods exist but are expensive; some approximations (K-FAC, etc.) are niche.
14 loss.backward() in PyTorch—what happens?Easy
Answer: Traverses the autograd graph from loss, accumulating .grad on leaf tensors that require grad. Must call zero_grad between iterations unless gradients add intentionally.
15 Time complexity of forward vs backward (typical claim).Medium
Answer: For many networks, backward is roughly 2× the multiply-add cost of forward (same order—rule of thumb). Constant factors depend on fusion and framework.
Practice one small two-layer network by hand once—it locks in the chain rule story.

Quick review checklist

  • Define backprop; chain rule; why reverse mode.
  • Linear + ReLU local grads; memory and checkpointing.
  • Vanishing gradients as product of Jacobians; autograd API basics.