Neural Networks 15 Essential Q&A
Interview Prep

Forward Propagation — 15 Interview Questions

From input to logits: layer order, tensor shapes, batching, inference vs training mode, and how interviewers test your mental model of the forward pass.

Colored left borders per card; green / amber / red difficulty chips.

Inference Shapes FLOPs Activations
1 What is forward propagation?Easy
Answer: Computing the network’s output from input by applying each layer in order: affine transforms, biases, activations, pooling, etc.—no weight updates. It is prediction / loss input during training and pure inference at deploy time.
2 Forward vs backward pass in one sentence each.Easy
Answer: Forward: compute outputs and (usually) cache intermediates for loss. Backward: apply chain rule to get gradients for learning. Forward does not change weights; backward supplies the update signal.
3 One step of an MLP layer in forward form.Easy
Answer: z = Wx + b, then a = f(z) for activation f. For a batch, X is stacked rows and the same W applies to each.
z = Wx + b,  a = f(z)
4 Shape of X, W, and output for a batched linear layer.Medium
Answer: X: B × d_in, W: d_in × d_out, bias b: d_out (broadcast). Output Y: B × d_out with Y = XW + b (row-wise).
5 Rough FLOPs for matrix multiply A (m×k) · B (k×n)?Medium
Answer: Dominant term is O(m·k·n) multiply-adds (often quoted as ~2mkn FLOPs if counting mul+add separately). Used to reason about layer cost in forward pass.
6 Why must layers be applied in a fixed order?Easy
Answer: Each layer’s input is the previous layer’s output. Reordering changes the composed function entirely unless the architecture is specially designed (e.g. parallel branches with merge).
7 What activations are often cached during forward pass in training?Medium
Answer: Pre-activations z and post-activations a (and BN stats inputs) so backprop can compute local gradients without recomputing everything. Frameworks handle this in autograd.
8 How does eval() / inference mode change forward behavior?Medium
Answer: Dropout disabled (or scaled). BatchNorm uses running mean/var not batch stats. No gradient tracking needed—saves memory and compute.
9 Why subtract max before softmax in practice?Hard
Answer: Logits can be large; ez overflows. z' = z − max(z) shifts logits without changing softmax output (invariant) but keeps exponentials bounded—numerical stability.
10 Forward pass for batch size 1 vs large B—same code path?Easy
Answer: Usually yes—B=1 is a degenerate batch; matrix ops still work. Some ops (e.g. BN) behave differently with tiny batch size; that’s a practical caveat.
11 What drives memory during forward (training)?Medium
Answer: Storing activations for backprop, plus optimizer state if updating. Wider/deeper nets and larger batch increase activation memory—often the bottleneck before weights.
12 “Functional” forward: what does it mean in frameworks?Medium
Answer: Applying ops with explicit weight tensors passed in (e.g. F.linear(x, W, b)) instead of nn.Module parameters—same math, useful for meta-learning or custom graphs.
13 Mixed precision forward—what changes?Hard
Answer: Many ops run in float16/bfloat16 for speed; sensitive reductions (loss, BN) may stay in float32. Loss scaling can help with small gradients in low precision.
14 Exported model “inference graph”—relation to forward pass?Medium
Answer: It is a frozen forward computation graph (no backward), optimized for deployment—same layer order as training forward, possibly fused ops.
15 Walk through a 3-layer MLP forward from x to class probs.Easy
Answer: x → h1 = f(W1x+b1) → h2 = f(W2h1+b2) → logits = W3h2+b3 → probs = softmax(logits). Mention where nonlinearity stops (before softmax).
Draw arrows on a whiteboard—interviewers check you separate linear blocks from f and softmax.

Quick review checklist

  • Define forward vs backward; one MLP layer: z, a, batch shapes.
  • Training caches; eval mode: dropout off, BN running stats.
  • Softmax stability; FLOPs order for matmul; memory = activations.