Interview Q&A60 Questions

Neural Network Fundamentals — Interview Q&A

Perceptron, MLP, forward propagation, and computational graphs for neural networks.

Perceptron â€” 15 Interview Questions

1 What is the Rosenblatt perceptron? Easy

Answer: An early binary linear classifier: it forms a weighted sum of inputs plus a bias, then applies a threshold (step) to decide between two classes. It is historically important as a simple trainable â€œneuronâ€ and the starting point for multi-layer networks.

2 Write the perceptron decision rule with labels in {âˆ’1, +1}. Easy

Answer: Compute the pre-activation (margin) s = wÂ·x + b. Predict Å· = sign(s) (with a convention for s = 0, e.g. treat as +1 or define a tie rule). Training adjusts w, b only when Å· â‰ y.

Å· = sign(wÂ·x + b) with y âˆˆ {âˆ’1, +1}

3 What does â€œlinearly separableâ€ mean? Easy

Answer: Two classes are linearly separable if there exists a hyperplane wÂ·x + b = 0 that puts all examples of one class strictly on one side and the other class on the other. The perceptron can learn such a separator when data are separable.

4 Why can a single perceptron not represent XOR? Medium

Answer: XOR in 2D is not linearly separable: no single line separates (0,0)/(1,1) from (0,1)/(1,0). A perceptron is exactly one linear decision boundary, so it cannot fit XOR without adding features or hidden layers (e.g. MLP).

Interview tip: Mention Minsky/Papert context brieflyâ€”motivates multi-layer networks.

5 State the perceptron learning rule for misclassified points. Medium

Answer: For labels y âˆˆ {âˆ’1, +1}, when (x, y) is misclassified, update w â† w + Î· y x and b â† b + Î· y (learning rate Î· > 0; often Î· = 1 in the classic algorithm). Correct points receive no update.

w := w + Î· y x , b := b + Î· y (on mistake only)

6 When does the perceptron algorithm converge? Hard

Answer: If the data are linearly separable, the perceptron rule converges in a finite number of mistakes (Novikoff-style bounds). If data are not separable, updates can cycle indefinitelyâ€”need pocket algorithm, averages, or a different model/loss.

7 Why is the bias term important? Easy

Answer: Without b, every separating hyperplane must pass through the origin in feature space. The bias shifts the decision boundary so it can separate offset clouds of points. Often implemented as an extra input fixed at 1 with a weight wâ‚€.

8 Step activation vs sigmoid for a â€œperceptronâ€â€”what changes? Medium

Answer: The step gives a hard decision and zero gradient almost everywhereâ€”classic perceptron uses discrete updates, not backprop through the step. Sigmoid is smooth, yields probabilities, and supports gradient-based training (logistic regression / neural nets with continuous loss).

9 How does a perceptron relate to logistic regression? Medium

Answer: Both use a linear score wÂ·x + b. Logistic regression outputs sigmoid(score) as probability and minimizes log loss with gradients. The perceptron uses a hard threshold and mistake-driven updates; same geometry, different output and training objective.

10 What is the margin of a correctly classified point? Medium

Answer: With y âˆˆ {âˆ’1, +1}, the (signed) margin is often written y (wÂ·x + b). It is positive when the prediction is correct; larger values mean the point is farther from the decision boundary. Convergence proofs bound the number of mistakes using margin and norm of a separating vector.

11 Does feature scaling matter for the perceptron? Easy

Answer: The decision boundary is still linear, but scale differences across features can slow convergence or make updates dominated by large-magnitude inputs. Standardizing or scaling features often helps iterative algorithms behave more evenly in practice.

12 How can perceptrons be used for multi-class problems? Medium

Answer: Common reductions: one-vs-rest (one perceptron per class vs all others) or one-vs-one (pairwise classifiers). At prediction time, combine votes or scores. This is not softmaxâ€”mention softmax as the smooth multi-class alternative in neural nets.

13 What is the â€œpocketâ€ algorithm idea? Hard

Answer: On noisy or non-separable data, standard perceptron updates may not stabilize. The pocket variant keeps the weight vector that achieved the lowest training error so far (â€œin your pocketâ€) while continuing updates, returning the best snapshot instead of the last iterate.

14 Perceptron vs linear SVMâ€”one-minute comparison. Hard

Answer: Both learn linear separators. SVM maximizes margin (often with slack for soft-margin) and yields a unique solution under convex optimization. Perceptron finds any separating hyperplane if one exists; many solutions possible. SVM generalizes better with kernels; perceptron is simple and historically foundational.

15 How does stacking perceptrons lead to multi-layer networks? Medium

Answer: One perceptron = one linear boundary. Hidden layers of non-linear units compose boundaries: early layers can fold or combine half-spaces so later layers separate XOR-like patterns. That is the core idea of an MLP: depth + non-linearity overcomes single-layer limits.

MLP: h = Ïƒ(Wâ‚x + bâ‚) â†’ Å· = Ïƒ(Wâ‚‚h + bâ‚‚) (non-linear Ïƒ)

Multi-Layer Perceptron â€” 15 Interview Questions

16 What is a multi-layer perceptron (MLP)?Easy

Answer: An MLP is a feedforward neural network: layers of neurons where each layer is typically fully connected to the next, with non-linear activations between affine transforms. Information flows input â†’ hidden(s) â†’ output without cycles.

17 What does â€œfeedforwardâ€ mean?Easy

Answer: Activations are computed in one direction onlyâ€”from input to output. There are no recurrent edges that feed a layerâ€™s output back into earlier layers in the same forward pass (that would be an RNN or similar).

18 Why do we use hidden layers?Easy

Answer: Hidden layers let the network build intermediate representations. Stacking non-linear layers composes functions so the model can approximate non-linear boundaries (e.g. XOR) that a single linear layer cannot.

19 Depth vs widthâ€”how do interviewers expect you to compare them?Medium

Answer: Depth (more layers) increases compositional power and hierarchical features; can help sample efficiency for some tasks but risks optimization issues. Width (more units per layer) increases capacity per layer; very wide shallow nets can also approximate functions. Trade-offs: data, compute, vanishing gradients, and inductive bias.

20 How many parameters in a linear layer from d_in to d_out?Easy

Answer: Weights: d_in Ã— d_out. Bias: d_out. Total d_in Ã— d_out + d_out. Mention this scales quickly for large fully connected layers.

params = d_in Ã— d_out + d_out

21 What happens if you remove activations and only stack linear layers?Easy

Answer: A composition of linear maps is still linear. The entire deep stack collapses to a single affine transformâ€”no extra expressive power vs one linear layer.

22 One hidden layer MLPâ€”what can it approximate?Hard

Answer: With a suitable non-linearity and enough hidden units, a single-hidden-layer MLP can approximate many continuous functions on compact domains (universal approximation theme). In practice depth, data, and optimization matterâ€”not only width.

23 How does a small MLP solve XOR?Medium

Answer: A hidden layer can form new features (e.g. AND-like combinations) so the output layer becomes linearly separable in that feature space. Classic example: 2 inputs â†’ small hidden layer with non-linearity â†’ output.

Sketch two hidden units as combining half-spacesâ€”interviewers reward the intuition, not memorizing exact weights.

24 How does a batch dimension change MLP math?Medium

Answer: For batch size B, input is B Ã— d_in. Linear layer: Y = XW + b (broadcast bias). Same weights for all batch rowsâ€”this is why matrix multiply is efficient on GPUs.

25 When prefer CNN over MLP for images?Medium

Answer: Images have local structure and translation patterns. CNNs use shared local filtersâ€”far fewer parameters and better inductive bias. A flat MLP on pixels ignores locality and scales poorly with resolution.

26 Why do large MLPs overfit easily?Medium

Answer: High parameter count vs data lets the network memorize noise. Mitigate with regularization (L2, dropout), more data, early stopping, or architecture better matched to the problem.

27 Why does weight initialization matter in deep MLPs?Hard

Answer: Poor scaling can make activations explode or vanish layer-to-layer, giving useless gradients. Schemes like Xavier/He set variance based on fan-in/fan-out to keep signal scale stable at initialization.

28 Typical output layer for multi-class classification?Easy

Answer: Linear logits followed by softmax (often applied inside loss for numerical stability). Training uses cross-entropy on probabilities or logits with log-softmax.

29 MLP for regressionâ€”output and loss?Easy

Answer: Often a linear output (no squashing) with MSE or Huber loss for real-valued targets. For bounded outputs you might use sigmoid scaling or tanh to a range.

30 When is an MLP still a reasonable first choice?Medium

Answer: Tabular or fixed-length feature vectors without strong spatial or sequential structure, baselines, or as a component inside larger models. For sequences use RNN/Transformer; for grids use CNN.

Forward Propagation â€” 15 Interview Questions

31 What is forward propagation?Easy

Answer: Computing the networkâ€™s output from input by applying each layer in order: affine transforms, biases, activations, pooling, etc.â€”no weight updates. It is prediction / loss input during training and pure inference at deploy time.

32 Forward vs backward pass in one sentence each.Easy

Answer: Forward: compute outputs and (usually) cache intermediates for loss. Backward: apply chain rule to get gradients for learning. Forward does not change weights; backward supplies the update signal.

33 One step of an MLP layer in forward form.Easy

Answer: z = Wx + b, then a = f(z) for activation f. For a batch, X is stacked rows and the same W applies to each.

z = Wx + b, a = f(z)

34 Shape of X, W, and output for a batched linear layer.Medium

Answer: X: B Ã— d_in, W: d_in Ã— d_out, bias b: d_out (broadcast). Output Y: B Ã— d_out with Y = XW + b (row-wise).

35 Rough FLOPs for matrix multiply A (mÃ—k) Â· B (kÃ—n)?Medium

Answer: Dominant term is O(mÂ·kÂ·n) multiply-adds (often quoted as ~2mkn FLOPs if counting mul+add separately). Used to reason about layer cost in forward pass.

36 Why must layers be applied in a fixed order?Easy

Answer: Each layerâ€™s input is the previous layerâ€™s output. Reordering changes the composed function entirely unless the architecture is specially designed (e.g. parallel branches with merge).

37 What activations are often cached during forward pass in training?Medium

Answer: Pre-activations z and post-activations a (and BN stats inputs) so backprop can compute local gradients without recomputing everything. Frameworks handle this in autograd.

38 How does eval() / inference mode change forward behavior?Medium

Answer: Dropout disabled (or scaled). BatchNorm uses running mean/var not batch stats. No gradient tracking neededâ€”saves memory and compute.

39 Why subtract max before softmax in practice?Hard

Answer: Logits can be large; e^z overflows. z' = z âˆ’ max(z) shifts logits without changing softmax output (invariant) but keeps exponentials boundedâ€”numerical stability.

40 Forward pass for batch size 1 vs large Bâ€”same code path?Easy

Answer: Usually yesâ€”B=1 is a degenerate batch; matrix ops still work. Some ops (e.g. BN) behave differently with tiny batch size; thatâ€™s a practical caveat.

41 What drives memory during forward (training)?Medium

Answer: Storing activations for backprop, plus optimizer state if updating. Wider/deeper nets and larger batch increase activation memoryâ€”often the bottleneck before weights.

42 â€œFunctionalâ€ forward: what does it mean in frameworks?Medium

Answer: Applying ops with explicit weight tensors passed in (e.g. F.linear(x, W, b)) instead of nn.Module parametersâ€”same math, useful for meta-learning or custom graphs.

43 Mixed precision forwardâ€”what changes?Hard

Answer: Many ops run in float16/bfloat16 for speed; sensitive reductions (loss, BN) may stay in float32. Loss scaling can help with small gradients in low precision.

44 Exported model â€œinference graphâ€â€”relation to forward pass?Medium

Answer: It is a frozen forward computation graph (no backward), optimized for deploymentâ€”same layer order as training forward, possibly fused ops.

45 Walk through a 3-layer MLP forward from x to class probs.Easy

Answer: x â†’ h1 = f(W1x+b1) â†’ h2 = f(W2h1+b2) â†’ logits = W3h2+b3 â†’ probs = softmax(logits). Mention where nonlinearity stops (before softmax).

Draw arrows on a whiteboardâ€”interviewers check you separate linear blocks from f and softmax.

Computational Graphs & Autodiff â€” 15 Interview Questions

46 What is a computational graph?Easy

Answer: A directed acyclic graph (DAG) representing a function: nodes are variables or operations, edges show data flow. Used to evaluate the function and (with autodiff) derivatives systematically.

47 Nodes vs edgesâ€”typical assignment.Easy

Answer: Nodes: tensors after an op, or the op itself (depends on framework representation). Edges: which outputs feed which inputs. The graph encodes dependencies for topological order.

48 What is automatic differentiation?Easy

Answer: Computes exact derivatives (up to floating point) by applying chain rule along the graphâ€”not numerical finite differences, not full symbolic algebra on the whole expression tree by hand.

49 Forward-mode autodiffâ€”when useful?Medium

Answer: Pushes directional derivatives forward; costs scale with number of inputs. Useful when few inputs and many outputs (rare for standard NN training vs reverse mode).

50 Reverse-mode autodiffâ€”why dominant in ML?Medium

Answer: One scalar loss, millions of parametersâ€”reverse mode gets the full gradient in O(graph size) time, same order as one forward pass (roughly). This is backpropagation.

51 Why must the graph be acyclic?Medium

Answer: For standard autodiff you need a clear topological order. RNNs unroll in time creating a DAG over steps; true cycles need special handling (implicit differentiation / BPTT structure).

52 Eager execution vs define-then-run.Medium

Answer: Eager: build graph as Python runs (PyTorch default). Static: trace or compile a full graph first (older TF graphs, torch.compile, XLA)â€”enables fusion and deployment optimizations.

53 Leaf vs non-leaf tensors (PyTorch mental model).Medium

Answer: Leaves are parameters or inputs you optimize; intermediates are non-leaf. .grad fills on leaves by default; retain_graph keeps graph for multiple backward calls.

54 What does detach() do?Medium

Answer: Breaks the graph from that tensor onwardâ€”no gradient flows through. Used to freeze parts of the model or treat values as constants.

55 stop_gradient in TensorFlowâ€”same idea?Easy

Answer: Yesâ€”block gradients through that path; common in GANs, reinforcement learning tricks, or fixed targets.

56 Custom autograd Functionâ€”what must you implement?Hard

Answer: Forward computes outputs; backward receives gradient w.r.t. outputs and returns gradients w.r.t. each differentiable inputâ€”must be mathematically consistent with the forward op.

57 Higher-order derivativesâ€”does the graph recurse?Hard

Answer: Frameworks can build a graph over gradient computations (create_graph=True in PyTorch) for Hessian-vector products; memory and cost grow quickly.

58 Why are in-place ops dangerous with autograd?Medium

Answer: They can overwrite values still needed for backward. Frameworks error or warn when versions mismatchâ€”prefer out-of-place when tensors require grad.

59 Autodiff vs symbolic differentiation vs numeric finite differences.Medium

Answer: Symbolic: algebra rules, expression swell. Finite diff: cheap to code, inaccurate and slow for high-D. Autodiff: exact, efficient for ML-scale graphs.

60 Inference graph vs training graph.Easy

Answer: Inference drops backward nodes and anything only needed for gradientsâ€”smaller, faster. Export formats (ONNX, TorchScript) target forward-only execution.

Draw a tiny graph (mul, add) and label âˆ‚L/âˆ‚x on paperâ€”classic interview sanity check.

Previous Next

Neural Network Fundamentals — Interview Q&A

Perceptron â€” 15 Interview Questions

Multi-Layer Perceptron â€” 15 Interview Questions

Forward Propagation â€” 15 Interview Questions

Computational Graphs &amp; Autodiff â€” 15 Interview Questions

Computational Graphs & Autodiff â€” 15 Interview Questions