Interview Q&A60 Questions

Neural Network Fundamentals — Interview Q&A

Perceptron, MLP, forward propagation, and computational graphs for neural networks.

Perceptron — 15 Interview Questions

1 What is the Rosenblatt perceptron? Easy
Answer: An early binary linear classifier: it forms a weighted sum of inputs plus a bias, then applies a threshold (step) to decide between two classes. It is historically important as a simple trainable “neuron” and the starting point for multi-layer networks.
2 Write the perceptron decision rule with labels in {−1, +1}. Easy
Answer: Compute the pre-activation (margin) s = w·x + b. Predict ŷ = sign(s) (with a convention for s = 0, e.g. treat as +1 or define a tie rule). Training adjusts w, b only when ŷ ≠ y.
ŷ = sign(w·x + b) with y ∈ {−1, +1}
3 What does “linearly separable” mean? Easy
Answer: Two classes are linearly separable if there exists a hyperplane w·x + b = 0 that puts all examples of one class strictly on one side and the other class on the other. The perceptron can learn such a separator when data are separable.
4 Why can a single perceptron not represent XOR? Medium
Answer: XOR in 2D is not linearly separable: no single line separates (0,0)/(1,1) from (0,1)/(1,0). A perceptron is exactly one linear decision boundary, so it cannot fit XOR without adding features or hidden layers (e.g. MLP).
Interview tip: Mention Minsky/Papert context briefly—motivates multi-layer networks.
5 State the perceptron learning rule for misclassified points. Medium
Answer: For labels y ∈ {−1, +1}, when (x, y) is misclassified, update w ← w + η y x and b ← b + η y (learning rate η > 0; often η = 1 in the classic algorithm). Correct points receive no update.
w := w + η y x , b := b + η y (on mistake only)
6 When does the perceptron algorithm converge? Hard
Answer: If the data are linearly separable, the perceptron rule converges in a finite number of mistakes (Novikoff-style bounds). If data are not separable, updates can cycle indefinitely—need pocket algorithm, averages, or a different model/loss.
7 Why is the bias term important? Easy
Answer: Without b, every separating hyperplane must pass through the origin in feature space. The bias shifts the decision boundary so it can separate offset clouds of points. Often implemented as an extra input fixed at 1 with a weight wâ‚€.
8 Step activation vs sigmoid for a “perceptron”—what changes? Medium
Answer: The step gives a hard decision and zero gradient almost everywhere—classic perceptron uses discrete updates, not backprop through the step. Sigmoid is smooth, yields probabilities, and supports gradient-based training (logistic regression / neural nets with continuous loss).
9 How does a perceptron relate to logistic regression? Medium
Answer: Both use a linear score w·x + b. Logistic regression outputs sigmoid(score) as probability and minimizes log loss with gradients. The perceptron uses a hard threshold and mistake-driven updates; same geometry, different output and training objective.
10 What is the margin of a correctly classified point? Medium
Answer: With y ∈ {−1, +1}, the (signed) margin is often written y (w·x + b). It is positive when the prediction is correct; larger values mean the point is farther from the decision boundary. Convergence proofs bound the number of mistakes using margin and norm of a separating vector.
11 Does feature scaling matter for the perceptron? Easy
Answer: The decision boundary is still linear, but scale differences across features can slow convergence or make updates dominated by large-magnitude inputs. Standardizing or scaling features often helps iterative algorithms behave more evenly in practice.
12 How can perceptrons be used for multi-class problems? Medium
Answer: Common reductions: one-vs-rest (one perceptron per class vs all others) or one-vs-one (pairwise classifiers). At prediction time, combine votes or scores. This is not softmax—mention softmax as the smooth multi-class alternative in neural nets.
13 What is the “pocket” algorithm idea? Hard
Answer: On noisy or non-separable data, standard perceptron updates may not stabilize. The pocket variant keeps the weight vector that achieved the lowest training error so far (“in your pocket”) while continuing updates, returning the best snapshot instead of the last iterate.
14 Perceptron vs linear SVM—one-minute comparison. Hard
Answer: Both learn linear separators. SVM maximizes margin (often with slack for soft-margin) and yields a unique solution under convex optimization. Perceptron finds any separating hyperplane if one exists; many solutions possible. SVM generalizes better with kernels; perceptron is simple and historically foundational.
15 How does stacking perceptrons lead to multi-layer networks? Medium
Answer: One perceptron = one linear boundary. Hidden layers of non-linear units compose boundaries: early layers can fold or combine half-spaces so later layers separate XOR-like patterns. That is the core idea of an MLP: depth + non-linearity overcomes single-layer limits.
MLP: h = σ(W₁x + b₁) → ŷ = σ(W₂h + b₂) (non-linear σ)

Multi-Layer Perceptron — 15 Interview Questions

16 What is a multi-layer perceptron (MLP)?Easy
Answer: An MLP is a feedforward neural network: layers of neurons where each layer is typically fully connected to the next, with non-linear activations between affine transforms. Information flows input → hidden(s) → output without cycles.
17 What does “feedforward” mean?Easy
Answer: Activations are computed in one direction only—from input to output. There are no recurrent edges that feed a layer’s output back into earlier layers in the same forward pass (that would be an RNN or similar).
18 Why do we use hidden layers?Easy
Answer: Hidden layers let the network build intermediate representations. Stacking non-linear layers composes functions so the model can approximate non-linear boundaries (e.g. XOR) that a single linear layer cannot.
19 Depth vs width—how do interviewers expect you to compare them?Medium
Answer: Depth (more layers) increases compositional power and hierarchical features; can help sample efficiency for some tasks but risks optimization issues. Width (more units per layer) increases capacity per layer; very wide shallow nets can also approximate functions. Trade-offs: data, compute, vanishing gradients, and inductive bias.
20 How many parameters in a linear layer from d_in to d_out?Easy
Answer: Weights: d_in × d_out. Bias: d_out. Total d_in × d_out + d_out. Mention this scales quickly for large fully connected layers.
params = d_in × d_out + d_out
21 What happens if you remove activations and only stack linear layers?Easy
Answer: A composition of linear maps is still linear. The entire deep stack collapses to a single affine transform—no extra expressive power vs one linear layer.
22 One hidden layer MLP—what can it approximate?Hard
Answer: With a suitable non-linearity and enough hidden units, a single-hidden-layer MLP can approximate many continuous functions on compact domains (universal approximation theme). In practice depth, data, and optimization matter—not only width.
23 How does a small MLP solve XOR?Medium
Answer: A hidden layer can form new features (e.g. AND-like combinations) so the output layer becomes linearly separable in that feature space. Classic example: 2 inputs → small hidden layer with non-linearity → output.
Sketch two hidden units as combining half-spaces—interviewers reward the intuition, not memorizing exact weights.
24 How does a batch dimension change MLP math?Medium
Answer: For batch size B, input is B × d_in. Linear layer: Y = XW + b (broadcast bias). Same weights for all batch rows—this is why matrix multiply is efficient on GPUs.
25 When prefer CNN over MLP for images?Medium
Answer: Images have local structure and translation patterns. CNNs use shared local filters—far fewer parameters and better inductive bias. A flat MLP on pixels ignores locality and scales poorly with resolution.
26 Why do large MLPs overfit easily?Medium
Answer: High parameter count vs data lets the network memorize noise. Mitigate with regularization (L2, dropout), more data, early stopping, or architecture better matched to the problem.
27 Why does weight initialization matter in deep MLPs?Hard
Answer: Poor scaling can make activations explode or vanish layer-to-layer, giving useless gradients. Schemes like Xavier/He set variance based on fan-in/fan-out to keep signal scale stable at initialization.
28 Typical output layer for multi-class classification?Easy
Answer: Linear logits followed by softmax (often applied inside loss for numerical stability). Training uses cross-entropy on probabilities or logits with log-softmax.
29 MLP for regression—output and loss?Easy
Answer: Often a linear output (no squashing) with MSE or Huber loss for real-valued targets. For bounded outputs you might use sigmoid scaling or tanh to a range.
30 When is an MLP still a reasonable first choice?Medium
Answer: Tabular or fixed-length feature vectors without strong spatial or sequential structure, baselines, or as a component inside larger models. For sequences use RNN/Transformer; for grids use CNN.

Forward Propagation — 15 Interview Questions

31 What is forward propagation?Easy
Answer: Computing the network’s output from input by applying each layer in order: affine transforms, biases, activations, pooling, etc.—no weight updates. It is prediction / loss input during training and pure inference at deploy time.
32 Forward vs backward pass in one sentence each.Easy
Answer: Forward: compute outputs and (usually) cache intermediates for loss. Backward: apply chain rule to get gradients for learning. Forward does not change weights; backward supplies the update signal.
33 One step of an MLP layer in forward form.Easy
Answer: z = Wx + b, then a = f(z) for activation f. For a batch, X is stacked rows and the same W applies to each.
z = Wx + b,  a = f(z)
34 Shape of X, W, and output for a batched linear layer.Medium
Answer: X: B × d_in, W: d_in × d_out, bias b: d_out (broadcast). Output Y: B × d_out with Y = XW + b (row-wise).
35 Rough FLOPs for matrix multiply A (m×k) · B (k×n)?Medium
Answer: Dominant term is O(m·k·n) multiply-adds (often quoted as ~2mkn FLOPs if counting mul+add separately). Used to reason about layer cost in forward pass.
36 Why must layers be applied in a fixed order?Easy
Answer: Each layer’s input is the previous layer’s output. Reordering changes the composed function entirely unless the architecture is specially designed (e.g. parallel branches with merge).
37 What activations are often cached during forward pass in training?Medium
Answer: Pre-activations z and post-activations a (and BN stats inputs) so backprop can compute local gradients without recomputing everything. Frameworks handle this in autograd.
38 How does eval() / inference mode change forward behavior?Medium
Answer: Dropout disabled (or scaled). BatchNorm uses running mean/var not batch stats. No gradient tracking needed—saves memory and compute.
39 Why subtract max before softmax in practice?Hard
Answer: Logits can be large; ez overflows. z' = z − max(z) shifts logits without changing softmax output (invariant) but keeps exponentials bounded—numerical stability.
40 Forward pass for batch size 1 vs large B—same code path?Easy
Answer: Usually yes—B=1 is a degenerate batch; matrix ops still work. Some ops (e.g. BN) behave differently with tiny batch size; that’s a practical caveat.
41 What drives memory during forward (training)?Medium
Answer: Storing activations for backprop, plus optimizer state if updating. Wider/deeper nets and larger batch increase activation memory—often the bottleneck before weights.
42 “Functional” forward: what does it mean in frameworks?Medium
Answer: Applying ops with explicit weight tensors passed in (e.g. F.linear(x, W, b)) instead of nn.Module parameters—same math, useful for meta-learning or custom graphs.
43 Mixed precision forward—what changes?Hard
Answer: Many ops run in float16/bfloat16 for speed; sensitive reductions (loss, BN) may stay in float32. Loss scaling can help with small gradients in low precision.
44 Exported model “inference graph”—relation to forward pass?Medium
Answer: It is a frozen forward computation graph (no backward), optimized for deployment—same layer order as training forward, possibly fused ops.
45 Walk through a 3-layer MLP forward from x to class probs.Easy
Answer: x → h1 = f(W1x+b1) → h2 = f(W2h1+b2) → logits = W3h2+b3 → probs = softmax(logits). Mention where nonlinearity stops (before softmax).
Draw arrows on a whiteboard—interviewers check you separate linear blocks from f and softmax.

Computational Graphs & Autodiff — 15 Interview Questions

46 What is a computational graph?Easy
Answer: A directed acyclic graph (DAG) representing a function: nodes are variables or operations, edges show data flow. Used to evaluate the function and (with autodiff) derivatives systematically.
47 Nodes vs edges—typical assignment.Easy
Answer: Nodes: tensors after an op, or the op itself (depends on framework representation). Edges: which outputs feed which inputs. The graph encodes dependencies for topological order.
48 What is automatic differentiation?Easy
Answer: Computes exact derivatives (up to floating point) by applying chain rule along the graph—not numerical finite differences, not full symbolic algebra on the whole expression tree by hand.
49 Forward-mode autodiff—when useful?Medium
Answer: Pushes directional derivatives forward; costs scale with number of inputs. Useful when few inputs and many outputs (rare for standard NN training vs reverse mode).
50 Reverse-mode autodiff—why dominant in ML?Medium
Answer: One scalar loss, millions of parameters—reverse mode gets the full gradient in O(graph size) time, same order as one forward pass (roughly). This is backpropagation.
51 Why must the graph be acyclic?Medium
Answer: For standard autodiff you need a clear topological order. RNNs unroll in time creating a DAG over steps; true cycles need special handling (implicit differentiation / BPTT structure).
52 Eager execution vs define-then-run.Medium
Answer: Eager: build graph as Python runs (PyTorch default). Static: trace or compile a full graph first (older TF graphs, torch.compile, XLA)—enables fusion and deployment optimizations.
53 Leaf vs non-leaf tensors (PyTorch mental model).Medium
Answer: Leaves are parameters or inputs you optimize; intermediates are non-leaf. .grad fills on leaves by default; retain_graph keeps graph for multiple backward calls.
54 What does detach() do?Medium
Answer: Breaks the graph from that tensor onward—no gradient flows through. Used to freeze parts of the model or treat values as constants.
55 stop_gradient in TensorFlow—same idea?Easy
Answer: Yes—block gradients through that path; common in GANs, reinforcement learning tricks, or fixed targets.
56 Custom autograd Function—what must you implement?Hard
Answer: Forward computes outputs; backward receives gradient w.r.t. outputs and returns gradients w.r.t. each differentiable input—must be mathematically consistent with the forward op.
57 Higher-order derivatives—does the graph recurse?Hard
Answer: Frameworks can build a graph over gradient computations (create_graph=True in PyTorch) for Hessian-vector products; memory and cost grow quickly.
58 Why are in-place ops dangerous with autograd?Medium
Answer: They can overwrite values still needed for backward. Frameworks error or warn when versions mismatch—prefer out-of-place when tensors require grad.
59 Autodiff vs symbolic differentiation vs numeric finite differences.Medium
Answer: Symbolic: algebra rules, expression swell. Finite diff: cheap to code, inaccurate and slow for high-D. Autodiff: exact, efficient for ML-scale graphs.
60 Inference graph vs training graph.Easy
Answer: Inference drops backward nodes and anything only needed for gradients—smaller, faster. Export formats (ONNX, TorchScript) target forward-only execution.
Draw a tiny graph (mul, add) and label ∂L/∂x on paper—classic interview sanity check.