Perceptron â€” 15 Interview Questions

Linear classifiers, threshold decisions, the classic learning rule, separability, XOR, and convergenceâ€”same card layout and colored borders as the NN basics interview page.

Each question card uses a distinct left border color; difficulty badges are color-coded.

Linear boundary Learning rule XOR / limits Convergence

1 What is the Rosenblatt perceptron? Easy

Answer: An early binary linear classifier: it forms a weighted sum of inputs plus a bias, then applies a threshold (step) to decide between two classes. It is historically important as a simple trainable â€œneuronâ€ and the starting point for multi-layer networks.

2 Write the perceptron decision rule with labels in {âˆ’1, +1}. Easy

Answer: Compute the pre-activation (margin) s = wÂ·x + b. Predict Å· = sign(s) (with a convention for s = 0, e.g. treat as +1 or define a tie rule). Training adjusts w, b only when Å· â‰ y.

Å· = sign(wÂ·x + b) with y âˆˆ {âˆ’1, +1}

3 What does â€œlinearly separableâ€ mean? Easy

Answer: Two classes are linearly separable if there exists a hyperplane wÂ·x + b = 0 that puts all examples of one class strictly on one side and the other class on the other. The perceptron can learn such a separator when data are separable.

4 Why can a single perceptron not represent XOR? Medium

Answer: XOR in 2D is not linearly separable: no single line separates (0,0)/(1,1) from (0,1)/(1,0). A perceptron is exactly one linear decision boundary, so it cannot fit XOR without adding features or hidden layers (e.g. MLP).

Interview tip: Mention Minsky/Papert context brieflyâ€”motivates multi-layer networks.

5 State the perceptron learning rule for misclassified points. Medium

Answer: For labels y âˆˆ {âˆ’1, +1}, when (x, y) is misclassified, update w â† w + Î· y x and b â† b + Î· y (learning rate Î· > 0; often Î· = 1 in the classic algorithm). Correct points receive no update.

w := w + Î· y x , b := b + Î· y (on mistake only)

6 When does the perceptron algorithm converge? Hard

Answer: If the data are linearly separable, the perceptron rule converges in a finite number of mistakes (Novikoff-style bounds). If data are not separable, updates can cycle indefinitelyâ€”need pocket algorithm, averages, or a different model/loss.

7 Why is the bias term important? Easy

Answer: Without b, every separating hyperplane must pass through the origin in feature space. The bias shifts the decision boundary so it can separate offset clouds of points. Often implemented as an extra input fixed at 1 with a weight wâ‚€.

8 Step activation vs sigmoid for a â€œperceptronâ€â€”what changes? Medium

Answer: The step gives a hard decision and zero gradient almost everywhereâ€”classic perceptron uses discrete updates, not backprop through the step. Sigmoid is smooth, yields probabilities, and supports gradient-based training (logistic regression / neural nets with continuous loss).

9 How does a perceptron relate to logistic regression? Medium

Answer: Both use a linear score wÂ·x + b. Logistic regression outputs sigmoid(score) as probability and minimizes log loss with gradients. The perceptron uses a hard threshold and mistake-driven updates; same geometry, different output and training objective.

10 What is the margin of a correctly classified point? Medium

Answer: With y âˆˆ {âˆ’1, +1}, the (signed) margin is often written y (wÂ·x + b). It is positive when the prediction is correct; larger values mean the point is farther from the decision boundary. Convergence proofs bound the number of mistakes using margin and norm of a separating vector.

11 Does feature scaling matter for the perceptron? Easy

Answer: The decision boundary is still linear, but scale differences across features can slow convergence or make updates dominated by large-magnitude inputs. Standardizing or scaling features often helps iterative algorithms behave more evenly in practice.

12 How can perceptrons be used for multi-class problems? Medium

Answer: Common reductions: one-vs-rest (one perceptron per class vs all others) or one-vs-one (pairwise classifiers). At prediction time, combine votes or scores. This is not softmaxâ€”mention softmax as the smooth multi-class alternative in neural nets.

13 What is the â€œpocketâ€ algorithm idea? Hard

Answer: On noisy or non-separable data, standard perceptron updates may not stabilize. The pocket variant keeps the weight vector that achieved the lowest training error so far (â€œin your pocketâ€) while continuing updates, returning the best snapshot instead of the last iterate.

14 Perceptron vs linear SVMâ€”one-minute comparison. Hard

Answer: Both learn linear separators. SVM maximizes margin (often with slack for soft-margin) and yields a unique solution under convex optimization. Perceptron finds any separating hyperplane if one exists; many solutions possible. SVM generalizes better with kernels; perceptron is simple and historically foundational.

15 How does stacking perceptrons lead to multi-layer networks? Medium

Answer: One perceptron = one linear boundary. Hidden layers of non-linear units compose boundaries: early layers can fold or combine half-spaces so later layers separate XOR-like patterns. That is the core idea of an MLP: depth + non-linearity overcomes single-layer limits.

MLP: h = Ïƒ(Wâ‚x + bâ‚) â†’ Å· = Ïƒ(Wâ‚‚h + bâ‚‚) (non-linear Ïƒ)

Quick review checklist

Write the decision rule and the mistake-driven update with y âˆˆ {âˆ’1, +1}.
Explain linear separability and draw XOR as the canonical counterexample.
State finite convergence for separable data; say what breaks on noisy data.
Contrast step vs sigmoid and perceptron vs logistic regression.
Close with how hidden layers fix what one perceptron cannot do.

Previous: NN basics Next: MLP

Related Neural Networks Links

Perceptron â€” 15 Interview Questions

Quick review checklist