Neural Network Fundamentals
Perceptron, MLP, forward propagation, and computational graphs for neural networks.
The Perceptron
What Is a Perceptron?
In 1957, Frank Rosenblatt described the perceptron as an algorithm and a simple device that could learn from examples. In modern terms, a (single-layer) perceptron for binary classification computes a linear score
z = wâ‚xâ‚ + wâ‚‚xâ‚‚ + … + wdxd + b = w·x + b
and outputs a class label using a threshold. With labels in {-1, +1}, a common convention is ŷ = sign(z) (with some tie-breaking rule at z = 0). With labels in {0, 1}, people often use a step: ŷ = 1 if z ≥ 0 else 0. The weights w and bias b are what we learn from data.
Geometry: The Decision Boundary
The equation w·x + b = 0 defines a hyperplane in input space. Points on one side are classified as one class, points on the other as the second class. The vector w is normal (perpendicular) to that hyperplane; the bias b shifts the plane away from the origin.
In two dimensions, the boundary is a line. For example, if w = [1, -1] and b = 0, the line is xâ‚ - xâ‚‚ = 0 (i.e. xâ‚ = xâ‚‚). Points above/below that line get different labels depending on the sign of z.
Tiny numeric check
Let w = [2, -1], b = -1, and x = [1, 1]. Then z = 2(1) + (-1)(1) - 1 = 0. If we use ŷ = 1 when z ≥ 0, this point lies on the boundary. For x = [2, 0], z = 4 - 0 - 1 = 3 > 0 → one side of the boundary; for x = [0, 2], z = 0 - 2 - 1 = -3 → the other side.
Perceptron Learning Rule
Assume labels y ∈ {-1, +1} and prediction ŷ = sign(z) with z = w·x + b. The classic perceptron update runs only when the example is misclassified (y ≠ŷ):
- w ↠w + η · y · x
- b ↠b + η · y
Here η > 0 is the learning rate. Intuition: if the true label is +1 but ŷ = -1, the score z was too low; adding a positive multiple of x to w tilts the hyperplane to increase z on that kind of input. If y = -1 and ŷ = +1, the update subtracts a multiple of x.
NumPy: Train on AND and OR
Logical AND and OR on two binary inputs are linearly separable. The snippet below uses labels -1 and +1, prediction sign(z), and the perceptron update on mistakes. After enough epochs, weights should separate the points.
import numpy as np
def sign(z):
return np.where(z >= 0, 1, -1)
# AND: (+1 only when both inputs are +1)
# Bias input is implemented as constant 1 and weight w0 (bias b)
X = np.array([
[-1, -1],
[-1, 1],
[ 1, -1],
[ 1, 1],
], dtype=float)
y = np.array([-1, -1, -1, 1], dtype=float) # AND with {-1, +1}
rng = np.random.default_rng(0)
w = rng.normal(0, 0.1, 2)
b = 0.0
eta = 0.5
for epoch in range(20):
err = 0
for xi, target in zip(X, y):
z = np.dot(w, xi) + b
pred = sign(z)
if pred != target:
err += 1
w = w + eta * target * xi
b = b + eta * target
if err == 0:
print(f"Converged epoch {epoch}")
break
print("w =", w, "b =", b)
print("predictions:", [sign(np.dot(w, xi) + b) for xi in X])
Try OR yourself
For OR with {-1,+1}, targets should be [-1, 1, 1, 1] for the same input order. Swap y and rerun; the algorithm should again converge.
y_or = np.array([-1, 1, 1, 1], dtype=float)
Limitation: XOR Is Not Linearly Separable
The XOR function (exclusive or) outputs +1 when inputs differ and -1 when they are equal. In the 2D plane with corners at (-1,-1), (-1,1), (1,-1), (1,1), no single straight line separates the two classes. The perceptron cannot represent XOR with one linear threshold unit.
This is the famous limitation discussed by Minsky and Papert (1969): one layer of linear threshold units is weak unless you add hidden layers or nonlinear features. A multi-layer perceptron (MLP) with hidden units and nonlinear activations can learn XOR—our next tutorial topic extends the story from one neuron to a stack of layers.
PyTorch: Single Linear + Threshold (Illustration)
Modern frameworks rarely train with the discrete perceptron rule; they use continuous losses and autograd. For comparison, a single nn.Linear with two inputs and one output is exactly the z = w·x + b part; you would still need a step for a literal perceptron. The snippet shows only the linear part—training it with BCEWithLogitsLoss is closer to logistic regression than to the classical perceptron algorithm.
import torch
import torch.nn as nn
# One linear neuron: 2 features -> 1 logit
model = nn.Linear(2, 1, bias=True)
x = torch.tensor([[1.0, -1.0], [-1.0, 1.0]])
logits = model(x)
print("logits shape:", logits.shape)
print("weights:", model.weight.data)
print("bias:", model.bias.data)
Summary
- The perceptron computes z = w·x + b and applies a hard threshold for binary labels.
- Its decision boundary is a hyperplane; the algorithm moves that hyperplane when it makes mistakes.
- It converges for linearly separable data; XOR motivates hidden layers (MLPs).
- Logistic regression keeps linear geometry but uses smooth sigmoid + log loss—different training, similar inductive bias.
Multi-Layer Perceptron (MLP)
What Is an MLP?
A multi-layer perceptron is a feedforward network: data flows from input → hidden layer(s) → output with no cycles. Each layer applies an affine map (linear transform + bias) followed by an activation σ (applied element-wise), except sometimes the output layer uses a different activation or none at all (e.g. raw logits for softmax cross-entropy).
The perceptron is a single linear threshold unit. An MLP adds hidden neurons so the model can represent non-convex decision regions built from piecewise-linear (ReLU) or smooth (sigmoid/tanh) building blocks.
Layer Equations
Let aâ½â°â¾ = x be the input. For layer l = 1 … L:
- zâ½Ë¡â¾ = Wâ½Ë¡â¾ aâ½Ë¡â»Â¹â¾ + bâ½Ë¡â¾ (matrix form: weights multiply activations from the previous layer)
- aâ½Ë¡â¾ = σâ½Ë¡â¾(zâ½Ë¡â¾) (element-wise; σ may differ per layer)
If σ is ReLU and shapes are Wâ½Ë¡â¾ ∈ â„dl×dl−1, then zâ½Ë¡â¾ has length dl. For a mini-batch of N examples, you stack rows so Aâ½Ë¡â»Â¹â¾ is N × dl−1 and use the same formulas with matrix multiplication.
Why Hidden Layers? XOR in One Diagram
No single line separates XOR in the original 2D input space. A small MLP can map inputs into a new space where XOR is linearly separable. A classic minimal pattern is 2 inputs → 2 hidden units (with nonlinearity) → 1 output.
Intuition: hidden units act as feature detectors (e.g. “both onâ€, “both offâ€, “exclusiveâ€). The last layer combines them with a linear decision.
NumPy: Forward Pass Through a 2-Layer MLP
Toy architecture: 2 → 4 → 1 with ReLU in the hidden layer. We use random weights only to show shape flow; training would adjust W, b to fit data.
import numpy as np
def relu(z): return np.maximum(0, z)
rng = np.random.default_rng(7)
# batch N=3, input dim 2
X = rng.normal(0, 1, (3, 2))
W1 = rng.normal(0, 0.3, (2, 4))
b1 = np.zeros((1, 4))
W2 = rng.normal(0, 0.3, (4, 1))
b2 = np.zeros((1, 1))
z1 = X @ W1 + b1 # (3, 2) @ (2, 4) → (3, 4)
a1 = relu(z1)
z2 = a1 @ W2 + b2 # (3, 4) @ (4, 1) → (3, 1)
# For binary logits, z2 is enough; for probabilities add sigmoid in training setup
print("hidden shape:", a1.shape, "output shape:", z2.shape)
PyTorch: nn.Sequential
Frameworks package affine + activation as modules. A compact MLP for 10-dimensional inputs and 3 classes might look like this:
import torch
import torch.nn as nn
model = nn.Sequential(
nn.Linear(10, 32),
nn.ReLU(),
nn.Linear(32, 32),
nn.ReLU(),
nn.Linear(32, 3), # logits for 3 classes → CrossEntropyLoss
)
x = torch.randn(16, 10) # batch 16
logits = model(x)
print(logits.shape) # torch.Size([16, 3])
Forward Propagation
What Forward Propagation Does
Given fixed weights, forward propagation answers: “What output does this network produce for this input?†During training, you also forward-propagate to compute the loss, then backpropagate gradients. During deployment, you often only need the forward pass (sometimes with quantization or smaller models for speed).
Here Aâ½â°â¾ is the input matrix X with shape (N, dâ‚€). Weight matrix Wâ½Ë¡â¾ has shape (dl−1, dl) so that Aâ½Ë¡â»Â¹â¾Wâ½Ë¡â¾ is (N, dl). Bias bâ½Ë¡â¾ broadcasts across rows.
Shape Rules (Sanity Checks)
For dense layer: out = in @ W + b
- in: (N, d_in)
- W: (d_in, d_out)
- out: (N, d_out)
- b: (1, d_out) or (d_out,) with broadcasting
x.shape, w.shape, and expected matmul rule: inner dimensions must match (d_in).
NumPy: Full Forward Through a Small MLP
Same pattern as the MLP lesson, but explicit loop over layers for clarity: 784 → 128 → 64 → 10 (like a toy MNIST-style head).
import numpy as np
def relu(z): return np.maximum(0, z)
rng = np.random.default_rng(42)
N, d0, d1, d2, d3 = 32, 784, 128, 64, 10
X = rng.standard_normal((N, d0))
W1 = rng.normal(0, 0.05, (d0, d1))
b1 = np.zeros((1, d1))
W2 = rng.normal(0, 0.05, (d1, d2))
b2 = np.zeros((1, d2))
W3 = rng.normal(0, 0.05, (d2, d3))
b3 = np.zeros((1, d3))
a = X
a = relu(a @ W1 + b1)
a = relu(a @ W2 + b2)
logits = a @ W3 + b3 # (N, 10) — pass to softmax + CE in training
print("logits shape:", logits.shape)
PyTorch: eval(), torch.no_grad()
For inference you disable gradient tracking to save memory and compute. Also set model.eval() so layers like dropout and batch norm use inference behavior.
import torch
import torch.nn as nn
model = nn.Sequential(
nn.Linear(784, 128),
nn.ReLU(),
nn.Linear(128, 64),
nn.ReLU(),
nn.Linear(64, 10),
)
model.eval()
x = torch.randn(64, 784)
with torch.no_grad():
logits = model(x)
probs = torch.softmax(logits, dim=1)
print(probs.shape, probs[0].sum())
Complexity (Rough Intuition)
A dense layer’s matmul (N×d_in) @ (d_in×d_out) dominates cost: on the order of N × d_in × d_out multiply-adds. Deeper/wider nets multiply this per layer. Convolutions reuse weights over spatial positions and scale differently. For large models, mixed precision (FP16/BF16) and hardware (GPU/TPU) matter as much as algorithm choice.
Summary
- Forward propagation = apply layers in order with fixed parameters.
- Batches stack as rows; check matmul shapes at every layer.
- Use
eval()+torch.no_grad()for standard PyTorch inference. - Next in the track: define loss functions on top of these logits.
Computational Graphs
Nodes, Edges, and Evaluation Order
Each node knows how to compute its outputs from its inputs (forward) and how to propagate gradients from outputs back to inputs (backward). The graph must be acyclic so there is a clear topological order: you can run forward from inputs to loss, then backward from loss to parameters.
Shared subgraphs (the same tensor feeding multiple consumers) require gradient accumulation: several backward paths add their contributions to the same tensor’s gradient. Frameworks handle this with reference counting or explicit “backward hooks.â€
Dynamic vs Static Graphs
Dynamic (define-by-run) graphs, typical of eager PyTorch, are built as Python executes each line. That makes debugging and control flow (loops, if statements) natural—the graph can change every iteration.
Static (define-then-run) graphs, historically common in TensorFlow 1-style sessions, describe the whole program once, then feed data repeatedly. Modern TensorFlow 2 and JAX blur the line with tracing and compilation (e.g. tf.function, XLA) that specialize a graph for performance while keeping flexible front ends.
Neither style changes the underlying math: both need correct forward and backward rules per primitive op. Static compilation can fuse kernels and optimize memory; dynamic execution favors research velocity.
Memory, Checkpointing, and Recomputation
Backward needs whatever forward saved: typically input tensors to each op (or enough to recompute them). Long sequences (RNNs) and huge vision models motivated gradient checkpointing: do not store every activation; re-run forward segments during backward. You trade compute for RAM—essential for large-batch or long-context training.
retain_graph=True is accidentally keeping graphs alive.
Higher-Order and Custom Ops
Hessian-vector products and meta-learning sometimes need derivatives of derivatives. Frameworks can extend autodiff to second order, but cost grows quickly. Custom CUDA kernels or fused ops still participate in the graph if wrapped with autograd rules.
When you implement a new layer, you provide forward and backward (or use autograd.Function in PyTorch). The graph stays consistent as long as local gradients are correct.
Seeing the Idea in PyTorch
Every tensor operation that touches a leaf with requires_grad=True records a grad_fn linking to the graph. Inspecting loss.grad_fn after a forward pass reveals the backward function chain—useful for education, rarely needed day-to-day.
import torch
w = torch.tensor(1.0, requires_grad=True)
x = torch.tensor(2.0, requires_grad=False)
y = w * x + 3
loss = y ** 2
print(loss.grad_fn) # PowBackward0
print(loss.grad_fn.next_functions)
Summary
- Networks are DAGs of differentiable ops; forward evaluates values, backward propagates gradients.
- Reverse-mode autodiff (backprop) is efficient when there are many inputs and one scalar loss.
- Graph style (dynamic vs compiled static) affects tooling and performance, not the chain rule.
- Memory management (checkpointing) is part of scaling real graphs.