Neural Networks

Neural Network Fundamentals

Perceptron, MLP, forward propagation, and computational graphs for neural networks.

The Perceptron

What Is a Perceptron?

In 1957, Frank Rosenblatt described the perceptron as an algorithm and a simple device that could learn from examples. In modern terms, a (single-layer) perceptron for binary classification computes a linear score

z = w₁x₁ + w₂x₂ + … + wdxd + b = w·x + b

and outputs a class label using a threshold. With labels in {-1, +1}, a common convention is ŷ = sign(z) (with some tie-breaking rule at z = 0). With labels in {0, 1}, people often use a step: ŷ = 1 if z ≥ 0 else 0. The weights w and bias b are what we learn from data.

x₁ ──w₁──┐ x₂ ──w₂──┼──► Σ + b ──► activation (step/sign) ──► ŷ … │ xd ──wd──┘
Connection. If you replace the step with a sigmoid and train with log loss, you get logistic regression—still a linear model, but with smooth probabilities and gradients everywhere. The perceptron instead uses a hard threshold and a rule that only updates on mistakes.

Geometry: The Decision Boundary

The equation w·x + b = 0 defines a hyperplane in input space. Points on one side are classified as one class, points on the other as the second class. The vector w is normal (perpendicular) to that hyperplane; the bias b shifts the plane away from the origin.

In two dimensions, the boundary is a line. For example, if w = [1, -1] and b = 0, the line is x₁ - x₂ = 0 (i.e. x₁ = x₂). Points above/below that line get different labels depending on the sign of z.

Tiny numeric check

Let w = [2, -1], b = -1, and x = [1, 1]. Then z = 2(1) + (-1)(1) - 1 = 0. If we use ŷ = 1 when z ≥ 0, this point lies on the boundary. For x = [2, 0], z = 4 - 0 - 1 = 3 > 0 → one side of the boundary; for x = [0, 2], z = 0 - 2 - 1 = -3 → the other side.

Perceptron Learning Rule

Assume labels y ∈ {-1, +1} and prediction ŷ = sign(z) with z = w·x + b. The classic perceptron update runs only when the example is misclassified (y ≠ ŷ):

  • w ← w + η · y · x
  • b ← b + η · y

Here η > 0 is the learning rate. Intuition: if the true label is +1 but ŷ = -1, the score z was too low; adding a positive multiple of x to w tilts the hyperplane to increase z on that kind of input. If y = -1 and ŷ = +1, the update subtracts a multiple of x.

NumPy: Train on AND and OR

Logical AND and OR on two binary inputs are linearly separable. The snippet below uses labels -1 and +1, prediction sign(z), and the perceptron update on mistakes. After enough epochs, weights should separate the points.

Perceptron training loop (AND)
import numpy as np

def sign(z):
    return np.where(z >= 0, 1, -1)

# AND: (+1 only when both inputs are +1)
# Bias input is implemented as constant 1 and weight w0 (bias b)
X = np.array([
    [-1, -1],
    [-1,  1],
    [ 1, -1],
    [ 1,  1],
], dtype=float)
y = np.array([-1, -1, -1, 1], dtype=float)  # AND with {-1, +1}

rng = np.random.default_rng(0)
w = rng.normal(0, 0.1, 2)
b = 0.0
eta = 0.5

for epoch in range(20):
    err = 0
    for xi, target in zip(X, y):
        z = np.dot(w, xi) + b
        pred = sign(z)
        if pred != target:
            err += 1
            w = w + eta * target * xi
            b = b + eta * target
    if err == 0:
        print(f"Converged epoch {epoch}")
        break

print("w =", w, "b =", b)
print("predictions:", [sign(np.dot(w, xi) + b) for xi in X])
Try OR yourself

For OR with {-1,+1}, targets should be [-1, 1, 1, 1] for the same input order. Swap y and rerun; the algorithm should again converge.

y_or = np.array([-1, 1, 1, 1], dtype=float)

Limitation: XOR Is Not Linearly Separable

The XOR function (exclusive or) outputs +1 when inputs differ and -1 when they are equal. In the 2D plane with corners at (-1,-1), (-1,1), (1,-1), (1,1), no single straight line separates the two classes. The perceptron cannot represent XOR with one linear threshold unit.

This is the famous limitation discussed by Minsky and Papert (1969): one layer of linear threshold units is weak unless you add hidden layers or nonlinear features. A multi-layer perceptron (MLP) with hidden units and nonlinear activations can learn XOR—our next tutorial topic extends the story from one neuron to a stack of layers.

Feature trick. You could map (x₁, x₂) to features like (x₁, x₂, x₁x₂) and then use a linear classifier in that lifted space—conceptually similar to what hidden layers do automatically.

PyTorch: Single Linear + Threshold (Illustration)

Modern frameworks rarely train with the discrete perceptron rule; they use continuous losses and autograd. For comparison, a single nn.Linear with two inputs and one output is exactly the z = w·x + b part; you would still need a step for a literal perceptron. The snippet shows only the linear part—training it with BCEWithLogitsLoss is closer to logistic regression than to the classical perceptron algorithm.

Linear layer = perceptron pre-activation
import torch
import torch.nn as nn

# One linear neuron: 2 features -> 1 logit
model = nn.Linear(2, 1, bias=True)
x = torch.tensor([[1.0, -1.0], [-1.0, 1.0]])
logits = model(x)
print("logits shape:", logits.shape)
print("weights:", model.weight.data)
print("bias:", model.bias.data)

Summary

  • The perceptron computes z = w·x + b and applies a hard threshold for binary labels.
  • Its decision boundary is a hyperplane; the algorithm moves that hyperplane when it makes mistakes.
  • It converges for linearly separable data; XOR motivates hidden layers (MLPs).
  • Logistic regression keeps linear geometry but uses smooth sigmoid + log loss—different training, similar inductive bias.

Multi-Layer Perceptron (MLP)

What Is an MLP?

A multi-layer perceptron is a feedforward network: data flows from input → hidden layer(s) → output with no cycles. Each layer applies an affine map (linear transform + bias) followed by an activation σ (applied element-wise), except sometimes the output layer uses a different activation or none at all (e.g. raw logits for softmax cross-entropy).

The perceptron is a single linear threshold unit. An MLP adds hidden neurons so the model can represent non-convex decision regions built from piecewise-linear (ReLU) or smooth (sigmoid/tanh) building blocks.

input x (d₀) → [W¹,b¹] → σ → h¹ (d₁) → [W²,b²] → σ → … → output (dout)

Layer Equations

Let a⁽⁰⁾ = x be the input. For layer l = 1 … L:

  • z⁽ˡ⁾ = W⁽ˡ⁾ a⁽ˡ⁻¹⁾ + b⁽ˡ⁾ (matrix form: weights multiply activations from the previous layer)
  • a⁽ˡ⁾ = σ⁽ˡ⁾(z⁽ˡ⁾) (element-wise; σ may differ per layer)

If σ is ReLU and shapes are W⁽ˡ⁾ ∈ ℝdl×dl−1, then z⁽ˡ⁾ has length dl. For a mini-batch of N examples, you stack rows so A⁽ˡ⁻¹⁾ is N × dl−1 and use the same formulas with matrix multiplication.

Parameter count. One dense layer has dl × dl−1 weights plus dl biases. Deeper/wider nets have more capacity but need more data and regularization to avoid overfitting.

Why Hidden Layers? XOR in One Diagram

No single line separates XOR in the original 2D input space. A small MLP can map inputs into a new space where XOR is linearly separable. A classic minimal pattern is 2 inputs → 2 hidden units (with nonlinearity) → 1 output.

Intuition: hidden units act as feature detectors (e.g. “both on”, “both off”, “exclusive”). The last layer combines them with a linear decision.

NumPy: Forward Pass Through a 2-Layer MLP

Toy architecture: 2 → 4 → 1 with ReLU in the hidden layer. We use random weights only to show shape flow; training would adjust W, b to fit data.

Forward: 2 → 4 → 1
import numpy as np

def relu(z): return np.maximum(0, z)

rng = np.random.default_rng(7)
# batch N=3, input dim 2
X = rng.normal(0, 1, (3, 2))
W1 = rng.normal(0, 0.3, (2, 4))
b1 = np.zeros((1, 4))
W2 = rng.normal(0, 0.3, (4, 1))
b2 = np.zeros((1, 1))

z1 = X @ W1 + b1      # (3, 2) @ (2, 4) → (3, 4)
a1 = relu(z1)
z2 = a1 @ W2 + b2     # (3, 4) @ (4, 1) → (3, 1)
# For binary logits, z2 is enough; for probabilities add sigmoid in training setup
print("hidden shape:", a1.shape, "output shape:", z2.shape)

PyTorch: nn.Sequential

Frameworks package affine + activation as modules. A compact MLP for 10-dimensional inputs and 3 classes might look like this:

MLP classifier skeleton
import torch
import torch.nn as nn

model = nn.Sequential(
    nn.Linear(10, 32),
    nn.ReLU(),
    nn.Linear(32, 32),
    nn.ReLU(),
    nn.Linear(32, 3),   # logits for 3 classes → CrossEntropyLoss
)
x = torch.randn(16, 10)   # batch 16
logits = model(x)
print(logits.shape)       # torch.Size([16, 3])

Forward Propagation

What Forward Propagation Does

Given fixed weights, forward propagation answers: “What output does this network produce for this input?” During training, you also forward-propagate to compute the loss, then backpropagate gradients. During deployment, you often only need the forward pass (sometimes with quantization or smaller models for speed).

For each layer l: Z(l) = A(l−1) W(l) + b(l) (batch as rows: A is N×d) A(l) = σ(l)( Z(l) )

Here A⁽⁰⁾ is the input matrix X with shape (N, d₀). Weight matrix W⁽ˡ⁾ has shape (dl−1, dl) so that A⁽ˡ⁻¹⁾W⁽ˡ⁾ is (N, dl). Bias b⁽ˡ⁾ broadcasts across rows.

Shape Rules (Sanity Checks)

For dense layer: out = in @ W + b

  • in: (N, d_in)
  • W: (d_in, d_out)
  • out: (N, d_out)
  • b: (1, d_out) or (d_out,) with broadcasting
Debug tip. When something breaks, print x.shape, w.shape, and expected matmul rule: inner dimensions must match (d_in).

NumPy: Full Forward Through a Small MLP

Same pattern as the MLP lesson, but explicit loop over layers for clarity: 784 → 128 → 64 → 10 (like a toy MNIST-style head).

Batched forward (3-layer MLP)
import numpy as np

def relu(z): return np.maximum(0, z)

rng = np.random.default_rng(42)
N, d0, d1, d2, d3 = 32, 784, 128, 64, 10
X = rng.standard_normal((N, d0))
W1 = rng.normal(0, 0.05, (d0, d1))
b1 = np.zeros((1, d1))
W2 = rng.normal(0, 0.05, (d1, d2))
b2 = np.zeros((1, d2))
W3 = rng.normal(0, 0.05, (d2, d3))
b3 = np.zeros((1, d3))

a = X
a = relu(a @ W1 + b1)
a = relu(a @ W2 + b2)
logits = a @ W3 + b3          # (N, 10) — pass to softmax + CE in training
print("logits shape:", logits.shape)

PyTorch: eval(), torch.no_grad()

For inference you disable gradient tracking to save memory and compute. Also set model.eval() so layers like dropout and batch norm use inference behavior.

Inference snippet
import torch
import torch.nn as nn

model = nn.Sequential(
    nn.Linear(784, 128),
    nn.ReLU(),
    nn.Linear(128, 64),
    nn.ReLU(),
    nn.Linear(64, 10),
)
model.eval()
x = torch.randn(64, 784)
with torch.no_grad():
    logits = model(x)
probs = torch.softmax(logits, dim=1)
print(probs.shape, probs[0].sum())

Complexity (Rough Intuition)

A dense layer’s matmul (N×d_in) @ (d_in×d_out) dominates cost: on the order of N × d_in × d_out multiply-adds. Deeper/wider nets multiply this per layer. Convolutions reuse weights over spatial positions and scale differently. For large models, mixed precision (FP16/BF16) and hardware (GPU/TPU) matter as much as algorithm choice.

Summary

  • Forward propagation = apply layers in order with fixed parameters.
  • Batches stack as rows; check matmul shapes at every layer.
  • Use eval() + torch.no_grad() for standard PyTorch inference.
  • Next in the track: define loss functions on top of these logits.

Computational Graphs

Nodes, Edges, and Evaluation Order

Each node knows how to compute its outputs from its inputs (forward) and how to propagate gradients from outputs back to inputs (backward). The graph must be acyclic so there is a clear topological order: you can run forward from inputs to loss, then backward from loss to parameters.

Shared subgraphs (the same tensor feeding multiple consumers) require gradient accumulation: several backward paths add their contributions to the same tensor’s gradient. Frameworks handle this with reference counting or explicit “backward hooks.”

Example: a = x * y; b = relu(a); L = (b - t)**2 Nodes: mul, relu, sub, pow/square Forward: bottom-up in topo order Backward: start at ∂L/∂L = 1, visit nodes in reverse order

Dynamic vs Static Graphs

Dynamic (define-by-run) graphs, typical of eager PyTorch, are built as Python executes each line. That makes debugging and control flow (loops, if statements) natural—the graph can change every iteration.

Static (define-then-run) graphs, historically common in TensorFlow 1-style sessions, describe the whole program once, then feed data repeatedly. Modern TensorFlow 2 and JAX blur the line with tracing and compilation (e.g. tf.function, XLA) that specialize a graph for performance while keeping flexible front ends.

Neither style changes the underlying math: both need correct forward and backward rules per primitive op. Static compilation can fuse kernels and optimize memory; dynamic execution favors research velocity.

Memory, Checkpointing, and Recomputation

Backward needs whatever forward saved: typically input tensors to each op (or enough to recompute them). Long sequences (RNNs) and huge vision models motivated gradient checkpointing: do not store every activation; re-run forward segments during backward. You trade compute for RAM—essential for large-batch or long-context training.

Profiling. OOM errors often mean activations dominate memory. Check batch size, image resolution, sequence length, and whether retain_graph=True is accidentally keeping graphs alive.

Higher-Order and Custom Ops

Hessian-vector products and meta-learning sometimes need derivatives of derivatives. Frameworks can extend autodiff to second order, but cost grows quickly. Custom CUDA kernels or fused ops still participate in the graph if wrapped with autograd rules.

When you implement a new layer, you provide forward and backward (or use autograd.Function in PyTorch). The graph stays consistent as long as local gradients are correct.

Seeing the Idea in PyTorch

Every tensor operation that touches a leaf with requires_grad=True records a grad_fn linking to the graph. Inspecting loss.grad_fn after a forward pass reveals the backward function chain—useful for education, rarely needed day-to-day.

Graph leaves and grad_fn
import torch
w = torch.tensor(1.0, requires_grad=True)
x = torch.tensor(2.0, requires_grad=False)
y = w * x + 3
loss = y ** 2
print(loss.grad_fn)          # PowBackward0
print(loss.grad_fn.next_functions)

Summary

  • Networks are DAGs of differentiable ops; forward evaluates values, backward propagates gradients.
  • Reverse-mode autodiff (backprop) is efficient when there are many inputs and one scalar loss.
  • Graph style (dynamic vs compiled static) affects tooling and performance, not the chain rule.
  • Memory management (checkpointing) is part of scaling real graphs.