Neural Networks MLP
Feedforward NumPy & PyTorch

Multi-Layer Perceptron (MLP)

An MLP stacks fully connected layers with nonlinear activations between them. One hidden layer is enough to solve problems like XOR that a single perceptron cannot. This page explains the structure, notation, and small forward-pass examples you can run.

hidden units dense layer feedforward XOR

What Is an MLP?

A multi-layer perceptron is a feedforward network: data flows from input → hidden layer(s) → output with no cycles. Each layer applies an affine map (linear transform + bias) followed by an activation σ (applied element-wise), except sometimes the output layer uses a different activation or none at all (e.g. raw logits for softmax cross-entropy).

The perceptron is a single linear threshold unit. An MLP adds hidden neurons so the model can represent non-convex decision regions built from piecewise-linear (ReLU) or smooth (sigmoid/tanh) building blocks.

input x (d₀) → [W¹,b¹] → σ → h¹ (d₁) → [W²,b²] → σ → … → output (dout)

Layer Equations

Let a⁽⁰⁾ = x be the input. For layer l = 1 … L:

  • z⁽ˡ⁾ = W⁽ˡ⁾ a⁽ˡ⁻¹⁾ + b⁽ˡ⁾ (matrix form: weights multiply activations from the previous layer)
  • a⁽ˡ⁾ = σ⁽ˡ⁾(z⁽ˡ⁾) (element-wise; σ may differ per layer)

If σ is ReLU and shapes are W⁽ˡ⁾ ∈ ℝdl×dl−1, then z⁽ˡ⁾ has length dl. For a mini-batch of N examples, you stack rows so A⁽ˡ⁻¹⁾ is N × dl−1 and use the same formulas with matrix multiplication.

Parameter count. One dense layer has dl × dl−1 weights plus dl biases. Deeper/wider nets have more capacity but need more data and regularization to avoid overfitting.

Why Hidden Layers? XOR in One Diagram

No single line separates XOR in the original 2D input space. A small MLP can map inputs into a new space where XOR is linearly separable. A classic minimal pattern is 2 inputs → 2 hidden units (with nonlinearity) → 1 output.

Intuition: hidden units act as feature detectors (e.g. “both on”, “both off”, “exclusive”). The last layer combines them with a linear decision.

You do not need to memorize XOR weights; modern training (gradient descent) discovers useful hidden representations from data. The important lesson: depth + nonlinearity unlocks functions a single linear boundary cannot express.

NumPy: Forward Pass Through a 2-Layer MLP

Toy architecture: 2 → 4 → 1 with ReLU in the hidden layer. We use random weights only to show shape flow; training would adjust W, b to fit data.

Forward: 2 → 4 → 1
import numpy as np

def relu(z): return np.maximum(0, z)

rng = np.random.default_rng(7)
# batch N=3, input dim 2
X = rng.normal(0, 1, (3, 2))
W1 = rng.normal(0, 0.3, (2, 4))
b1 = np.zeros((1, 4))
W2 = rng.normal(0, 0.3, (4, 1))
b2 = np.zeros((1, 1))

z1 = X @ W1 + b1      # (3, 2) @ (2, 4) → (3, 4)
a1 = relu(z1)
z2 = a1 @ W2 + b2     # (3, 4) @ (4, 1) → (3, 1)
# For binary logits, z2 is enough; for probabilities add sigmoid in training setup
print("hidden shape:", a1.shape, "output shape:", z2.shape)

PyTorch: nn.Sequential

Frameworks package affine + activation as modules. A compact MLP for 10-dimensional inputs and 3 classes might look like this:

MLP classifier skeleton
import torch
import torch.nn as nn

model = nn.Sequential(
    nn.Linear(10, 32),
    nn.ReLU(),
    nn.Linear(32, 32),
    nn.ReLU(),
    nn.Linear(32, 3),   # logits for 3 classes → CrossEntropyLoss
)
x = torch.randn(16, 10)   # batch 16
logits = model(x)
print(logits.shape)       # torch.Size([16, 3])

Summary

  • MLP = stacked dense layers + nonlinear activations (usually).
  • Notation: z = Wa + b, then a' = σ(z), repeated.
  • Hidden layers fix limitations of a single perceptron (e.g. XOR).
  • Batching stacks examples as rows; matrix multiply handles all at once.