Multi-Layer Perceptron (MLP)

An MLP stacks fully connected layers with nonlinear activations between them. One hidden layer is enough to solve problems like XOR that a single perceptron cannot. This page explains the structure, notation, and small forward-pass examples you can run.

hidden units dense layer feedforward XOR

What Is an MLP?

A multi-layer perceptron is a feedforward network: data flows from input â†’ hidden layer(s) â†’ output with no cycles. Each layer applies an affine map (linear transform + bias) followed by an activation Ïƒ (applied element-wise), except sometimes the output layer uses a different activation or none at all (e.g. raw logits for softmax cross-entropy).

The perceptron is a single linear threshold unit. An MLP adds hidden neurons so the model can represent non-convex decision regions built from piecewise-linear (ReLU) or smooth (sigmoid/tanh) building blocks.

input x (dâ‚€) â†’ [WÂ¹,bÂ¹] â†’ Ïƒ â†’ hÂ¹ (dâ‚) â†’ [WÂ²,bÂ²] â†’ Ïƒ â†’ â€¦ â†’ output (d_out)

Layer Equations

Let aâ½â°â¾ = x be the input. For layer l = 1 â€¦ L:

zâ½Ë¡â¾ = Wâ½Ë¡â¾ aâ½Ë¡â»Â¹â¾ + bâ½Ë¡â¾ (matrix form: weights multiply activations from the previous layer)
aâ½Ë¡â¾ = Ïƒâ½Ë¡â¾(zâ½Ë¡â¾) (element-wise; Ïƒ may differ per layer)

If Ïƒ is ReLU and shapes are Wâ½Ë¡â¾ âˆˆ â„^{d_lÃ—d_lâˆ’1}, then zâ½Ë¡â¾ has length d_l. For a mini-batch of N examples, you stack rows so Aâ½Ë¡â»Â¹â¾ is N Ã— d_lâˆ’1 and use the same formulas with matrix multiplication.

Parameter count. One dense layer has d_l Ã— d_lâˆ’1 weights plus d_l biases. Deeper/wider nets have more capacity but need more data and regularization to avoid overfitting.

Why Hidden Layers? XOR in One Diagram

No single line separates XOR in the original 2D input space. A small MLP can map inputs into a new space where XOR is linearly separable. A classic minimal pattern is 2 inputs â†’ 2 hidden units (with nonlinearity) â†’ 1 output.

Intuition: hidden units act as feature detectors (e.g. â€œboth onâ€, â€œboth offâ€, â€œexclusiveâ€). The last layer combines them with a linear decision.

You do not need to memorize XOR weights; modern training (gradient descent) discovers useful hidden representations from data. The important lesson: depth + nonlinearity unlocks functions a single linear boundary cannot express.

NumPy: Forward Pass Through a 2-Layer MLP

Toy architecture: 2 â†’ 4 â†’ 1 with ReLU in the hidden layer. We use random weights only to show shape flow; training would adjust W, b to fit data.

Forward: 2 â†’ 4 â†’ 1

import numpy as np

def relu(z): return np.maximum(0, z)

rng = np.random.default_rng(7)
# batch N=3, input dim 2
X = rng.normal(0, 1, (3, 2))
W1 = rng.normal(0, 0.3, (2, 4))
b1 = np.zeros((1, 4))
W2 = rng.normal(0, 0.3, (4, 1))
b2 = np.zeros((1, 1))

z1 = X @ W1 + b1      # (3, 2) @ (2, 4) â†’ (3, 4)
a1 = relu(z1)
z2 = a1 @ W2 + b2     # (3, 4) @ (4, 1) â†’ (3, 1)
# For binary logits, z2 is enough; for probabilities add sigmoid in training setup
print("hidden shape:", a1.shape, "output shape:", z2.shape)

PyTorch: `nn.Sequential`

Frameworks package affine + activation as modules. A compact MLP for 10-dimensional inputs and 3 classes might look like this:

MLP classifier skeleton

import torch
import torch.nn as nn

model = nn.Sequential(
    nn.Linear(10, 32),
    nn.ReLU(),
    nn.Linear(32, 32),
    nn.ReLU(),
    nn.Linear(32, 3),   # logits for 3 classes â†’ CrossEntropyLoss
)
x = torch.randn(16, 10)   # batch 16
logits = model(x)
print(logits.shape)       # torch.Size([16, 3])

Summary

MLP = stacked dense layers + nonlinear activations (usually).
Notation: z = Wa + b, then a' = Ïƒ(z), repeated.
Hidden layers fix limitations of a single perceptron (e.g. XOR).
Batching stacks examples as rows; matrix multiply handles all at once.

Previous: Perceptron Next: Activations

Related Neural Networks Links