Forward Propagation

Forward propagation (inference) is the process of passing input data through every layerâ€”linear maps, biases, activationsâ€”to obtain predictions. There is no weight update. This page stresses the batched matrix view, shape checking, and how to run inference efficiently in PyTorch.

inference mini-batch tensor shapes no_grad

What Forward Propagation Does

Given fixed weights, forward propagation answers: â€œWhat output does this network produce for this input?â€ During training, you also forward-propagate to compute the loss, then backpropagate gradients. During deployment, you often only need the forward pass (sometimes with quantization or smaller models for speed).

For each layer l: Z^(l) = A^(lâˆ’1) W^(l) + b^(l) (batch as rows: A is NÃ—d) A^(l) = Ïƒ^(l)( Z^(l) )

Here Aâ½â°â¾ is the input matrix X with shape (N, dâ‚€). Weight matrix Wâ½Ë¡â¾ has shape (d_lâˆ’1, d_l) so that Aâ½Ë¡â»Â¹â¾Wâ½Ë¡â¾ is (N, d_l). Bias bâ½Ë¡â¾ broadcasts across rows.

Shape Rules (Sanity Checks)

For dense layer: out = in @ W + b

in: (N, d_in)
W: (d_in, d_out)
out: (N, d_out)
b: (1, d_out) or (d_out,) with broadcasting

Debug tip. When something breaks, print x.shape, w.shape, and expected matmul rule: inner dimensions must match (d_in).

NumPy: Full Forward Through a Small MLP

Same pattern as the MLP lesson, but explicit loop over layers for clarity: 784 â†’ 128 â†’ 64 â†’ 10 (like a toy MNIST-style head).

Batched forward (3-layer MLP)

import numpy as np

def relu(z): return np.maximum(0, z)

rng = np.random.default_rng(42)
N, d0, d1, d2, d3 = 32, 784, 128, 64, 10
X = rng.standard_normal((N, d0))
W1 = rng.normal(0, 0.05, (d0, d1))
b1 = np.zeros((1, d1))
W2 = rng.normal(0, 0.05, (d1, d2))
b2 = np.zeros((1, d2))
W3 = rng.normal(0, 0.05, (d2, d3))
b3 = np.zeros((1, d3))

a = X
a = relu(a @ W1 + b1)
a = relu(a @ W2 + b2)
logits = a @ W3 + b3          # (N, 10) â€” pass to softmax + CE in training
print("logits shape:", logits.shape)

PyTorch: `eval()`, `torch.no_grad()`

For inference you disable gradient tracking to save memory and compute. Also set model.eval() so layers like dropout and batch norm use inference behavior.

Inference snippet

import torch
import torch.nn as nn

model = nn.Sequential(
    nn.Linear(784, 128),
    nn.ReLU(),
    nn.Linear(128, 64),
    nn.ReLU(),
    nn.Linear(64, 10),
)
model.eval()
x = torch.randn(64, 784)
with torch.no_grad():
    logits = model(x)
probs = torch.softmax(logits, dim=1)
print(probs.shape, probs[0].sum())

Complexity (Rough Intuition)

A dense layerâ€™s matmul (NÃ—d_in) @ (d_inÃ—d_out) dominates cost: on the order of N Ã— d_in Ã— d_out multiply-adds. Deeper/wider nets multiply this per layer. Convolutions reuse weights over spatial positions and scale differently. For large models, mixed precision (FP16/BF16) and hardware (GPU/TPU) matter as much as algorithm choice.

Summary

Forward propagation = apply layers in order with fixed parameters.
Batches stack as rows; check matmul shapes at every layer.
Use eval() + torch.no_grad() for standard PyTorch inference.
Next in the track: define loss functions on top of these logits.

Next: Loss functions compare predictions to targets and drive learning.

Previous: Activations Next: Loss functions

Related Neural Networks Links