Neural Networks Forward pass
Batched Shapes

Forward Propagation

Forward propagation (inference) is the process of passing input data through every layer—linear maps, biases, activations—to obtain predictions. There is no weight update. This page stresses the batched matrix view, shape checking, and how to run inference efficiently in PyTorch.

inference mini-batch tensor shapes no_grad

What Forward Propagation Does

Given fixed weights, forward propagation answers: “What output does this network produce for this input?” During training, you also forward-propagate to compute the loss, then backpropagate gradients. During deployment, you often only need the forward pass (sometimes with quantization or smaller models for speed).

For each layer l: Z(l) = A(l−1) W(l) + b(l) (batch as rows: A is N×d) A(l) = σ(l)( Z(l) )

Here A⁽⁰⁾ is the input matrix X with shape (N, d₀). Weight matrix W⁽ˡ⁾ has shape (dl−1, dl) so that A⁽ˡ⁻¹⁾W⁽ˡ⁾ is (N, dl). Bias b⁽ˡ⁾ broadcasts across rows.

Shape Rules (Sanity Checks)

For dense layer: out = in @ W + b

  • in: (N, d_in)
  • W: (d_in, d_out)
  • out: (N, d_out)
  • b: (1, d_out) or (d_out,) with broadcasting
Debug tip. When something breaks, print x.shape, w.shape, and expected matmul rule: inner dimensions must match (d_in).

NumPy: Full Forward Through a Small MLP

Same pattern as the MLP lesson, but explicit loop over layers for clarity: 784 → 128 → 64 → 10 (like a toy MNIST-style head).

Batched forward (3-layer MLP)
import numpy as np

def relu(z): return np.maximum(0, z)

rng = np.random.default_rng(42)
N, d0, d1, d2, d3 = 32, 784, 128, 64, 10
X = rng.standard_normal((N, d0))
W1 = rng.normal(0, 0.05, (d0, d1))
b1 = np.zeros((1, d1))
W2 = rng.normal(0, 0.05, (d1, d2))
b2 = np.zeros((1, d2))
W3 = rng.normal(0, 0.05, (d2, d3))
b3 = np.zeros((1, d3))

a = X
a = relu(a @ W1 + b1)
a = relu(a @ W2 + b2)
logits = a @ W3 + b3          # (N, 10) — pass to softmax + CE in training
print("logits shape:", logits.shape)

PyTorch: eval(), torch.no_grad()

For inference you disable gradient tracking to save memory and compute. Also set model.eval() so layers like dropout and batch norm use inference behavior.

Inference snippet
import torch
import torch.nn as nn

model = nn.Sequential(
    nn.Linear(784, 128),
    nn.ReLU(),
    nn.Linear(128, 64),
    nn.ReLU(),
    nn.Linear(64, 10),
)
model.eval()
x = torch.randn(64, 784)
with torch.no_grad():
    logits = model(x)
probs = torch.softmax(logits, dim=1)
print(probs.shape, probs[0].sum())

Complexity (Rough Intuition)

A dense layer’s matmul (N×d_in) @ (d_in×d_out) dominates cost: on the order of N × d_in × d_out multiply-adds. Deeper/wider nets multiply this per layer. Convolutions reuse weights over spatial positions and scale differently. For large models, mixed precision (FP16/BF16) and hardware (GPU/TPU) matter as much as algorithm choice.

Summary

  • Forward propagation = apply layers in order with fixed parameters.
  • Batches stack as rows; check matmul shapes at every layer.
  • Use eval() + torch.no_grad() for standard PyTorch inference.
  • Next in the track: define loss functions on top of these logits.

Next: Loss functions compare predictions to targets and drive learning.