Batch Normalization

Batch normalization (BN) standardizes the inputs to each layer using statistics computed over the mini-batch during training, then applies a learnable affine transform (scale Î³ and shift Î²) so the network can recover representational power if identity is optimal. It tends to stabilize gradients, allow higher learning rates, and acts as a mild regularizer because each exampleâ€™s normalization depends on other examples in the batch.

running mean eval mode CNN PyTorch

What Batch Norm Does

For a tensor of activations, BN computes mean and variance across the normalization axes (for fully connected layers, often over the batch dimension; for conv layers, over batch and spatial dims per channel). It then transforms xÌ‚ = (x âˆ’ Î¼) / âˆš(ÏƒÂ² + Îµ) and outputs y = Î³ xÌ‚ + Î². The small Îµ avoids division by zero.

The original paper motivated BN as reducing internal covariate shiftâ€”the change in input distribution to layers as parameters update. Whether that story is the full explanation remains debated; empirically BN often smooths the loss landscape and improves optimization in CNNs.

Training vs Inference

During training, Î¼ and ÏƒÂ² come from the current batch. During inference, batch statistics would be noisy for batch size 1; frameworks maintain exponential moving averages of mean and variance updated during training and use those frozen values at test time.

In PyTorch, call model.eval() before validation or deployment so BatchNorm and Dropout switch behavior. Forgetting this is a classic source of â€œvalidation accuracy much worse than trainingâ€ even when the model is fine.

Small batches. With very small batch size, batch statistics are high-variance; consider GroupNorm or LayerNorm in those regimes, or accumulate statistics carefully.

Where to Place BN

Common pattern in CNNs: Conv â†’ BatchNorm â†’ ReLU. For linear layers, Linear â†’ BatchNorm â†’ activation. Some architectures use BN before activation; others afterâ€”consistency within a model matters more than dogma, but follow the reference implementation when reproducing papers.

BN interacts with weight decay: some practices decouple BNâ€™s Î³, Î² from L2; PyTorchâ€™s AdamW and parameter groups help you exclude biases and BN affine from decay if desired.

PyTorch: `BatchNorm1d` / `BatchNorm2d`

MLP and conv blocks

import torch.nn as nn

mlp_block = nn.Sequential(
    nn.Linear(512, 256),
    nn.BatchNorm1d(256),
    nn.ReLU(),
)

conv_block = nn.Sequential(
    nn.Conv2d(3, 64, 3, padding=1),
    nn.BatchNorm2d(64),
    nn.ReLU(),
)

Summary

BN normalizes activations per batch (per channel in conv) then applies learnable Î³, Î².
Training uses batch stats; inference uses running averagesâ€”toggle with train()/eval().
Often improves optimization for CNNs; small-batch settings may prefer GroupNorm/LayerNorm.
Placement and weight-decay handling should match your baseline architecture.

Next: when the model fits too wellâ€”overfittingâ€”and how to recognize it on learning curves.

Previous: Weight init Next: Overfitting

Related Neural Networks Links