Neural Networks

Network Design & Regularization

Network design, weight initialization, batch norm, overfitting, and dropout.

Network Design

Capacity: What Can the Model Represent?

Roughly, more parameters and more nonlinear layers increase the set of functions you can approximate—up to limits imposed by architecture (e.g. a linear model stays linear). But bigger capacity without enough data or regularization invites overfitting: excellent training loss, poor generalization. Too little capacity yields underfitting: the model cannot reduce training error enough.

Validation curves guide you: if train and validation errors are both high, increase capacity or train longer; if train is low but validation is high, add data, regularization, or reduce capacity. Modern practice often starts with an established baseline architecture for the domain (ResNet-style CNNs, Transformer blocks) and adjusts width/depth to match GPU memory and dataset size.

Depth vs Width

Width (many neurons per layer) increases representational power in a single “slice” of computation; very wide shallow nets can approximate many functions. Depth (many layers) enables hierarchical composition: early layers can build simple features, later layers combine them. Empirically, depth helps in vision and language when paired with skip connections and normalization.

There is no universal formula. Rules of thumb for MLPs on tabular data might start with one or two hidden layers of moderate width and grow until validation metrics plateau. For images, convolutional locality and weight sharing are usually more parameter-efficient than gigantic fully connected stacks.

Universal Approximation (What It Does and Does Not Say)

The universal approximation theorem (informally) says that an MLP with a single hidden layer and enough nonlinear units can approximate continuous functions on compact domains arbitrarily well. This is an existence result: it does not tell you how many units you need, how to train them, or that one hidden layer is optimal in practice.

Deep networks often achieve the same accuracy with fewer parameters than a single enormous hidden layer would require. So UAT motivates why neural nets are reasonable function approximators—not why your specific 3-layer net will converge on Tuesday.

Inductive Bias: Architecture Encodes Assumptions

CNNs assume translation equivariance and local structure; they share weights across space. RNNs assume sequential structure with a hidden state. Transformers assume flexible pairwise interactions mediated by attention. An MLP treats every input dimension independently at first mix—fine for some tabular data, wasteful for images where nearby pixels correlate.

Choosing the right inductive bias reduces sample complexity: the model searches a smaller, more relevant hypothesis space. When in doubt, copy a proven blueprint for your modality and modify incrementally.

Practical Heuristics

  • Match output dimension and activation to the loss (e.g. K logits + cross-entropy).
  • Increase width before extreme depth if optimization is unstable; add normalization (BatchNorm, LayerNorm) for deep stacks.
  • Use skip connections (ResNet) when adding depth to help gradient flow.
  • Profile memory: parameter count × bytes × optimizer states matters for large models.
  • Prefer reproducible baselines and ablations over one-off heroic architectures.
Parameter count sketch. Dense layer: d_in * d_out + d_out. Sum across layers for a ballpark; conv layers count kernel parameters times channels.

Example: Compare Two MLP Heads

Shallow wide vs narrower deep
import torch.nn as nn

# Shallow & wide
shallow = nn.Sequential(
    nn.Linear(100, 256), nn.ReLU(),
    nn.Linear(256, 10),
)

# Deeper & narrower
deep = nn.Sequential(
    nn.Linear(100, 128), nn.ReLU(),
    nn.Linear(128, 128), nn.ReLU(),
    nn.Linear(128, 10),
)

def nparams(m):
    return sum(p.numel() for p in m.parameters())

print("shallow params:", nparams(shallow))
print("deep params:  ", nparams(deep))

Weight Initialization

Why Initialization Scale Matters

Each layer applies a weighted sum of inputs. If weights are too large, linear pre-activations grow with depth; sigmoids saturate, ReLUs fire aggressively, and gradients can explode or vanish depending on the activation’s derivative. If weights are too small, signals decay layer by layer and the network barely learns—gradients vanish because downstream units see tiny variations.

Initialization schemes aim for unit variance (order 1) of pre-activations at the beginning of training, under simplifying assumptions about input distribution and linearity. They do not replace training; they put parameters in a reasonable part of parameter space so SGD/Adam can work without extreme learning-rate hacks.

Biases. Often initialized to zero (or small constants). The main subtlety is weights; biases shift decision thresholds and are usually less sensitive.

Xavier / Glorot Initialization

Glorot initialization (often called Xavier) targets layers with linear output followed by symmetric activations like tanh or sigmoid (historically). For a uniform distribution, weights are drawn from a range related to 1/√n where n is fan-in (or a harmonic mean of fan-in and fan-out in the “normal” variant). The goal is to preserve variance of activations forward and variance of gradients backward under linear approximations.

PyTorch’s torch.nn.init.xavier_uniform_ and xavier_normal_ implement these rules. For sigmoid/tanh MLPs without batch norm, Xavier remains a standard teaching reference—though ReLU-dominated vision models more often use He.

He Initialization (ReLU)

He initialization accounts for ReLU zeroing half the mass (roughly): variance is scaled by 2/fan_in for the common normal/uniform variants so that the expected variance of activations after ReLU stays in a sensible range. This is the default family for many Conv2d and Linear modules in PyTorch (Kaiming uniform/normal).

If you stack many ReLU layers without normalization, He init plus reasonable learning rate is a better starting point than Xavier, which assumed different activation statistics.

What Usually Fails

All zeros for weights: symmetric breaking disappears; hidden units in a layer behave identically and gradients are tied—learning stalls. Same large constant everywhere: similar symmetry and saturation issues. Unscaled random Normal(0,1) in a 4096-wide layer: enormous pre-activations. Always tie scale to layer width.

For output layers, sometimes small random or zero-final-layer tricks are used in residual networks or special heads; follow established recipes for the architecture you copy.

PyTorch: Defaults and Overrides

Explicit Kaiming & Xavier
import torch
import torch.nn as nn

m = nn.Linear(256, 128)
# Default is often Kaiming uniform for Linear
nn.init.kaiming_normal_(m.weight, nonlinearity='relu')
nn.init.zeros_(m.bias)

# Xavier for tanh-style stack
m2 = nn.Linear(256, 128)
nn.init.xavier_uniform_(m2.weight)

When you define custom modules, call init after creating parameters or register a reset method. Transfer learning skips init for loaded weights—only new heads need it.

Batch Normalization

What Batch Norm Does

For a tensor of activations, BN computes mean and variance across the normalization axes (for fully connected layers, often over the batch dimension; for conv layers, over batch and spatial dims per channel). It then transforms x̂ = (x − μ) / √(σ² + ε) and outputs y = γ x̂ + β. The small ε avoids division by zero.

The original paper motivated BN as reducing internal covariate shift—the change in input distribution to layers as parameters update. Whether that story is the full explanation remains debated; empirically BN often smooths the loss landscape and improves optimization in CNNs.

Training vs Inference

During training, μ and σ² come from the current batch. During inference, batch statistics would be noisy for batch size 1; frameworks maintain exponential moving averages of mean and variance updated during training and use those frozen values at test time.

In PyTorch, call model.eval() before validation or deployment so BatchNorm and Dropout switch behavior. Forgetting this is a classic source of “validation accuracy much worse than training” even when the model is fine.

Small batches. With very small batch size, batch statistics are high-variance; consider GroupNorm or LayerNorm in those regimes, or accumulate statistics carefully.

Where to Place BN

Common pattern in CNNs: Conv → BatchNorm → ReLU. For linear layers, Linear → BatchNorm → activation. Some architectures use BN before activation; others after—consistency within a model matters more than dogma, but follow the reference implementation when reproducing papers.

BN interacts with weight decay: some practices decouple BN’s γ, β from L2; PyTorch’s AdamW and parameter groups help you exclude biases and BN affine from decay if desired.

PyTorch: BatchNorm1d / BatchNorm2d

MLP and conv blocks
import torch.nn as nn

mlp_block = nn.Sequential(
    nn.Linear(512, 256),
    nn.BatchNorm1d(256),
    nn.ReLU(),
)

conv_block = nn.Sequential(
    nn.Conv2d(3, 64, 3, padding=1),
    nn.BatchNorm2d(64),
    nn.ReLU(),
)

Summary

  • BN normalizes activations per batch (per channel in conv) then applies learnable γ, β.
  • Training uses batch stats; inference uses running averages—toggle with train()/eval().
  • Often improves optimization for CNNs; small-batch settings may prefer GroupNorm/LayerNorm.
  • Placement and weight-decay handling should match your baseline architecture.

Overfitting & Underfitting

Reading the Train–Validation Gap

During training, plot (or log) loss and metrics on a held-out validation set that is not used for gradient updates. If training loss decreases smoothly but validation loss eventually increases, you are likely overfitting: capacity or training time exceeds what the data support without extra regularization. If both curves plateau high, you may need more model capacity, better features, longer training, or a tuned learning rate.

The bias–variance tradeoff is a related story: high bias (underfitting) means the model class cannot fit the signal; high variance (overfitting) means the model is sensitive to training sample noise. Deep nets are flexible enough that variance often dominates unless you use data, regularization, or ensembling.

Common Causes of Overfitting

  • Too few examples for the number of parameters.
  • Noisy or mislabeled training labels.
  • Training too long without early stopping or regularization.
  • Leaking validation into architecture search repeatedly (implicit overfitting to the validation set—use a test set or nested CV for final claims).

Mitigation is rarely one lever: combine more diverse data, augmentation, weight decay, dropout, early stopping, smaller networks, label smoothing, or better priors (architecture suited to the domain).

Practical Mitigations (Overview)

Early stopping halts training when validation metric stops improving—cheap and effective. Data augmentation (flips, crops, noise) artificially expands the training distribution. L2 weight decay penalizes large weights; dropout randomly drops activations during training. Batch norm and larger batches change optimization dynamics and can act like mild regularizers. For classification, label smoothing softens one-hot targets to discourage overconfident logits.

Always monitor a validation curve before trusting leaderboard scores. A model with 99% train accuracy and 70% val accuracy is telling you something explicit.

Summary

  • Overfitting = great train, worse generalization; underfitting = poor train and val.
  • Use a proper validation split and watch the gap over epochs.
  • Fix with data (more, cleaner, augmented), capacity control, and regularization.
  • Next pages dive into dropout and optimizers as part of the toolkit.

Dropout & Regularization

How Dropout Works

With dropout probability p, each kept unit is often multiplied by 1/(1−p) during training (inverted dropout) so that at test time you simply disable dropout without rescaling. Frameworks hide this detail: in PyTorch, nn.Dropout(p) applies inverted dropout in training mode.

Typical values: p = 0.2–0.5 on hidden layers of MLPs; CNNs sometimes use lower rates on conv features or dropout only on fully connected heads. Too much dropout can underfit; too little may not curb overfitting.

Always use model.eval() for inference—otherwise dropout stays on and predictions become stochastic and wrong on average.

Placement: MLP vs CNN

In MLPs, dropout after activations (e.g. Linear → ReLU → Dropout) is standard. In CNNs, spatial dropout (Dropout2d) drops entire feature maps so neighboring pixels do not leak information through the mask—often preferable to elementwise dropout on conv layers.

L2 and L1 (Weight Decay)

L2 regularization adds λ‖w‖² to the loss, encouraging smaller weights and smoother functions. In SGD this is equivalent to weight decay on the update (with subtle differences for adaptive optimizers like Adam—AdamW decouples decay properly). L1 adds λ‖w‖₁ and can drive some weights exactly to zero, promoting sparsity; it is less dominant in standard deep CNN training than L2 but appears in structured pruning and interpretability settings.

PyTorch Example

Dropout in a small MLP
import torch.nn as nn

class SmallMLP(nn.Module):
    def __init__(self, d_in, d_hidden, d_out, p_drop=0.3):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(d_in, d_hidden),
            nn.ReLU(),
            nn.Dropout(p_drop),
            nn.Linear(d_hidden, d_out),
        )

    def forward(self, x):
        return self.net(x)

Optimizer with decoupled weight decay: torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-2)

Summary

  • Dropout = random zeroing during training; disabled at inference with correct scaling.
  • Use train()/eval() consistently with BatchNorm and Dropout.
  • Conv nets often prefer Dropout2d on feature maps.
  • L2 weight decay (and AdamW) complements dropout for generalization.