Network Design & Regularization

Network Design

Capacity: What Can the Model Represent?

Roughly, more parameters and more nonlinear layers increase the set of functions you can approximateâ€”up to limits imposed by architecture (e.g. a linear model stays linear). But bigger capacity without enough data or regularization invites overfitting: excellent training loss, poor generalization. Too little capacity yields underfitting: the model cannot reduce training error enough.

Validation curves guide you: if train and validation errors are both high, increase capacity or train longer; if train is low but validation is high, add data, regularization, or reduce capacity. Modern practice often starts with an established baseline architecture for the domain (ResNet-style CNNs, Transformer blocks) and adjusts width/depth to match GPU memory and dataset size.

Depth vs Width

Width (many neurons per layer) increases representational power in a single â€œsliceâ€ of computation; very wide shallow nets can approximate many functions. Depth (many layers) enables hierarchical composition: early layers can build simple features, later layers combine them. Empirically, depth helps in vision and language when paired with skip connections and normalization.

There is no universal formula. Rules of thumb for MLPs on tabular data might start with one or two hidden layers of moderate width and grow until validation metrics plateau. For images, convolutional locality and weight sharing are usually more parameter-efficient than gigantic fully connected stacks.

Universal Approximation (What It Does and Does Not Say)

The universal approximation theorem (informally) says that an MLP with a single hidden layer and enough nonlinear units can approximate continuous functions on compact domains arbitrarily well. This is an existence result: it does not tell you how many units you need, how to train them, or that one hidden layer is optimal in practice.

Deep networks often achieve the same accuracy with fewer parameters than a single enormous hidden layer would require. So UAT motivates why neural nets are reasonable function approximatorsâ€”not why your specific 3-layer net will converge on Tuesday.

Inductive Bias: Architecture Encodes Assumptions

CNNs assume translation equivariance and local structure; they share weights across space. RNNs assume sequential structure with a hidden state. Transformers assume flexible pairwise interactions mediated by attention. An MLP treats every input dimension independently at first mixâ€”fine for some tabular data, wasteful for images where nearby pixels correlate.

Choosing the right inductive bias reduces sample complexity: the model searches a smaller, more relevant hypothesis space. When in doubt, copy a proven blueprint for your modality and modify incrementally.

Practical Heuristics

Match output dimension and activation to the loss (e.g. K logits + cross-entropy).
Increase width before extreme depth if optimization is unstable; add normalization (BatchNorm, LayerNorm) for deep stacks.
Use skip connections (ResNet) when adding depth to help gradient flow.
Profile memory: parameter count Ã— bytes Ã— optimizer states matters for large models.
Prefer reproducible baselines and ablations over one-off heroic architectures.

Parameter count sketch. Dense layer: d_in * d_out + d_out. Sum across layers for a ballpark; conv layers count kernel parameters times channels.

Example: Compare Two MLP Heads

Shallow wide vs narrower deep

import torch.nn as nn

# Shallow & wide
shallow = nn.Sequential(
    nn.Linear(100, 256), nn.ReLU(),
    nn.Linear(256, 10),
)

# Deeper & narrower
deep = nn.Sequential(
    nn.Linear(100, 128), nn.ReLU(),
    nn.Linear(128, 128), nn.ReLU(),
    nn.Linear(128, 10),
)

def nparams(m):
    return sum(p.numel() for p in m.parameters())

print("shallow params:", nparams(shallow))
print("deep params:  ", nparams(deep))

Weight Initialization

Why Initialization Scale Matters

Each layer applies a weighted sum of inputs. If weights are too large, linear pre-activations grow with depth; sigmoids saturate, ReLUs fire aggressively, and gradients can explode or vanish depending on the activationâ€™s derivative. If weights are too small, signals decay layer by layer and the network barely learnsâ€”gradients vanish because downstream units see tiny variations.

Initialization schemes aim for unit variance (order 1) of pre-activations at the beginning of training, under simplifying assumptions about input distribution and linearity. They do not replace training; they put parameters in a reasonable part of parameter space so SGD/Adam can work without extreme learning-rate hacks.

Biases. Often initialized to zero (or small constants). The main subtlety is weights; biases shift decision thresholds and are usually less sensitive.

Xavier / Glorot Initialization

Glorot initialization (often called Xavier) targets layers with linear output followed by symmetric activations like tanh or sigmoid (historically). For a uniform distribution, weights are drawn from a range related to 1/âˆšn where n is fan-in (or a harmonic mean of fan-in and fan-out in the â€œnormalâ€ variant). The goal is to preserve variance of activations forward and variance of gradients backward under linear approximations.

PyTorchâ€™s torch.nn.init.xavier_uniform_ and xavier_normal_ implement these rules. For sigmoid/tanh MLPs without batch norm, Xavier remains a standard teaching referenceâ€”though ReLU-dominated vision models more often use He.

He Initialization (ReLU)

He initialization accounts for ReLU zeroing half the mass (roughly): variance is scaled by 2/fan_in for the common normal/uniform variants so that the expected variance of activations after ReLU stays in a sensible range. This is the default family for many Conv2d and Linear modules in PyTorch (Kaiming uniform/normal).

If you stack many ReLU layers without normalization, He init plus reasonable learning rate is a better starting point than Xavier, which assumed different activation statistics.

What Usually Fails

All zeros for weights: symmetric breaking disappears; hidden units in a layer behave identically and gradients are tiedâ€”learning stalls. Same large constant everywhere: similar symmetry and saturation issues. Unscaled random Normal(0,1) in a 4096-wide layer: enormous pre-activations. Always tie scale to layer width.

For output layers, sometimes small random or zero-final-layer tricks are used in residual networks or special heads; follow established recipes for the architecture you copy.

PyTorch: Defaults and Overrides

Explicit Kaiming & Xavier

import torch
import torch.nn as nn

m = nn.Linear(256, 128)
# Default is often Kaiming uniform for Linear
nn.init.kaiming_normal_(m.weight, nonlinearity='relu')
nn.init.zeros_(m.bias)

# Xavier for tanh-style stack
m2 = nn.Linear(256, 128)
nn.init.xavier_uniform_(m2.weight)

When you define custom modules, call init after creating parameters or register a reset method. Transfer learning skips init for loaded weightsâ€”only new heads need it.

Batch Normalization

What Batch Norm Does

For a tensor of activations, BN computes mean and variance across the normalization axes (for fully connected layers, often over the batch dimension; for conv layers, over batch and spatial dims per channel). It then transforms xÌ‚ = (x âˆ’ Î¼) / âˆš(ÏƒÂ² + Îµ) and outputs y = Î³ xÌ‚ + Î². The small Îµ avoids division by zero.

The original paper motivated BN as reducing internal covariate shiftâ€”the change in input distribution to layers as parameters update. Whether that story is the full explanation remains debated; empirically BN often smooths the loss landscape and improves optimization in CNNs.

Training vs Inference

During training, Î¼ and ÏƒÂ² come from the current batch. During inference, batch statistics would be noisy for batch size 1; frameworks maintain exponential moving averages of mean and variance updated during training and use those frozen values at test time.

In PyTorch, call model.eval() before validation or deployment so BatchNorm and Dropout switch behavior. Forgetting this is a classic source of â€œvalidation accuracy much worse than trainingâ€ even when the model is fine.

Small batches. With very small batch size, batch statistics are high-variance; consider GroupNorm or LayerNorm in those regimes, or accumulate statistics carefully.

Where to Place BN

Common pattern in CNNs: Conv â†’ BatchNorm â†’ ReLU. For linear layers, Linear â†’ BatchNorm â†’ activation. Some architectures use BN before activation; others afterâ€”consistency within a model matters more than dogma, but follow the reference implementation when reproducing papers.

BN interacts with weight decay: some practices decouple BNâ€™s Î³, Î² from L2; PyTorchâ€™s AdamW and parameter groups help you exclude biases and BN affine from decay if desired.

PyTorch: `BatchNorm1d` / `BatchNorm2d`

MLP and conv blocks

import torch.nn as nn

mlp_block = nn.Sequential(
    nn.Linear(512, 256),
    nn.BatchNorm1d(256),
    nn.ReLU(),
)

conv_block = nn.Sequential(
    nn.Conv2d(3, 64, 3, padding=1),
    nn.BatchNorm2d(64),
    nn.ReLU(),
)

Summary

BN normalizes activations per batch (per channel in conv) then applies learnable Î³, Î².
Training uses batch stats; inference uses running averagesâ€”toggle with train()/eval().
Often improves optimization for CNNs; small-batch settings may prefer GroupNorm/LayerNorm.
Placement and weight-decay handling should match your baseline architecture.

Overfitting & Underfitting

Reading the Trainâ€“Validation Gap

During training, plot (or log) loss and metrics on a held-out validation set that is not used for gradient updates. If training loss decreases smoothly but validation loss eventually increases, you are likely overfitting: capacity or training time exceeds what the data support without extra regularization. If both curves plateau high, you may need more model capacity, better features, longer training, or a tuned learning rate.

The biasâ€“variance tradeoff is a related story: high bias (underfitting) means the model class cannot fit the signal; high variance (overfitting) means the model is sensitive to training sample noise. Deep nets are flexible enough that variance often dominates unless you use data, regularization, or ensembling.

Common Causes of Overfitting

Too few examples for the number of parameters.
Noisy or mislabeled training labels.
Training too long without early stopping or regularization.
Leaking validation into architecture search repeatedly (implicit overfitting to the validation setâ€”use a test set or nested CV for final claims).

Mitigation is rarely one lever: combine more diverse data, augmentation, weight decay, dropout, early stopping, smaller networks, label smoothing, or better priors (architecture suited to the domain).

Practical Mitigations (Overview)

Early stopping halts training when validation metric stops improvingâ€”cheap and effective. Data augmentation (flips, crops, noise) artificially expands the training distribution. L2 weight decay penalizes large weights; dropout randomly drops activations during training. Batch norm and larger batches change optimization dynamics and can act like mild regularizers. For classification, label smoothing softens one-hot targets to discourage overconfident logits.

Always monitor a validation curve before trusting leaderboard scores. A model with 99% train accuracy and 70% val accuracy is telling you something explicit.

Summary

Overfitting = great train, worse generalization; underfitting = poor train and val.
Use a proper validation split and watch the gap over epochs.
Fix with data (more, cleaner, augmented), capacity control, and regularization.
Next pages dive into dropout and optimizers as part of the toolkit.

Dropout & Regularization

How Dropout Works

With dropout probability p, each kept unit is often multiplied by 1/(1âˆ’p) during training (inverted dropout) so that at test time you simply disable dropout without rescaling. Frameworks hide this detail: in PyTorch, nn.Dropout(p) applies inverted dropout in training mode.

Typical values: p = 0.2â€“0.5 on hidden layers of MLPs; CNNs sometimes use lower rates on conv features or dropout only on fully connected heads. Too much dropout can underfit; too little may not curb overfitting.

Always use model.eval() for inferenceâ€”otherwise dropout stays on and predictions become stochastic and wrong on average.

Placement: MLP vs CNN

In MLPs, dropout after activations (e.g. Linear â†’ ReLU â†’ Dropout) is standard. In CNNs, spatial dropout (Dropout2d) drops entire feature maps so neighboring pixels do not leak information through the maskâ€”often preferable to elementwise dropout on conv layers.

L2 and L1 (Weight Decay)

L2 regularization adds Î»â€–wâ€–Â² to the loss, encouraging smaller weights and smoother functions. In SGD this is equivalent to weight decay on the update (with subtle differences for adaptive optimizers like Adamâ€”AdamW decouples decay properly). L1 adds Î»â€–wâ€–â‚ and can drive some weights exactly to zero, promoting sparsity; it is less dominant in standard deep CNN training than L2 but appears in structured pruning and interpretability settings.

PyTorch Example

Dropout in a small MLP

import torch.nn as nn

class SmallMLP(nn.Module):
    def __init__(self, d_in, d_hidden, d_out, p_drop=0.3):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(d_in, d_hidden),
            nn.ReLU(),
            nn.Dropout(p_drop),
            nn.Linear(d_hidden, d_out),
        )

    def forward(self, x):
        return self.net(x)

Optimizer with decoupled weight decay: torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-2)

Summary

Dropout = random zeroing during training; disabled at inference with correct scaling.
Use train()/eval() consistently with BatchNorm and Dropout.
Conv nets often prefer Dropout2d on feature maps.
L2 weight decay (and AdamW) complements dropout for generalization.

Network Design & Regularization

Network Design

Capacity: What Can the Model Represent?

Depth vs Width

Universal Approximation (What It Does and Does Not Say)

Inductive Bias: Architecture Encodes Assumptions

Practical Heuristics

Example: Compare Two MLP Heads

Weight Initialization

Why Initialization Scale Matters

Xavier / Glorot Initialization

He Initialization (ReLU)

What Usually Fails

PyTorch: Defaults and Overrides

Batch Normalization

What Batch Norm Does

Training vs Inference

Where to Place BN

PyTorch: BatchNorm1d / BatchNorm2d

Summary

Overfitting &amp; Underfitting

Reading the Trainâ€“Validation Gap

Common Causes of Overfitting

Practical Mitigations (Overview)

Summary

Dropout &amp; Regularization

How Dropout Works

Placement: MLP vs CNN

L2 and L1 (Weight Decay)

PyTorch Example

Summary

PyTorch: `BatchNorm1d` / `BatchNorm2d`

Overfitting & Underfitting

Dropout & Regularization