Dropout & Regularization

Dropout randomly sets a fraction of activations to zero during each forward pass in training. That prevents neurons from co-adapting too tightly and approximates averaging many â€œthinnedâ€ sub-networks. At inference, dropout is turned off and activations are typically scaled (or equivalently weights are scaled at train time) so expected magnitudes match training. Alongside dropout, L2 weight decay (ridge) and L1 (lasso, sparsity) penalize weight norms directly in the loss.

p = 0.5 inverted dropout weight decay PyTorch

How Dropout Works

With dropout probability p, each kept unit is often multiplied by 1/(1âˆ’p) during training (inverted dropout) so that at test time you simply disable dropout without rescaling. Frameworks hide this detail: in PyTorch, nn.Dropout(p) applies inverted dropout in training mode.

Typical values: p = 0.2â€“0.5 on hidden layers of MLPs; CNNs sometimes use lower rates on conv features or dropout only on fully connected heads. Too much dropout can underfit; too little may not curb overfitting.

Always use model.eval() for inferenceâ€”otherwise dropout stays on and predictions become stochastic and wrong on average.

Placement: MLP vs CNN

In MLPs, dropout after activations (e.g. Linear â†’ ReLU â†’ Dropout) is standard. In CNNs, spatial dropout (Dropout2d) drops entire feature maps so neighboring pixels do not leak information through the maskâ€”often preferable to elementwise dropout on conv layers.

L2 and L1 (Weight Decay)

L2 regularization adds Î»â€–wâ€–Â² to the loss, encouraging smaller weights and smoother functions. In SGD this is equivalent to weight decay on the update (with subtle differences for adaptive optimizers like Adamâ€”AdamW decouples decay properly). L1 adds Î»â€–wâ€–â‚ and can drive some weights exactly to zero, promoting sparsity; it is less dominant in standard deep CNN training than L2 but appears in structured pruning and interpretability settings.

PyTorch Example

Dropout in a small MLP

import torch.nn as nn

class SmallMLP(nn.Module):
    def __init__(self, d_in, d_hidden, d_out, p_drop=0.3):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(d_in, d_hidden),
            nn.ReLU(),
            nn.Dropout(p_drop),
            nn.Linear(d_hidden, d_out),
        )

    def forward(self, x):
        return self.net(x)

Optimizer with decoupled weight decay: torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-2)

Summary

Dropout = random zeroing during training; disabled at inference with correct scaling.
Use train()/eval() consistently with BatchNorm and Dropout.
Conv nets often prefer Dropout2d on feature maps.
L2 weight decay (and AdamW) complements dropout for generalization.

Next: choosing how to update weightsâ€”SGD, momentum, Adam, and friends.

Previous: Overfitting Next: Optimizers

Related Neural Networks Links