Related Neural Networks Links
Learn Dropout Neural Networks Tutorial, validate concepts with Dropout Neural Networks MCQ Questions, and prepare interviews through Dropout Neural Networks Interview Questions and Answers.
Dropout & Regularization
Dropout randomly sets a fraction of activations to zero during each forward pass in training. That prevents neurons from co-adapting too tightly and approximates averaging many “thinned†sub-networks. At inference, dropout is turned off and activations are typically scaled (or equivalently weights are scaled at train time) so expected magnitudes match training. Alongside dropout, L2 weight decay (ridge) and L1 (lasso, sparsity) penalize weight norms directly in the loss.
p = 0.5 inverted dropout weight decay PyTorch
How Dropout Works
With dropout probability p, each kept unit is often multiplied by 1/(1−p) during training (inverted dropout) so that at test time you simply disable dropout without rescaling. Frameworks hide this detail: in PyTorch, nn.Dropout(p) applies inverted dropout in training mode.
Typical values: p = 0.2–0.5 on hidden layers of MLPs; CNNs sometimes use lower rates on conv features or dropout only on fully connected heads. Too much dropout can underfit; too little may not curb overfitting.
model.eval() for inference—otherwise dropout stays on and predictions become stochastic and wrong on average.
Placement: MLP vs CNN
In MLPs, dropout after activations (e.g. Linear → ReLU → Dropout) is standard. In CNNs, spatial dropout (Dropout2d) drops entire feature maps so neighboring pixels do not leak information through the mask—often preferable to elementwise dropout on conv layers.
L2 and L1 (Weight Decay)
L2 regularization adds λ‖w‖² to the loss, encouraging smaller weights and smoother functions. In SGD this is equivalent to weight decay on the update (with subtle differences for adaptive optimizers like Adam—AdamW decouples decay properly). L1 adds λ‖w‖₠and can drive some weights exactly to zero, promoting sparsity; it is less dominant in standard deep CNN training than L2 but appears in structured pruning and interpretability settings.
PyTorch Example
import torch.nn as nn
class SmallMLP(nn.Module):
def __init__(self, d_in, d_hidden, d_out, p_drop=0.3):
super().__init__()
self.net = nn.Sequential(
nn.Linear(d_in, d_hidden),
nn.ReLU(),
nn.Dropout(p_drop),
nn.Linear(d_hidden, d_out),
)
def forward(self, x):
return self.net(x)
Optimizer with decoupled weight decay: torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-2)
Summary
- Dropout = random zeroing during training; disabled at inference with correct scaling.
- Use
train()/eval()consistently with BatchNorm and Dropout. - Conv nets often prefer
Dropout2don feature maps. - L2 weight decay (and AdamW) complements dropout for generalization.
Next: choosing how to update weights—SGD, momentum, Adam, and friends.