Neural Networks Generalization
Train vs val Mitigation

Overfitting & Underfitting

Generalization means your model performs well on new data drawn from the same underlying process—not just on the examples it memorized during training. Overfitting is the classic failure mode where training error keeps dropping while validation error worsens: the model learns idiosyncrasies and noise. Underfitting is the opposite: both training and validation errors stay high because the model is too simple or training is inadequate.

validation set bias–variance early stopping more data

Reading the Train–Validation Gap

During training, plot (or log) loss and metrics on a held-out validation set that is not used for gradient updates. If training loss decreases smoothly but validation loss eventually increases, you are likely overfitting: capacity or training time exceeds what the data support without extra regularization. If both curves plateau high, you may need more model capacity, better features, longer training, or a tuned learning rate.

The bias–variance tradeoff is a related story: high bias (underfitting) means the model class cannot fit the signal; high variance (overfitting) means the model is sensitive to training sample noise. Deep nets are flexible enough that variance often dominates unless you use data, regularization, or ensembling.

Common Causes of Overfitting

  • Too few examples for the number of parameters.
  • Noisy or mislabeled training labels.
  • Training too long without early stopping or regularization.
  • Leaking validation into architecture search repeatedly (implicit overfitting to the validation set—use a test set or nested CV for final claims).

Mitigation is rarely one lever: combine more diverse data, augmentation, weight decay, dropout, early stopping, smaller networks, label smoothing, or better priors (architecture suited to the domain).

Practical Mitigations (Overview)

Early stopping halts training when validation metric stops improving—cheap and effective. Data augmentation (flips, crops, noise) artificially expands the training distribution. L2 weight decay penalizes large weights; dropout randomly drops activations during training. Batch norm and larger batches change optimization dynamics and can act like mild regularizers. For classification, label smoothing softens one-hot targets to discourage overconfident logits.

Always monitor a validation curve before trusting leaderboard scores. A model with 99% train accuracy and 70% val accuracy is telling you something explicit.

Summary

  • Overfitting = great train, worse generalization; underfitting = poor train and val.
  • Use a proper validation split and watch the gap over epochs.
  • Fix with data (more, cleaner, augmented), capacity control, and regularization.
  • Next pages dive into dropout and optimizers as part of the toolkit.

Next. Dropout and explicit L2/L1 penalties add regularization directly in the forward pass and objective.