Related Neural Networks Links
Learn Loss Functions Neural Networks Tutorial, validate concepts with Loss Functions Neural Networks MCQ Questions, and prepare interviews through Loss Functions Neural Networks Interview Questions and Answers.
Loss Functions
A loss function (or cost function) scores how far your network’s predictions are from the targets you care about. Supervised learning almost always reduces to: pick a model, pick a loss that reflects the task and is convenient to optimize, then adjust parameters so that the average loss on training data goes down—while hoping that validation loss follows.
cross-entropy MSE logits empirical risk
Why Loss Functions Matter
In theory you might want to minimize expected loss under the real data distribution—but we only see a finite training set. So we minimize the empirical average (possibly with regularization terms added to the objective). The loss is the bridge between “what we want†(correct labels, small errors) and “what gradient descent can use†(a scalar that is smooth enough or subdifferentiable enough to backprop through).
Not every natural metric is a good training loss. Classification accuracy is piecewise constant in the weights: tiny changes rarely flip a discrete decision, so gradients are zero almost everywhere. That is why we train with surrogate losses like cross-entropy that reward moving logits in the right direction even when the predicted class is already correct.
The right loss encodes assumptions and costs: squared error penalizes large mistakes heavily; absolute error treats outliers more gently; cross-entropy aligns with probabilistic models for class labels. Mismatch between loss and deployment metric (e.g. training with log loss but reporting F1) is normal—you often tune thresholds or auxiliary losses later.
Big picture. Forward pass produces predictions → loss compares them to targets → backward pass computes ∂loss/∂parameters → optimizer updates weights. The next pages cover gradient descent and backpropagation; here we focus on the middle term: choosing which scalar to minimize.
Regression Losses: MSE, MAE, and Robust Variants
For real-valued targets y and predictions ŷ, the mean squared error (MSE) averages squared residuals: (1/N) Σ (yᵢ − ŷᵢ)². Squaring magnifies large errors, so a few bad outliers can dominate the gradient. That is often desirable when noise is Gaussian and large errors are genuinely worse—but it can destabilize training if labels are noisy or heavy-tailed.
Mean absolute error (MAE), the L1 loss, uses |y − ŷ|. It is more robust to outliers and corresponds to the median in simple settings, but its derivative is discontinuous at zero and gradients do not shrink with error size the same way as MSE. Huber loss blends the two: quadratic near zero (smooth optimization) and linear far away (robust tails). Many depth-estimation and detection heads use variants of these ideas.
When outputs are bounded (e.g. probabilities in [0,1]), you might still use MSE on raw outputs, but watch saturation: if the last layer uses sigmoid and targets are probabilities, MSE can work; if targets are arbitrary reals, a linear output head is standard with MSE.
Classification: Cross-Entropy and Binary Cross-Entropy
For K mutually exclusive classes, a probabilistic model outputs a distribution p over classes. The cross-entropy between the true distribution q (often one-hot) and prediction p is −Σk qk log pk. With one-hot labels this reduces to −log py for the correct class y—heavily penalizing confident wrong answers.
In practice the network usually emits logits (unnormalized scores) z; softmax turns logits into p. The combination log-softmax + NLL is numerically stable and equivalent to softmax followed by log and cross-entropy. PyTorch’s CrossEntropyLoss expects raw logits of shape (N, K) and integer class indices (N)—it applies log-softmax internally, so you must not apply softmax yourself before this loss.
For binary problems you can use a single logit with sigmoid and binary cross-entropy, or prefer BCEWithLogitsLoss which fuses sigmoid and BCE in a stable way. Multi-label classification (several independent yes/no dimensions) also uses BCE-style losses per label.
CrossEntropyLoss, or applying softmax twice. Read the docstring: logits in, integer targets in, scalar loss out.
PyTorch Examples
import torch
import torch.nn as nn
# Multi-class: logits (N, K), targets (N,) with class indices
ce = nn.CrossEntropyLoss()
logits = torch.randn(8, 5)
targets = torch.tensor([0, 2, 1, 4, 3, 2, 1, 0])
print("CE:", ce(logits, targets).item())
# Regression: predictions and targets same shape
mse = nn.MSELoss()
pred = torch.randn(8, 3)
y = torch.randn(8, 3)
print("MSE:", mse(pred, y).item())
Regularization in the objective
Weight decay (L2) is often implemented inside the optimizer rather than added explicitly to loss, but conceptually it is the same as penalizing large weights. That term is not a “loss on labels†but part of the total objective you minimize.
Summary
- Loss = scalar measure of prediction quality; training minimizes its average over data (plus regularization).
- MSE / MAE / Huber dominate regression; choice depends on outlier sensitivity.
- Cross-entropy is the standard classification surrogate; use logits with the appropriate PyTorch loss.
- Align loss with your probabilistic story and double-check tensor shapes and whether softmax is already inside the criterion.