Loss Functions

A loss function (or cost function) scores how far your networkâ€™s predictions are from the targets you care about. Supervised learning almost always reduces to: pick a model, pick a loss that reflects the task and is convenient to optimize, then adjust parameters so that the average loss on training data goes downâ€”while hoping that validation loss follows.

cross-entropy MSE logits empirical risk

Why Loss Functions Matter

In theory you might want to minimize expected loss under the real data distributionâ€”but we only see a finite training set. So we minimize the empirical average (possibly with regularization terms added to the objective). The loss is the bridge between â€œwhat we wantâ€ (correct labels, small errors) and â€œwhat gradient descent can useâ€ (a scalar that is smooth enough or subdifferentiable enough to backprop through).

Not every natural metric is a good training loss. Classification accuracy is piecewise constant in the weights: tiny changes rarely flip a discrete decision, so gradients are zero almost everywhere. That is why we train with surrogate losses like cross-entropy that reward moving logits in the right direction even when the predicted class is already correct.

The right loss encodes assumptions and costs: squared error penalizes large mistakes heavily; absolute error treats outliers more gently; cross-entropy aligns with probabilistic models for class labels. Mismatch between loss and deployment metric (e.g. training with log loss but reporting F1) is normalâ€”you often tune thresholds or auxiliary losses later.

Big picture. Forward pass produces predictions â†’ loss compares them to targets â†’ backward pass computes âˆ‚loss/âˆ‚parameters â†’ optimizer updates weights. The next pages cover gradient descent and backpropagation; here we focus on the middle term: choosing which scalar to minimize.

Regression Losses: MSE, MAE, and Robust Variants

For real-valued targets y and predictions Å·, the mean squared error (MSE) averages squared residuals: (1/N) Î£ (yáµ¢ âˆ’ Å·áµ¢)Â². Squaring magnifies large errors, so a few bad outliers can dominate the gradient. That is often desirable when noise is Gaussian and large errors are genuinely worseâ€”but it can destabilize training if labels are noisy or heavy-tailed.

Mean absolute error (MAE), the L1 loss, uses |y âˆ’ Å·|. It is more robust to outliers and corresponds to the median in simple settings, but its derivative is discontinuous at zero and gradients do not shrink with error size the same way as MSE. Huber loss blends the two: quadratic near zero (smooth optimization) and linear far away (robust tails). Many depth-estimation and detection heads use variants of these ideas.

When outputs are bounded (e.g. probabilities in [0,1]), you might still use MSE on raw outputs, but watch saturation: if the last layer uses sigmoid and targets are probabilities, MSE can work; if targets are arbitrary reals, a linear output head is standard with MSE.

Classification: Cross-Entropy and Binary Cross-Entropy

For K mutually exclusive classes, a probabilistic model outputs a distribution p over classes. The cross-entropy between the true distribution q (often one-hot) and prediction p is âˆ’Î£_k q_k log p_k. With one-hot labels this reduces to âˆ’log p_y for the correct class yâ€”heavily penalizing confident wrong answers.

In practice the network usually emits logits (unnormalized scores) z; softmax turns logits into p. The combination log-softmax + NLL is numerically stable and equivalent to softmax followed by log and cross-entropy. PyTorchâ€™s CrossEntropyLoss expects raw logits of shape (N, K) and integer class indices (N)â€”it applies log-softmax internally, so you must not apply softmax yourself before this loss.

For binary problems you can use a single logit with sigmoid and binary cross-entropy, or prefer BCEWithLogitsLoss which fuses sigmoid and BCE in a stable way. Multi-label classification (several independent yes/no dimensions) also uses BCE-style losses per label.

Common mistake. Passing probabilities into CrossEntropyLoss, or applying softmax twice. Read the docstring: logits in, integer targets in, scalar loss out.

PyTorch Examples

Multi-class & regression

import torch
import torch.nn as nn

# Multi-class: logits (N, K), targets (N,) with class indices
ce = nn.CrossEntropyLoss()
logits = torch.randn(8, 5)
targets = torch.tensor([0, 2, 1, 4, 3, 2, 1, 0])
print("CE:", ce(logits, targets).item())

# Regression: predictions and targets same shape
mse = nn.MSELoss()
pred = torch.randn(8, 3)
y = torch.randn(8, 3)
print("MSE:", mse(pred, y).item())

Regularization in the objective

Weight decay (L2) is often implemented inside the optimizer rather than added explicitly to loss, but conceptually it is the same as penalizing large weights. That term is not a â€œloss on labelsâ€ but part of the total objective you minimize.

Summary

Loss = scalar measure of prediction quality; training minimizes its average over data (plus regularization).
MSE / MAE / Huber dominate regression; choice depends on outlier sensitivity.
Cross-entropy is the standard classification surrogate; use logits with the appropriate PyTorch loss.
Align loss with your probabilistic story and double-check tensor shapes and whether softmax is already inside the criterion.

Previous: Forward propagation Next: Gradient descent

Related Neural Networks Links