Activation & Loss Functions
Activation functions and loss functions for training neural networks.
Activation Functions
Why Nonlinear Activations?
If you only multiply by matrices and add biases, composing layers still yields one matrix: W₃(Wâ‚‚(Wâ‚x)) = Wx. You cannot bend the decision boundary into curves or disjoint regions. Inserting a nonlinear σ between layers breaks this equivalence: W₃ σ(Wâ‚‚ σ(Wâ‚x)) can approximate far richer functions (subject to width/depth).
ReLU and Variants
ReLU(z) = max(0, z) is the default hidden activation in many vision MLPs/CNNs: cheap, sparse (many zeros), and gradients do not shrink for positive activations.
Leaky ReLU uses a small slope α for z < 0: max(αz, z), reducing “dead neurons†that never activate. GELU and Swish are smooth alternatives popular in Transformers and some modern CNNs.
import numpy as np
def relu(z): return np.maximum(0, z)
def leaky_relu(z, alpha=0.01):
return np.where(z > 0, z, alpha * z)
z = np.array([-2.0, -0.5, 0.0, 0.5, 2.0])
print("ReLU: ", relu(z))
print("Leaky:", leaky_relu(z))
Sigmoid and Tanh
σ(z) = 1 / (1 + eâ»á¶») maps ℠→ (0, 1). Good for binary probabilities at the output (sometimes with BCE loss). tanh(z) maps to (-1, 1) and is zero-centered, which can help compared to sigmoid in some setups.
Both saturate for large |z|: derivatives ≈ 0, which can contribute to vanishing gradients in very deep sigmoid/tanh networks. That is one reason ReLU became standard for hidden layers.
Sigmoid & tanh
import numpy as np
def sigmoid(z):
return 1.0 / (1.0 + np.exp(-z))
z = np.array([-3.0, 0.0, 3.0])
print("sigmoid:", sigmoid(z))
print("tanh: ", np.tanh(z))
Derivatives (concept)
σ' = σ(1−σ); tanh' = 1−tanh². ReLU' is 1 for z>0, 0 for z<0. Optimizers multiply these along the chain rule during backprop.
Softmax (Multi-Class Output)
Given a vector of logits (one per class), softmax produces probabilities that sum to 1:
softmax(z)i = ezi / Σj ezj
Subtract max(z) before exp for numerical stability. In PyTorch, CrossEntropyLoss expects raw logits and applies log-softmax internally—do not apply softmax twice.
import numpy as np
def softmax(z):
e = np.exp(z - np.max(z))
return e / e.sum()
logits = np.array([2.0, 1.0, 0.1])
p = softmax(logits)
print("probs:", p, "sum:", p.sum())
Quick Reference: Where to Use What
| Activation | Typical role | Notes |
|---|---|---|
| ReLU / Leaky ReLU | Hidden layers (MLP, CNN) | Fast; watch dead ReLUs / use leaky/GELU if needed |
| Sigmoid | Binary output probability | Or use BCEWithLogitsLoss without explicit sigmoid |
| Softmax | Multi-class probabilities | Often paired with cross-entropy on logits |
| Tanh | Some RNNs / legacy nets | Zero-centered; saturates like sigmoid |
| Linear (none) | Regression output | Raw scalar or vector prediction |
PyTorch: F.relu, nn.Sigmoid, …
import torch
import torch.nn as nn
import torch.nn.functional as F
x = torch.randn(4, 8)
# Functional
y1 = F.relu(x)
y2 = torch.sigmoid(x)
# Module (stateful if e.g. learned params — not for plain ReLU)
relu_m = nn.ReLU()
y3 = relu_m(x)
Summary
- Nonlinear activations let deep nets represent non-linear functions.
- ReLU family dominates hidden layers in many domains.
- Sigmoid/softmax relate logits to probabilities at outputs.
- Saturating activations can slow learning in deep stacks; match loss to whether logits or probabilities go into the criterion.
Loss Functions
Why Loss Functions Matter
In theory you might want to minimize expected loss under the real data distribution—but we only see a finite training set. So we minimize the empirical average (possibly with regularization terms added to the objective). The loss is the bridge between “what we want†(correct labels, small errors) and “what gradient descent can use†(a scalar that is smooth enough or subdifferentiable enough to backprop through).
Not every natural metric is a good training loss. Classification accuracy is piecewise constant in the weights: tiny changes rarely flip a discrete decision, so gradients are zero almost everywhere. That is why we train with surrogate losses like cross-entropy that reward moving logits in the right direction even when the predicted class is already correct.
The right loss encodes assumptions and costs: squared error penalizes large mistakes heavily; absolute error treats outliers more gently; cross-entropy aligns with probabilistic models for class labels. Mismatch between loss and deployment metric (e.g. training with log loss but reporting F1) is normal—you often tune thresholds or auxiliary losses later.
Regression Losses: MSE, MAE, and Robust Variants
For real-valued targets y and predictions ŷ, the mean squared error (MSE) averages squared residuals: (1/N) Σ (yᵢ − ŷᵢ)². Squaring magnifies large errors, so a few bad outliers can dominate the gradient. That is often desirable when noise is Gaussian and large errors are genuinely worse—but it can destabilize training if labels are noisy or heavy-tailed.
Mean absolute error (MAE), the L1 loss, uses |y − ŷ|. It is more robust to outliers and corresponds to the median in simple settings, but its derivative is discontinuous at zero and gradients do not shrink with error size the same way as MSE. Huber loss blends the two: quadratic near zero (smooth optimization) and linear far away (robust tails). Many depth-estimation and detection heads use variants of these ideas.
When outputs are bounded (e.g. probabilities in [0,1]), you might still use MSE on raw outputs, but watch saturation: if the last layer uses sigmoid and targets are probabilities, MSE can work; if targets are arbitrary reals, a linear output head is standard with MSE.
Classification: Cross-Entropy and Binary Cross-Entropy
For K mutually exclusive classes, a probabilistic model outputs a distribution p over classes. The cross-entropy between the true distribution q (often one-hot) and prediction p is −Σk qk log pk. With one-hot labels this reduces to −log py for the correct class y—heavily penalizing confident wrong answers.
In practice the network usually emits logits (unnormalized scores) z; softmax turns logits into p. The combination log-softmax + NLL is numerically stable and equivalent to softmax followed by log and cross-entropy. PyTorch’s CrossEntropyLoss expects raw logits of shape (N, K) and integer class indices (N)—it applies log-softmax internally, so you must not apply softmax yourself before this loss.
For binary problems you can use a single logit with sigmoid and binary cross-entropy, or prefer BCEWithLogitsLoss which fuses sigmoid and BCE in a stable way. Multi-label classification (several independent yes/no dimensions) also uses BCE-style losses per label.
CrossEntropyLoss, or applying softmax twice. Read the docstring: logits in, integer targets in, scalar loss out.
PyTorch Examples
import torch
import torch.nn as nn
# Multi-class: logits (N, K), targets (N,) with class indices
ce = nn.CrossEntropyLoss()
logits = torch.randn(8, 5)
targets = torch.tensor([0, 2, 1, 4, 3, 2, 1, 0])
print("CE:", ce(logits, targets).item())
# Regression: predictions and targets same shape
mse = nn.MSELoss()
pred = torch.randn(8, 3)
y = torch.randn(8, 3)
print("MSE:", mse(pred, y).item())
Regularization in the objective
Weight decay (L2) is often implemented inside the optimizer rather than added explicitly to loss, but conceptually it is the same as penalizing large weights. That term is not a “loss on labels†but part of the total objective you minimize.