Neural Networks

Activation & Loss Functions

Activation functions and loss functions for training neural networks.

Activation Functions

Why Nonlinear Activations?

If you only multiply by matrices and add biases, composing layers still yields one matrix: Wâ‚ƒ(Wâ‚‚(Wâ‚x)) = Wx. You cannot bend the decision boundary into curves or disjoint regions. Inserting a nonlinear Ïƒ between layers breaks this equivalence: Wâ‚ƒ Ïƒ(Wâ‚‚ Ïƒ(Wâ‚x)) can approximate far richer functions (subject to width/depth).

Training needs derivatives. For gradient-based learning, activations should be (piecewise) differentiable. ReLU is not differentiable at 0 in the strict sense; in practice subgradients work fine.

ReLU and Variants

ReLU(z) = max(0, z) is the default hidden activation in many vision MLPs/CNNs: cheap, sparse (many zeros), and gradients do not shrink for positive activations.

Leaky ReLU uses a small slope Î± for z < 0: max(Î±z, z), reducing â€œdead neuronsâ€ that never activate. GELU and Swish are smooth alternatives popular in Transformers and some modern CNNs.

ReLU & Leaky ReLU (NumPy)

import numpy as np

def relu(z): return np.maximum(0, z)

def leaky_relu(z, alpha=0.01):
    return np.where(z > 0, z, alpha * z)

z = np.array([-2.0, -0.5, 0.0, 0.5, 2.0])
print("ReLU: ", relu(z))
print("Leaky:", leaky_relu(z))

Sigmoid and Tanh

Ïƒ(z) = 1 / (1 + eâ»á¶») maps â„ â†’ (0, 1). Good for binary probabilities at the output (sometimes with BCE loss). tanh(z) maps to (-1, 1) and is zero-centered, which can help compared to sigmoid in some setups.

Both saturate for large |z|: derivatives â‰ˆ 0, which can contribute to vanishing gradients in very deep sigmoid/tanh networks. That is one reason ReLU became standard for hidden layers.

Sigmoid & tanh

import numpy as np

def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

z = np.array([-3.0, 0.0, 3.0])
print("sigmoid:", sigmoid(z))
print("tanh:  ", np.tanh(z))

Derivatives (concept)

Ïƒ' = Ïƒ(1âˆ’Ïƒ); tanh' = 1âˆ’tanhÂ². ReLU' is 1 for z>0, 0 for z<0. Optimizers multiply these along the chain rule during backprop.

Softmax (Multi-Class Output)

Given a vector of logits (one per class), softmax produces probabilities that sum to 1:

softmax(z)_i = e^z_i / Î£_j e^z_j

Subtract max(z) before exp for numerical stability. In PyTorch, CrossEntropyLoss expects raw logits and applies log-softmax internallyâ€”do not apply softmax twice.

Stable softmax

import numpy as np

def softmax(z):
    e = np.exp(z - np.max(z))
    return e / e.sum()

logits = np.array([2.0, 1.0, 0.1])
p = softmax(logits)
print("probs:", p, "sum:", p.sum())

Quick Reference: Where to Use What

Activation	Typical role	Notes
ReLU / Leaky ReLU	Hidden layers (MLP, CNN)	Fast; watch dead ReLUs / use leaky/GELU if needed
Sigmoid	Binary output probability	Or use BCEWithLogitsLoss without explicit sigmoid
Softmax	Multi-class probabilities	Often paired with cross-entropy on logits
Tanh	Some RNNs / legacy nets	Zero-centered; saturates like sigmoid
Linear (none)	Regression output	Raw scalar or vector prediction

PyTorch: `F.relu`, `nn.Sigmoid`, â€¦

Module vs functional

import torch
import torch.nn as nn
import torch.nn.functional as F

x = torch.randn(4, 8)
# Functional
y1 = F.relu(x)
y2 = torch.sigmoid(x)
# Module (stateful if e.g. learned params â€” not for plain ReLU)
relu_m = nn.ReLU()
y3 = relu_m(x)

Summary

Nonlinear activations let deep nets represent non-linear functions.
ReLU family dominates hidden layers in many domains.
Sigmoid/softmax relate logits to probabilities at outputs.
Saturating activations can slow learning in deep stacks; match loss to whether logits or probabilities go into the criterion.

Loss Functions

Why Loss Functions Matter

In theory you might want to minimize expected loss under the real data distributionâ€”but we only see a finite training set. So we minimize the empirical average (possibly with regularization terms added to the objective). The loss is the bridge between â€œwhat we wantâ€ (correct labels, small errors) and â€œwhat gradient descent can useâ€ (a scalar that is smooth enough or subdifferentiable enough to backprop through).

Not every natural metric is a good training loss. Classification accuracy is piecewise constant in the weights: tiny changes rarely flip a discrete decision, so gradients are zero almost everywhere. That is why we train with surrogate losses like cross-entropy that reward moving logits in the right direction even when the predicted class is already correct.

The right loss encodes assumptions and costs: squared error penalizes large mistakes heavily; absolute error treats outliers more gently; cross-entropy aligns with probabilistic models for class labels. Mismatch between loss and deployment metric (e.g. training with log loss but reporting F1) is normalâ€”you often tune thresholds or auxiliary losses later.

Regression Losses: MSE, MAE, and Robust Variants

For real-valued targets y and predictions Å·, the mean squared error (MSE) averages squared residuals: (1/N) Î£ (yáµ¢ âˆ’ Å·áµ¢)Â². Squaring magnifies large errors, so a few bad outliers can dominate the gradient. That is often desirable when noise is Gaussian and large errors are genuinely worseâ€”but it can destabilize training if labels are noisy or heavy-tailed.

Mean absolute error (MAE), the L1 loss, uses |y âˆ’ Å·|. It is more robust to outliers and corresponds to the median in simple settings, but its derivative is discontinuous at zero and gradients do not shrink with error size the same way as MSE. Huber loss blends the two: quadratic near zero (smooth optimization) and linear far away (robust tails). Many depth-estimation and detection heads use variants of these ideas.

When outputs are bounded (e.g. probabilities in [0,1]), you might still use MSE on raw outputs, but watch saturation: if the last layer uses sigmoid and targets are probabilities, MSE can work; if targets are arbitrary reals, a linear output head is standard with MSE.

Classification: Cross-Entropy and Binary Cross-Entropy

For K mutually exclusive classes, a probabilistic model outputs a distribution p over classes. The cross-entropy between the true distribution q (often one-hot) and prediction p is âˆ’Î£_k q_k log p_k. With one-hot labels this reduces to âˆ’log p_y for the correct class yâ€”heavily penalizing confident wrong answers.

In practice the network usually emits logits (unnormalized scores) z; softmax turns logits into p. The combination log-softmax + NLL is numerically stable and equivalent to softmax followed by log and cross-entropy. PyTorchâ€™s CrossEntropyLoss expects raw logits of shape (N, K) and integer class indices (N)â€”it applies log-softmax internally, so you must not apply softmax yourself before this loss.

For binary problems you can use a single logit with sigmoid and binary cross-entropy, or prefer BCEWithLogitsLoss which fuses sigmoid and BCE in a stable way. Multi-label classification (several independent yes/no dimensions) also uses BCE-style losses per label.

Common mistake. Passing probabilities into CrossEntropyLoss, or applying softmax twice. Read the docstring: logits in, integer targets in, scalar loss out.

PyTorch Examples

Multi-class & regression

import torch
import torch.nn as nn

# Multi-class: logits (N, K), targets (N,) with class indices
ce = nn.CrossEntropyLoss()
logits = torch.randn(8, 5)
targets = torch.tensor([0, 2, 1, 4, 3, 2, 1, 0])
print("CE:", ce(logits, targets).item())

# Regression: predictions and targets same shape
mse = nn.MSELoss()
pred = torch.randn(8, 3)
y = torch.randn(8, 3)
print("MSE:", mse(pred, y).item())

Regularization in the objective

Weight decay (L2) is often implemented inside the optimizer rather than added explicitly to loss, but conceptually it is the same as penalizing large weights. That term is not a â€œloss on labelsâ€ but part of the total objective you minimize.

Previous Next