Activation Functions

Activations sit between linear layers and decide how each neuron â€œfires.â€ Without them, a deep stack of matrices would collapse to a single linear map. This page covers the usual suspectsâ€”ReLU, sigmoid, tanh, softmaxâ€”with formulas, intuition, and small code examples.

ReLU softmax vanishing gradient PyTorch

Why Nonlinear Activations?

If you only multiply by matrices and add biases, composing layers still yields one matrix: Wâ‚ƒ(Wâ‚‚(Wâ‚x)) = Wx. You cannot bend the decision boundary into curves or disjoint regions. Inserting a nonlinear Ïƒ between layers breaks this equivalence: Wâ‚ƒ Ïƒ(Wâ‚‚ Ïƒ(Wâ‚x)) can approximate far richer functions (subject to width/depth).

Training needs derivatives. For gradient-based learning, activations should be (piecewise) differentiable. ReLU is not differentiable at 0 in the strict sense; in practice subgradients work fine.

ReLU and Variants

ReLU(z) = max(0, z) is the default hidden activation in many vision MLPs/CNNs: cheap, sparse (many zeros), and gradients do not shrink for positive activations.

Leaky ReLU uses a small slope Î± for z < 0: max(Î±z, z), reducing â€œdead neuronsâ€ that never activate. GELU and Swish are smooth alternatives popular in Transformers and some modern CNNs.

ReLU & Leaky ReLU (NumPy)

import numpy as np

def relu(z): return np.maximum(0, z)

def leaky_relu(z, alpha=0.01):
    return np.where(z > 0, z, alpha * z)

z = np.array([-2.0, -0.5, 0.0, 0.5, 2.0])
print("ReLU: ", relu(z))
print("Leaky:", leaky_relu(z))

Sigmoid and Tanh

Ïƒ(z) = 1 / (1 + eâ»á¶») maps â„ â†’ (0, 1). Good for binary probabilities at the output (sometimes with BCE loss). tanh(z) maps to (-1, 1) and is zero-centered, which can help compared to sigmoid in some setups.

Both saturate for large |z|: derivatives â‰ˆ 0, which can contribute to vanishing gradients in very deep sigmoid/tanh networks. That is one reason ReLU became standard for hidden layers.

Sigmoid & tanh

import numpy as np

def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

z = np.array([-3.0, 0.0, 3.0])
print("sigmoid:", sigmoid(z))
print("tanh:  ", np.tanh(z))

Derivatives (concept)

Ïƒ' = Ïƒ(1âˆ’Ïƒ); tanh' = 1âˆ’tanhÂ². ReLU' is 1 for z>0, 0 for z<0. Optimizers multiply these along the chain rule during backprop.

Softmax (Multi-Class Output)

Given a vector of logits (one per class), softmax produces probabilities that sum to 1:

softmax(z)_i = e^z_i / Î£_j e^z_j

Subtract max(z) before exp for numerical stability. In PyTorch, CrossEntropyLoss expects raw logits and applies log-softmax internallyâ€”do not apply softmax twice.

Stable softmax

import numpy as np

def softmax(z):
    e = np.exp(z - np.max(z))
    return e / e.sum()

logits = np.array([2.0, 1.0, 0.1])
p = softmax(logits)
print("probs:", p, "sum:", p.sum())

Quick Reference: Where to Use What

Activation	Typical role	Notes
ReLU / Leaky ReLU	Hidden layers (MLP, CNN)	Fast; watch dead ReLUs / use leaky/GELU if needed
Sigmoid	Binary output probability	Or use BCEWithLogitsLoss without explicit sigmoid
Softmax	Multi-class probabilities	Often paired with cross-entropy on logits
Tanh	Some RNNs / legacy nets	Zero-centered; saturates like sigmoid
Linear (none)	Regression output	Raw scalar or vector prediction

PyTorch: `F.relu`, `nn.Sigmoid`, â€¦

Module vs functional

import torch
import torch.nn as nn
import torch.nn.functional as F

x = torch.randn(4, 8)
# Functional
y1 = F.relu(x)
y2 = torch.sigmoid(x)
# Module (stateful if e.g. learned params â€” not for plain ReLU)
relu_m = nn.ReLU()
y3 = relu_m(x)

Summary

Nonlinear activations let deep nets represent non-linear functions.
ReLU family dominates hidden layers in many domains.
Sigmoid/softmax relate logits to probabilities at outputs.
Saturating activations can slow learning in deep stacks; match loss to whether logits or probabilities go into the criterion.

Next: chain those layers together in a full forward pass (batched matrix view).

Previous: MLP Next: Forward propagation

Related Neural Networks Links