Neural Networks Activations
Nonlinearity NumPy

Activation Functions

Activations sit between linear layers and decide how each neuron “fires.” Without them, a deep stack of matrices would collapse to a single linear map. This page covers the usual suspects—ReLU, sigmoid, tanh, softmax—with formulas, intuition, and small code examples.

ReLU softmax vanishing gradient PyTorch

Why Nonlinear Activations?

If you only multiply by matrices and add biases, composing layers still yields one matrix: W₃(W₂(W₁x)) = Wx. You cannot bend the decision boundary into curves or disjoint regions. Inserting a nonlinear σ between layers breaks this equivalence: W₃ σ(W₂ σ(W₁x)) can approximate far richer functions (subject to width/depth).

Training needs derivatives. For gradient-based learning, activations should be (piecewise) differentiable. ReLU is not differentiable at 0 in the strict sense; in practice subgradients work fine.

ReLU and Variants

ReLU(z) = max(0, z) is the default hidden activation in many vision MLPs/CNNs: cheap, sparse (many zeros), and gradients do not shrink for positive activations.

Leaky ReLU uses a small slope α for z < 0: max(αz, z), reducing “dead neurons” that never activate. GELU and Swish are smooth alternatives popular in Transformers and some modern CNNs.

ReLU & Leaky ReLU (NumPy)
import numpy as np

def relu(z): return np.maximum(0, z)

def leaky_relu(z, alpha=0.01):
    return np.where(z > 0, z, alpha * z)

z = np.array([-2.0, -0.5, 0.0, 0.5, 2.0])
print("ReLU: ", relu(z))
print("Leaky:", leaky_relu(z))

Sigmoid and Tanh

σ(z) = 1 / (1 + e⁻ᶻ) maps ℝ → (0, 1). Good for binary probabilities at the output (sometimes with BCE loss). tanh(z) maps to (-1, 1) and is zero-centered, which can help compared to sigmoid in some setups.

Both saturate for large |z|: derivatives ≈ 0, which can contribute to vanishing gradients in very deep sigmoid/tanh networks. That is one reason ReLU became standard for hidden layers.

Sigmoid & tanh
import numpy as np

def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))

z = np.array([-3.0, 0.0, 3.0])
print("sigmoid:", sigmoid(z))
print("tanh:  ", np.tanh(z))
Derivatives (concept)

σ' = σ(1−σ); tanh' = 1−tanh². ReLU' is 1 for z>0, 0 for z<0. Optimizers multiply these along the chain rule during backprop.

Softmax (Multi-Class Output)

Given a vector of logits (one per class), softmax produces probabilities that sum to 1:

softmax(z)i = ezi / Σj ezj

Subtract max(z) before exp for numerical stability. In PyTorch, CrossEntropyLoss expects raw logits and applies log-softmax internally—do not apply softmax twice.

Stable softmax
import numpy as np

def softmax(z):
    e = np.exp(z - np.max(z))
    return e / e.sum()

logits = np.array([2.0, 1.0, 0.1])
p = softmax(logits)
print("probs:", p, "sum:", p.sum())

Quick Reference: Where to Use What

ActivationTypical roleNotes
ReLU / Leaky ReLUHidden layers (MLP, CNN)Fast; watch dead ReLUs / use leaky/GELU if needed
SigmoidBinary output probabilityOr use BCEWithLogitsLoss without explicit sigmoid
SoftmaxMulti-class probabilitiesOften paired with cross-entropy on logits
TanhSome RNNs / legacy netsZero-centered; saturates like sigmoid
Linear (none)Regression outputRaw scalar or vector prediction

PyTorch: F.relu, nn.Sigmoid, …

Module vs functional
import torch
import torch.nn as nn
import torch.nn.functional as F

x = torch.randn(4, 8)
# Functional
y1 = F.relu(x)
y2 = torch.sigmoid(x)
# Module (stateful if e.g. learned params — not for plain ReLU)
relu_m = nn.ReLU()
y3 = relu_m(x)

Summary

  • Nonlinear activations let deep nets represent non-linear functions.
  • ReLU family dominates hidden layers in many domains.
  • Sigmoid/softmax relate logits to probabilities at outputs.
  • Saturating activations can slow learning in deep stacks; match loss to whether logits or probabilities go into the criterion.

Next: chain those layers together in a full forward pass (batched matrix view).