Related Neural Networks Links
Learn Activations Neural Networks Tutorial, validate concepts with Activations Neural Networks MCQ Questions, and prepare interviews through Activations Neural Networks Interview Questions and Answers.
Activation Functions
Activations sit between linear layers and decide how each neuron “fires.†Without them, a deep stack of matrices would collapse to a single linear map. This page covers the usual suspects—ReLU, sigmoid, tanh, softmax—with formulas, intuition, and small code examples.
ReLU softmax vanishing gradient PyTorch
Why Nonlinear Activations?
If you only multiply by matrices and add biases, composing layers still yields one matrix: W₃(Wâ‚‚(Wâ‚x)) = Wx. You cannot bend the decision boundary into curves or disjoint regions. Inserting a nonlinear σ between layers breaks this equivalence: W₃ σ(Wâ‚‚ σ(Wâ‚x)) can approximate far richer functions (subject to width/depth).
ReLU and Variants
ReLU(z) = max(0, z) is the default hidden activation in many vision MLPs/CNNs: cheap, sparse (many zeros), and gradients do not shrink for positive activations.
Leaky ReLU uses a small slope α for z < 0: max(αz, z), reducing “dead neurons†that never activate. GELU and Swish are smooth alternatives popular in Transformers and some modern CNNs.
import numpy as np
def relu(z): return np.maximum(0, z)
def leaky_relu(z, alpha=0.01):
return np.where(z > 0, z, alpha * z)
z = np.array([-2.0, -0.5, 0.0, 0.5, 2.0])
print("ReLU: ", relu(z))
print("Leaky:", leaky_relu(z))
Sigmoid and Tanh
σ(z) = 1 / (1 + eâ»á¶») maps ℠→ (0, 1). Good for binary probabilities at the output (sometimes with BCE loss). tanh(z) maps to (-1, 1) and is zero-centered, which can help compared to sigmoid in some setups.
Both saturate for large |z|: derivatives ≈ 0, which can contribute to vanishing gradients in very deep sigmoid/tanh networks. That is one reason ReLU became standard for hidden layers.
Sigmoid & tanh
import numpy as np
def sigmoid(z):
return 1.0 / (1.0 + np.exp(-z))
z = np.array([-3.0, 0.0, 3.0])
print("sigmoid:", sigmoid(z))
print("tanh: ", np.tanh(z))
Derivatives (concept)
σ' = σ(1−σ); tanh' = 1−tanh². ReLU' is 1 for z>0, 0 for z<0. Optimizers multiply these along the chain rule during backprop.
Softmax (Multi-Class Output)
Given a vector of logits (one per class), softmax produces probabilities that sum to 1:
softmax(z)i = ezi / Σj ezj
Subtract max(z) before exp for numerical stability. In PyTorch, CrossEntropyLoss expects raw logits and applies log-softmax internally—do not apply softmax twice.
import numpy as np
def softmax(z):
e = np.exp(z - np.max(z))
return e / e.sum()
logits = np.array([2.0, 1.0, 0.1])
p = softmax(logits)
print("probs:", p, "sum:", p.sum())
Quick Reference: Where to Use What
| Activation | Typical role | Notes |
|---|---|---|
| ReLU / Leaky ReLU | Hidden layers (MLP, CNN) | Fast; watch dead ReLUs / use leaky/GELU if needed |
| Sigmoid | Binary output probability | Or use BCEWithLogitsLoss without explicit sigmoid |
| Softmax | Multi-class probabilities | Often paired with cross-entropy on logits |
| Tanh | Some RNNs / legacy nets | Zero-centered; saturates like sigmoid |
| Linear (none) | Regression output | Raw scalar or vector prediction |
PyTorch: F.relu, nn.Sigmoid, …
import torch
import torch.nn as nn
import torch.nn.functional as F
x = torch.randn(4, 8)
# Functional
y1 = F.relu(x)
y2 = torch.sigmoid(x)
# Module (stateful if e.g. learned params — not for plain ReLU)
relu_m = nn.ReLU()
y3 = relu_m(x)
Summary
- Nonlinear activations let deep nets represent non-linear functions.
- ReLU family dominates hidden layers in many domains.
- Sigmoid/softmax relate logits to probabilities at outputs.
- Saturating activations can slow learning in deep stacks; match loss to whether logits or probabilities go into the criterion.
Next: chain those layers together in a full forward pass (batched matrix view).