Activation Functions
20 Essential Q/A
DL Interview Prep
Deep Learning Activation Functions: 20 Interview Questions
Master sigmoid, tanh, ReLU, Leaky ReLU, ELU, Swish, GeLU, Softmax and more. Vanishing gradients, dying neurons, output layers, mathematical derivatives – all with concise, interview-ready answers.
Sigmoid
ReLU
Tanh
Leaky ReLU
Softmax
Swish/GeLU
1
What is an activation function? Why is it non-linear?
⚡ Easy
Answer: An activation function decides whether a neuron should fire. It introduces non-linearity – without it, stacked linear layers collapse into a single linear transformation. Non-linearity enables neural networks to approximate arbitrary complex functions (universal approximation theorem).
y = activation(W·x + b) ; Without activation: y = W₂(W₁x + b₁)+b₂ = W'x + b' (linear)
2
Explain Sigmoid activation. Where is it used? Main drawback?
⚡ Easy
Answer: Sigmoid: σ(x) = 1/(1+e⁻ˣ), output in (0,1). Used in binary classification output layer (probability). Also in gates of LSTM. Drawbacks: Vanishing gradient (saturated regions), not zero-centered, exp() computation.
smooth, probabilistic
vanish grad, not zero-centered
3
How is Tanh different from Sigmoid? Why is Tanh preferred in hidden layers?
📊 Medium
Answer: Tanh = (eˣ - e⁻ˣ)/(eˣ + e⁻ˣ), output range (-1,1). It is zero-centered, which helps gradient flow. Still suffers from vanishing gradient. Preferred over sigmoid in hidden units because zero-centered outputs prevent biased gradients.
sigmoid: [0,1] ; tanh: [-1,1] (zero-centered)
4
What is ReLU? Why is it so widely used?
⚡ Easy
Answer: ReLU = max(0, x). Advantages: Non-saturating, cheap computation, sparse activation, mitigates vanishing gradient. It is the default for most CNNs and deep architectures.
import numpy as np
def relu(x): return np.maximum(0, x) # derivative: 1 if x>0 else 0
5
What is the “dying ReLU” problem? Solutions?
📊 Medium
Answer: When many neurons get stuck in negative region and output 0 for all inputs – gradient zero, never recover. Solutions: Leaky ReLU, PReLU, ELU, or use smaller learning rate.
dead neurons
Leaky ReLU
6
Differentiate Leaky ReLU, PReLU, and RReLU.
🔥 Hard
Answer: Leaky ReLU: f(x)=max(αx, x) with α fixed (0.01). PReLU: α learned. RReLU: α randomly sampled during training. All fix dying ReLU.
7
What are ELU and SELU? When to use SELU?
🔥 Hard
Answer: ELU: f(x)= x if x>0 else α(eˣ-1). Smooth, negative saturation, robust to noise. SELU is self-normalizing; with proper initialization, outputs automatically tend to zero mean/unit variance – works for deep FCNs.
8
Explain Softmax. Why use exponentials?
📊 Medium
Answer: Softmax converts logits to probability distribution: eᶻⁱ/Σ eᶻʲ. Exponentials amplify differences and ensure positivity. Used in multi-class output layer.
P(y=i) = exp(z_i) / Σⱼ exp(z_j)
9
What are Swish and GeLU? Why do they outperform ReLU in Transformers?
🔥 Hard
Answer: Swish = x·sigmoid(x) (smooth, non-monotonic). GeLU = x·Φ(x) (Gaussian CDF). Both allow negative values with a smooth bump. Used in BERT, GPT, ViT; provide better gradient flow and often higher accuracy.
10
Which activation functions are prone to vanishing gradient? Why?
📊 Medium
Answer: Sigmoid and Tanh – gradients near 0 for large |x|. ReLU variants avoid left saturation. However, ReLU can have dead neurons; ELU/Swish have small negative gradients but not zero.
11
What activation function for regression output?
⚡ Easy
Answer: Linear (identity). No activation or f(x)=x. Allows unbounded values. For positive-only regression, use ReLU or Softplus.
12
What is Softplus? Relation to ReLU?
⚡ Easy
Answer: Softplus = ln(1+eˣ). Smooth approximation of ReLU. Used in some variational autoencoders (variance). Never zero, but computationally heavier.
13
Output activation for binary classification?
⚡ Easy
Answer: Sigmoid (single unit). Gives probability P(y=1). Loss: binary cross-entropy.
14
Activation for multi-label classification?
📊 Medium
Answer: Sigmoid per output (independent probabilities). Not softmax (which forces one class). Use sigmoid + binary cross-entropy.
15
Why not use step function as activation?
📊 Medium
Answer: Step function derivative is 0 everywhere (except discontinuity). Gradient descent can't learn. Need differentiable (or subgradient) functions.
16
Why is zero-centered activation desirable?
🔥 Hard
Answer: If activations are always positive (sigmoid, ReLU), gradients of weights for a given neuron are all same sign. This can cause zig-zagging updates and slower convergence. Zero-centered (tanh) avoids this.
17
Write derivative of ReLU and Leaky ReLU.
📊 Medium
ReLU': 1 if x>0 else 0. Leaky ReLU': 1 if x>0 else α (e.g., 0.01)
18
Which activations are used in RNNs/LSTMs? Why?
📊 Medium
Answer: LSTM: sigmoid for gates (0-1 control), tanh for cell/ hidden updates. Prevents exploding gradient via gating. Modern RNNs sometimes use ReLU with careful init.
19
What is Maxout activation? Pros/cons?
🔥 Hard
Answer: Maxout: takes max of k linear combinations. Can approximate any convex function. No vanishing gradient, but doubles parameters. Rarely used today.
20
Heuristic: which activation for hidden layers?
⚡ Easy
Answer: Default: ReLU. If dying ReLU, try Leaky ReLU/ELU. For very deep nets: Swish/GeLU. For regression head: linear. For self-normalizing nets: SELU.
ReLU → Leaky ReLU → ELU → Swish (increasing complexity)
Activation Functions – Interview Cheat Sheet
Output layers
- Binary Sigmoid
- Multi-class Softmax
- Multi-label Sigmoid (per unit)
- Regression Linear
Vanishing gradient
- Sigmoid, Tanh (avoid hidden layers)
Hidden units
- 1st ReLU (fast, sparse)
- 2nd Leaky ReLU / ELU
- 3rd Swish / GeLU (SOTA)
Dying ReLU fix
- Leaky ReLU, PReLU, ELU
Verdict: "ReLU first, then tune. Know your gradients!"