Activation Functions 20 Essential Q/A
DL Interview Prep

Deep Learning Activation Functions: 20 Interview Questions

Master sigmoid, tanh, ReLU, Leaky ReLU, ELU, Swish, GeLU, Softmax and more. Vanishing gradients, dying neurons, output layers, mathematical derivatives – all with concise, interview-ready answers.

Sigmoid ReLU Tanh Leaky ReLU Softmax Swish/GeLU
1 What is an activation function? Why is it non-linear? ⚡ Easy
Answer: An activation function decides whether a neuron should fire. It introduces non-linearity – without it, stacked linear layers collapse into a single linear transformation. Non-linearity enables neural networks to approximate arbitrary complex functions (universal approximation theorem).
y = activation(W·x + b) ; Without activation: y = W₂(W₁x + b₁)+b₂ = W'x + b' (linear)
2 Explain Sigmoid activation. Where is it used? Main drawback? ⚡ Easy
Answer: Sigmoid: σ(x) = 1/(1+e⁻ˣ), output in (0,1). Used in binary classification output layer (probability). Also in gates of LSTM. Drawbacks: Vanishing gradient (saturated regions), not zero-centered, exp() computation.
smooth, probabilistic
vanish grad, not zero-centered
3 How is Tanh different from Sigmoid? Why is Tanh preferred in hidden layers? 📊 Medium
Answer: Tanh = (eˣ - e⁻ˣ)/(eˣ + e⁻ˣ), output range (-1,1). It is zero-centered, which helps gradient flow. Still suffers from vanishing gradient. Preferred over sigmoid in hidden units because zero-centered outputs prevent biased gradients.
sigmoid: [0,1] ; tanh: [-1,1] (zero-centered)
4 What is ReLU? Why is it so widely used? ⚡ Easy
Answer: ReLU = max(0, x). Advantages: Non-saturating, cheap computation, sparse activation, mitigates vanishing gradient. It is the default for most CNNs and deep architectures.
import numpy as np
def relu(x): return np.maximum(0, x)  # derivative: 1 if x>0 else 0
5 What is the “dying ReLU” problem? Solutions? 📊 Medium
Answer: When many neurons get stuck in negative region and output 0 for all inputs – gradient zero, never recover. Solutions: Leaky ReLU, PReLU, ELU, or use smaller learning rate.
dead neurons Leaky ReLU
6 Differentiate Leaky ReLU, PReLU, and RReLU. 🔥 Hard
Answer: Leaky ReLU: f(x)=max(αx, x) with α fixed (0.01). PReLU: α learned. RReLU: α randomly sampled during training. All fix dying ReLU.
7 What are ELU and SELU? When to use SELU? 🔥 Hard
Answer: ELU: f(x)= x if x>0 else α(eˣ-1). Smooth, negative saturation, robust to noise. SELU is self-normalizing; with proper initialization, outputs automatically tend to zero mean/unit variance – works for deep FCNs.
8 Explain Softmax. Why use exponentials? 📊 Medium
Answer: Softmax converts logits to probability distribution: eᶻⁱ/Σ eᶻʲ. Exponentials amplify differences and ensure positivity. Used in multi-class output layer.
P(y=i) = exp(z_i) / Σⱼ exp(z_j)
9 What are Swish and GeLU? Why do they outperform ReLU in Transformers? 🔥 Hard
Answer: Swish = x·sigmoid(x) (smooth, non-monotonic). GeLU = x·Φ(x) (Gaussian CDF). Both allow negative values with a smooth bump. Used in BERT, GPT, ViT; provide better gradient flow and often higher accuracy.
10 Which activation functions are prone to vanishing gradient? Why? 📊 Medium
Answer: Sigmoid and Tanh – gradients near 0 for large |x|. ReLU variants avoid left saturation. However, ReLU can have dead neurons; ELU/Swish have small negative gradients but not zero.
11 What activation function for regression output? ⚡ Easy
Answer: Linear (identity). No activation or f(x)=x. Allows unbounded values. For positive-only regression, use ReLU or Softplus.
12 What is Softplus? Relation to ReLU? ⚡ Easy
Answer: Softplus = ln(1+eˣ). Smooth approximation of ReLU. Used in some variational autoencoders (variance). Never zero, but computationally heavier.
13 Output activation for binary classification? ⚡ Easy
Answer: Sigmoid (single unit). Gives probability P(y=1). Loss: binary cross-entropy.
14 Activation for multi-label classification? 📊 Medium
Answer: Sigmoid per output (independent probabilities). Not softmax (which forces one class). Use sigmoid + binary cross-entropy.
15 Why not use step function as activation? 📊 Medium
Answer: Step function derivative is 0 everywhere (except discontinuity). Gradient descent can't learn. Need differentiable (or subgradient) functions.
16 Why is zero-centered activation desirable? 🔥 Hard
Answer: If activations are always positive (sigmoid, ReLU), gradients of weights for a given neuron are all same sign. This can cause zig-zagging updates and slower convergence. Zero-centered (tanh) avoids this.
17 Write derivative of ReLU and Leaky ReLU. 📊 Medium
ReLU': 1 if x>0 else 0. Leaky ReLU': 1 if x>0 else α (e.g., 0.01)
18 Which activations are used in RNNs/LSTMs? Why? 📊 Medium
Answer: LSTM: sigmoid for gates (0-1 control), tanh for cell/ hidden updates. Prevents exploding gradient via gating. Modern RNNs sometimes use ReLU with careful init.
19 What is Maxout activation? Pros/cons? 🔥 Hard
Answer: Maxout: takes max of k linear combinations. Can approximate any convex function. No vanishing gradient, but doubles parameters. Rarely used today.
20 Heuristic: which activation for hidden layers? ⚡ Easy
Answer: Default: ReLU. If dying ReLU, try Leaky ReLU/ELU. For very deep nets: Swish/GeLU. For regression head: linear. For self-normalizing nets: SELU.
ReLU → Leaky ReLU → ELU → Swish (increasing complexity)

Activation Functions – Interview Cheat Sheet

Output layers
  • Binary Sigmoid
  • Multi-class Softmax
  • Multi-label Sigmoid (per unit)
  • Regression Linear
Vanishing gradient
  • Sigmoid, Tanh (avoid hidden layers)
Hidden units
  • 1st ReLU (fast, sparse)
  • 2nd Leaky ReLU / ELU
  • 3rd Swish / GeLU (SOTA)
Dying ReLU fix
  • Leaky ReLU, PReLU, ELU

Verdict: "ReLU first, then tune. Know your gradients!"