Deep Learning Activation Functions: 20 Interview Questions

Question 1

1 What is an activation function? Why is it non-linear? ⚡ Easy

Answer

Answer: An activation function decides whether a neuron should fire. It introduces non-linearity – without it, stacked linear layers collapse into a single linear transformation. Non-linearity enables neural networks to approximate arbitrary complex functions (universal approximation theorem).

Question 2

2 Explain Sigmoid activation. Where is it used? Main drawback? ⚡ Easy

Answer

Answer: Sigmoid: σ(x) = 1/(1+e⁻ˣ), output in (0,1). Used in binary classification output layer (probability). Also in gates of LSTM. Drawbacks: Vanishing gradient (saturated regions), not zero-centered, exp() computation.

Question 3

3 How is Tanh different from Sigmoid? Why is Tanh preferred in hidden layers? 📊 Medium

Answer

Answer: Tanh = (eˣ - e⁻ˣ)/(eˣ + e⁻ˣ), output range (-1,1). It is zero-centered, which helps gradient flow. Still suffers from vanishing gradient. Preferred over sigmoid in hidden units because zero-centered outputs prevent biased gradients.

Question 4

4 What is ReLU? Why is it so widely used? ⚡ Easy

Answer

Answer: ReLU = max(0, x). Advantages: Non-saturating, cheap computation, sparse activation, mitigates vanishing gradient. It is the default for most CNNs and deep architectures.

Question 5

5 What is the “dying ReLU” problem? Solutions? 📊 Medium

Answer

Answer: When many neurons get stuck in negative region and output 0 for all inputs – gradient zero, never recover. Solutions: Leaky ReLU, PReLU, ELU, or use smaller learning rate.

Question 6

6 Differentiate Leaky ReLU, PReLU, and RReLU. 🔥 Hard

Answer

Answer: Leaky ReLU: f(x)=max(αx, x) with α fixed (0.01). PReLU: α learned. RReLU: α randomly sampled during training. All fix dying ReLU.

Question 7

7 What are ELU and SELU? When to use SELU? 🔥 Hard

Answer

Answer: ELU: f(x)= x if x>0 else α(eˣ-1). Smooth, negative saturation, robust to noise. SELU is self-normalizing; with proper initialization, outputs automatically tend to zero mean/unit variance – works for deep FCNs.

Question 8

8 Explain Softmax. Why use exponentials? 📊 Medium

Answer

Answer: Softmax converts logits to probability distribution: eᶻⁱ/Σ eᶻʲ. Exponentials amplify differences and ensure positivity. Used in multi-class output layer.

Question 9

9 What are Swish and GeLU? Why do they outperform ReLU in Transformers? 🔥 Hard

Answer

Answer: Swish = x·sigmoid(x) (smooth, non-monotonic). GeLU = x·Φ(x) (Gaussian CDF). Both allow negative values with a smooth bump. Used in BERT, GPT, ViT; provide better gradient flow and often higher accuracy.

Question 10

10 Which activation functions are prone to vanishing gradient? Why? 📊 Medium

Answer

Answer: Sigmoid and Tanh – gradients near 0 for large |x|. ReLU variants avoid left saturation. However, ReLU can have dead neurons; ELU/Swish have small negative gradients but not zero.

Question 11

11 What activation function for regression output? ⚡ Easy

Answer

Answer: Linear (identity). No activation or f(x)=x. Allows unbounded values. For positive-only regression, use ReLU or Softplus.

Question 12

12 What is Softplus? Relation to ReLU? ⚡ Easy

Answer

Answer: Softplus = ln(1+eˣ). Smooth approximation of ReLU. Used in some variational autoencoders (variance). Never zero, but computationally heavier.

Question 13

13 Output activation for binary classification? ⚡ Easy

Answer

Answer: Sigmoid (single unit). Gives probability P(y=1). Loss: binary cross-entropy.

Question 14

14 Activation for multi-label classification? 📊 Medium

Answer

Answer: Sigmoid per output (independent probabilities). Not softmax (which forces one class). Use sigmoid + binary cross-entropy.

Question 15

15 Why not use step function as activation? 📊 Medium

Answer

Answer: Step function derivative is 0 everywhere (except discontinuity). Gradient descent can't learn. Need differentiable (or subgradient) functions.

Question 16

16 Why is zero-centered activation desirable? 🔥 Hard

Answer

Answer: If activations are always positive (sigmoid, ReLU), gradients of weights for a given neuron are all same sign. This can cause zig-zagging updates and slower convergence. Zero-centered (tanh) avoids this.

Question 17

17 Write derivative of ReLU and Leaky ReLU. 📊 Medium

Question 18

18 Which activations are used in RNNs/LSTMs? Why? 📊 Medium

Answer

Answer: LSTM: sigmoid for gates (0-1 control), tanh for cell/ hidden updates. Prevents exploding gradient via gating. Modern RNNs sometimes use ReLU with careful init.

Question 19

19 What is Maxout activation? Pros/cons? 🔥 Hard

Answer

Answer: Maxout: takes max of k linear combinations. Can approximate any convex function. No vanishing gradient, but doubles parameters. Rarely used today.

Question 20

20 Heuristic: which activation for hidden layers? ⚡ Easy

Answer

Answer: Default: ReLU. If dying ReLU, try Leaky ReLU/ELU. For very deep nets: Swish/GeLU. For regression head: linear. For self-normalizing nets: SELU.

Deep Learning Activation Functions: 20 Interview Questions

Activation Functions – Interview Cheat Sheet

Output layers

Vanishing gradient

Hidden units

Dying ReLU fix