Activation Functions â€” 15 Interview Questions

ReLU family, sigmoid/tanh trade-offs, softmax, smooth alternatives, and what interviewers probe on gradients and dead neurons.

Colored left borders per card; green / amber / red difficulty chips.

Nonlinearity Gradients ReLU Softmax

1 Why do neural networks need nonlinear activation functions?Easy

Answer: A stack of linear layers without nonlinearities is still one affine map. Nonlinear activations let the network compose functions and approximate curved decision boundaries.

2 Define ReLU and why it is popular.Easy

Answer: ReLU(x) = max(0, x). Cheap to compute, often sparse activations, and avoids the vanishing gradient problem of saturating sigmoids in deep hidden layers for positive regions.

ReLU(x) = max(0, x)

3 What is the â€œdead ReLUâ€ problem?Medium

Answer: If a neuronâ€™s weights bias it so its pre-activation is always â‰¤ 0, ReLU outputs 0 and the gradient w.r.t. that unit is 0â€”it may stop updating. Leaky ReLU, PReLU, or better init/learning rate mitigate this.

4 Sigmoid: formula and typical use today.Easy

Answer: Ïƒ(x) = 1/(1+e^âˆ’x), outputs (0,1). Saturates for large |x| (small gradients). Still used for binary probabilities at output or gates; less common in deep hidden stacks than ReLU.

Ïƒ(x) = 1 / (1 + e^(-x))

5 Tanh vs sigmoid for hidden layers (historically).Medium

Answer: Tanh is zero-centered (âˆ’1,1), which can help compared to sigmoidâ€™s (0,1) bias shift. Both still saturate; modern deep nets favor ReLU-like activations for hidden layers.

6 What does softmax do and where is it used?Easy

Answer: Maps a vector of logits to a probability distribution (non-negative, sum to 1). Standard for multi-class mutually exclusive classification outputs; often paired with cross-entropy loss.

softmax(z_i) = e^{z_i} / Î£_j e^{z_j}

7 Leaky ReLU and Parametric ReLU (PReLU)â€”idea?Medium

Answer: For x < 0, use a small slope Î± instead of 0 so gradients can flow. PReLU learns Î± per channel/unit. Goal: reduce dead neurons while keeping ReLU-like behavior for positives.

8 What is GELU and why do Transformers use it?Hard

Answer: GELU is a smooth, probabilistic gating nonlinearity (related to Gaussian CDF). It performs well in Transformers and avoids hard kinks; implementations often use fast approximations for speed.

9 ReLU derivative at x = 0?Medium

Answer: Mathematically undefined; in practice frameworks pick 0 or 1 (subgradient). This rarely breaks training at scale because x = 0 is a measure-zero event.

10 How do sigmoid/tanh contribute to vanishing gradients?Medium

Answer: In saturation regions, derivatives are near zero. Backprop multiplies many small terms through deep stacks, shrinking updates to early layers.

11 Can the last layer before softmax be linear?Easy

Answer: Yesâ€”logits are usually an affine map; softmax (or log-softmax in loss) provides nonlinearity for probabilities. No extra activation needed before softmax for standard classifiers.

12 Swish / SiLU: one-line description.Medium

Answer: x Â· Ïƒ(x) (SiLU is the same idea). Smooth, non-monotonic on negative side, sometimes improves accuracy vs ReLU in some architectures at extra compute cost.

13 Default hidden activation youâ€™d name in an interview?Easy

Answer: ReLU or ReLU variant (Leaky ReLU, GELU) depending on architectureâ€”ReLU for classic CNN/MLP baselines; GELU common in Transformers.

14 Regression output: which activation on the final unit?Easy

Answer: Often none (linear output) for unbounded targets. For bounded [0,1] you might use sigmoid; for general bounded ranges, tanh scaling or custom bounds.

15 Why does ReLU help training speed vs sigmoid?Medium

Answer: Cheaper ops (compare/threshold vs exp), and less saturation in the active region so gradients stay larger for active neuronsâ€”often faster convergence in deep nets.

Tie answers to gradients, saturation, and computeâ€”three axes interviewers like.

Quick review checklist

Why nonlinearity; ReLU + dead ReLU; softmax + cross-entropy story.
Sigmoid/tanh saturation and vanishing gradients.
Leaky/PReLU, Swish/SiLU, GELU at high level.
Output layer: logits + softmax vs linear regression head.

Previous: MLP Next: Forward propagation

Related Neural Networks Links

Activation Functions â€” 15 Interview Questions

Quick review checklist