Related Neural Networks Links
Learn Activations Neural Networks Tutorial, validate concepts with Activations Neural Networks MCQ Questions, and prepare interviews through Activations Neural Networks Interview Questions and Answers.
Neural Networks
15 Essential Q&A
Interview Prep
Activation Functions — 15 Interview Questions
ReLU family, sigmoid/tanh trade-offs, softmax, smooth alternatives, and what interviewers probe on gradients and dead neurons.
Colored left borders per card; green / amber / red difficulty chips.
Nonlinearity
Gradients
ReLU
Softmax
1 Why do neural networks need nonlinear activation functions?Easy
Answer: A stack of linear layers without nonlinearities is still one affine map. Nonlinear activations let the network compose functions and approximate curved decision boundaries.
2 Define ReLU and why it is popular.Easy
Answer: ReLU(x) = max(0, x). Cheap to compute, often sparse activations, and avoids the vanishing gradient problem of saturating sigmoids in deep hidden layers for positive regions.
ReLU(x) = max(0, x)
3 What is the “dead ReLU†problem?Medium
Answer: If a neuron’s weights bias it so its pre-activation is always ≤ 0, ReLU outputs 0 and the gradient w.r.t. that unit is 0—it may stop updating. Leaky ReLU, PReLU, or better init/learning rate mitigate this.
4 Sigmoid: formula and typical use today.Easy
Answer: σ(x) = 1/(1+e−x), outputs (0,1). Saturates for large |x| (small gradients). Still used for binary probabilities at output or gates; less common in deep hidden stacks than ReLU.
σ(x) = 1 / (1 + e^(-x))
5 Tanh vs sigmoid for hidden layers (historically).Medium
Answer: Tanh is zero-centered (−1,1), which can help compared to sigmoid’s (0,1) bias shift. Both still saturate; modern deep nets favor ReLU-like activations for hidden layers.
6 What does softmax do and where is it used?Easy
Answer: Maps a vector of logits to a probability distribution (non-negative, sum to 1). Standard for multi-class mutually exclusive classification outputs; often paired with cross-entropy loss.
softmax(z_i) = e^{z_i} / Σ_j e^{z_j}
7 Leaky ReLU and Parametric ReLU (PReLU)—idea?Medium
Answer: For x < 0, use a small slope α instead of 0 so gradients can flow. PReLU learns α per channel/unit. Goal: reduce dead neurons while keeping ReLU-like behavior for positives.
8 What is GELU and why do Transformers use it?Hard
Answer: GELU is a smooth, probabilistic gating nonlinearity (related to Gaussian CDF). It performs well in Transformers and avoids hard kinks; implementations often use fast approximations for speed.
9 ReLU derivative at x = 0?Medium
Answer: Mathematically undefined; in practice frameworks pick 0 or 1 (subgradient). This rarely breaks training at scale because x = 0 is a measure-zero event.
10 How do sigmoid/tanh contribute to vanishing gradients?Medium
Answer: In saturation regions, derivatives are near zero. Backprop multiplies many small terms through deep stacks, shrinking updates to early layers.
11 Can the last layer before softmax be linear?Easy
Answer: Yes—logits are usually an affine map; softmax (or log-softmax in loss) provides nonlinearity for probabilities. No extra activation needed before softmax for standard classifiers.
12 Swish / SiLU: one-line description.Medium
Answer: x · σ(x) (SiLU is the same idea). Smooth, non-monotonic on negative side, sometimes improves accuracy vs ReLU in some architectures at extra compute cost.
13 Default hidden activation you’d name in an interview?Easy
Answer: ReLU or ReLU variant (Leaky ReLU, GELU) depending on architecture—ReLU for classic CNN/MLP baselines; GELU common in Transformers.
14 Regression output: which activation on the final unit?Easy
Answer: Often none (linear output) for unbounded targets. For bounded [0,1] you might use sigmoid; for general bounded ranges, tanh scaling or custom bounds.
15 Why does ReLU help training speed vs sigmoid?Medium
Answer: Cheaper ops (compare/threshold vs exp), and less saturation in the active region so gradients stay larger for active neurons—often faster convergence in deep nets.
Tie answers to gradients, saturation, and compute—three axes interviewers like.
Quick review checklist
- Why nonlinearity; ReLU + dead ReLU; softmax + cross-entropy story.
- Sigmoid/tanh saturation and vanishing gradients.
- Leaky/PReLU, Swish/SiLU, GELU at high level.
- Output layer: logits + softmax vs linear regression head.