Interview Q&A30 Questions

Activation & Loss Functions — Interview Q&A

Activation functions and loss functions for training neural networks.

Activation Functions â€” 15 Interview Questions

1 Why do neural networks need nonlinear activation functions?Easy

Answer: A stack of linear layers without nonlinearities is still one affine map. Nonlinear activations let the network compose functions and approximate curved decision boundaries.

2 Define ReLU and why it is popular.Easy

Answer: ReLU(x) = max(0, x). Cheap to compute, often sparse activations, and avoids the vanishing gradient problem of saturating sigmoids in deep hidden layers for positive regions.

ReLU(x) = max(0, x)

3 What is the â€œdead ReLUâ€ problem?Medium

Answer: If a neuronâ€™s weights bias it so its pre-activation is always â‰¤ 0, ReLU outputs 0 and the gradient w.r.t. that unit is 0â€”it may stop updating. Leaky ReLU, PReLU, or better init/learning rate mitigate this.

4 Sigmoid: formula and typical use today.Easy

Answer: Ïƒ(x) = 1/(1+e^âˆ’x), outputs (0,1). Saturates for large |x| (small gradients). Still used for binary probabilities at output or gates; less common in deep hidden stacks than ReLU.

Ïƒ(x) = 1 / (1 + e^(-x))

5 Tanh vs sigmoid for hidden layers (historically).Medium

Answer: Tanh is zero-centered (âˆ’1,1), which can help compared to sigmoidâ€™s (0,1) bias shift. Both still saturate; modern deep nets favor ReLU-like activations for hidden layers.

6 What does softmax do and where is it used?Easy

Answer: Maps a vector of logits to a probability distribution (non-negative, sum to 1). Standard for multi-class mutually exclusive classification outputs; often paired with cross-entropy loss.

softmax(z_i) = e^{z_i} / Î£_j e^{z_j}

7 Leaky ReLU and Parametric ReLU (PReLU)â€”idea?Medium

Answer: For x < 0, use a small slope Î± instead of 0 so gradients can flow. PReLU learns Î± per channel/unit. Goal: reduce dead neurons while keeping ReLU-like behavior for positives.

8 What is GELU and why do Transformers use it?Hard

Answer: GELU is a smooth, probabilistic gating nonlinearity (related to Gaussian CDF). It performs well in Transformers and avoids hard kinks; implementations often use fast approximations for speed.

9 ReLU derivative at x = 0?Medium

Answer: Mathematically undefined; in practice frameworks pick 0 or 1 (subgradient). This rarely breaks training at scale because x = 0 is a measure-zero event.

10 How do sigmoid/tanh contribute to vanishing gradients?Medium

Answer: In saturation regions, derivatives are near zero. Backprop multiplies many small terms through deep stacks, shrinking updates to early layers.

11 Can the last layer before softmax be linear?Easy

Answer: Yesâ€”logits are usually an affine map; softmax (or log-softmax in loss) provides nonlinearity for probabilities. No extra activation needed before softmax for standard classifiers.

12 Swish / SiLU: one-line description.Medium

Answer: x Â· Ïƒ(x) (SiLU is the same idea). Smooth, non-monotonic on negative side, sometimes improves accuracy vs ReLU in some architectures at extra compute cost.

13 Default hidden activation youâ€™d name in an interview?Easy

Answer: ReLU or ReLU variant (Leaky ReLU, GELU) depending on architectureâ€”ReLU for classic CNN/MLP baselines; GELU common in Transformers.

14 Regression output: which activation on the final unit?Easy

Answer: Often none (linear output) for unbounded targets. For bounded [0,1] you might use sigmoid; for general bounded ranges, tanh scaling or custom bounds.

15 Why does ReLU help training speed vs sigmoid?Medium

Answer: Cheaper ops (compare/threshold vs exp), and less saturation in the active region so gradients stay larger for active neuronsâ€”often faster convergence in deep nets.

Tie answers to gradients, saturation, and computeâ€”three axes interviewers like.

Loss Functions â€” 15 Interview Questions

16 What is a loss function in supervised learning?Easy

Answer: A scalar that scores how far model outputs are from targets for one example (or batch). Training minimizes the average loss over the datasetâ€”the empirical risk.

17 Mean squared error (MSE)â€”definition and typical use.Easy

Answer: Average of squared differences between prediction and target. Common for regression; penalizes large errors heavily. With Gaussian noise assumptions, MSE relates to maximum likelihood.

MSE = (1/n) Î£ (Å·_i âˆ’ y_i)Â²

18 Binary cross-entropy in one line.Easy

Answer: For label y âˆˆ {0,1} and predicted probability pÌ‚, loss encourages pÌ‚ â†’ y. It is the negative log-likelihood of a Bernoulli modelâ€”strong gradients when the model is confidently wrong.

19 Multi-class cross-entropy with one-hot targets.Medium

Answer: âˆ’Î£_k y_k log pÌ‚_k with one-hot y picks the log-probability of the true class. With softmax outputs, this is standard classification training.

20 Why softmax + cross-entropy together?Medium

Answer: Softmax turns logits into a distribution; CE matches it to labels. The combined gradient w.r.t. logits is often simple (prediction minus target), which is stable and efficient to implement (e.g. log-softmax + NLL).

21 Hinge lossâ€”when does it appear?Medium

Answer: Classic for SVMs: penalizes margin violations. Less common in standard deep classifiers than CE but shows up in contrastive / max-margin formulations.

22 Huber loss vs MSE for regression.Medium

Answer: Behaves like MSE near zero (smooth) and like L1 far outâ€”less sensitive to outliers than pure MSE while staying differentiable in practice (at the join point subgradient).

23 Where does L2 regularization appear in the loss?Easy

Answer: Add Î»||w||Â² (or similar) to the empirical loss so optimization shrinks weights, improving generalization. It is weight decay in the objective (implementation details can differ in AdamW).

24 Why not train directly on classification accuracy?Medium

Answer: Accuracy is piecewise constant in logitsâ€”gradient is zero almost everywhere. Differentiable surrogates (CE) provide learning signal.

25 Focal lossâ€”purpose in one sentence.Hard

Answer: Down-weights easy examples so training focuses on hard onesâ€”useful with extreme class imbalance in detection settings.

26 Class imbalanceâ€”common loss-side fixes?Medium

Answer: Class weights in CE, resampling, focal loss, or changing the evaluation metric. Mention that rebalancing affects calibration.

27 Label smoothingâ€”what does it change?Hard

Answer: Replace hard one-hot with a mixture with a uniform (or other) distribution so the model is not pushed to infinite confidence. Often improves calibration and regularization.

28 KL divergence as a loss componentâ€”when?Hard

Answer: When matching two distributionsâ€”e.g. knowledge distillation (student vs teacher softmax), variational objectives, or probabilistic models. It measures extra bits if using q instead of p.

29 Multi-label classificationâ€”typical loss?Medium

Answer: Independent sigmoid + binary CE per label (not softmax), because multiple labels can be active at once.

30 How do you pick a loss for a new task?Medium

Answer: Match the output head and probabilistic story: regression â†’ MSE/Huber; exclusive classes â†’ softmax+CE; multi-label â†’ sigmoid+BCE; ranking â†’ pairwise/ranking losses. Align with business metric when possible.

Tie every loss answer to gradients and what is being optimized.

Previous Next