Interview Q&A30 Questions

Activation & Loss Functions — Interview Q&A

Activation functions and loss functions for training neural networks.

Activation Functions — 15 Interview Questions

1 Why do neural networks need nonlinear activation functions?Easy
Answer: A stack of linear layers without nonlinearities is still one affine map. Nonlinear activations let the network compose functions and approximate curved decision boundaries.
2 Define ReLU and why it is popular.Easy
Answer: ReLU(x) = max(0, x). Cheap to compute, often sparse activations, and avoids the vanishing gradient problem of saturating sigmoids in deep hidden layers for positive regions.
ReLU(x) = max(0, x)
3 What is the “dead ReLU” problem?Medium
Answer: If a neuron’s weights bias it so its pre-activation is always ≤ 0, ReLU outputs 0 and the gradient w.r.t. that unit is 0—it may stop updating. Leaky ReLU, PReLU, or better init/learning rate mitigate this.
4 Sigmoid: formula and typical use today.Easy
Answer: σ(x) = 1/(1+e−x), outputs (0,1). Saturates for large |x| (small gradients). Still used for binary probabilities at output or gates; less common in deep hidden stacks than ReLU.
σ(x) = 1 / (1 + e^(-x))
5 Tanh vs sigmoid for hidden layers (historically).Medium
Answer: Tanh is zero-centered (−1,1), which can help compared to sigmoid’s (0,1) bias shift. Both still saturate; modern deep nets favor ReLU-like activations for hidden layers.
6 What does softmax do and where is it used?Easy
Answer: Maps a vector of logits to a probability distribution (non-negative, sum to 1). Standard for multi-class mutually exclusive classification outputs; often paired with cross-entropy loss.
softmax(z_i) = e^{z_i} / Σ_j e^{z_j}
7 Leaky ReLU and Parametric ReLU (PReLU)—idea?Medium
Answer: For x < 0, use a small slope α instead of 0 so gradients can flow. PReLU learns α per channel/unit. Goal: reduce dead neurons while keeping ReLU-like behavior for positives.
8 What is GELU and why do Transformers use it?Hard
Answer: GELU is a smooth, probabilistic gating nonlinearity (related to Gaussian CDF). It performs well in Transformers and avoids hard kinks; implementations often use fast approximations for speed.
9 ReLU derivative at x = 0?Medium
Answer: Mathematically undefined; in practice frameworks pick 0 or 1 (subgradient). This rarely breaks training at scale because x = 0 is a measure-zero event.
10 How do sigmoid/tanh contribute to vanishing gradients?Medium
Answer: In saturation regions, derivatives are near zero. Backprop multiplies many small terms through deep stacks, shrinking updates to early layers.
11 Can the last layer before softmax be linear?Easy
Answer: Yes—logits are usually an affine map; softmax (or log-softmax in loss) provides nonlinearity for probabilities. No extra activation needed before softmax for standard classifiers.
12 Swish / SiLU: one-line description.Medium
Answer: x · σ(x) (SiLU is the same idea). Smooth, non-monotonic on negative side, sometimes improves accuracy vs ReLU in some architectures at extra compute cost.
13 Default hidden activation you’d name in an interview?Easy
Answer: ReLU or ReLU variant (Leaky ReLU, GELU) depending on architecture—ReLU for classic CNN/MLP baselines; GELU common in Transformers.
14 Regression output: which activation on the final unit?Easy
Answer: Often none (linear output) for unbounded targets. For bounded [0,1] you might use sigmoid; for general bounded ranges, tanh scaling or custom bounds.
15 Why does ReLU help training speed vs sigmoid?Medium
Answer: Cheaper ops (compare/threshold vs exp), and less saturation in the active region so gradients stay larger for active neurons—often faster convergence in deep nets.
Tie answers to gradients, saturation, and compute—three axes interviewers like.

Loss Functions — 15 Interview Questions

16 What is a loss function in supervised learning?Easy
Answer: A scalar that scores how far model outputs are from targets for one example (or batch). Training minimizes the average loss over the dataset—the empirical risk.
17 Mean squared error (MSE)—definition and typical use.Easy
Answer: Average of squared differences between prediction and target. Common for regression; penalizes large errors heavily. With Gaussian noise assumptions, MSE relates to maximum likelihood.
MSE = (1/n) Σ (ŷ_i − y_i)²
18 Binary cross-entropy in one line.Easy
Answer: For label y ∈ {0,1} and predicted probability p̂, loss encourages p̂ → y. It is the negative log-likelihood of a Bernoulli model—strong gradients when the model is confidently wrong.
19 Multi-class cross-entropy with one-hot targets.Medium
Answer: −Σ_k y_k log p̂_k with one-hot y picks the log-probability of the true class. With softmax outputs, this is standard classification training.
20 Why softmax + cross-entropy together?Medium
Answer: Softmax turns logits into a distribution; CE matches it to labels. The combined gradient w.r.t. logits is often simple (prediction minus target), which is stable and efficient to implement (e.g. log-softmax + NLL).
21 Hinge loss—when does it appear?Medium
Answer: Classic for SVMs: penalizes margin violations. Less common in standard deep classifiers than CE but shows up in contrastive / max-margin formulations.
22 Huber loss vs MSE for regression.Medium
Answer: Behaves like MSE near zero (smooth) and like L1 far out—less sensitive to outliers than pure MSE while staying differentiable in practice (at the join point subgradient).
23 Where does L2 regularization appear in the loss?Easy
Answer: Add λ||w||² (or similar) to the empirical loss so optimization shrinks weights, improving generalization. It is weight decay in the objective (implementation details can differ in AdamW).
24 Why not train directly on classification accuracy?Medium
Answer: Accuracy is piecewise constant in logits—gradient is zero almost everywhere. Differentiable surrogates (CE) provide learning signal.
25 Focal loss—purpose in one sentence.Hard
Answer: Down-weights easy examples so training focuses on hard ones—useful with extreme class imbalance in detection settings.
26 Class imbalance—common loss-side fixes?Medium
Answer: Class weights in CE, resampling, focal loss, or changing the evaluation metric. Mention that rebalancing affects calibration.
27 Label smoothing—what does it change?Hard
Answer: Replace hard one-hot with a mixture with a uniform (or other) distribution so the model is not pushed to infinite confidence. Often improves calibration and regularization.
28 KL divergence as a loss component—when?Hard
Answer: When matching two distributions—e.g. knowledge distillation (student vs teacher softmax), variational objectives, or probabilistic models. It measures extra bits if using q instead of p.
29 Multi-label classification—typical loss?Medium
Answer: Independent sigmoid + binary CE per label (not softmax), because multiple labels can be active at once.
30 How do you pick a loss for a new task?Medium
Answer: Match the output head and probabilistic story: regression → MSE/Huber; exclusive classes → softmax+CE; multi-label → sigmoid+BCE; ranking → pairwise/ranking losses. Align with business metric when possible.
Tie every loss answer to gradients and what is being optimized.