Related Neural Networks Links
Learn Loss Functions Neural Networks Tutorial, validate concepts with Loss Functions Neural Networks MCQ Questions, and prepare interviews through Loss Functions Neural Networks Interview Questions and Answers.
Neural Networks
15 Essential Q&A
Interview Prep
Loss Functions — 15 Interview Questions
Empirical risk, MSE vs cross-entropy, softmax pairing, robust losses, and how regularizers enter the objective—what interviewers expect you to connect to gradients.
Colored left borders per card; green / amber / red difficulty chips.
Objective
Cross-entropy
MSE
Regularization
1 What is a loss function in supervised learning?Easy
Answer: A scalar that scores how far model outputs are from targets for one example (or batch). Training minimizes the average loss over the dataset—the empirical risk.
2 Mean squared error (MSE)—definition and typical use.Easy
Answer: Average of squared differences between prediction and target. Common for regression; penalizes large errors heavily. With Gaussian noise assumptions, MSE relates to maximum likelihood.
MSE = (1/n) Σ (ŷ_i − y_i)²
3 Binary cross-entropy in one line.Easy
Answer: For label y ∈ {0,1} and predicted probability p̂, loss encourages p̂ → y. It is the negative log-likelihood of a Bernoulli model—strong gradients when the model is confidently wrong.
4 Multi-class cross-entropy with one-hot targets.Medium
Answer: −Σ_k y_k log p̂_k with one-hot y picks the log-probability of the true class. With softmax outputs, this is standard classification training.
5 Why softmax + cross-entropy together?Medium
Answer: Softmax turns logits into a distribution; CE matches it to labels. The combined gradient w.r.t. logits is often simple (prediction minus target), which is stable and efficient to implement (e.g. log-softmax + NLL).
6 Hinge loss—when does it appear?Medium
Answer: Classic for SVMs: penalizes margin violations. Less common in standard deep classifiers than CE but shows up in contrastive / max-margin formulations.
7 Huber loss vs MSE for regression.Medium
Answer: Behaves like MSE near zero (smooth) and like L1 far out—less sensitive to outliers than pure MSE while staying differentiable in practice (at the join point subgradient).
8 Where does L2 regularization appear in the loss?Easy
Answer: Add λ||w||² (or similar) to the empirical loss so optimization shrinks weights, improving generalization. It is weight decay in the objective (implementation details can differ in AdamW).
9 Why not train directly on classification accuracy?Medium
Answer: Accuracy is piecewise constant in logits—gradient is zero almost everywhere. Differentiable surrogates (CE) provide learning signal.
10 Focal loss—purpose in one sentence.Hard
Answer: Down-weights easy examples so training focuses on hard ones—useful with extreme class imbalance in detection settings.
11 Class imbalance—common loss-side fixes?Medium
Answer: Class weights in CE, resampling, focal loss, or changing the evaluation metric. Mention that rebalancing affects calibration.
12 Label smoothing—what does it change?Hard
Answer: Replace hard one-hot with a mixture with a uniform (or other) distribution so the model is not pushed to infinite confidence. Often improves calibration and regularization.
13 KL divergence as a loss component—when?Hard
Answer: When matching two distributions—e.g. knowledge distillation (student vs teacher softmax), variational objectives, or probabilistic models. It measures extra bits if using q instead of p.
14 Multi-label classification—typical loss?Medium
Answer: Independent sigmoid + binary CE per label (not softmax), because multiple labels can be active at once.
15 How do you pick a loss for a new task?Medium
Answer: Match the output head and probabilistic story: regression → MSE/Huber; exclusive classes → softmax+CE; multi-label → sigmoid+BCE; ranking → pairwise/ranking losses. Align with business metric when possible.
Tie every loss answer to gradients and what is being optimized.
Quick review checklist
- Empirical risk; MSE vs CE; softmax+CE gradient story.
- Why not accuracy; multi-label vs multi-class losses.
- Regularization in objective; label smoothing / focal at high level.