Loss Functions 20 Essential Q/A
DL Interview Prep

Deep Learning Loss Functions: 20 Interview Questions

Master MSE, MAE, Binary/ Categorical Cross-Entropy, Hinge, Huber, Contrastive, Triplet, KL Divergence, CTC, and more. When to use, gradient behavior, robustness – concise interview-ready answers.

MSE Cross-Entropy MAE Hinge KL Div Huber
1 What is a loss function in deep learning? ⚡ Easy
Answer: A loss function (cost/objective) quantifies the error between model predictions and true targets. Training minimizes this loss via gradient descent. Choice of loss depends on task: regression (L1, L2), classification (cross-entropy), ranking (hinge), etc.
ℒ(ŷ, y) : measure of "how wrong" the model is.
2 Compare MSE and MAE. When to use each? 📊 Medium
Answer: MSE = mean( (y-ŷ)² ), MAE = mean( |y-ŷ| ). MSE penalizes large errors more (squared), sensitive to outliers. MAE is robust to outliers. Use MSE when outliers are rare/need to be emphasized; MAE when robustness is needed. MSE gradient magnitude ∝ error, MAE gradient constant (±1).
MSE: smooth, convex
sensitive to outliers
3 Why use cross-entropy for classification, not MSE? 🔥 Hard
Answer: Cross-entropy with softmax/sigmoid gives stronger gradients when prediction is wrong. MSE + sigmoid saturates quickly – vanishing gradient. CE is also probabilistic (minimizes KL divergence), directly optimizes log-likelihood. CE is convex in parameters for linear models.
Binary CE: -[y log(p) + (1-y) log(1-p)] vs MSE: (y-p)²
4 Binary vs Categorical Cross-Entropy: difference? ⚡ Easy
Answer: Binary CE for 2 classes (single sigmoid output). Categorical CE for ≥3 classes (softmax output). For multi-label (multiple binary tasks), use binary CE per output.
5 What is Hinge loss? Where is it used? 📊 Medium
Answer: Hinge: max(0, 1 - y·ŷ) for y ∈ {-1,1}. Used in SVMs and max-margin classifiers. Encourages correct classification with a margin. Not differentiable at margin; subgradient used. Less common in deep nets but used in Siamese nets (contrastive hinge).
L = Σ max(0, 1 - y_i * f(x_i))
6 Explain Huber loss. When is it useful? 🔥 Hard
Answer: Huber loss = MSE for small error, MAE for large error (quadratic near zero, linear otherwise). Smooth, less sensitive to outliers than MSE, differentiable. Used in robust regression (e.g., object detection bounding boxes – Smooth L1 is similar).
# Smooth L1 (similar to Huber)
if |x| < 1: 0.5 * x²  else |x| - 0.5
7 KL Divergence vs Cross-Entropy: relation? 🔥 Hard
Answer: Cross-Entropy = H(p,q) = H(p) + KL(p||q). Minimizing cross-entropy is equivalent to minimizing KL divergence if p is fixed (target distribution). In VAEs, we minimize KL(q(z|x) || p(z)) to regularize latent space.
8 What are Contrastive and Triplet losses? 🔥 Hard
Answer: Contrastive: pulls positive pairs together, pushes negative apart (margin). Triplet: anchor, positive, negative; loss = max(0, d(a,p) - d(a,n) + margin). Used in face recognition (FaceNet), siamese networks, self-supervised learning (SimCLR).
9 What is Focal Loss? Where is it used? 🔥 Hard
Answer: Focal loss = -(1-p_t)^γ * log(p_t). Modifies cross-entropy to down-weight easy examples, focus on hard misclassified. Solves class imbalance in object detection (RetinaNet). γ=2 common.
10 What is CTC loss? Why is it useful? 🔥 Hard
Answer: Connectionist Temporal Classification (CTC) aligns input sequences to output sequences without pre-alignment. Used in speech recognition, OCR. It sums probabilities over all possible alignments via dynamic programming.
11 Heuristics: choose L1, L2, or Huber for regression? 📊 Medium
Answer: L2 (MSE): default, but outlier-sensitive. L1 (MAE): robust, but slower convergence. Huber: best of both – quadratic for small errors, linear for large. Smooth L1 used in detectors.
12 Why is cross-entropy always ≥ 0? 📊 Medium
Answer: Cross-entropy = -Σ p(x) log q(x). Since p(x) ≤ 1 and log q(x) ≤ 0 (q(x) ≤ 1), product is negative; with minus sign becomes non-negative. Zero only if predictions exactly match targets.
13 Relation between perplexity and cross-entropy? 📊 Medium
Answer: Perplexity = 2^{H(p,q)} where H is cross-entropy (if using log base 2). It measures how "surprised" the model is. Lower perplexity = better language model.
14 NLL vs Cross-Entropy – same? ⚡ Easy
Answer: For classification with one-hot targets, categorical cross-entropy = negative log-likelihood. NLL is just -log(p(y|x)). In PyTorch, `CrossEntropyLoss` = LogSoftmax + NLLLoss.
15 What is Dice loss? Where is it used? 🔥 Hard
Answer: Dice = 1 - (2|X∩Y|)/(|X|+|Y|). Differentiable approximation of IoU. Used in medical image segmentation, imbalanced data. Handles pixel-wise class imbalance well.
16 Why use log in cross-entropy loss? 📊 Medium
Answer: Log converts multiplicative probabilities to additive; numerically stable. Also, maximizing likelihood = minimizing negative log-likelihood. Log loss heavily penalizes very wrong confident predictions.
17 Compare gradients of MSE and MAE. 📊 Medium
∂MSE/∂ŷ = 2(ŷ - y) ; ∂MAE/∂ŷ = sign(ŷ - y)
MSE gradient scales with error; MAE gradient magnitude constant ±1. MSE converges faster but outlier-sensitive.
18 Loss function for ordinal regression? 🔥 Hard
Answer: CORAL loss (Cumulative link model) or square of difference in thresholds. Alternatively, treat as regression with rounding, or use MSE/MAE if scale meaningful.
19 What is energy-based loss? 🔥 Hard
Answer: Energy-based models (EBM) assign scalar energy to configurations. Loss designed to push down energy of correct answer, pull up incorrect. Example: contrastive loss, hinge loss for EBM.
20 Designing a custom loss: key requirements? 🔥 Hard
Answer: Must be differentiable (almost everywhere), should correlate with evaluation metric, numerically stable, efficient. Also consider convexity (not strictly required) and gradient behavior.
Example: custom IoU loss, focal loss, Huber.

Loss Functions – Interview Cheat Sheet

Regression
  • L2 MSE – sensitive to outliers
  • L1 MAE – robust, constant grad
  • Huber Robust + smooth
  • Smooth L1 Faster than Huber
Classification
  • CE Binary / Categorical
  • Hinge Max-margin
  • Focal Class imbalance
Advanced
  • KL VAE, distribution
  • Contrastive Siamese nets
  • Triplet Face recognition
  • CTC Seq alignment
  • Dice Segmentation
Outlier robust
  • MAE, Huber, Smooth L1

Verdict: "Task dictates loss – regression, classification, ranking, alignment."

20 loss Q/A covered Backpropogation