Loss Functions
20 Essential Q/A
DL Interview Prep
Deep Learning Loss Functions: 20 Interview Questions
Master MSE, MAE, Binary/ Categorical Cross-Entropy, Hinge, Huber, Contrastive, Triplet, KL Divergence, CTC, and more. When to use, gradient behavior, robustness – concise interview-ready answers.
MSE
Cross-Entropy
MAE
Hinge
KL Div
Huber
1
What is a loss function in deep learning?
⚡ Easy
Answer: A loss function (cost/objective) quantifies the error between model predictions and true targets. Training minimizes this loss via gradient descent. Choice of loss depends on task: regression (L1, L2), classification (cross-entropy), ranking (hinge), etc.
ℒ(ŷ, y) : measure of "how wrong" the model is.
2
Compare MSE and MAE. When to use each?
📊 Medium
Answer: MSE = mean( (y-ŷ)² ), MAE = mean( |y-ŷ| ). MSE penalizes large errors more (squared), sensitive to outliers. MAE is robust to outliers. Use MSE when outliers are rare/need to be emphasized; MAE when robustness is needed. MSE gradient magnitude ∝ error, MAE gradient constant (±1).
MSE: smooth, convex
sensitive to outliers
3
Why use cross-entropy for classification, not MSE?
🔥 Hard
Answer: Cross-entropy with softmax/sigmoid gives stronger gradients when prediction is wrong. MSE + sigmoid saturates quickly – vanishing gradient. CE is also probabilistic (minimizes KL divergence), directly optimizes log-likelihood. CE is convex in parameters for linear models.
Binary CE: -[y log(p) + (1-y) log(1-p)] vs MSE: (y-p)²
4
Binary vs Categorical Cross-Entropy: difference?
⚡ Easy
Answer: Binary CE for 2 classes (single sigmoid output). Categorical CE for ≥3 classes (softmax output). For multi-label (multiple binary tasks), use binary CE per output.
5
What is Hinge loss? Where is it used?
📊 Medium
Answer: Hinge: max(0, 1 - y·ŷ) for y ∈ {-1,1}. Used in SVMs and max-margin classifiers. Encourages correct classification with a margin. Not differentiable at margin; subgradient used. Less common in deep nets but used in Siamese nets (contrastive hinge).
L = Σ max(0, 1 - y_i * f(x_i))
6
Explain Huber loss. When is it useful?
🔥 Hard
Answer: Huber loss = MSE for small error, MAE for large error (quadratic near zero, linear otherwise). Smooth, less sensitive to outliers than MSE, differentiable. Used in robust regression (e.g., object detection bounding boxes – Smooth L1 is similar).
# Smooth L1 (similar to Huber)
if |x| < 1: 0.5 * x² else |x| - 0.5
7
KL Divergence vs Cross-Entropy: relation?
🔥 Hard
Answer: Cross-Entropy = H(p,q) = H(p) + KL(p||q). Minimizing cross-entropy is equivalent to minimizing KL divergence if p is fixed (target distribution). In VAEs, we minimize KL(q(z|x) || p(z)) to regularize latent space.
8
What are Contrastive and Triplet losses?
🔥 Hard
Answer: Contrastive: pulls positive pairs together, pushes negative apart (margin). Triplet: anchor, positive, negative; loss = max(0, d(a,p) - d(a,n) + margin). Used in face recognition (FaceNet), siamese networks, self-supervised learning (SimCLR).
9
What is Focal Loss? Where is it used?
🔥 Hard
Answer: Focal loss = -(1-p_t)^γ * log(p_t). Modifies cross-entropy to down-weight easy examples, focus on hard misclassified. Solves class imbalance in object detection (RetinaNet). γ=2 common.
10
What is CTC loss? Why is it useful?
🔥 Hard
Answer: Connectionist Temporal Classification (CTC) aligns input sequences to output sequences without pre-alignment. Used in speech recognition, OCR. It sums probabilities over all possible alignments via dynamic programming.
11
Heuristics: choose L1, L2, or Huber for regression?
📊 Medium
Answer: L2 (MSE): default, but outlier-sensitive. L1 (MAE): robust, but slower convergence. Huber: best of both – quadratic for small errors, linear for large. Smooth L1 used in detectors.
12
Why is cross-entropy always ≥ 0?
📊 Medium
Answer: Cross-entropy = -Σ p(x) log q(x). Since p(x) ≤ 1 and log q(x) ≤ 0 (q(x) ≤ 1), product is negative; with minus sign becomes non-negative. Zero only if predictions exactly match targets.
13
Relation between perplexity and cross-entropy?
📊 Medium
Answer: Perplexity = 2^{H(p,q)} where H is cross-entropy (if using log base 2). It measures how "surprised" the model is. Lower perplexity = better language model.
14
NLL vs Cross-Entropy – same?
⚡ Easy
Answer: For classification with one-hot targets, categorical cross-entropy = negative log-likelihood. NLL is just -log(p(y|x)). In PyTorch, `CrossEntropyLoss` = LogSoftmax + NLLLoss.
15
What is Dice loss? Where is it used?
🔥 Hard
Answer: Dice = 1 - (2|X∩Y|)/(|X|+|Y|). Differentiable approximation of IoU. Used in medical image segmentation, imbalanced data. Handles pixel-wise class imbalance well.
16
Why use log in cross-entropy loss?
📊 Medium
Answer: Log converts multiplicative probabilities to additive; numerically stable. Also, maximizing likelihood = minimizing negative log-likelihood. Log loss heavily penalizes very wrong confident predictions.
17
Compare gradients of MSE and MAE.
📊 Medium
∂MSE/∂ŷ = 2(ŷ - y) ; ∂MAE/∂ŷ = sign(ŷ - y)
MSE gradient scales with error; MAE gradient magnitude constant ±1. MSE converges faster but outlier-sensitive.
18
Loss function for ordinal regression?
🔥 Hard
Answer: CORAL loss (Cumulative link model) or square of difference in thresholds. Alternatively, treat as regression with rounding, or use MSE/MAE if scale meaningful.
19
What is energy-based loss?
🔥 Hard
Answer: Energy-based models (EBM) assign scalar energy to configurations. Loss designed to push down energy of correct answer, pull up incorrect. Example: contrastive loss, hinge loss for EBM.
20
Designing a custom loss: key requirements?
🔥 Hard
Answer: Must be differentiable (almost everywhere), should correlate with evaluation metric, numerically stable, efficient. Also consider convexity (not strictly required) and gradient behavior.
Example: custom IoU loss, focal loss, Huber.
Loss Functions – Interview Cheat Sheet
Regression
- L2 MSE – sensitive to outliers
- L1 MAE – robust, constant grad
- Huber Robust + smooth
- Smooth L1 Faster than Huber
Classification
- CE Binary / Categorical
- Hinge Max-margin
- Focal Class imbalance
Advanced
- KL VAE, distribution
- Contrastive Siamese nets
- Triplet Face recognition
- CTC Seq alignment
- Dice Segmentation
Outlier robust
- MAE, Huber, Smooth L1
Verdict: "Task dictates loss – regression, classification, ranking, alignment."
20 loss Q/A covered
Backpropogation