Neural Network Fundamentals — Interview Q&A

Question 1

1 What is a perceptron? How is it different from a neuron in deep learning? âš¡ Easy

Answer

Answer: A perceptron is the simplest artificial neuron, invented by Rosenblatt. It takes binary inputs, applies weights, sums them, and passes through a step activation function (0 or 1). Modern deep learning neurons use continuous activation functions (ReLU, Sigmoid, Tanh) and are arranged in multiple layers.

Question 2

2 Why do we need non-linear activation functions? âš¡ Easy

Answer

Answer: Without non-linearity, stacking multiple linear layers collapses into a single linear transformation. Non-linear activations (ReLU, sigmoid, tanh) allow neural networks to approximate any complex, non-linear function (universal approximation theorem).

Question 3

3 Compare ReLU, Sigmoid, and Tanh activations. When to use each? ðŸ“Š Medium

Answer

Answer:

ReLU (max(0,x)): Default for hidden layers. Fast, sparse, mitigates vanishing gradient. Dead neurons issue.
Sigmoid (0 to 1): Output layer for binary classification. Prone to vanishing gradient.
Tanh (-1 to 1): Zero-centered, often used in RNNs/classical nets. Still suffers saturation.

Question 4

4 Explain backpropagation in simple terms. ðŸ“Š Medium

Answer

Answer: Backpropagation computes gradients of the loss with respect to each weight using the chain rule. It propagates error backward from output to input, layer by layer. These gradients are used by optimizers (SGD, Adam) to update weights and minimize loss.

Question 5

5 What is vanishing gradient? How do you fix it? ðŸ”¥ Hard

Answer

Answer: Vanishing gradient occurs when gradients become extremely small in early layers, preventing learning. Causes: deep networks with sigmoid/tanh. Fixes: Use ReLU, residual connections (ResNet), batch normalization, proper weight initialization (Xavier/He), LSTM gates.

Question 6

6 What is the difference between batch gradient descent, SGD, and mini-batch? âš¡ Easy

Answer

Answer:

Batch GD: Full dataset â€“ accurate but slow, memory heavy.
SGD: One sample at a time â€“ fast updates, high variance.
Mini-batch: Subset (e.g., 32, 64) â€“ balance between speed and stability. Most common.

Question 7

7 Explain Dropout. Why does it work? ðŸ“Š Medium

Answer

Answer: Dropout randomly deactivates neurons during training with probability p. It prevents co-adaptation of neurons, forces redundant representations, and acts as ensemble learning. At inference, all neurons are used (weights scaled by p).

Question 8

8 What is Batch Normalization? How does it help? ðŸ“Š Medium

Answer

Answer: BatchNorm normalizes layer outputs to zero mean and unit variance within each mini-batch. It stabilizes training, allows higher learning rates, reduces internal covariate shift, and acts as a regularizer.

Question 9

9 What is weight initialization? Why is it important? ðŸ“Š Medium

Answer

Answer: Weight initialization sets initial values of weights. Poor init causes vanishing/exploding gradients. Xavier init for tanh/sigmoid, He init for ReLU. Proper init speeds convergence.

Question 10

10 What is the Universal Approximation Theorem? ðŸ”¥ Hard

Answer

Answer: A feedforward network with a single hidden layer and non-linear activation can approximate any continuous function on a compact domain, given enough neurons. Depth improves parameter efficiency, not just theoretical capacity.

Question 11

11 What is the difference between epoch, batch, and iteration? âš¡ Easy

Answer

Answer:

Epoch: One full pass of entire training data.
Batch: Number of samples processed before update.
Iteration: One batch update = steps per epoch.

Question 12

12 What is cross-entropy loss? When do you use it? âš¡ Easy

Answer

Answer: Cross-entropy measures difference between predicted probability distribution and true labels. Used for classification (binary: binary cross-entropy, multi-class: categorical cross-entropy). Preferred over MSE for classification.

Question 13

13 Explain underfitting and overfitting in neural networks. âš¡ Easy

Answer

Answer:

Underfitting: Model too simple, high bias â€“ fails on training data. Fix: increase capacity, train longer.
Overfitting: Model memorizes noise, high variance â€“ low train error, high test error. Fix: dropout, regularization, more data.

Question 14

14 What is the role of the learning rate? âš¡ Easy

Answer

Answer: Learning rate controls step size during gradient descent. Too high: overshoot, divergence. Too low: slow convergence, gets stuck. Use learning rate schedules or adaptive optimizers (Adam).

Question 15

15 Compare Adam vs SGD optimizer. ðŸ“Š Medium

Answer

Answer:

SGD: Simple, requires manual LR tuning, may need momentum.
Adam: Adaptive LR + momentum, works well out-of-box, less sensitive to hyperparameters. Tends to generalize slightly worse than tuned SGD.

Question 16

16 What is gradient clipping? When is it needed? ðŸ“Š Medium

Answer

Answer: Gradient clipping caps gradients to a threshold value during backprop. Prevents exploding gradients, common in RNNs and Transformers. Maintains stable training.

Question 17

17 What is the difference between a neural network and a deep neural network? âš¡ Easy

Answer

Answer: "Neural network" is a broad term. Deep neural network (DNN) typically has more than 2-3 hidden layers. Depth allows hierarchical feature learning. Shallow nets may suffice for simple tasks.

Question 18

18 What are skip connections? Why are they useful? ðŸ”¥ Hard

Answer

Answer: Skip connections (ResNet) add input of a layer to its output (F(x) + x). They alleviate vanishing gradient, enable training of very deep networks (>100 layers), and act as gradient superhighways.

Question 19

19 What is the F1 score? When is it better than accuracy? ðŸ“Š Medium

Answer

Answer: F1 is harmonic mean of precision and recall. Better than accuracy for imbalanced datasets. For example, fraud detection (99.9% negative) â€“ accuracy high but model useless; F1 reflects minority class performance.

Question 20

20 How do you decide the number of layers and neurons? ðŸ”¥ Hard

Answer

Answer: No fixed rule. Start with architecture proven for similar tasks. Use validation error: increase capacity until overfitting, then add regularization. Heuristic: more data â†’ deeper/wider. Automated via hyperparameter search (Grid/Random/Bayesian).

Question 21

21 What is an activation function? Why is it non-linear? âš¡ Easy

Answer

Answer: An activation function decides whether a neuron should fire. It introduces non-linearity â€“ without it, stacked linear layers collapse into a single linear transformation. Non-linearity enables neural networks to approximate arbitrary complex functions (universal approximation theorem).

Question 22

22 Explain Sigmoid activation. Where is it used? Main drawback? âš¡ Easy

Answer

Answer: Sigmoid: Ïƒ(x) = 1/(1+eâ»Ë£), output in (0,1). Used in binary classification output layer (probability). Also in gates of LSTM. Drawbacks: Vanishing gradient (saturated regions), not zero-centered, exp() computation.

Question 23

23 How is Tanh different from Sigmoid? Why is Tanh preferred in hidden layers? ðŸ“Š Medium

Answer

Answer: Tanh = (eË£ - eâ»Ë£)/(eË£ + eâ»Ë£), output range (-1,1). It is zero-centered, which helps gradient flow. Still suffers from vanishing gradient. Preferred over sigmoid in hidden units because zero-centered outputs prevent biased gradients.

Question 24

24 What is ReLU? Why is it so widely used? âš¡ Easy

Answer

Answer: ReLU = max(0, x). Advantages: Non-saturating, cheap computation, sparse activation, mitigates vanishing gradient. It is the default for most CNNs and deep architectures.

Question 25

25 What is the â€œdying ReLUâ€ problem? Solutions? ðŸ“Š Medium

Answer

Answer: When many neurons get stuck in negative region and output 0 for all inputs â€“ gradient zero, never recover. Solutions: Leaky ReLU, PReLU, ELU, or use smaller learning rate.

Question 26

26 Differentiate Leaky ReLU, PReLU, and RReLU. ðŸ”¥ Hard

Answer

Answer: Leaky ReLU: f(x)=max(Î±x, x) with Î± fixed (0.01). PReLU: Î± learned. RReLU: Î± randomly sampled during training. All fix dying ReLU.

Question 27

27 What are ELU and SELU? When to use SELU? ðŸ”¥ Hard

Answer

Answer: ELU: f(x)= x if x>0 else Î±(eË£-1). Smooth, negative saturation, robust to noise. SELU is self-normalizing; with proper initialization, outputs automatically tend to zero mean/unit variance â€“ works for deep FCNs.

Question 28

28 Explain Softmax. Why use exponentials? ðŸ“Š Medium

Answer

Answer: Softmax converts logits to probability distribution: eá¶»â±/Î£ eá¶»Ê². Exponentials amplify differences and ensure positivity. Used in multi-class output layer.

Question 29

29 What are Swish and GeLU? Why do they outperform ReLU in Transformers? ðŸ”¥ Hard

Answer

Answer: Swish = xÂ·sigmoid(x) (smooth, non-monotonic). GeLU = xÂ·Î¦(x) (Gaussian CDF). Both allow negative values with a smooth bump. Used in BERT, GPT, ViT; provide better gradient flow and often higher accuracy.

Question 30

30 Which activation functions are prone to vanishing gradient? Why? ðŸ“Š Medium

Answer

Answer: Sigmoid and Tanh â€“ gradients near 0 for large |x|. ReLU variants avoid left saturation. However, ReLU can have dead neurons; ELU/Swish have small negative gradients but not zero.

Question 31

31 What activation function for regression output? âš¡ Easy

Answer

Answer: Linear (identity). No activation or f(x)=x. Allows unbounded values. For positive-only regression, use ReLU or Softplus.

Question 32

32 What is Softplus? Relation to ReLU? âš¡ Easy

Answer

Answer: Softplus = ln(1+eË£). Smooth approximation of ReLU. Used in some variational autoencoders (variance). Never zero, but computationally heavier.

Question 33

33 Output activation for binary classification? âš¡ Easy

Answer

Answer: Sigmoid (single unit). Gives probability P(y=1). Loss: binary cross-entropy.

Question 34

34 Activation for multi-label classification? ðŸ“Š Medium

Answer

Answer: Sigmoid per output (independent probabilities). Not softmax (which forces one class). Use sigmoid + binary cross-entropy.

Question 35

35 Why not use step function as activation? ðŸ“Š Medium

Answer

Answer: Step function derivative is 0 everywhere (except discontinuity). Gradient descent can't learn. Need differentiable (or subgradient) functions.

Question 36

36 Why is zero-centered activation desirable? ðŸ”¥ Hard

Answer

Answer: If activations are always positive (sigmoid, ReLU), gradients of weights for a given neuron are all same sign. This can cause zig-zagging updates and slower convergence. Zero-centered (tanh) avoids this.

Question 37

37 Write derivative of ReLU and Leaky ReLU. ðŸ“Š Medium

Question 38

38 Which activations are used in RNNs/LSTMs? Why? ðŸ“Š Medium

Answer

Answer: LSTM: sigmoid for gates (0-1 control), tanh for cell/ hidden updates. Prevents exploding gradient via gating. Modern RNNs sometimes use ReLU with careful init.

Question 39

39 What is Maxout activation? Pros/cons? ðŸ”¥ Hard

Answer

Answer: Maxout: takes max of k linear combinations. Can approximate any convex function. No vanishing gradient, but doubles parameters. Rarely used today.

Question 40

40 Heuristic: which activation for hidden layers? âš¡ Easy

Answer

Answer: Default: ReLU. If dying ReLU, try Leaky ReLU/ELU. For very deep nets: Swish/GeLU. For regression head: linear. For self-normalizing nets: SELU.

Question 41

41 What is a loss function in deep learning? âš¡ Easy

Answer

Answer: A loss function (cost/objective) quantifies the error between model predictions and true targets. Training minimizes this loss via gradient descent. Choice of loss depends on task: regression (L1, L2), classification (cross-entropy), ranking (hinge), etc.

Question 42

42 Compare MSE and MAE. When to use each? ðŸ“Š Medium

Answer

Answer: MSE = mean( (y-Å·)Â² ), MAE = mean( |y-Å·| ). MSE penalizes large errors more (squared), sensitive to outliers. MAE is robust to outliers. Use MSE when outliers are rare/need to be emphasized; MAE when robustness is needed. MSE gradient magnitude âˆ error, MAE gradient constant (Â±1).

Question 43

43 Why use cross-entropy for classification, not MSE? ðŸ”¥ Hard

Answer

Answer: Cross-entropy with softmax/sigmoid gives stronger gradients when prediction is wrong. MSE + sigmoid saturates quickly â€“ vanishing gradient. CE is also probabilistic (minimizes KL divergence), directly optimizes log-likelihood. CE is convex in parameters for linear models.

Question 44

44 Binary vs Categorical Cross-Entropy: difference? âš¡ Easy

Answer

Answer: Binary CE for 2 classes (single sigmoid output). Categorical CE for â‰¥3 classes (softmax output). For multi-label (multiple binary tasks), use binary CE per output.

Question 45

45 What is Hinge loss? Where is it used? ðŸ“Š Medium

Answer

Answer: Hinge: max(0, 1 - yÂ·Å·) for y âˆˆ {-1,1}. Used in SVMs and max-margin classifiers. Encourages correct classification with a margin. Not differentiable at margin; subgradient used. Less common in deep nets but used in Siamese nets (contrastive hinge).

Question 46

46 Explain Huber loss. When is it useful? ðŸ”¥ Hard

Answer

Answer: Huber loss = MSE for small error, MAE for large error (quadratic near zero, linear otherwise). Smooth, less sensitive to outliers than MSE, differentiable. Used in robust regression (e.g., object detection bounding boxes â€“ Smooth L1 is similar).

Question 47

47 KL Divergence vs Cross-Entropy: relation? ðŸ”¥ Hard

Answer

Answer: Cross-Entropy = H(p,q) = H(p) + KL(p||q). Minimizing cross-entropy is equivalent to minimizing KL divergence if p is fixed (target distribution). In VAEs, we minimize KL(q(z|x) || p(z)) to regularize latent space.

Question 48

48 What are Contrastive and Triplet losses? ðŸ”¥ Hard

Answer

Answer: Contrastive: pulls positive pairs together, pushes negative apart (margin). Triplet: anchor, positive, negative; loss = max(0, d(a,p) - d(a,n) + margin). Used in face recognition (FaceNet), siamese networks, self-supervised learning (SimCLR).

Question 49

49 What is Focal Loss? Where is it used? ðŸ”¥ Hard

Answer

Answer: Focal loss = -(1-p_t)^Î³ * log(p_t). Modifies cross-entropy to down-weight easy examples, focus on hard misclassified. Solves class imbalance in object detection (RetinaNet). Î³=2 common.

Question 50

50 What is CTC loss? Why is it useful? ðŸ”¥ Hard

Answer

Answer: Connectionist Temporal Classification (CTC) aligns input sequences to output sequences without pre-alignment. Used in speech recognition, OCR. It sums probabilities over all possible alignments via dynamic programming.

Question 51

51 Heuristics: choose L1, L2, or Huber for regression? ðŸ“Š Medium

Answer

Answer: L2 (MSE): default, but outlier-sensitive. L1 (MAE): robust, but slower convergence. Huber: best of both â€“ quadratic for small errors, linear for large. Smooth L1 used in detectors.

Question 52

52 Why is cross-entropy always â‰¥ 0? ðŸ“Š Medium

Answer

Answer: Cross-entropy = -Î£ p(x) log q(x). Since p(x) â‰¤ 1 and log q(x) â‰¤ 0 (q(x) â‰¤ 1), product is negative; with minus sign becomes non-negative. Zero only if predictions exactly match targets.

Question 53

53 Relation between perplexity and cross-entropy? ðŸ“Š Medium

Answer

Answer: Perplexity = 2^{H(p,q)} where H is cross-entropy (if using log base 2). It measures how "surprised" the model is. Lower perplexity = better language model.

Question 54

54 NLL vs Cross-Entropy â€“ same? âš¡ Easy

Answer

Answer: For classification with one-hot targets, categorical cross-entropy = negative log-likelihood. NLL is just -log(p(y|x)). In PyTorch, `CrossEntropyLoss` = LogSoftmax + NLLLoss.

Question 55

55 What is Dice loss? Where is it used? ðŸ”¥ Hard

Answer

Answer: Dice = 1 - (2|Xâˆ©Y|)/(|X|+|Y|). Differentiable approximation of IoU. Used in medical image segmentation, imbalanced data. Handles pixel-wise class imbalance well.

Question 56

56 Why use log in cross-entropy loss? ðŸ“Š Medium

Answer

Answer: Log converts multiplicative probabilities to additive; numerically stable. Also, maximizing likelihood = minimizing negative log-likelihood. Log loss heavily penalizes very wrong confident predictions.

Question 57

57 Compare gradients of MSE and MAE. ðŸ“Š Medium

Answer

MSE gradient scales with error; MAE gradient magnitude constant Â±1. MSE converges faster but outlier-sensitive.

Question 58

58 Loss function for ordinal regression? ðŸ”¥ Hard

Answer

Answer: CORAL loss (Cumulative link model) or square of difference in thresholds. Alternatively, treat as regression with rounding, or use MSE/MAE if scale meaningful.

Question 59

59 What is energy-based loss? ðŸ”¥ Hard

Answer

Answer: Energy-based models (EBM) assign scalar energy to configurations. Loss designed to push down energy of correct answer, pull up incorrect. Example: contrastive loss, hinge loss for EBM.

Question 60

60 Designing a custom loss: key requirements? ðŸ”¥ Hard

Answer

Answer: Must be differentiable (almost everywhere), should correlate with evaluation metric, numerically stable, efficient. Also consider convexity (not strictly required) and gradient behavior.

Question 61

61 What is backpropagation? Explain the intuition. âš¡ Easy

Answer

Answer: Backpropagation computes the gradient of the loss function with respect to every weight using the chain rule. It propagates error signals backward from output to input. Intuition: each neuron's contribution to the final error is measured, then weights are adjusted to reduce loss.

Question 62

62 How does chain rule work in backpropagation? ðŸ“Š Medium

Answer

Answer: Chain rule multiplies local gradients along the path from loss to weight. For a composition f(g(x)), derivative = f'(g(x))Â·g'(x). In neural nets, gradients are multiplied backward layer by layer.

Question 63

63 Difference between forward pass and backward pass? âš¡ Easy

Answer

Answer: Forward pass computes predictions and caches intermediate activations. Backward pass computes gradients using cached values and chain rule. Forward is inference; backward is learning.

Question 64

64 What causes vanishing gradient in backpropagation? ðŸ“Š Medium

Answer

Answer: Gradients of saturated activations (sigmoid, tanh) are <1; repeated multiplication makes gradients exponentially small in early layers. Also deep networks with many multiplications.

Answer 64

Answer: Gradients become exponentially large due to large weights >1 or poor initialization. Causes unstable updates. Solutions: gradient clipping, weight regularization, careful initialization (Xavier/He).

Answer 65

Answer: Directed acyclic graph where nodes are operations/ variables, edges define dependencies. Backprop traverses graph in reverse topological order, multiplying gradients via chain rule. Frameworks (TF, PyTorch) build autograd on this.

Answer 66

Answer: Backprop is a special case of reverse-mode autodiff. It efficiently computes gradients of scalar loss w.r.t many parameters in one forward+backward pass. Autograd builds the graph dynamically (PyTorch) or statically (TF1).

Answer 67

Answer: Chain rule for composite functions is multiplicative. Each layer's effect compounds; if one layer has zero gradient, whole branch dies. Multiplication reflects dependency strength. Addition would be for parallel paths (e.g., skip connections).

Answer 68

Answer: Zero init makes neurons symmetric â€“ same gradient, same updates, no feature diversity. Backprop would compute identical gradients for all neurons in a layer, preventing learning. Random init breaks symmetry.

Answer 69

Answer: Numerically approximate gradient: (L(Î¸+Îµ)-L(Î¸-Îµ))/(2Îµ) and compare with analytical backprop gradient. Used for debugging. Must be disabled in training (expensive).

Answer 70

Answer: Gradient only passes to the neuron that achieved the max (argmax). Others get zero gradient. It's like a switch: route error to winner.

Answer 71

Answer: Combined gradient simplifies to (p - y) where p is softmax output, y is one-hot target. Very elegant and numerically stable.

Answer 72

Answer: Backprop requires intermediate activations (input to ReLU) to compute gradient. In-place overwrites them, breaking the graph. PyTorch/TF usually avoid or handle carefully.

Answer 73

Answer: Yes, via automatic differentiation on the gradient graph (e.g., PyTorch `torch.autograd.grad`). Used in meta-learning, Hessian-free optimization, etc.

Answer 74

Answer: Backprop is the algorithm applied to neural nets; reverse-mode autodiff is the general technique. Backprop = reverse-mode AD applied to a scalar loss with caching.

Answer 75

Answer: For vector functions, the local gradient is a Jacobian matrix (âˆ‚output/âˆ‚input). Backprop multiplies Jacobians along the path. In practice, frameworks use vector-Jacobian products (VJPs) for efficiency.

Answer 76

Answer: Î´_i^l = âˆ‚L / âˆ‚z_i^l (pre-activation at layer l). It represents how much the total loss changes when the pre-activation changes. Propagated backward: Î´^l = (Î¸^{l+1})^T Î´^{l+1} âŠ™ Ïƒ'(z^l).

Answer 77

Answer: BPTT unfolds RNN through time steps, treats it as a deep network with shared weights. Gradients sum over time. Suffers from vanishing/exploding due to repeated multiplications. Truncated BPTT limits steps.

Answer 78

Answer: Initialized weights in a sensible region, avoiding vanishing gradients. Backprop then fine-tuned. Modern techniques (ReLU, batch norm, good init) made pretraining less critical.

Answer 79

Answer: Skip connections create an alternative gradient highway â€“ identity mapping. Gradient can flow directly through skip path, mitigating vanishing gradient and enabling very deep networks (>100 layers).

Answer 80

Answer: An optimizer is an algorithm that updates model parameters (weights) to minimize the loss function. It implements a variant of gradient descent, controlling learning rate, momentum, and adaptive per-parameter updates.

Answer 81

Answer: Pros: Simple, memory efficient, generalizes well. Cons: Slow convergence, sensitive to learning rate, oscillations in ravines, struggles with sparse data.

Answer 82

Answer: Accumulates a velocity vector in the direction of persistent gradients. Accelerates convergence, dampens oscillations. v_t = Î³ v_{t-1} + Î·âˆ‡L; Î¸ -= v_t.

Answer 83

Answer: NAG computes gradient at the "lookahead" position (Î¸ - Î³Â·v_prev). This gives a more accurate update, reducing oscillations. Often faster convergence.

Answer 84

Answer: AdaGrad adapts learning rate per parameter: scales inversely with sqrt(sum of squared gradients). Good for sparse data (e.g., embeddings, NLP). Major drawback: learning rate monotonically decays to zero.

Answer 85

Answer: RMSprop uses exponentially moving average of squared gradients, not cumulative sum. Prevents learning rate vanishing. E[gÂ²]_t = Î²Â·E[gÂ²]_{t-1} + (1-Î²)Â·(âˆ‡L)Â². Step = Î·/âˆš(E[gÂ²]+Îµ)Â·âˆ‡L.

Answer 86

Answer: Adam = RMSprop + Momentum. Maintains first moment (mean) and second moment (uncentered variance) of gradients. Bias correction for initial steps. Default Î²1=0.9, Î²2=0.999, Îµ=1e-8. Popular due to fast convergence and robustness.

Answer 87

Answer: AdamW decouples weight decay from gradient updates. In Adam, L2 regularization is added to loss; AdamW directly subtracts weight decay from parameters. Leads to better generalization, widely used in Transformers (BERT, ViT).

Answer 88

Answer: Nadam = Adam + Nesterov momentum. It applies Nesterov lookahead on top of Adam's momentum. Sometimes converges slightly faster than Adam.

Answer 89

Answer: AdaBelief modifies Adam: second moment v_t = Î²2Â·v_{t-1} + (1-Î²2)Â·(âˆ‡L - m_t)Â². Stepsize is Î·/(âˆšvÌ‚+Îµ)Â·mÌ‚. Intuition: adapts to "belief" in observed gradient direction. More stable, often better generalization.

Answer 90

Answer: Lion (Evolved Sign Momentum) uses sign of momentum and gradient combination. Update: Î¸ = Î¸ - Î·Â·sign(Î²1Â·m + (1-Î²1)Â·âˆ‡L). Memory efficient, outperforms AdamW in some large-scale tasks.

Answer 91

Answer: Step decay (reduce by factor every few epochs), exponential decay, cosine annealing, linear warmup. Warmup helps Adam in early training (prevents large variance). Cosine decay popular in Transformers.

Answer 92

Answer: Hypothesis: Adam may converge to sharper minima, while SGD finds flatter minima (better generalization). Also, adaptive methods have implicit regularization differences. However, AdamW with decoupled weight decay narrows the gap.

Answer 93

Answer: Clipping limits gradient magnitude to avoid exploding gradients (RNNs, Transformers). Applied per-sample or globally. Essential for LSTM, but also used with Adam in large Transformers.

Answer 94

Answer: In early steps, Adam's second moment (v) is small, causing large effective LR. Warmup gradually increases LR, stabilizing training. Critical for large-scale Transformer training (BERT, GPT).

Answer 95

Answer: AdaMax replaces L2 norm in Adam with L-infinity norm. v_t = max(Î²2Â·v_{t-1}, |âˆ‡L|). More stable for some problems, less common.

Answer 96

Answer: Adam can sometimes increase learning rate (when v decreases). AMSGrad ensures v_t is monotonic: v_hat = max(v_hat, v_t). Guarantees non-increasing step size. Marginal improvement in practice.

Answer 97

Answer: AdaGrad, RMSprop, or Adam with sparse updates (lazy Adam). Sparse gradients benefit from per-parameter adaptive LR.

Answer 98

Answer: Hessian is huge (billions of params). Approximations (L-BFGS) are expensive, need large batches, noisy gradients. Mostly used in small-batch convex problems or K-FAC (rare).

Answer 99

Answer: Default: AdamW with cosine decay + warmup (Transformers, CNNs). For NLP/Transformers: AdamW. For CV: SGD with momentum (generalizes well) or AdamW. For sparse embeddings: Adam/AdaGrad. For memory-limited: SGD or Lion.

Neural Network Fundamentals — Interview Q&A

Neural Networks: 20 Interview Questions

Deep Learning Activation Functions: 20 Interview Questions

Deep Learning Loss Functions: 20 Interview Questions

Backpropagation: 20 Interview Questions & Intuition

Deep Learning Optimizers: 20 Interview Questions