Interview Q&A75 Questions

Network Design & Regularization — Interview Q&A

Network design, weight initialization, batch norm, overfitting, and dropout.

Network Design & Depth â€” 15 Interview Questions

1 What does â€œnetwork designâ€ mean in an interview?Easy

Answer: Choosing depth, width, connectivity patterns (residual, dense), input/output heads, and regularization hooks so the model has enough capacity but fits data and compute.

2 Depth vs widthâ€”trade-offs.Medium

Answer: Depth composes features hierarchically; can improve sample efficiency for structured tasks. Width increases representational power per layer. Very deep nets need care (residuals, normalization); very wide nets cost parameters and memory.

3 What is model capacity?Easy

Answer: Roughly the family of functions the architecture can represent (VC-style intuition or parameter count as proxy). High capacity can overfit small data; too low underfits.

4 What is a bottleneck layer?Medium

Answer: A layer with fewer units than neighbors, forcing compression of the representationâ€”used in autoencoders, Inception modules, some efficient conv blocks (1Ã—1 convs).

5 Why do skip (residual) connections help very deep nets?Medium

Answer: They provide gradient highways and make it easier to learn near-identity refinements (â€œresidual mappingâ€). Mitigates degradation and vanishing signal in deep stacks.

6 What is inductive bias?Medium

Answer: Prior assumptions baked into the architectureâ€”e.g. CNNs assume locality and translation structure; RNNs assume sequential dependence. Good bias improves data efficiency.

7 Receptive fieldâ€”why does it matter for CNN design?Medium

Answer: The region of input affecting one output neuron. Must grow large enough to capture context (objects, text n-grams in 1D CNNs)â€”deeper stacks, dilated convs, or pooling increase effective RF.

8 Parameters vs FLOPsâ€”both needed?Easy

Answer: Parameters drive memory and overfitting risk; FLOPs drive latency and training cost. A layer can be compute-heavy but parameter-light (depthwise separable convs) or the opposite.

9 Signs your network is too small (underfitting).Easy

Answer: Training loss stays high; both train and validation error poor. Fix: more layers/units, better features, or longer training if optimization was the issue.

10 Signs your network is too large (overfitting).Easy

Answer: Training loss low but validation much worse. Fix: regularization, data, smaller model, early stoppingâ€”not always â€œmore parameters.â€

11 â€œScaling lawsâ€ in one interview sentence.Hard

Answer: Empirically, loss often improves predictably with more parameters, data, and compute along Pareto frontsâ€”guides large-model training but doesnâ€™t replace task-specific design.

12 Multi-branch architectures (e.g. Inception idea).Hard

Answer: Parallel paths with different kernel sizes or operations capture multi-scale features; concatenation or addition fuses themâ€”richer than a single tower at cost of complexity.

13 How does input resolution affect design?Medium

Answer: Higher resolution increases spatial tokens and compute (often quadratically for attention, linearly depth-wise for conv). May need deeper nets or downsampling early to control cost.

14 When start from a pretrained architecture?Medium

Answer: Small data or similar domain to pretrainingâ€”reuse backbone, replace classifier head. Random init better when data is huge or domain mismatch is extreme (with caveats).

15 Practical order for picking depth and width.Medium

Answer: Start from a known baseline (ResNet-18, small Transformer), match parameter budget to GPU and dataset size, measure train vs val curves, then adjust depth/width/regularizationâ€”not guess huge first.

Mention train/val gap and compute budgetâ€”signals you design empirically, not only from theory.

Weight Initialization â€” 15 Interview Questions

16 Why canâ€™t we set all weights to zero?Easy

Answer: Neurons in a layer stay identical: same outputs, same gradients, same updatesâ€”symmetry never breaks. Need random (or other asymmetric) init so units specialize.

17 How are biases usually initialized?Easy

Answer: Often zeros is fine for biasesâ€”symmetry breaking comes from weights. Sometimes small positive bias for ReLU to avoid dead neurons at start.

18 What are fan-in and fan-out?Easy

Answer: For one neuron, fan-in = number of incoming connections (input dim); fan-out = outgoing (next layer input count per filter/neuron context). Init schemes scale variance using these.

19 Xavier / Glorot initializationâ€”idea.Medium

Answer: Choose weight variance so activation variance stays roughly stable forward and gradient variance backwardâ€”often Uniform or Normal with scale âˆ 1/fan_avg. Suited to tanh/sigmoid (linear-ish near 0).

Var(W) â‰ˆ 2 / (fan_in + fan_out) (common form)

20 He initializationâ€”for which activation?Medium

Answer: Designed for ReLU: roughly half of activations are zero, so variance is scaled with fan_in only (e.g. Var â‰ˆ 2/fan_in for ReLU). Prevents signal dying or exploding early in deep ReLU nets.

21 What goes wrong with too large or too small random init?Medium

Answer: Too large: activations/gradients explode. Too small: activations vanish, gradients tinyâ€”slow learning. Good init keeps scale in a reasonable band across layers.

22 LeCun normal / uniformâ€”one line.Medium

Answer: Another fan-in-based scaling (e.g. std = 1/âˆšfan_in) to preserve variance; similar family to Xavier/He with different constants for different assumptions.

23 Orthogonal initializationâ€”when mentioned?Hard

Answer: Start with orthogonal weight matrices so singular values start near 1â€”helps very deep nets or RNNs with gradient flow. Less default than Xavier/He for vanilla CNN/MLP.

24 Transfer learningâ€”how does â€œinitializationâ€ change?Easy

Answer: Load pretrained weights instead of randomâ€”only new head layers need fresh init. Fine-tuning uses small LR so pretrained init isnâ€™t destroyed immediately.

25 Does batch norm make initialization less critical?Medium

Answer: Partlyâ€”BN stabilizes activations so training is less sensitive to exact scale. You still avoid pathological init; bad init can still hurt before BN statistics stabilize.

26 Residual blocksâ€”init of last conv layer sometimes zeroâ€”why?Hard

Answer: Some designs initialize the last conv in a block near zero so the block starts as near-identity (skip path dominates), improving optimization of very deep nets.

27 What is a â€œgainâ€ or activation-specific multiplier?Medium

Answer: Frameworks apply a constant (e.g. for Leaky ReLU slope) to adjust variance for that nonlinearityâ€”He/Xavier formulas include these gains.

28 fan_in for a conv layer?Medium

Answer: Typically k_h Ã— k_w Ã— in_channels per filterâ€”number of multiply-add inputs contributing to one output activation before bias.

29 Why set a random seed in experiments?Easy

Answer: Reproducibilityâ€”same init and shuffling for fair comparisons. Does not remove variance across seeds; best practice report multiple seeds for papers.

30 Default init youâ€™d name in an interview?Easy

Answer: He for ReLU-based CNN/MLP; Xavier for tanh/sigmoid-heavy nets; use framework defaults (kaiming_uniform, xavier_normal) and match to activation.

Say fan_in, activation, and symmetryâ€”three anchors interviewers expect.

Batch Normalization â€” 15 Interview Questions

31 What does batch normalization do per feature map?Easy

Answer: For each channel (or neuron), subtract batch mean and divide by batch std (with Îµ), then apply learnable scale Î³ and shift Î² so the network can undo normalization if useful.

Å· = Î³ Â· (x âˆ’ Î¼_B) / âˆš(ÏƒÂ²_B + Îµ) + Î²

32 What is â€œinternal covariate shiftâ€?Medium

Answer: The original paperâ€™s term: changing distribution of layer inputs as earlier layers update. BN aims to stabilize those distributions. Modern view also stresses smoothing loss landscape and allowing higher learning rates.

33 Training vs inference (eval) in batch norm.Easy

Answer: Train: use current mini-batch Î¼, ÏƒÂ²; update running_mean and running_var with momentum. Eval: freeze Î³, Î² and use running statisticsâ€”not batch statsâ€”so one sample or any batch size behaves consistently.

34 Why is batch norm problematic with batch size 1 or very small batches?Medium

Answer: Batch statistics are noisy or undefined; running estimates misalign. Common fixes: Sync BN across GPUs, larger batch, or switch to Layer/Group Norm.

35 Where is BN typically placed: before or after activation?Medium

Answer: Original paper: before nonlinearity (conv â†’ BN â†’ ReLU). Many modern CNNs use after conv but before ReLU still common; ResNet-style often convâ€“BNâ€“ReLU. Be consistent within an architecture; know both camps exist.

36 Are Î³ and Î² learned?Easy

Answer: Yesâ€”trainable parameters like weights. If optimal, the network can learn identity-like scaling so BN doesnâ€™t hurt representational power.

37 What does momentum on running mean/variance mean?Medium

Answer: Exponential moving average: running = (1âˆ’m)Â·batch + mÂ·runningâ€”smooths estimates across iterations so eval stats arenâ€™t dominated by last batch.

38 Batch norm in CNNsâ€”which dimensions are normalized?Medium

Answer: Per channel, aggregate over batch, height, width (NÃ—HÃ—W) to get one Î¼ and Ïƒ per channel. Keeps spatial structure within each feature map.

39 Layer Norm vs Batch Normâ€”key difference.Medium

Answer: LN normalizes across features for each example (independent of batch size). BN uses batch dimensionâ€”LN fits RNNs/Transformers and small batches better.

40 Group Normâ€”in one sentence.Hard

Answer: Split channels into groups, normalize within each group per spatial locationâ€”less batch-dependent than BN, useful for small batches in vision.

41 Does BN act like regularization?Medium

Answer: Noisy batch statistics add mild regularization similar to jitter; donâ€™t rely on it instead of dropout/weight decay. Effect shrinks with large batch.

42 Inference batch size different from trainingâ€”OK?Easy

Answer: Yesâ€”eval uses fixed running stats, so any batch size (including 1) is valid if running stats were estimated well during training.

43 Fine-tuning: freeze BN layersâ€”when?Hard

Answer: Small new dataset: sometimes freeze BN (use pretrained running stats) to avoid bad estimates; or keep BN trainable with small LR. Depends on domain shift and batch size.

44 Interaction with weight decay (L2).Hard

Answer: Debated implementation details (decoupling in AdamW). Conceptually BN changes effective step geometry; use framework defaults and literature for the optimizer pairing you cite.

45 When would you avoid batch norm?Medium

Answer: Very small batches, non-batch settings (online), some GAN setups, or when you need batch-independent normâ€”prefer LN, GN, or modern alternatives (RMSNorm, etc.).

Always say train vs eval and running statisticsâ€”the core BN interview answer.

Overfitting & Underfitting â€” 15 Interview Questions

46 Define overfitting.Easy

Answer: The model learns training noise and idiosyncrasies so training error is low but validation/test error is much worseâ€”poor generalization.

47 Define underfitting.Easy

Answer: The model is too simple or insufficiently trained: both training and validation errors remain highâ€”it misses real signal.

48 Bias vs variance (classic interview version).Medium

Answer: High bias: systematically wrong (underfitting). High variance: sensitive to training sample (overfitting). Ideal model balances both for lowest expected test error.

49 What is the generalization gap?Easy

Answer: Difference between train performance and held-out performance. Large gap often signals overfitting; small gap with poor absolute score suggests underfitting or hard task.

50 How do learning curves diagnose overfitting?Medium

Answer: Plot loss vs epoch: train â†“ but val â†‘ or plateaus bad â†’ overfit. Both high and stuck â†’ underfit or need better features/architecture.

51 List fixes for overfitting.Easy

Answer: More/better data, augmentation, regularization (L2, dropout), smaller model, early stopping, label noise cleanup, cross-validation for honest estimates.

52 List fixes for underfitting.Easy

Answer: Bigger / deeper model, train longer, lower regularization, richer features, check learning rate and optimization bugs, ensure data quality.

53 Can a neural net memorize random labels?Medium

Answer: Yesâ€”large enough nets can fit random noise on training set (classic experiment). Shows capacity without generalization; motivates regularization and correct targets.

54 Why use a validation set?Easy

Answer: Tune hyperparameters and early stop without peeking at test set. Test set should estimate final generalization once to avoid optimistic bias.

55 Classical U-shaped risk vs modern â€œdouble descentâ€â€”mention?Hard

Answer: Classical: biasâ€“variance U-shape in model complexity. Some regimes show double descent where risk drops again past interpolation thresholdâ€”interview bonus topic, not required for basics.

56 k-fold cross-validationâ€”purpose.Medium

Answer: Rotate train/val splits to estimate performance with less variance when data is smallâ€”better hyperparameter comparison than one random split.

57 Label noise and overfitting.Medium

Answer: Wrong labels are noise; the model may memorize them. Clean data, robust loss, or regularization helps; audit labels in production ML.

58 More data vs smaller model for overfitting?Medium

Answer: Often more diverse data is the best fix if feasible. Smaller model is a lever when data is fixedâ€”trade off capacity vs available signal.

59 Early stoppingâ€”how does it reduce overfitting?Easy

Answer: Stop training when validation loss worsensâ€”prevents continued fitting of training noise. Acts as implicit regularization on training time/weight trajectory.

60 One diagram youâ€™d draw in an interview.Easy

Answer: Two curves vs epochs: train loss down, val loss down then upâ€”point at the elbow as overfitting onset. Pair with train vs val accuracy if classification.

Tie symptoms to train and val numbersâ€”interviewers want concrete diagnostics.

Dropout & Regularization â€” 15 Interview Questions

61 What is dropout?Easy

Answer: During training, each activation is kept with probability 1âˆ’p and set to zero otherwiseâ€”different random mask each step. Reduces co-adaptation of neurons.

62 Dropout at training vs inference.Easy

Answer: Training: apply stochastic mask. Inference: no dropoutâ€”use full network. Expectation of output must match; handled by scaling (see inverted dropout or test-time multiply by 1âˆ’p).

63 What is inverted dropout?Medium

Answer: Scale kept activations by 1/(1âˆ’p) during training so inference needs no extra scaling. Common in frameworksâ€”cleaner eval path.

64 Ensemble interpretation of dropout.Medium

Answer: Training samples many thinned subnets; inference averages over exponentially many such netsâ€”approximated by using the full net with scaled weights. Explains regularizing effect.

65 Where is dropout usually applied?Easy

Answer: After fully connected or sometimes conv layers (less common in modern CNNs); often not on output layer. Transformers use attention dropout on weights/probs.

66 Typical dropout probability p?Easy

Answer: Hidden layers often 0.2â€“0.5; too high hurts capacity. Tune on validation; some architectures (BN-heavy nets) use less dropout.

67 L2 regularization (weight decay)â€”effect.Medium

Answer: Penalty Î»||w||Â² encourages smaller weights, smoother functions, less overfitting. With SGD equivalent to shrinking weights each step; AdamW decouples decay properly.

68 L1 vs L2 for neural nets.Medium

Answer: L1 encourages sparsity (many exact zeros with subgradient methods). L2 shrinks all weights smoothly. L2 is default; L1 for feature selection or sparse models.

69 Monte Carlo dropout at test timeâ€”why?Hard

Answer: Leave dropout on during inference, average multiple forward passesâ€”approximate predictive uncertainty (Bayesian NN heuristic).

70 Dropout with batch normalizationâ€”interaction?Hard

Answer: Order and strength matter; dropout before BN can shift batch statistics. Many modern vision models rely more on BN + data aug than heavy dropoutâ€”know itâ€™s architecture-dependent.

71 Spatial dropout in CNNs.Medium

Answer: Drop entire feature maps (channels) instead of individual pixelsâ€”stronger structural regularization, avoids correlating adjacent activations.

72 Is label smoothing regularization?Medium

Answer: Yesâ€”softens targets so the model doesnâ€™t become overconfident; acts on the loss, not weights directly.

73 Gaussian noise on inputs as regularization.Easy

Answer: Adds robustness to small input perturbationsâ€”related to data augmentation and Tikhonov-style effects in linear models.

74 Stochastic depth / drop path (high level).Hard

Answer: Randomly skip whole residual branches during trainingâ€”regularizes very deep networks similarly in spirit to dropout but on graph structure.

75 When prefer dropout vs weight decay?Medium

Answer: Often use both lightly. Dropout targets co-adaptation of activations; weight decay shrinks parameters. Large data + BN may need little dropout; small data FC nets benefit more.

State clearly: dropout off at eval unless doing MC dropout.

Previous Next

Network Design & Regularization — Interview Q&A

Network Design &amp; Depth â€” 15 Interview Questions

Weight Initialization â€” 15 Interview Questions

Batch Normalization â€” 15 Interview Questions

Overfitting &amp; Underfitting â€” 15 Interview Questions

Dropout &amp; Regularization â€” 15 Interview Questions

Network Design & Depth â€” 15 Interview Questions

Overfitting & Underfitting â€” 15 Interview Questions

Dropout & Regularization â€” 15 Interview Questions