Interview Q&A75 Questions
Network Design & Regularization — Interview Q&A
Network design, weight initialization, batch norm, overfitting, and dropout.
Network Design & Depth — 15 Interview Questions
1 What does “network design†mean in an interview?Easy
Answer: Choosing depth, width, connectivity patterns (residual, dense), input/output heads, and regularization hooks so the model has enough capacity but fits data and compute.
2 Depth vs width—trade-offs.Medium
Answer: Depth composes features hierarchically; can improve sample efficiency for structured tasks. Width increases representational power per layer. Very deep nets need care (residuals, normalization); very wide nets cost parameters and memory.
3 What is model capacity?Easy
Answer: Roughly the family of functions the architecture can represent (VC-style intuition or parameter count as proxy). High capacity can overfit small data; too low underfits.
4 What is a bottleneck layer?Medium
Answer: A layer with fewer units than neighbors, forcing compression of the representation—used in autoencoders, Inception modules, some efficient conv blocks (1×1 convs).
5 Why do skip (residual) connections help very deep nets?Medium
Answer: They provide gradient highways and make it easier to learn near-identity refinements (“residual mappingâ€). Mitigates degradation and vanishing signal in deep stacks.
6 What is inductive bias?Medium
Answer: Prior assumptions baked into the architecture—e.g. CNNs assume locality and translation structure; RNNs assume sequential dependence. Good bias improves data efficiency.
7 Receptive field—why does it matter for CNN design?Medium
Answer: The region of input affecting one output neuron. Must grow large enough to capture context (objects, text n-grams in 1D CNNs)—deeper stacks, dilated convs, or pooling increase effective RF.
8 Parameters vs FLOPs—both needed?Easy
Answer: Parameters drive memory and overfitting risk; FLOPs drive latency and training cost. A layer can be compute-heavy but parameter-light (depthwise separable convs) or the opposite.
9 Signs your network is too small (underfitting).Easy
Answer: Training loss stays high; both train and validation error poor. Fix: more layers/units, better features, or longer training if optimization was the issue.
10 Signs your network is too large (overfitting).Easy
Answer: Training loss low but validation much worse. Fix: regularization, data, smaller model, early stopping—not always “more parameters.â€
11 “Scaling laws†in one interview sentence.Hard
Answer: Empirically, loss often improves predictably with more parameters, data, and compute along Pareto fronts—guides large-model training but doesn’t replace task-specific design.
12 Multi-branch architectures (e.g. Inception idea).Hard
Answer: Parallel paths with different kernel sizes or operations capture multi-scale features; concatenation or addition fuses them—richer than a single tower at cost of complexity.
13 How does input resolution affect design?Medium
Answer: Higher resolution increases spatial tokens and compute (often quadratically for attention, linearly depth-wise for conv). May need deeper nets or downsampling early to control cost.
14 When start from a pretrained architecture?Medium
Answer: Small data or similar domain to pretraining—reuse backbone, replace classifier head. Random init better when data is huge or domain mismatch is extreme (with caveats).
15 Practical order for picking depth and width.Medium
Answer: Start from a known baseline (ResNet-18, small Transformer), match parameter budget to GPU and dataset size, measure train vs val curves, then adjust depth/width/regularization—not guess huge first.
Mention train/val gap and compute budget—signals you design empirically, not only from theory.
Weight Initialization — 15 Interview Questions
16 Why can’t we set all weights to zero?Easy
Answer: Neurons in a layer stay identical: same outputs, same gradients, same updates—symmetry never breaks. Need random (or other asymmetric) init so units specialize.
17 How are biases usually initialized?Easy
Answer: Often zeros is fine for biases—symmetry breaking comes from weights. Sometimes small positive bias for ReLU to avoid dead neurons at start.
18 What are fan-in and fan-out?Easy
Answer: For one neuron, fan-in = number of incoming connections (input dim); fan-out = outgoing (next layer input count per filter/neuron context). Init schemes scale variance using these.
19 Xavier / Glorot initialization—idea.Medium
Answer: Choose weight variance so activation variance stays roughly stable forward and gradient variance backward—often Uniform or Normal with scale ∠1/fan_avg. Suited to tanh/sigmoid (linear-ish near 0).
Var(W) ≈ 2 / (fan_in + fan_out) (common form)
20 He initialization—for which activation?Medium
Answer: Designed for ReLU: roughly half of activations are zero, so variance is scaled with fan_in only (e.g. Var ≈ 2/fan_in for ReLU). Prevents signal dying or exploding early in deep ReLU nets.
21 What goes wrong with too large or too small random init?Medium
Answer: Too large: activations/gradients explode. Too small: activations vanish, gradients tiny—slow learning. Good init keeps scale in a reasonable band across layers.
22 LeCun normal / uniform—one line.Medium
Answer: Another fan-in-based scaling (e.g. std = 1/√fan_in) to preserve variance; similar family to Xavier/He with different constants for different assumptions.
23 Orthogonal initialization—when mentioned?Hard
Answer: Start with orthogonal weight matrices so singular values start near 1—helps very deep nets or RNNs with gradient flow. Less default than Xavier/He for vanilla CNN/MLP.
24 Transfer learning—how does “initialization†change?Easy
Answer: Load pretrained weights instead of random—only new head layers need fresh init. Fine-tuning uses small LR so pretrained init isn’t destroyed immediately.
25 Does batch norm make initialization less critical?Medium
Answer: Partly—BN stabilizes activations so training is less sensitive to exact scale. You still avoid pathological init; bad init can still hurt before BN statistics stabilize.
26 Residual blocks—init of last conv layer sometimes zero—why?Hard
Answer: Some designs initialize the last conv in a block near zero so the block starts as near-identity (skip path dominates), improving optimization of very deep nets.
27 What is a “gain†or activation-specific multiplier?Medium
Answer: Frameworks apply a constant (e.g. for Leaky ReLU slope) to adjust variance for that nonlinearity—He/Xavier formulas include these gains.
28 fan_in for a conv layer?Medium
Answer: Typically k_h × k_w × in_channels per filter—number of multiply-add inputs contributing to one output activation before bias.
29 Why set a random seed in experiments?Easy
Answer: Reproducibility—same init and shuffling for fair comparisons. Does not remove variance across seeds; best practice report multiple seeds for papers.
30 Default init you’d name in an interview?Easy
Answer: He for ReLU-based CNN/MLP; Xavier for tanh/sigmoid-heavy nets; use framework defaults (
kaiming_uniform, xavier_normal) and match to activation. Say fan_in, activation, and symmetry—three anchors interviewers expect.
Batch Normalization — 15 Interview Questions
31 What does batch normalization do per feature map?Easy
Answer: For each channel (or neuron), subtract batch mean and divide by batch std (with ε), then apply learnable scale γ and shift β so the network can undo normalization if useful.
ŷ = γ · (x − μ_B) / √(σ²_B + ε) + β
32 What is “internal covariate shift�Medium
Answer: The original paper’s term: changing distribution of layer inputs as earlier layers update. BN aims to stabilize those distributions. Modern view also stresses smoothing loss landscape and allowing higher learning rates.
33 Training vs inference (eval) in batch norm.Easy
Answer: Train: use current mini-batch μ, σ²; update running_mean and running_var with momentum. Eval: freeze γ, β and use running statistics—not batch stats—so one sample or any batch size behaves consistently.
34 Why is batch norm problematic with batch size 1 or very small batches?Medium
Answer: Batch statistics are noisy or undefined; running estimates misalign. Common fixes: Sync BN across GPUs, larger batch, or switch to Layer/Group Norm.
35 Where is BN typically placed: before or after activation?Medium
Answer: Original paper: before nonlinearity (conv → BN → ReLU). Many modern CNNs use after conv but before ReLU still common; ResNet-style often conv–BN–ReLU. Be consistent within an architecture; know both camps exist.
36 Are γ and β learned?Easy
Answer: Yes—trainable parameters like weights. If optimal, the network can learn identity-like scaling so BN doesn’t hurt representational power.
37 What does momentum on running mean/variance mean?Medium
Answer: Exponential moving average: running = (1−m)·batch + m·running—smooths estimates across iterations so eval stats aren’t dominated by last batch.
38 Batch norm in CNNs—which dimensions are normalized?Medium
Answer: Per channel, aggregate over batch, height, width (N×H×W) to get one μ and σ per channel. Keeps spatial structure within each feature map.
39 Layer Norm vs Batch Norm—key difference.Medium
Answer: LN normalizes across features for each example (independent of batch size). BN uses batch dimension—LN fits RNNs/Transformers and small batches better.
40 Group Norm—in one sentence.Hard
Answer: Split channels into groups, normalize within each group per spatial location—less batch-dependent than BN, useful for small batches in vision.
41 Does BN act like regularization?Medium
Answer: Noisy batch statistics add mild regularization similar to jitter; don’t rely on it instead of dropout/weight decay. Effect shrinks with large batch.
42 Inference batch size different from training—OK?Easy
Answer: Yes—eval uses fixed running stats, so any batch size (including 1) is valid if running stats were estimated well during training.
43 Fine-tuning: freeze BN layers—when?Hard
Answer: Small new dataset: sometimes freeze BN (use pretrained running stats) to avoid bad estimates; or keep BN trainable with small LR. Depends on domain shift and batch size.
44 Interaction with weight decay (L2).Hard
Answer: Debated implementation details (decoupling in AdamW). Conceptually BN changes effective step geometry; use framework defaults and literature for the optimizer pairing you cite.
45 When would you avoid batch norm?Medium
Answer: Very small batches, non-batch settings (online), some GAN setups, or when you need batch-independent norm—prefer LN, GN, or modern alternatives (RMSNorm, etc.).
Always say train vs eval and running statistics—the core BN interview answer.
Overfitting & Underfitting — 15 Interview Questions
46 Define overfitting.Easy
Answer: The model learns training noise and idiosyncrasies so training error is low but validation/test error is much worse—poor generalization.
47 Define underfitting.Easy
Answer: The model is too simple or insufficiently trained: both training and validation errors remain high—it misses real signal.
48 Bias vs variance (classic interview version).Medium
Answer: High bias: systematically wrong (underfitting). High variance: sensitive to training sample (overfitting). Ideal model balances both for lowest expected test error.
49 What is the generalization gap?Easy
Answer: Difference between train performance and held-out performance. Large gap often signals overfitting; small gap with poor absolute score suggests underfitting or hard task.
50 How do learning curves diagnose overfitting?Medium
Answer: Plot loss vs epoch: train ↓ but val ↑ or plateaus bad → overfit. Both high and stuck → underfit or need better features/architecture.
51 List fixes for overfitting.Easy
Answer: More/better data, augmentation, regularization (L2, dropout), smaller model, early stopping, label noise cleanup, cross-validation for honest estimates.
52 List fixes for underfitting.Easy
Answer: Bigger / deeper model, train longer, lower regularization, richer features, check learning rate and optimization bugs, ensure data quality.
53 Can a neural net memorize random labels?Medium
Answer: Yes—large enough nets can fit random noise on training set (classic experiment). Shows capacity without generalization; motivates regularization and correct targets.
54 Why use a validation set?Easy
Answer: Tune hyperparameters and early stop without peeking at test set. Test set should estimate final generalization once to avoid optimistic bias.
55 Classical U-shaped risk vs modern “double descentâ€â€”mention?Hard
Answer: Classical: bias–variance U-shape in model complexity. Some regimes show double descent where risk drops again past interpolation threshold—interview bonus topic, not required for basics.
56 k-fold cross-validation—purpose.Medium
Answer: Rotate train/val splits to estimate performance with less variance when data is small—better hyperparameter comparison than one random split.
57 Label noise and overfitting.Medium
Answer: Wrong labels are noise; the model may memorize them. Clean data, robust loss, or regularization helps; audit labels in production ML.
58 More data vs smaller model for overfitting?Medium
Answer: Often more diverse data is the best fix if feasible. Smaller model is a lever when data is fixed—trade off capacity vs available signal.
59 Early stopping—how does it reduce overfitting?Easy
Answer: Stop training when validation loss worsens—prevents continued fitting of training noise. Acts as implicit regularization on training time/weight trajectory.
60 One diagram you’d draw in an interview.Easy
Answer: Two curves vs epochs: train loss down, val loss down then up—point at the elbow as overfitting onset. Pair with train vs val accuracy if classification.
Tie symptoms to train and val numbers—interviewers want concrete diagnostics.
Dropout & Regularization — 15 Interview Questions
61 What is dropout?Easy
Answer: During training, each activation is kept with probability 1−p and set to zero otherwise—different random mask each step. Reduces co-adaptation of neurons.
62 Dropout at training vs inference.Easy
Answer: Training: apply stochastic mask. Inference: no dropout—use full network. Expectation of output must match; handled by scaling (see inverted dropout or test-time multiply by 1−p).
63 What is inverted dropout?Medium
Answer: Scale kept activations by 1/(1−p) during training so inference needs no extra scaling. Common in frameworks—cleaner eval path.
64 Ensemble interpretation of dropout.Medium
Answer: Training samples many thinned subnets; inference averages over exponentially many such nets—approximated by using the full net with scaled weights. Explains regularizing effect.
65 Where is dropout usually applied?Easy
Answer: After fully connected or sometimes conv layers (less common in modern CNNs); often not on output layer. Transformers use attention dropout on weights/probs.
66 Typical dropout probability p?Easy
Answer: Hidden layers often 0.2–0.5; too high hurts capacity. Tune on validation; some architectures (BN-heavy nets) use less dropout.
67 L2 regularization (weight decay)—effect.Medium
Answer: Penalty λ||w||² encourages smaller weights, smoother functions, less overfitting. With SGD equivalent to shrinking weights each step; AdamW decouples decay properly.
68 L1 vs L2 for neural nets.Medium
Answer: L1 encourages sparsity (many exact zeros with subgradient methods). L2 shrinks all weights smoothly. L2 is default; L1 for feature selection or sparse models.
69 Monte Carlo dropout at test time—why?Hard
Answer: Leave dropout on during inference, average multiple forward passes—approximate predictive uncertainty (Bayesian NN heuristic).
70 Dropout with batch normalization—interaction?Hard
Answer: Order and strength matter; dropout before BN can shift batch statistics. Many modern vision models rely more on BN + data aug than heavy dropout—know it’s architecture-dependent.
71 Spatial dropout in CNNs.Medium
Answer: Drop entire feature maps (channels) instead of individual pixels—stronger structural regularization, avoids correlating adjacent activations.
72 Is label smoothing regularization?Medium
Answer: Yes—softens targets so the model doesn’t become overconfident; acts on the loss, not weights directly.
73 Gaussian noise on inputs as regularization.Easy
Answer: Adds robustness to small input perturbations—related to data augmentation and Tikhonov-style effects in linear models.
74 Stochastic depth / drop path (high level).Hard
Answer: Randomly skip whole residual branches during training—regularizes very deep networks similarly in spirit to dropout but on graph structure.
75 When prefer dropout vs weight decay?Medium
Answer: Often use both lightly. Dropout targets co-adaptation of activations; weight decay shrinks parameters. Large data + BN may need little dropout; small data FC nets benefit more.
State clearly: dropout off at eval unless doing MC dropout.