Neural Networks 15 Essential Q&A
Interview Prep

Weight Initialization — 15 Interview Questions

Xavier/Glorot, He, fan-in and fan-out, why symmetry must break, and how init interacts with activation and depth.

Colored left borders per card; green / amber / red difficulty chips.

Variance Xavier He Random
1 Why can’t we set all weights to zero?Easy
Answer: Neurons in a layer stay identical: same outputs, same gradients, same updates—symmetry never breaks. Need random (or other asymmetric) init so units specialize.
2 How are biases usually initialized?Easy
Answer: Often zeros is fine for biases—symmetry breaking comes from weights. Sometimes small positive bias for ReLU to avoid dead neurons at start.
3 What are fan-in and fan-out?Easy
Answer: For one neuron, fan-in = number of incoming connections (input dim); fan-out = outgoing (next layer input count per filter/neuron context). Init schemes scale variance using these.
4 Xavier / Glorot initialization—idea.Medium
Answer: Choose weight variance so activation variance stays roughly stable forward and gradient variance backward—often Uniform or Normal with scale ∝ 1/fan_avg. Suited to tanh/sigmoid (linear-ish near 0).
Var(W) ≈ 2 / (fan_in + fan_out)  (common form)
5 He initialization—for which activation?Medium
Answer: Designed for ReLU: roughly half of activations are zero, so variance is scaled with fan_in only (e.g. Var ≈ 2/fan_in for ReLU). Prevents signal dying or exploding early in deep ReLU nets.
6 What goes wrong with too large or too small random init?Medium
Answer: Too large: activations/gradients explode. Too small: activations vanish, gradients tiny—slow learning. Good init keeps scale in a reasonable band across layers.
7 LeCun normal / uniform—one line.Medium
Answer: Another fan-in-based scaling (e.g. std = 1/√fan_in) to preserve variance; similar family to Xavier/He with different constants for different assumptions.
8 Orthogonal initialization—when mentioned?Hard
Answer: Start with orthogonal weight matrices so singular values start near 1—helps very deep nets or RNNs with gradient flow. Less default than Xavier/He for vanilla CNN/MLP.
9 Transfer learning—how does “initialization” change?Easy
Answer: Load pretrained weights instead of random—only new head layers need fresh init. Fine-tuning uses small LR so pretrained init isn’t destroyed immediately.
10 Does batch norm make initialization less critical?Medium
Answer: Partly—BN stabilizes activations so training is less sensitive to exact scale. You still avoid pathological init; bad init can still hurt before BN statistics stabilize.
11 Residual blocks—init of last conv layer sometimes zero—why?Hard
Answer: Some designs initialize the last conv in a block near zero so the block starts as near-identity (skip path dominates), improving optimization of very deep nets.
12 What is a “gain” or activation-specific multiplier?Medium
Answer: Frameworks apply a constant (e.g. for Leaky ReLU slope) to adjust variance for that nonlinearity—He/Xavier formulas include these gains.
13 fan_in for a conv layer?Medium
Answer: Typically k_h × k_w × in_channels per filter—number of multiply-add inputs contributing to one output activation before bias.
14 Why set a random seed in experiments?Easy
Answer: Reproducibility—same init and shuffling for fair comparisons. Does not remove variance across seeds; best practice report multiple seeds for papers.
15 Default init you’d name in an interview?Easy
Answer: He for ReLU-based CNN/MLP; Xavier for tanh/sigmoid-heavy nets; use framework defaults (kaiming_uniform, xavier_normal) and match to activation.
Say fan_in, activation, and symmetry—three anchors interviewers expect.

Quick review checklist

  • Zero weights; fan-in/fan-out; Xavier vs He.
  • Too large/small random; BN interaction; pretrained as init.
  • Conv fan_in; seed for reproducibility.