Weight Initialization â€” 15 Interview Questions

Xavier/Glorot, He, fan-in and fan-out, why symmetry must break, and how init interacts with activation and depth.

Colored left borders per card; green / amber / red difficulty chips.

Variance Xavier He Random

1 Why canâ€™t we set all weights to zero?Easy

Answer: Neurons in a layer stay identical: same outputs, same gradients, same updatesâ€”symmetry never breaks. Need random (or other asymmetric) init so units specialize.

2 How are biases usually initialized?Easy

Answer: Often zeros is fine for biasesâ€”symmetry breaking comes from weights. Sometimes small positive bias for ReLU to avoid dead neurons at start.

3 What are fan-in and fan-out?Easy

Answer: For one neuron, fan-in = number of incoming connections (input dim); fan-out = outgoing (next layer input count per filter/neuron context). Init schemes scale variance using these.

4 Xavier / Glorot initializationâ€”idea.Medium

Answer: Choose weight variance so activation variance stays roughly stable forward and gradient variance backwardâ€”often Uniform or Normal with scale âˆ 1/fan_avg. Suited to tanh/sigmoid (linear-ish near 0).

Var(W) â‰ˆ 2 / (fan_in + fan_out) (common form)

5 He initializationâ€”for which activation?Medium

Answer: Designed for ReLU: roughly half of activations are zero, so variance is scaled with fan_in only (e.g. Var â‰ˆ 2/fan_in for ReLU). Prevents signal dying or exploding early in deep ReLU nets.

6 What goes wrong with too large or too small random init?Medium

Answer: Too large: activations/gradients explode. Too small: activations vanish, gradients tinyâ€”slow learning. Good init keeps scale in a reasonable band across layers.

7 LeCun normal / uniformâ€”one line.Medium

Answer: Another fan-in-based scaling (e.g. std = 1/âˆšfan_in) to preserve variance; similar family to Xavier/He with different constants for different assumptions.

8 Orthogonal initializationâ€”when mentioned?Hard

Answer: Start with orthogonal weight matrices so singular values start near 1â€”helps very deep nets or RNNs with gradient flow. Less default than Xavier/He for vanilla CNN/MLP.

9 Transfer learningâ€”how does â€œinitializationâ€ change?Easy

Answer: Load pretrained weights instead of randomâ€”only new head layers need fresh init. Fine-tuning uses small LR so pretrained init isnâ€™t destroyed immediately.

10 Does batch norm make initialization less critical?Medium

Answer: Partlyâ€”BN stabilizes activations so training is less sensitive to exact scale. You still avoid pathological init; bad init can still hurt before BN statistics stabilize.

11 Residual blocksâ€”init of last conv layer sometimes zeroâ€”why?Hard

Answer: Some designs initialize the last conv in a block near zero so the block starts as near-identity (skip path dominates), improving optimization of very deep nets.

12 What is a â€œgainâ€ or activation-specific multiplier?Medium

Answer: Frameworks apply a constant (e.g. for Leaky ReLU slope) to adjust variance for that nonlinearityâ€”He/Xavier formulas include these gains.

13 fan_in for a conv layer?Medium

Answer: Typically k_h Ã— k_w Ã— in_channels per filterâ€”number of multiply-add inputs contributing to one output activation before bias.

14 Why set a random seed in experiments?Easy

Answer: Reproducibilityâ€”same init and shuffling for fair comparisons. Does not remove variance across seeds; best practice report multiple seeds for papers.

15 Default init youâ€™d name in an interview?Easy

Answer: He for ReLU-based CNN/MLP; Xavier for tanh/sigmoid-heavy nets; use framework defaults (kaiming_uniform, xavier_normal) and match to activation.

Say fan_in, activation, and symmetryâ€”three anchors interviewers expect.

Quick review checklist

Zero weights; fan-in/fan-out; Xavier vs He.
Too large/small random; BN interaction; pretrained as init.
Conv fan_in; seed for reproducibility.

Previous: Network design Next: Batch normalization

Related Neural Networks Links

Weight Initialization â€” 15 Interview Questions

Quick review checklist