Advanced CNN Architectures — Interview Q&A

Question 1

1 What is ResNet? ⚡ easy

Answer

Answer: Deep CNN where layers learn residual functions F(x) with skip connections adding input x—enables training very deep networks (50–1000+ layers).

Question 2

2 What is the degradation problem? 🔥 hard

Answer

Answer: As depth increases, training error can get worse even without overfitting—not vanishing gradients alone; harder optimization for plain deep stacks.

Question 3

3 Basic residual block? 📊 medium

Answer

Answer: y = F(x, {W_i}) + x where F is usually two 3×3 convs + BN + ReLU—output same spatial size as x for identity add.

Question 4

4 Identity shortcut? ⚡ easy

Answer

Answer: Skip connection adds x directly when dimensions match—if channels/stride differ, use 1×1 conv projection on shortcut to match shape.

Question 5

5 Why learn residual F? 🔥 hard

Answer

Answer: If optimal mapping is close to identity, easier to learn small perturbation F than full mapping; empirically eases optimization of deep nets.

Question 6

6 When projection shortcut? 📊 medium

Answer

Answer: When block changes spatial size (stride 2) or channel count—1×1 conv on x with stride aligns dimensions for addition.

Question 7

7 Bottleneck block? 📊 medium

Answer

Answer: 1×1 reduce channels → 3×3 spatial conv → 1×1 expand—cuts FLOPs for deep models (ResNet-50+).

Question 8

8 Common depths? ⚡ easy

Answer

Answer: ResNet-18/34 use basic blocks; 50/101/152 use bottleneck—standard backbones for detection/segmentation.

Question 9

9 Relation to vanishing gradients? 📊 medium

Answer

Answer: Shortcuts provide gradient highways—identity path carries gradients deeper; complements BN and good init.

Question 10

10 BN ordering? 📊 medium

Answer

Answer: Original: conv → BN → ReLU inside F; post-activation variants exist (ResNet v2)—interview often accepts conv-BN-ReLU block.

Question 11

11 Initialization? ⚡ easy

Answer

Answer: He init for conv layers suited to ReLU—standard with ResNet training recipes.

Question 12

12 ResNeXt? 🔥 hard

Answer

Answer: Splits channels into cardinality groups inside block—trade width vs depth; improves accuracy with similar FLOPs.

Question 13

13 ResNet in detection? 📊 medium

Answer

Answer: Common backbone in Faster R-CNN, RetinaNet with FPN—C4/C5 feature maps extracted for heads.

Question 14

14 In segmentation? 📊 medium

Answer

Answer: Encoder backbone (e.g. ResNet-50) + decoder (U-Net style, ASPP)—pretrained ImageNet weights standard.

Question 15

15 vs VGG? 📊 medium

Answer

Answer: ResNet achieves better accuracy with fewer FLOPs than very deep VGG due to bottlenecks and efficiency.

Question 16

16 Training recipe? 📊 medium

Answer

Answer: SGD + momentum, step LR decay, weight decay, long epochs on ImageNet—augmentation similar to prior CNNs.

Question 17

17 vs DenseNet? 🔥 hard

Answer

Answer: DenseNet concatenates all previous features—different memory/compute tradeoff; ResNet adds single skip per block.

Question 18

18 Write the equation. ⚡ easy

Answer

Answer: Typically y = σ(F(x) + x) or ReLU after add depending on variant—core idea is additive skip.

Question 19

19 Zero-init last BN? 🔥 hard

Answer

Answer: Some training refinements initialize last BN in residual branch to zero so block starts near identity—stabilizes early training.

Question 20

20 Still used? ⚡ easy

Answer

Answer: Yes—strong baseline; ConvNeXt, ViT compete on benchmarks but ResNet remains default for robustness and tooling.

Question 21

21 What is MobileNet? 📊 medium

Answer

Answer: Efficient CNN family for mobile/edge using depthwise separable convolutions to cut FLOPs and parameters vs standard convs.

Question 22

22 What is depthwise convolution? 📊 medium

Answer

Answer: Each input channel has its own spatial filter—no mixing across channels; drastically fewer params than full conv per output channel.

Question 23

23 What is pointwise convolution? 📊 medium

Answer

Answer: 1×1 conv after depthwise—mixes channels at each spatial location, like per-pixel linear layer across depth.

Question 24

24 Complexity vs standard conv? 🔥 hard

Answer

Answer: Roughly 1/C_out + 1/k² factor reduction vs k×k conv when comparing costs—huge savings for large kernels and channels.

Question 25

25 Width multiplier α? ⚡ easy

Answer

Answer: Uniformly thin every layer’s channels by α ∈ (0,1]—linear accuracy–latency tradeoff for deployment targets.

Question 26

26 Resolution multiplier? 📊 medium

Answer

Answer: Train/infer on smaller input resolution ρ—quadratic FLOP savings with predictable accuracy drop.

Question 27

27 MobileNetV2 inverted residual? 🔥 hard

Answer

Answer: Expand low-dim bottleneck → depthwise → project back—shortcut connects thin bottlenecks (memory efficient), opposite of classical residual wide→narrow.

Question 28

28 Why expansion t? 📊 medium

Answer

Answer: Depthwise needs rich features—expand channels by factor t before DW conv, then linear 1×1 compress to avoid ReLU killing info in low-dim subspace.

Question 29

29 ReLU6? 📊 medium

Answer

Answer: Clip ReLU at 6—originally claimed helpful for quantized deployment; still seen in some mobile architectures.

Question 30

30 MobileNet + SSD? 📊 medium

Answer

Answer: Lightweight object detectors attach SSD heads to MobileNet stages—real-time on phones with acceptable mAP on constrained devices.

Question 31

31 vs ShuffleNet? 🔥 hard

Answer

Answer: ShuffleNet uses channel shuffle after grouped convs—different structural trick; both target efficient inference.

Question 32

32 vs EfficientNet? 📊 medium

Answer

Answer: EfficientNet scales depth/width/resolution together (compound scaling)—often better Pareto frontier; MobileNet simpler family widely supported in runtimes.

Question 33

33 MobileNetV3? 🔥 hard

Answer

Answer: Uses NAS + NetAdapt for layer choices, h-swish activations in some layers, SE-like squeeze-excitation—improved accuracy per FLOP.

Question 34

34 Squeeze-and-excitation? 📊 medium

Answer

Answer: Global pool → small FC → channel gates—recalibrates channel importance; appears in MobileNetV3 and many efficient nets.

Question 35

35 Quantization? ⚡ easy

Answer

Answer: Depthwise-heavy nets often deployed as INT8—fewer MACs and memory; validate accuracy after PTQ/QAT.

Question 36

36 Strided depthwise? 📊 medium

Answer

Answer: Depthwise conv with stride 2 downsamples spatially—paired with pointwise for channel mix; replaces pooling in many blocks.

Question 37

37 Pointwise = ? ⚡ easy

Answer

Answer: Standard conv with 1×1 kernel—channel mixing only, no spatial context.

Question 38

38 Transfer to tasks? 📊 medium

Answer

Answer: ImageNet-pretrained MobileNet backbones fine-tune for classification, detection, segmentation with small heads—standard on edge.

Question 39

39 Accuracy ceiling? 📊 medium

Answer

Answer: Extreme width/resolution cuts hurt top-1 on hard datasets—need larger efficient families or distillation from big teacher.

Question 40

40 Deployment? ⚡ easy

Answer

Answer: Use vendor runtimes (CoreML, NNAPI, TensorRT) with fused DW+PW kernels; profile latency not just FLOPs.

Advanced CNN Architectures — Interview Q&A

ResNet: 20 Essential Q&A

MobileNet: 20 Essential Q&A

Full tutorial chapter