Computer Vision Interview 40 Q&A Chapter 15

Advanced CNN Architectures — Interview Q&A

ResNet skip connections, MobileNet depthwise separable convolutions, and EfficientNet scaling.

40 questions Chapter 15

ResNet: 20 Essential Q&A

1 What is ResNet? ⚡ easy
Answer: Deep CNN where layers learn residual functions F(x) with skip connections adding input x—enables training very deep networks (50–1000+ layers).
2 What is the degradation problem? 🔥 hard
Answer: As depth increases, training error can get worse even without overfitting—not vanishing gradients alone; harder optimization for plain deep stacks.
3 Basic residual block? 📊 medium
Answer: y = F(x, {W_i}) + x where F is usually two 3×3 convs + BN + ReLU—output same spatial size as x for identity add.
4 Identity shortcut? ⚡ easy
Answer: Skip connection adds x directly when dimensions match—if channels/stride differ, use 1×1 conv projection on shortcut to match shape.
5 Why learn residual F? 🔥 hard
Answer: If optimal mapping is close to identity, easier to learn small perturbation F than full mapping; empirically eases optimization of deep nets.
6 When projection shortcut? 📊 medium
Answer: When block changes spatial size (stride 2) or channel count—1×1 conv on x with stride aligns dimensions for addition.
7 Bottleneck block? 📊 medium
Answer: 1×1 reduce channels → 3×3 spatial conv → 1×1 expand—cuts FLOPs for deep models (ResNet-50+).
8 Common depths? ⚡ easy
Answer: ResNet-18/34 use basic blocks; 50/101/152 use bottleneck—standard backbones for detection/segmentation.
9 Relation to vanishing gradients? 📊 medium
Answer: Shortcuts provide gradient highways—identity path carries gradients deeper; complements BN and good init.
10 BN ordering? 📊 medium
Answer: Original: conv → BN → ReLU inside F; post-activation variants exist (ResNet v2)—interview often accepts conv-BN-ReLU block.
11 Initialization? ⚡ easy
Answer: He init for conv layers suited to ReLU—standard with ResNet training recipes.
12 ResNeXt? 🔥 hard
Answer: Splits channels into cardinality groups inside block—trade width vs depth; improves accuracy with similar FLOPs.
13 ResNet in detection? 📊 medium
Answer: Common backbone in Faster R-CNN, RetinaNet with FPN—C4/C5 feature maps extracted for heads.
14 In segmentation? 📊 medium
Answer: Encoder backbone (e.g. ResNet-50) + decoder (U-Net style, ASPP)—pretrained ImageNet weights standard.
15 vs VGG? 📊 medium
Answer: ResNet achieves better accuracy with fewer FLOPs than very deep VGG due to bottlenecks and efficiency.
16 Training recipe? 📊 medium
Answer: SGD + momentum, step LR decay, weight decay, long epochs on ImageNet—augmentation similar to prior CNNs.
17 vs DenseNet? 🔥 hard
Answer: DenseNet concatenates all previous features—different memory/compute tradeoff; ResNet adds single skip per block.
18 Write the equation. ⚡ easy
Answer: Typically y = σ(F(x) + x) or ReLU after add depending on variant—core idea is additive skip.
19 Zero-init last BN? 🔥 hard
Answer: Some training refinements initialize last BN in residual branch to zero so block starts near identity—stabilizes early training.
20 Still used? ⚡ easy
Answer: Yes—strong baseline; ConvNeXt, ViT compete on benchmarks but ResNet remains default for robustness and tooling.

MobileNet: 20 Essential Q&A

21 What is MobileNet? 📊 medium
Answer: Efficient CNN family for mobile/edge using depthwise separable convolutions to cut FLOPs and parameters vs standard convs.
22 What is depthwise convolution? 📊 medium
Answer: Each input channel has its own spatial filter—no mixing across channels; drastically fewer params than full conv per output channel.
23 What is pointwise convolution? 📊 medium
Answer: 1×1 conv after depthwise—mixes channels at each spatial location, like per-pixel linear layer across depth.
# Depthwise: groups=in_channels; Pointwise: 1x1 conv
24 Complexity vs standard conv? 🔥 hard
Answer: Roughly 1/C_out + 1/k² factor reduction vs k×k conv when comparing costs—huge savings for large kernels and channels.
25 Width multiplier α? ⚡ easy
Answer: Uniformly thin every layer’s channels by α ∈ (0,1]—linear accuracy–latency tradeoff for deployment targets.
26 Resolution multiplier? 📊 medium
Answer: Train/infer on smaller input resolution ρ—quadratic FLOP savings with predictable accuracy drop.
27 MobileNetV2 inverted residual? 🔥 hard
Answer: Expand low-dim bottleneck → depthwise → project back—shortcut connects thin bottlenecks (memory efficient), opposite of classical residual wide→narrow.
28 Why expansion t? 📊 medium
Answer: Depthwise needs rich features—expand channels by factor t before DW conv, then linear 1×1 compress to avoid ReLU killing info in low-dim subspace.
29 ReLU6? 📊 medium
Answer: Clip ReLU at 6—originally claimed helpful for quantized deployment; still seen in some mobile architectures.
30 MobileNet + SSD? 📊 medium
Answer: Lightweight object detectors attach SSD heads to MobileNet stages—real-time on phones with acceptable mAP on constrained devices.
31 vs ShuffleNet? 🔥 hard
Answer: ShuffleNet uses channel shuffle after grouped convs—different structural trick; both target efficient inference.
32 vs EfficientNet? 📊 medium
Answer: EfficientNet scales depth/width/resolution together (compound scaling)—often better Pareto frontier; MobileNet simpler family widely supported in runtimes.
33 MobileNetV3? 🔥 hard
Answer: Uses NAS + NetAdapt for layer choices, h-swish activations in some layers, SE-like squeeze-excitation—improved accuracy per FLOP.
34 Squeeze-and-excitation? 📊 medium
Answer: Global pool → small FC → channel gates—recalibrates channel importance; appears in MobileNetV3 and many efficient nets.
35 Quantization? ⚡ easy
Answer: Depthwise-heavy nets often deployed as INT8—fewer MACs and memory; validate accuracy after PTQ/QAT.
36 Strided depthwise? 📊 medium
Answer: Depthwise conv with stride 2 downsamples spatially—paired with pointwise for channel mix; replaces pooling in many blocks.
37 Pointwise = ? ⚡ easy
Answer: Standard conv with 1×1 kernel—channel mixing only, no spatial context.
38 Transfer to tasks? 📊 medium
Answer: ImageNet-pretrained MobileNet backbones fine-tune for classification, detection, segmentation with small heads—standard on edge.
39 Accuracy ceiling? 📊 medium
Answer: Extreme width/resolution cuts hurt top-1 on hard datasets—need larger efficient families or distillation from big teacher.
40 Deployment? ⚡ easy
Answer: Use vendor runtimes (CoreML, NNAPI, TensorRT) with fused DW+PW kernels; profile latency not just FLOPs.
Full tutorial chapter

Pair these interview notes with the matching CV tutorial chapter.

align-items-center flex-wrap gap-2"> Previous Next