Computer Vision Interview
40 Q&A
Chapter 15
Advanced CNN Architectures — Interview Q&A
ResNet skip connections, MobileNet depthwise separable convolutions, and EfficientNet scaling.
40 questions
Chapter 15
ResNet: 20 Essential Q&A
1
What is ResNet?
⚡ easy
Answer: Deep CNN where layers learn residual functions F(x) with skip connections adding input x—enables training very deep networks (50–1000+ layers).
2
What is the degradation problem?
🔥 hard
Answer: As depth increases, training error can get worse even without overfitting—not vanishing gradients alone; harder optimization for plain deep stacks.
3
Basic residual block?
📊 medium
Answer: y = F(x, {W_i}) + x where F is usually two 3×3 convs + BN + ReLU—output same spatial size as x for identity add.
4
Identity shortcut?
⚡ easy
Answer: Skip connection adds x directly when dimensions match—if channels/stride differ, use 1×1 conv projection on shortcut to match shape.
5
Why learn residual F?
🔥 hard
Answer: If optimal mapping is close to identity, easier to learn small perturbation F than full mapping; empirically eases optimization of deep nets.
6
When projection shortcut?
📊 medium
Answer: When block changes spatial size (stride 2) or channel count—1×1 conv on x with stride aligns dimensions for addition.
7
Bottleneck block?
📊 medium
Answer: 1×1 reduce channels → 3×3 spatial conv → 1×1 expand—cuts FLOPs for deep models (ResNet-50+).
8
Common depths?
⚡ easy
Answer: ResNet-18/34 use basic blocks; 50/101/152 use bottleneck—standard backbones for detection/segmentation.
9
Relation to vanishing gradients?
📊 medium
Answer: Shortcuts provide gradient highways—identity path carries gradients deeper; complements BN and good init.
10
BN ordering?
📊 medium
Answer: Original: conv → BN → ReLU inside F; post-activation variants exist (ResNet v2)—interview often accepts conv-BN-ReLU block.
11
Initialization?
⚡ easy
Answer: He init for conv layers suited to ReLU—standard with ResNet training recipes.
12
ResNeXt?
🔥 hard
Answer: Splits channels into cardinality groups inside block—trade width vs depth; improves accuracy with similar FLOPs.
13
ResNet in detection?
📊 medium
Answer: Common backbone in Faster R-CNN, RetinaNet with FPN—C4/C5 feature maps extracted for heads.
14
In segmentation?
📊 medium
Answer: Encoder backbone (e.g. ResNet-50) + decoder (U-Net style, ASPP)—pretrained ImageNet weights standard.
15
vs VGG?
📊 medium
Answer: ResNet achieves better accuracy with fewer FLOPs than very deep VGG due to bottlenecks and efficiency.
16
Training recipe?
📊 medium
Answer: SGD + momentum, step LR decay, weight decay, long epochs on ImageNet—augmentation similar to prior CNNs.
17
vs DenseNet?
🔥 hard
Answer: DenseNet concatenates all previous features—different memory/compute tradeoff; ResNet adds single skip per block.
18
Write the equation.
⚡ easy
Answer: Typically y = σ(F(x) + x) or ReLU after add depending on variant—core idea is additive skip.
19
Zero-init last BN?
🔥 hard
Answer: Some training refinements initialize last BN in residual branch to zero so block starts near identity—stabilizes early training.
20
Still used?
⚡ easy
Answer: Yes—strong baseline; ConvNeXt, ViT compete on benchmarks but ResNet remains default for robustness and tooling.
MobileNet: 20 Essential Q&A
21
What is MobileNet?
📊 medium
Answer: Efficient CNN family for mobile/edge using depthwise separable convolutions to cut FLOPs and parameters vs standard convs.
22
What is depthwise convolution?
📊 medium
Answer: Each input channel has its own spatial filter—no mixing across channels; drastically fewer params than full conv per output channel.
23
What is pointwise convolution?
📊 medium
Answer: 1×1 conv after depthwise—mixes channels at each spatial location, like per-pixel linear layer across depth.
# Depthwise: groups=in_channels; Pointwise: 1x1 conv
24
Complexity vs standard conv?
🔥 hard
Answer: Roughly 1/C_out + 1/k² factor reduction vs k×k conv when comparing costs—huge savings for large kernels and channels.
25
Width multiplier α?
⚡ easy
Answer: Uniformly thin every layer’s channels by α ∈ (0,1]—linear accuracy–latency tradeoff for deployment targets.
26
Resolution multiplier?
📊 medium
Answer: Train/infer on smaller input resolution ρ—quadratic FLOP savings with predictable accuracy drop.
27
MobileNetV2 inverted residual?
🔥 hard
Answer: Expand low-dim bottleneck → depthwise → project back—shortcut connects thin bottlenecks (memory efficient), opposite of classical residual wide→narrow.
28
Why expansion t?
📊 medium
Answer: Depthwise needs rich features—expand channels by factor t before DW conv, then linear 1×1 compress to avoid ReLU killing info in low-dim subspace.
29
ReLU6?
📊 medium
Answer: Clip ReLU at 6—originally claimed helpful for quantized deployment; still seen in some mobile architectures.
30
MobileNet + SSD?
📊 medium
Answer: Lightweight object detectors attach SSD heads to MobileNet stages—real-time on phones with acceptable mAP on constrained devices.
31
vs ShuffleNet?
🔥 hard
Answer: ShuffleNet uses channel shuffle after grouped convs—different structural trick; both target efficient inference.
32
vs EfficientNet?
📊 medium
Answer: EfficientNet scales depth/width/resolution together (compound scaling)—often better Pareto frontier; MobileNet simpler family widely supported in runtimes.
33
MobileNetV3?
🔥 hard
Answer: Uses NAS + NetAdapt for layer choices, h-swish activations in some layers, SE-like squeeze-excitation—improved accuracy per FLOP.
34
Squeeze-and-excitation?
📊 medium
Answer: Global pool → small FC → channel gates—recalibrates channel importance; appears in MobileNetV3 and many efficient nets.
35
Quantization?
⚡ easy
Answer: Depthwise-heavy nets often deployed as INT8—fewer MACs and memory; validate accuracy after PTQ/QAT.
36
Strided depthwise?
📊 medium
Answer: Depthwise conv with stride 2 downsamples spatially—paired with pointwise for channel mix; replaces pooling in many blocks.
37
Pointwise = ?
⚡ easy
Answer: Standard conv with 1×1 kernel—channel mixing only, no spatial context.
38
Transfer to tasks?
📊 medium
Answer: ImageNet-pretrained MobileNet backbones fine-tune for classification, detection, segmentation with small heads—standard on edge.
39
Accuracy ceiling?
📊 medium
Answer: Extreme width/resolution cuts hurt top-1 on hard datasets—need larger efficient families or distillation from big teacher.
40
Deployment?
⚡ easy
Answer: Use vendor runtimes (CoreML, NNAPI, TensorRT) with fused DW+PW kernels; profile latency not just FLOPs.
Full tutorial chapter
Pair these interview notes with the matching CV tutorial chapter.