Computer Vision Interview 20 essential Q&A Updated 2026
R-CNN

R-CNN Family: 20 Essential Q&A

From selective search to RPN and feature pyramids—the two-stage detector story.

~12 min read 20 questions Advanced
RPNRoIFPNCascade
1 Original R-CNN steps? 📊 medium
Answer: Propose ~2k regions (selective search) → warp each → CNN features → SVM per class + bbox regressor—no shared conv per region → very slow.
2 Main bottleneck? ⚡ easy
Answer: Running CNN thousands of times per image on warped crops; also disk caching of features in early work.
3 What did Fast R-CNN fix? 📊 medium
Answer: Run CNN once on full image; project RoIs onto feature map → RoI pool to fixed size → heads—big speedup + end-to-end backprop.
4 How RoI pooling works? 📊 medium
Answer: Divide each RoI on feature map into H×W bins; max-pool each bin to fixed output—quantization loses subpixel alignment.
5 What is Faster R-CNN? 🔥 hard
Answer: Replaces selective search with RPN that shares full-image conv features—learned proposals, joint training with detector.
6 What does the RPN output? 🔥 hard
Answer: At each anchor location: objectness logits and box deltas to refine anchors—proposals passed to RoI head.
7 Anchor scales/aspect ratios? 📊 medium
Answer: Multiple templates per location cover different object shapes; k anchors per cell → many candidate boxes before filtering by score + NMS.
8 Losses in Faster R-CNN? 🔥 hard
Answer: RPN: binary CE for objectness + smooth L1 for box deltas on assigned anchors; detector head: multi-class CE + bbox regression on positive RoIs.
9 Why FPN? 🔥 hard
Answer: Semantic single high-level feature map is weak for small objects—FPN builds a top-down pyramid with lateral connections for multi-scale RoI features.
10 RoIAlign role? 📊 medium
Answer: Bilinear sample features at exact RoI locations—used in Mask R-CNN for alignment-sensitive mask prediction.
11 What is Cascade R-CNN? 🔥 hard
Answer: Sequence of detector stages with increasing IoU thresholds for positives—reduces overfitting to low-quality proposals and improves AP.
12 NMS placement? ⚡ easy
Answer: After RPN (proposal NMS) and usually after final class-specific boxes—removes duplicate detections.
13 Approximate joint training? 📊 medium
Answer: Alternating or 4-step training historically; modern implementations use single loss with shared backbone and careful sampling.
14 Two-stage strength? ⚡ easy
Answer: Typically higher mAP especially on challenging datasets vs comparable-era one-stage; slower inference.
15 Mask R-CNN? 📊 medium
Answer: Adds mask branch to Faster R-CNN with RoIAlign—instance segmentation with modest overhead.
16 Keypoint R-CNN? 📊 medium
Answer: Same framework with one-hot masks per keypoint or heatmap head—used for pose.
17 Deformable conv in detectors? 🔥 hard
Answer: Offsets sampling grid in conv—better geometric modeling for deformable objects; used in RefineDet / DCN backbones.
18 What is HTC? 🔥 hard
Answer: Hybrid Task Cascade—interleaves detection and segmentation stages with feature fusion—strong COCO instance segmentation.
19 DETR vs R-CNN? 📊 medium
Answer: DETR removes anchors/NMS with transformers—simpler pipeline but different training dynamics and compute.
20 When choose two-stage today? ⚡ easy
Answer: When max accuracy matters and latency budget allows, or when building on mature frameworks (Detectron2) with many pretrained configs.

R-CNN Family Cheat Sheet

Evolution
  • R-CNN → Fast
  • Faster + RPN
Add-ons
  • FPN
  • RoIAlign
Accuracy
  • Cascade
  • HTC

đź’ˇ Pro tip: Faster R-CNN = shared backbone + RPN proposals + RoI head.

Full tutorial track

Go deeper with the matching tutorial chapter and code examples.