Computer Vision Interview
20 essential Q&A
Updated 2026
R-CNN
R-CNN Family: 20 Essential Q&A
From selective search to RPN and feature pyramids—the two-stage detector story.
~12 min read
20 questions
Advanced
RPNRoIFPNCascade
Quick Navigation
1
Original R-CNN steps?
📊 medium
Answer: Propose ~2k regions (selective search) → warp each → CNN features → SVM per class + bbox regressor—no shared conv per region → very slow.
2
Main bottleneck?
⚡ easy
Answer: Running CNN thousands of times per image on warped crops; also disk caching of features in early work.
3
What did Fast R-CNN fix?
📊 medium
Answer: Run CNN once on full image; project RoIs onto feature map → RoI pool to fixed size → heads—big speedup + end-to-end backprop.
4
How RoI pooling works?
📊 medium
Answer: Divide each RoI on feature map into H×W bins; max-pool each bin to fixed output—quantization loses subpixel alignment.
5
What is Faster R-CNN?
🔥 hard
Answer: Replaces selective search with RPN that shares full-image conv features—learned proposals, joint training with detector.
6
What does the RPN output?
🔥 hard
Answer: At each anchor location: objectness logits and box deltas to refine anchors—proposals passed to RoI head.
7
Anchor scales/aspect ratios?
📊 medium
Answer: Multiple templates per location cover different object shapes; k anchors per cell → many candidate boxes before filtering by score + NMS.
8
Losses in Faster R-CNN?
🔥 hard
Answer: RPN: binary CE for objectness + smooth L1 for box deltas on assigned anchors; detector head: multi-class CE + bbox regression on positive RoIs.
9
Why FPN?
🔥 hard
Answer: Semantic single high-level feature map is weak for small objects—FPN builds a top-down pyramid with lateral connections for multi-scale RoI features.
10
RoIAlign role?
📊 medium
Answer: Bilinear sample features at exact RoI locations—used in Mask R-CNN for alignment-sensitive mask prediction.
11
What is Cascade R-CNN?
🔥 hard
Answer: Sequence of detector stages with increasing IoU thresholds for positives—reduces overfitting to low-quality proposals and improves AP.
12
NMS placement?
⚡ easy
Answer: After RPN (proposal NMS) and usually after final class-specific boxes—removes duplicate detections.
13
Approximate joint training?
📊 medium
Answer: Alternating or 4-step training historically; modern implementations use single loss with shared backbone and careful sampling.
14
Two-stage strength?
⚡ easy
Answer: Typically higher mAP especially on challenging datasets vs comparable-era one-stage; slower inference.
15
Mask R-CNN?
📊 medium
Answer: Adds mask branch to Faster R-CNN with RoIAlign—instance segmentation with modest overhead.
16
Keypoint R-CNN?
📊 medium
Answer: Same framework with one-hot masks per keypoint or heatmap head—used for pose.
17
Deformable conv in detectors?
🔥 hard
Answer: Offsets sampling grid in conv—better geometric modeling for deformable objects; used in RefineDet / DCN backbones.
18
What is HTC?
🔥 hard
Answer: Hybrid Task Cascade—interleaves detection and segmentation stages with feature fusion—strong COCO instance segmentation.
19
DETR vs R-CNN?
📊 medium
Answer: DETR removes anchors/NMS with transformers—simpler pipeline but different training dynamics and compute.
20
When choose two-stage today?
⚡ easy
Answer: When max accuracy matters and latency budget allows, or when building on mature frameworks (Detectron2) with many pretrained configs.
R-CNN Family Cheat Sheet
Evolution
- R-CNN → Fast
- Faster + RPN
Add-ons
- FPN
- RoIAlign
Accuracy
- Cascade
- HTC
đź’ˇ Pro tip: Faster R-CNN = shared backbone + RPN proposals + RoI head.
Full tutorial track
Go deeper with the matching tutorial chapter and code examples.