Computer Vision Interview
40 Q&A
Chapter 8
Two-Stage Object Detection — Interview Q&A
Detection fundamentals and the R-CNN family—region proposals, Fast R-CNN, and Faster R-CNN.
40 questions
Chapter 8
Object Detection Intro: 20 Essential Q&A
1
What is object detection?
⚡ easy
Answer: Predict where objects are (bounding boxes or points) and what they are (class labels)—often multiple objects per image.
2
Common bbox formats?
📊 medium
Answer: (x_min,y_min,x_max,y_max), (cx,cy,w,h), or normalized variants—be consistent when converting and computing IoU.
3
Define IoU.
📊 medium
Answer: Intersection area / union area of two boxes (or masks)—0 no overlap, 1 perfect match; used for matching preds to ground truth.
def iou(a, b):
xa, ya, wa, ha = a; xb, yb, wb, hb = b
inter = max(0, min(xa+wa, xb+wb)-max(xa,xb)) * max(0, min(ya+ha, yb+hb)-max(ya,yb))
return inter / (wa*ha + wb*hb - inter + 1e-6)
4
TP vs FP for a detection?
📊 medium
Answer: Match prediction to GT by IoU ≥ threshold: matched = TP; no matching GT = FP; unmatched GT = FN.
5
What is mAP?
🔥 hard
Answer: Mean Average Precision over classes: AP integrates precision–recall curve (often at IoU 0.5 or 0.5:0.95 on COCO).
6
PR curve from detections?
📊 medium
Answer: Sort predictions by confidence; vary threshold to trace precision vs recall—AP is area under interpolated PR curve.
7
Why NMS?
📊 medium
Answer: Many windows fire on same object—suppress lower-scoring boxes with high IoU to same higher-scoring box; variants: soft-NMS, class-aware NMS.
8
What is an anchor?
📊 medium
Answer: Predefined box prior at a feature map location; network predicts offsets + class—speeds convergence vs pure coordinate regression.
9
Two-stage vs one-stage?
📊 medium
Answer: Two-stage: propose regions then classify (R-CNN family). One-stage: dense predictions in one pass (YOLO, SSD, RetinaNet)—usually faster, different error profile.
10
Historical sliding window?
⚡ easy
Answer: Exhaustive windows + classifier—prohibitively slow; modern detectors replace with region proposals or dense anchors on feature maps.
11
Multi-scale detection?
📊 medium
Answer: Image/feature pyramids, multi-scale anchors, or FPN so small and large objects are both seen at appropriate resolution.
12
Why are small objects hard?
📊 medium
Answer: Few pixels on feature maps, weak signal—higher-res inputs, specialized heads, and data augmentation (copy-paste) help.
13
COCO AP@[.5:.95]?
🔥 hard
Answer: Average AP over IoU thresholds 0.50 to 0.95 step 0.05—rewards localization quality, not just 0.5 IoU hits.
14
Role of confidence score?
⚡ easy
Answer: Estimated probability of class (and sometimes objectness)—used to sort preds for PR curve and NMS thresholding.
15
Multi-class boxes?
📊 medium
Answer: Each prediction has C-way softmax (or sigmoid per class for multi-label); match only within same predicted class for mAP.
16
Assigning training targets?
📊 medium
Answer: Match anchors/points to GT by IoU or center rules; positives get box regression targets and class; negatives contribute to objectness / background loss.
17
Many background anchors?
📊 medium
Answer: Extreme imbalance—addressed by sampling (hard negative mining), focal loss, or balanced loss weighting.
18
Latency drivers?
⚡ easy
Answer: Backbone depth, input resolution, number of proposals, NMS cost, batch size—profile end-to-end for deployment.
19
Detection vs segmentation?
⚡ easy
Answer: Boxes are coarse; segmentation gives pixel masks—detection often first stage in two-stage instance segmentation.
20
Per-image vs global AP?
📊 medium
Answer: Standard benchmarks aggregate over dataset; understand whether metric averages over images or pools all detections (COCO pools).
R-CNN Family: 20 Essential Q&A
21
Original R-CNN steps?
📊 medium
Answer: Propose ~2k regions (selective search) → warp each → CNN features → SVM per class + bbox regressor—no shared conv per region → very slow.
22
Main bottleneck?
⚡ easy
Answer: Running CNN thousands of times per image on warped crops; also disk caching of features in early work.
23
What did Fast R-CNN fix?
📊 medium
Answer: Run CNN once on full image; project RoIs onto feature map → RoI pool to fixed size → heads—big speedup + end-to-end backprop.
24
How RoI pooling works?
📊 medium
Answer: Divide each RoI on feature map into H×W bins; max-pool each bin to fixed output—quantization loses subpixel alignment.
25
What is Faster R-CNN?
🔥 hard
Answer: Replaces selective search with RPN that shares full-image conv features—learned proposals, joint training with detector.
26
What does the RPN output?
🔥 hard
Answer: At each anchor location: objectness logits and box deltas to refine anchors—proposals passed to RoI head.
27
Anchor scales/aspect ratios?
📊 medium
Answer: Multiple templates per location cover different object shapes; k anchors per cell → many candidate boxes before filtering by score + NMS.
28
Losses in Faster R-CNN?
🔥 hard
Answer: RPN: binary CE for objectness + smooth L1 for box deltas on assigned anchors; detector head: multi-class CE + bbox regression on positive RoIs.
29
Why FPN?
🔥 hard
Answer: Semantic single high-level feature map is weak for small objects—FPN builds a top-down pyramid with lateral connections for multi-scale RoI features.
30
RoIAlign role?
📊 medium
Answer: Bilinear sample features at exact RoI locations—used in Mask R-CNN for alignment-sensitive mask prediction.
31
What is Cascade R-CNN?
🔥 hard
Answer: Sequence of detector stages with increasing IoU thresholds for positives—reduces overfitting to low-quality proposals and improves AP.
32
NMS placement?
⚡ easy
Answer: After RPN (proposal NMS) and usually after final class-specific boxes—removes duplicate detections.
33
Approximate joint training?
📊 medium
Answer: Alternating or 4-step training historically; modern implementations use single loss with shared backbone and careful sampling.
34
Two-stage strength?
⚡ easy
Answer: Typically higher mAP especially on challenging datasets vs comparable-era one-stage; slower inference.
35
Mask R-CNN?
📊 medium
Answer: Adds mask branch to Faster R-CNN with RoIAlign—instance segmentation with modest overhead.
36
Keypoint R-CNN?
📊 medium
Answer: Same framework with one-hot masks per keypoint or heatmap head—used for pose.
37
Deformable conv in detectors?
🔥 hard
Answer: Offsets sampling grid in conv—better geometric modeling for deformable objects; used in RefineDet / DCN backbones.
38
What is HTC?
🔥 hard
Answer: Hybrid Task Cascade—interleaves detection and segmentation stages with feature fusion—strong COCO instance segmentation.
39
DETR vs R-CNN?
📊 medium
Answer: DETR removes anchors/NMS with transformers—simpler pipeline but different training dynamics and compute.
40
When choose two-stage today?
⚡ easy
Answer: When max accuracy matters and latency budget allows, or when building on mature frameworks (Detectron2) with many pretrained configs.
Full tutorial chapter
Pair these interview notes with the matching CV tutorial chapter.