Two-Stage Object Detection — Interview Q&A

Question 1

1 What is object detection? ⚡ easy

Answer

Answer: Predict where objects are (bounding boxes or points) and what they are (class labels)—often multiple objects per image.

Question 2

2 Common bbox formats? 📊 medium

Answer

Answer: (x_min,y_min,x_max,y_max), (cx,cy,w,h), or normalized variants—be consistent when converting and computing IoU.

Question 3

3 Define IoU. 📊 medium

Answer

Answer: Intersection area / union area of two boxes (or masks)—0 no overlap, 1 perfect match; used for matching preds to ground truth.

Question 4

4 TP vs FP for a detection? 📊 medium

Answer

Answer: Match prediction to GT by IoU ≥ threshold: matched = TP; no matching GT = FP; unmatched GT = FN.

Question 5

5 What is mAP? 🔥 hard

Answer

Answer: Mean Average Precision over classes: AP integrates precision–recall curve (often at IoU 0.5 or 0.5:0.95 on COCO).

Question 6

6 PR curve from detections? 📊 medium

Answer

Answer: Sort predictions by confidence; vary threshold to trace precision vs recall—AP is area under interpolated PR curve.

Question 7

7 Why NMS? 📊 medium

Answer

Answer: Many windows fire on same object—suppress lower-scoring boxes with high IoU to same higher-scoring box; variants: soft-NMS, class-aware NMS.

Question 8

8 What is an anchor? 📊 medium

Answer

Answer: Predefined box prior at a feature map location; network predicts offsets + class—speeds convergence vs pure coordinate regression.

Question 9

9 Two-stage vs one-stage? 📊 medium

Answer

Answer: Two-stage: propose regions then classify (R-CNN family). One-stage: dense predictions in one pass (YOLO, SSD, RetinaNet)—usually faster, different error profile.

Question 10

10 Historical sliding window? ⚡ easy

Answer

Answer: Exhaustive windows + classifier—prohibitively slow; modern detectors replace with region proposals or dense anchors on feature maps.

Question 11

11 Multi-scale detection? 📊 medium

Answer

Answer: Image/feature pyramids, multi-scale anchors, or FPN so small and large objects are both seen at appropriate resolution.

Question 12

12 Why are small objects hard? 📊 medium

Answer

Answer: Few pixels on feature maps, weak signal—higher-res inputs, specialized heads, and data augmentation (copy-paste) help.

Question 13

13 COCO AP@[.5:.95]? 🔥 hard

Answer

Answer: Average AP over IoU thresholds 0.50 to 0.95 step 0.05—rewards localization quality, not just 0.5 IoU hits.

Question 14

14 Role of confidence score? ⚡ easy

Answer

Answer: Estimated probability of class (and sometimes objectness)—used to sort preds for PR curve and NMS thresholding.

Question 15

15 Multi-class boxes? 📊 medium

Answer

Answer: Each prediction has C-way softmax (or sigmoid per class for multi-label); match only within same predicted class for mAP.

Question 16

16 Assigning training targets? 📊 medium

Answer

Answer: Match anchors/points to GT by IoU or center rules; positives get box regression targets and class; negatives contribute to objectness / background loss.

Question 17

17 Many background anchors? 📊 medium

Answer

Answer: Extreme imbalance—addressed by sampling (hard negative mining), focal loss, or balanced loss weighting.

Question 18

18 Latency drivers? ⚡ easy

Answer

Answer: Backbone depth, input resolution, number of proposals, NMS cost, batch size—profile end-to-end for deployment.

Question 19

19 Detection vs segmentation? ⚡ easy

Answer

Answer: Boxes are coarse; segmentation gives pixel masks—detection often first stage in two-stage instance segmentation.

Question 20

20 Per-image vs global AP? 📊 medium

Answer

Answer: Standard benchmarks aggregate over dataset; understand whether metric averages over images or pools all detections (COCO pools).

Question 21

21 Original R-CNN steps? 📊 medium

Answer

Answer: Propose ~2k regions (selective search) → warp each → CNN features → SVM per class + bbox regressor—no shared conv per region → very slow.

Question 22

22 Main bottleneck? ⚡ easy

Answer

Answer: Running CNN thousands of times per image on warped crops; also disk caching of features in early work.

Question 23

23 What did Fast R-CNN fix? 📊 medium

Answer

Answer: Run CNN once on full image; project RoIs onto feature map → RoI pool to fixed size → heads—big speedup + end-to-end backprop.

Question 24

24 How RoI pooling works? 📊 medium

Answer

Answer: Divide each RoI on feature map into H×W bins; max-pool each bin to fixed output—quantization loses subpixel alignment.

Question 25

25 What is Faster R-CNN? 🔥 hard

Answer

Answer: Replaces selective search with RPN that shares full-image conv features—learned proposals, joint training with detector.

Question 26

26 What does the RPN output? 🔥 hard

Answer

Answer: At each anchor location: objectness logits and box deltas to refine anchors—proposals passed to RoI head.

Question 27

27 Anchor scales/aspect ratios? 📊 medium

Answer

Answer: Multiple templates per location cover different object shapes; k anchors per cell → many candidate boxes before filtering by score + NMS.

Question 28

28 Losses in Faster R-CNN? 🔥 hard

Answer

Answer: RPN: binary CE for objectness + smooth L1 for box deltas on assigned anchors; detector head: multi-class CE + bbox regression on positive RoIs.

Question 29

29 Why FPN? 🔥 hard

Answer

Answer: Semantic single high-level feature map is weak for small objects—FPN builds a top-down pyramid with lateral connections for multi-scale RoI features.

Question 30

30 RoIAlign role? 📊 medium

Answer

Answer: Bilinear sample features at exact RoI locations—used in Mask R-CNN for alignment-sensitive mask prediction.

Question 31

31 What is Cascade R-CNN? 🔥 hard

Answer

Answer: Sequence of detector stages with increasing IoU thresholds for positives—reduces overfitting to low-quality proposals and improves AP.

Question 32

32 NMS placement? ⚡ easy

Answer

Answer: After RPN (proposal NMS) and usually after final class-specific boxes—removes duplicate detections.

Question 33

33 Approximate joint training? 📊 medium

Answer

Answer: Alternating or 4-step training historically; modern implementations use single loss with shared backbone and careful sampling.

Question 34

34 Two-stage strength? ⚡ easy

Answer

Answer: Typically higher mAP especially on challenging datasets vs comparable-era one-stage; slower inference.

Question 35

35 Mask R-CNN? 📊 medium

Answer

Answer: Adds mask branch to Faster R-CNN with RoIAlign—instance segmentation with modest overhead.

Question 36

36 Keypoint R-CNN? 📊 medium

Answer

Answer: Same framework with one-hot masks per keypoint or heatmap head—used for pose.

Question 37

37 Deformable conv in detectors? 🔥 hard

Answer

Answer: Offsets sampling grid in conv—better geometric modeling for deformable objects; used in RefineDet / DCN backbones.

Question 38

38 What is HTC? 🔥 hard

Answer

Answer: Hybrid Task Cascade—interleaves detection and segmentation stages with feature fusion—strong COCO instance segmentation.

Question 39

39 DETR vs R-CNN? 📊 medium

Answer

Answer: DETR removes anchors/NMS with transformers—simpler pipeline but different training dynamics and compute.

Question 40

40 When choose two-stage today? ⚡ easy

Answer

Answer: When max accuracy matters and latency budget allows, or when building on mature frameworks (Detectron2) with many pretrained configs.

Two-Stage Object Detection — Interview Q&A

Object Detection Intro: 20 Essential Q&A

R-CNN Family: 20 Essential Q&A

Full tutorial chapter