Object Detection 20 Essential Q/A

CV Interview Prep

Object Detection: 20 Interview Questions

Master YOLO, SSD, Faster R-CNN, RetinaNet, DETR, anchor boxes, NMS, mAP, FPN, and one-stage vs two-stage trade-offs. Interview-ready answers with formulas.

YOLO Faster R-CNN SSD mAP Anchor Boxes DETR

1 What is object detection? How is it different from classification and segmentation? ⚡ Easy

Answer: Object detection = localization (bounding box) + classification. Classification assigns label to whole image; segmentation is pixel-level. Detection outputs variable number of boxes with class labels.

Task: "Where are objects and what are they?"

2 What is IoU? How is it calculated? ⚡ Easy

Answer: IoU = Area of Overlap / Area of Union between predicted and ground truth boxes. Range [0,1]. Threshold (typically 0.5) determines true/false positive.

IoU = |A ∩ B| / |A ∪ B|

3 Explain mAP (mean Average Precision) for object detection. 🔥 Hard

Answer: For each class, compute Average Precision (area under Precision-Recall curve). Then mean over classes. COCO mAP averages over IoU thresholds (0.5:0.05:0.95). Standard metric for detection.

4 One-stage vs two-stage detectors – trade-offs? 📊 Medium

Answer: Two-stage (Faster R-CNN): region proposal + classification. Higher accuracy, slower. One-stage (YOLO, SSD): dense prediction, faster, better speed-accuracy trade-off. RetinaNet bridges gap with Focal Loss.

Two-stage: accurate, R-CNN family

One-stage: fast, YOLO/SSD

5 What are anchor boxes? How are they designed? 🔥 Hard

Answer: Predefined bounding boxes of various scales/aspect ratios placed at each spatial location. Model predicts offsets to refine anchors. Designed via k-means on training set (YOLOv2) or hand-picked (Faster R-CNN: 3 scales × 3 ratios).

6 How does Non-Maximum Suppression (NMS) work? 📊 Medium

Answer: Removes duplicate detections. Sort boxes by confidence, select highest, suppress others with IoU > threshold. Soft-NMS decays scores instead of removing. NMS is non-differentiable.

7 Explain Faster R-CNN. Role of RPN? 🔥 Hard

Answer: Backbone → Feature maps → Region Proposal Network (RPN) generates candidate boxes (objectness + regression). RoI Pooling crops features → classifier/bbox regressor. End-to-end trainable. RPN replaces selective search.

8 How does YOLO work? Loss function components? 🔥 Hard

Answer: Single regression: divide image into S×S grid, each cell predicts B boxes (x,y,w,h, confidence) and C class probabilities. Loss: localization (xywh), confidence (objectness), classification. Sum-squared error.

9 What makes SSD fast? Multi-scale feature maps? 📊 Medium

Answer: SSD predicts anchors on multiple feature maps at different resolutions (detects objects of various sizes). No RPN, fully convolutional. Faster than Faster R-CNN, competitive accuracy.

10 What is FPN? Why important? 🔥 Hard

Answer: Feature Pyramid Network: top-down pathway + lateral connections. Combines semantic-rich high-level features with high-resolution low-level features. Improves detection of small objects. Used in RetinaNet, Mask R-CNN.

11 What problem does Focal Loss solve? How? 🔥 Hard

Answer: One-stage detectors suffer from extreme class imbalance (many easy negatives). Focal Loss down-weights easy examples, focuses on hard misclassifications. FL(p_t) = -(1-p_t)^γ log(p_t). RetinaNet matches two-stage accuracy.

FL(p_t) = -α_t (1-p_t)^γ log(p_t)

12 How does DETR (Detection Transformer) work? 🔥 Hard

Answer: CNN backbone + Transformer encoder-decoder. Treats detection as set prediction: fixed N learnable object queries, bipartite matching loss (Hungarian). No anchors, no NMS. Slow convergence, but elegant.

13 RoI Pooling vs RoI Align? Why Align better? 🔥 Hard

Answer: RoI Pooling quantizes (floor/ceil) causing misalignment. RoI Align uses bilinear interpolation at sample points, no quantization. Essential for pixel-precise tasks (Mask R-CNN). Improves detection accuracy.

14 What is Online Hard Example Mining? 🔥 Hard

Answer: During training, select top-k highest loss RoIs (hard examples) to backpropagate. Ignores easy negatives. Improves model robustness. Used in SSD, Faster R-CNN variants.

15 How are positive/negative anchors assigned? 📊 Medium

Answer: Positive: IoU > 0.7 (or highest IoU). Negative: IoU < 0.3. Intermediate ignored. YOLO assigns based on center in grid cell. RetinaNet uses IoU threshold 0.5. Adaptive strategies (ATSS) improve.

16 Common data augmentation strategies? 📊 Medium

Answer: Random crop, flip, rotation, color jitter. Mosaic (YOLOv4): combine 4 images. MixUp, CutOut, GridMask. Keep bbox consistency.

17 Instance segmentation vs panoptic segmentation? 📊 Medium

Answer: Instance: detect and segment each object instance (Mask R-CNN). Panoptic: unify instance (things) + semantic segmentation (stuff). Each pixel gets unique ID (instance) or stuff label.

18 Why are small objects hard to detect? Solutions? 🔥 Hard

Answer: Low resolution, few pixels, anchor mismatch, feature map downsampling. Solutions: FPN, high-resolution input, copy-paste augmentation, better anchor design, context modeling, GAN-based super-resolution.

19 Other metrics for object detection? 📊 Medium

Answer: FPS (speed), FLOPs, model size, AP@0.5, AP@0.75, AP_small/medium/large, recall, precision, inference latency. COCO also reports AR (average recall).

20 What problem does Deformable DETR solve? 🔥 Hard

Answer: DETR slow convergence, high complexity on feature maps. Deformable DETR: attends to small set of key sampling points (deformable attention). Multi-scale features, faster training, better small object performance.

Object Detection – Interview Cheat Sheet

Detector Families

Two-stage Faster R-CNN, Mask R-CNN, Cascade R-CNN
One-stage YOLO, SSD, RetinaNet
Transformer DETR, Deformable DETR

Key Components

Anchor boxes / object queries
NMS / Soft-NMS
FPN / BiFPN
RoI Align / RoI Pooling

Loss Functions

Focal Loss Class imbalance
GIoU/DIoU/CIoU Better box regression
Hungarian DETR set loss

Speed vs Accuracy

YOLOv7: real-time
RetinaNet: balanced
Cascade R-CNN: high accuracy

Verdict: "Anchor-based dominates, but transformer detectors are rising. Know your IoU, NMS, and FPN."

NLP