Object Detection
20 Essential Q/A
CV Interview Prep
Object Detection: 20 Interview Questions
Master YOLO, SSD, Faster R-CNN, RetinaNet, DETR, anchor boxes, NMS, mAP, FPN, and one-stage vs two-stage trade-offs. Interview-ready answers with formulas.
YOLO
Faster R-CNN
SSD
mAP
Anchor Boxes
DETR
1
What is object detection? How is it different from classification and segmentation?
⚡ Easy
Answer: Object detection = localization (bounding box) + classification. Classification assigns label to whole image; segmentation is pixel-level. Detection outputs variable number of boxes with class labels.
Task: "Where are objects and what are they?"
2
What is IoU? How is it calculated?
⚡ Easy
Answer: IoU = Area of Overlap / Area of Union between predicted and ground truth boxes. Range [0,1]. Threshold (typically 0.5) determines true/false positive.
IoU = |A ∩ B| / |A ∪ B|
3
Explain mAP (mean Average Precision) for object detection.
🔥 Hard
Answer: For each class, compute Average Precision (area under Precision-Recall curve). Then mean over classes. COCO mAP averages over IoU thresholds (0.5:0.05:0.95). Standard metric for detection.
4
One-stage vs two-stage detectors – trade-offs?
📊 Medium
Answer: Two-stage (Faster R-CNN): region proposal + classification. Higher accuracy, slower. One-stage (YOLO, SSD): dense prediction, faster, better speed-accuracy trade-off. RetinaNet bridges gap with Focal Loss.
Two-stage: accurate, R-CNN family
One-stage: fast, YOLO/SSD
5
What are anchor boxes? How are they designed?
🔥 Hard
Answer: Predefined bounding boxes of various scales/aspect ratios placed at each spatial location. Model predicts offsets to refine anchors. Designed via k-means on training set (YOLOv2) or hand-picked (Faster R-CNN: 3 scales × 3 ratios).
6
How does Non-Maximum Suppression (NMS) work?
📊 Medium
Answer: Removes duplicate detections. Sort boxes by confidence, select highest, suppress others with IoU > threshold. Soft-NMS decays scores instead of removing. NMS is non-differentiable.
7
Explain Faster R-CNN. Role of RPN?
🔥 Hard
Answer: Backbone → Feature maps → Region Proposal Network (RPN) generates candidate boxes (objectness + regression). RoI Pooling crops features → classifier/bbox regressor. End-to-end trainable. RPN replaces selective search.
8
How does YOLO work? Loss function components?
🔥 Hard
Answer: Single regression: divide image into S×S grid, each cell predicts B boxes (x,y,w,h, confidence) and C class probabilities. Loss: localization (xywh), confidence (objectness), classification. Sum-squared error.
9
What makes SSD fast? Multi-scale feature maps?
📊 Medium
Answer: SSD predicts anchors on multiple feature maps at different resolutions (detects objects of various sizes). No RPN, fully convolutional. Faster than Faster R-CNN, competitive accuracy.
10
What is FPN? Why important?
🔥 Hard
Answer: Feature Pyramid Network: top-down pathway + lateral connections. Combines semantic-rich high-level features with high-resolution low-level features. Improves detection of small objects. Used in RetinaNet, Mask R-CNN.
11
What problem does Focal Loss solve? How?
🔥 Hard
Answer: One-stage detectors suffer from extreme class imbalance (many easy negatives). Focal Loss down-weights easy examples, focuses on hard misclassifications. FL(p_t) = -(1-p_t)^γ log(p_t). RetinaNet matches two-stage accuracy.
FL(p_t) = -α_t (1-p_t)^γ log(p_t)
12
How does DETR (Detection Transformer) work?
🔥 Hard
Answer: CNN backbone + Transformer encoder-decoder. Treats detection as set prediction: fixed N learnable object queries, bipartite matching loss (Hungarian). No anchors, no NMS. Slow convergence, but elegant.
13
RoI Pooling vs RoI Align? Why Align better?
🔥 Hard
Answer: RoI Pooling quantizes (floor/ceil) causing misalignment. RoI Align uses bilinear interpolation at sample points, no quantization. Essential for pixel-precise tasks (Mask R-CNN). Improves detection accuracy.
14
What is Online Hard Example Mining?
🔥 Hard
Answer: During training, select top-k highest loss RoIs (hard examples) to backpropagate. Ignores easy negatives. Improves model robustness. Used in SSD, Faster R-CNN variants.
15
How are positive/negative anchors assigned?
📊 Medium
Answer: Positive: IoU > 0.7 (or highest IoU). Negative: IoU < 0.3. Intermediate ignored. YOLO assigns based on center in grid cell. RetinaNet uses IoU threshold 0.5. Adaptive strategies (ATSS) improve.
16
Common data augmentation strategies?
📊 Medium
Answer: Random crop, flip, rotation, color jitter. Mosaic (YOLOv4): combine 4 images. MixUp, CutOut, GridMask. Keep bbox consistency.
17
Instance segmentation vs panoptic segmentation?
📊 Medium
Answer: Instance: detect and segment each object instance (Mask R-CNN). Panoptic: unify instance (things) + semantic segmentation (stuff). Each pixel gets unique ID (instance) or stuff label.
18
Why are small objects hard to detect? Solutions?
🔥 Hard
Answer: Low resolution, few pixels, anchor mismatch, feature map downsampling. Solutions: FPN, high-resolution input, copy-paste augmentation, better anchor design, context modeling, GAN-based super-resolution.
19
Other metrics for object detection?
📊 Medium
Answer: FPS (speed), FLOPs, model size, AP@0.5, AP@0.75, AP_small/medium/large, recall, precision, inference latency. COCO also reports AR (average recall).
20
What problem does Deformable DETR solve?
🔥 Hard
Answer: DETR slow convergence, high complexity on feature maps. Deformable DETR: attends to small set of key sampling points (deformable attention). Multi-scale features, faster training, better small object performance.
Object Detection – Interview Cheat Sheet
Detector Families
- Two-stage Faster R-CNN, Mask R-CNN, Cascade R-CNN
- One-stage YOLO, SSD, RetinaNet
- Transformer DETR, Deformable DETR
Key Components
- Anchor boxes / object queries
- NMS / Soft-NMS
- FPN / BiFPN
- RoI Align / RoI Pooling
Loss Functions
- Focal Loss Class imbalance
- GIoU/DIoU/CIoU Better box regression
- Hungarian DETR set loss
Speed vs Accuracy
- YOLOv7: real-time
- RetinaNet: balanced
- Cascade R-CNN: high accuracy
Verdict: "Anchor-based dominates, but transformer detectors are rising. Know your IoU, NMS, and FPN."