One-Stage Object Detection — Interview Q&A

Question 1

1 What does YOLO mean? 📊 medium

Answer

Answer: You Only Look Once: single forward pass predicts boxes and classes—treats detection as regression from a grid of cells.

Question 2

2 YOLOv1 grid idea? 📊 medium

Answer

Answer: Image split into S×S cells; cell responsible for object whose center falls in it—predicts B boxes + class distribution per cell.

Question 3

3 YOLOv1 loss components? 🔥 hard

Answer

Answer: Coordinate regression (with sqrt w,h trick), confidence (IoU weighted), classification CE—λ weights balance localization vs no-object cells.

Question 4

4 When did anchors appear? 📊 medium

Answer

Answer: YOLOv2+ uses k-means anchor priors on dataset boxes—predict offsets instead of raw sizes for stability.

Question 5

5 IoU in training? 📊 medium

Answer

Answer: Assign anchors/cells to GT by best IoU; some versions ignore preds below IoU threshold for classification to reduce conflict.

Question 6

6 Post-processing? ⚡ easy

Answer

Answer: Like other detectors: NMS on decoded boxes with class-wise scores—some variants use DIoU-NMS or soft-NMS.

Question 7

7 Objectness vs class? ⚡ easy

Answer

Answer: Objectness = is there an object in this anchor; class = which class—decoupled in many heads (obj * class prob = final score).

Question 8

8 Multi-scale YOLO? 📊 medium

Answer

Answer: Later versions predict at multiple feature map scales (e.g. large/small stride) to catch objects of different sizes—similar spirit to FPN.

Question 9

9 Path aggregation? 📊 medium

Answer

Answer: Models like YOLOv4 use PANet-style bottom-up path after top-down FPN for richer multi-scale features.

Question 10

10 YOLOv5/v8 / Ultralytics? ⚡ easy

Answer

Answer: Popular PyTorch implementations with training zoo, export, and deployment tooling—interview “practical YOLO” often means this ecosystem.

Question 11

11 Deploy on edge? 📊 medium

Answer

Answer: Export to ONNX, TensorRT, CoreML—quantize INT8 for speed; validate mAP drop after conversion.

Question 12

12 Small objects? 📊 medium

Answer

Answer: Higher-res input, smaller stride heads, copy-paste aug, or tiling—same fundamentals as other detectors.

Question 13

13 Crowded objects? 🔥 hard

Answer

Answer: Grid responsibility and NMS can struggle—improved assignment (e.g. ATSS-style ideas in some detectors) and better NMS help.

Question 14

14 Common augmentations? 📊 medium

Answer

Answer: Mosaic, mixup, HSV jitter, random scale—strong aug standard in modern YOLO training recipes.

Question 15

15 mAP vs FPS tradeoff? ⚡ easy

Answer

Answer: Larger model and image size ↑ mAP, ↓ FPS—choose for product SLA (latency vs accuracy).

Question 16

16 YOLO vs SSD? 📊 medium

Answer

Answer: Both one-stage; SSD uses multi-scale default boxes on VGG features; YOLO family evolved different heads and assignment—both real-time capable.

Question 17

17 YOLO vs RetinaNet? 📊 medium

Answer

Answer: RetinaNet introduced focal loss for dense classification imbalance; YOLO uses different obj loss weighting—both dense predictors.

Question 18

18 Tiling satellite / huge images? 📊 medium

Answer

Answer: Split image, run YOLO per tile with overlap, merge + NMS—handle boundary duplicates.

Question 19

19 Rotated boxes? 🔥 hard

Answer

Answer: Variants predict angle θ or use rotated IoU—needed for aerial/text detection.

Question 20

20 Real-time on CPU? 📊 medium

Answer

Answer: Choose nano/tiny backbones, reduce input size, INT8—expect large accuracy gap vs GPU server models.

Question 21

21 What is RetinaNet? 📊 medium

Answer

Answer: One-stage detector with FPN backbone and focal loss on dense classification—closes accuracy gap to two-stage without proposals.

Question 22

22 Focal loss intuition? 🔥 hard

Answer

Answer: Down-weights easy negatives (well-classified background) so training focuses on hard examples—prevents huge CE loss from overwhelming gradients.

Question 23

23 Role of γ (gamma)? 🔥 hard

Answer

Answer: Focusing parameter: (1 − p_t)^γ reduces loss for high-confidence correct preds; γ=0 is CE; typical γ=2.

Question 24

24 Why imbalance in one-stage? 📊 medium

Answer

Answer: ~100k anchors per image with few positives—vanilla CE is dominated by easy background classifications.

Question 25

25 How does FPN help RetinaNet? 📊 medium

Answer

Answer: Predicts at multiple pyramid levels P3–P7 with shared heads—each level responsible for objects in a scale range.

Question 26

26 Subnet design? 📊 medium

Answer

Answer: Separate small conv classification and box regression subnets applied per level—4 conv layers each in original paper.

Question 27

27 Anchors? ⚡ easy

Answer

Answer: Similar to RPN: multiple scales/ratios per location; classification predicts class (sigmoid per class) and reg head predicts deltas.

Question 28

28 Box regression loss? ⚡ easy

Answer

Answer: Smooth L1 on positive anchors only—standard in Faster R-CNN lineage.

Question 29

29 vs SSD? 📊 medium

Answer

Answer: Both multi-scale one-stage; RetinaNet’s focal loss specifically addresses training imbalance SSD tackled partly with hard negative mining.

Question 30

30 vs two-stage? 📊 medium

Answer

Answer: No separate proposal stage—simpler pipeline; historically competitive mAP on COCO with proper FPN + focal loss.

Question 31

31 Training tips? 📊 medium

Answer

Answer: Longer schedules help; careful anchor matching; synchronized BN on multi-GPU for large batch stability.

Question 32

32 Inference cost? ⚡ easy

Answer

Answer: Single backbone forward + per-level heads + NMS—faster than two-stage but still heavier than tiny YOLO variants.

Question 33

33 Anchor-free successors? 🔥 hard

Answer

Answer: FCOS, CenterNet, DETR reduce anchor design—focal loss ideas still influence classification in some heads.

Question 34

34 Why sigmoid per class? 📊 medium

Answer

Answer: Enables multi-label rare cases and simplifies K independent binary classifiers vs softmax mutual exclusivity.

Question 35

35 Unified loss? ⚡ easy

Answer

Answer: Sum of focal classification + smooth L1 regression over all locations (masked to assigned anchors).

Question 36

36 Variants of focal loss? 🔥 hard

Answer

Answer: Quality focal loss, balanced loss, GHM—adjust weighting scheme for hard/easy examples differently.

Question 37

37 IoU-aware classification? 🔥 hard

Answer

Answer: Some heads predict joint IoU quality with class to better rank detections—post-RetinaNet refinement.

Question 38

38 Historical COCO note? ⚡ easy

Answer

Answer: RetinaNet showed one-stage could match two-stage mAP around 2017—important milestone before transformer detectors.

Question 39

39 Limitations? 📊 medium

Answer

Answer: Many hyperparameters (α, γ, anchor design); dense preds still need NMS; superseded in some tracks by newer architectures.

Question 40

40 When reuse focal loss? ⚡ easy

Answer

Answer: Any extreme class imbalance in dense prediction—segmentation, keypoint heatmaps, or custom detectors.

One-Stage Object Detection — Interview Q&A

YOLO: 20 Essential Q&A

RetinaNet: 20 Essential Q&A

Full tutorial chapter