Evaluation & Benchmarks — Interview Q&A

Question 1

1 Why care about metrics? ⚡ easy

Answer

Answer: They define success criteria, compare models, and expose tradeoffs (precision vs recall)—wrong metric optimizes the wrong behavior.

Question 2

2 When is accuracy misleading? 📊 medium

Answer

Answer: Imbalanced classes—99% negatives makes naive accuracy useless; need per-class and balanced metrics.

Question 3

3 Define precision and recall. 📊 medium

Answer

Answer: Precision = TP/(TP+FP); Recall = TP/(TP+FN)—tension controlled by decision threshold.

Question 4

4 F1? ⚡ easy

Answer

Answer: Harmonic mean of precision and recall—penalizes ignoring either; common single-number summary for binary/multiclass macro-F1.

Question 5

5 Confusion matrix? 📊 medium

Answer

Answer: Counts predictions vs truth for all classes—shows confusion pairs and supports per-class recall.

Question 6

6 ROC / AUC? 🔥 hard

Answer

Answer: TPR vs FPR curve as threshold sweeps; AUC summarizes ranking quality—invariant to prior when comparing rankers.

Question 7

7 What is IoU? 📊 medium

Answer

Answer: Intersection over union of predicted vs ground-truth boxes/masks—range [0,1]; standard match criterion in detection.

Question 8

8 What is mAP in detection? 🔥 hard

Answer

Answer: Mean AP across classes—AP is area under precision–recall curve after IoU-thresholded matches; COCO averages multiple IoU thresholds.

Question 9

9 AP vs mAP? 📊 medium

Answer

Answer: AP per class; mAP averages classes—report AP50 vs AP75 to show coarse vs tight localization skill.

Question 10

10 NMS effect on metrics? 📊 medium

Answer

Answer: Suppresses overlapping boxes before evaluation—metric implementation must match competition rules (soft-NMS differs).

Question 11

11 Segmentation IoU? 📊 medium

Answer

Answer: Per-class IoU on pixels; mean IoU (mIoU) across classes—ignore void label per dataset protocol.

Question 12

12 Dice coefficient? 📊 medium

Answer

Answer: 2|A∩B|/(|A|+|B|)—related to F1 on masks; common in medical segmentation with class imbalance.

Question 13

13 Threshold tuning? 🔥 hard

Answer

Answer: Pick operating point on validation to meet product constraint (min recall)—don’t tune on test set.

Question 14

14 Micro vs macro averaging? 🔥 hard

Answer

Answer: Micro pools all examples; macro averages per-class stats—macro highlights rare class performance.

Question 15

15 OKS in pose? 🔥 hard

Answer

Answer: Object keypoint similarity scales error by joint size—COCO pose AP builds on OKS thresholds.

Question 16

16 Calibration? 📊 medium

Answer

Answer: Predicted probabilities match empirical frequencies—ECE, reliability diagrams; miscalibration hurts downstream decisions.

Question 17

17 Sampling bias? 📊 medium

Answer

Answer: Geographic, demographic, or capture bias inflates benchmark scores—report subgroup metrics.

Question 18

18 Benchmark leakage? ⚡ easy

Answer

Answer: Test images in pretraining data or duplicate near-neighbors—contaminates leaderboard comparisons.

Question 19

19 Human baseline? ⚡ easy

Answer

Answer: Annotator agreement sets ceiling—if model beats humans, check task ambiguity or evaluation bugs.

Question 20

20 What to report? 📊 medium

Answer

Answer: Primary metric + confidence intervals or multiple seeds, compute budget, and failure cases—not leaderboard cherry-picking.

Question 21

21 What is ImageNet? ⚡ easy

Answer

Answer: Large-scale image dataset organized by WordNet synsets—millions of labeled images driving classification pretraining.

Question 22

22 ILSVRC? 📊 medium

Answer

Answer: Annual challenge subset (~1.2M train, 50k val, 1000 classes) used historically for ImageNet-1K classification benchmarks.

Question 23

23 What is a synset? 📊 medium

Answer

Answer: WordNet sense (e.g. specific dog breed)—each class is a disambiguated noun phrase to reduce polysemy.

Question 24

24 Scale? ⚡ easy

Answer

Answer: Roughly 1.28M training images for 1K ILSVRC classes—enough diversity to learn general visual features.

Question 25

25 Why report top-5 error? 📊 medium

Answer

Answer: Fine-grained classes make single exact label harsh—top-5 was standard headline metric during AlexNet era.

Question 26

26 Val vs test? 📊 medium

Answer

Answer: Public val for development; test held out for leaderboard—reproducible papers compare on val with fixed split.

Question 27

27 Canonical preprocessing? 🔥 hard

Answer

Answer: Short side resize 256, center crop 224, mean/std normalization—must match weights (different for Inception vs ResNet sometimes).

Question 28

28 Transfer learning role? 📊 medium

Answer

Answer: Backbone trained on ImageNet features edges/textures/objects—finetune on small domain datasets with smaller LR.

Question 29

29 Freeze backbone? 📊 medium

Answer

Answer: Early training only head when data tiny; unfreeze later—BN layers need care (batch stats) in finetune.

Question 30

30 Noise / bias? 🔥 hard

Answer

Answer: Crowdsourced labels contain errors; geographic and demographic skew—ImageNet audit projects documented issues.

Question 31

31 Hierarchical labels? 📊 medium

Answer

Answer: WordNet tree enables hierarchical metrics and zero-shot transfer—not all models exploit hierarchy in loss.

Question 32

32 Classic augmentations? ⚡ easy

Answer

Answer: RandomResizedCrop, flip, color jitter—standard on ImageNet training recipes (RRC is critical for ResNet).

Question 33

33 Tiny ImageNet? ⚡ easy

Answer

Answer: Teaching subset (200 classes, 64×64)—useful for coursework; not same distribution as full IN.

Question 34

34 Relation to Open Images? 📊 medium

Answer

Answer: Different project (multi-label, boxes)—don’t confuse with ImageNet-1K single-label classification.

Question 35

35 Licensing? ⚡ easy

Answer

Answer: Images scraped from web with varying rights—research use common; commercial redeployment needs legal review.

Question 36

36 ObjectNet lesson? 🔥 hard

Answer

Answer: Controls viewpoint/background—shows ImageNet-trained models rely on spurious cues; stresses robust evaluation.

Question 37

37 Beyond single-label? 📊 medium

Answer

Answer: Web-scale image-text (CLIP) reduces reliance on pure ImageNet classification for pretraining—still often finetuned with IN-like data.

Question 38

38 EfficientNet story? 📊 medium

Answer

Answer: Compound scaling depth/width/resolution on ImageNet—Pareto frontier influenced mobile deployment targets.

Question 39

39 ViT on ImageNet? 🔥 hard

Answer

Answer: Transformers need large data or strong augmentation + pretrain—ImageNet-1K alone smaller than JFT; hybrids mattered early.

Question 40

40 Still pretrain on IN? ⚡ easy

Answer

Answer: Common baseline though larger multimodal corpora grow—ImageNet remains reference for architecture comparisons.

Question 41

41 What is MS COCO? ⚡ easy

Answer

Answer: Common Objects in Context—benchmark for detection, segmentation, captions, and keypoints with rich scene images.

Question 42

42 Which tasks? 📊 medium

Answer

Answer: Object detection (bbox), instance seg, panoptic (thing+stuff), image captioning, person keypoints—each has metrics.

Question 43

43 JSON annotations? 📊 medium

Answer

Answer: COCO format lists images, categories, annotations with bbox [x,y,w,h], segmentation polygons/RLE, area, iscrowd flag.

Question 44

44 Bbox format? ⚡ easy

Answer

Answer: Top-left x,y plus width,height in pixels—convert carefully vs xyxy conventions in codebases.

Question 45

45 Instance masks? 📊 medium

Answer

Answer: Often stored as RLE compression per object—Mask R-CNN training decodes to binary masks per instance.

Question 46

46 Panoptic on COCO? 🔥 hard

Answer

Answer: Unifies semantic “stuff” and instance “things” with PQ metric—requires non-overlapping label assignment per pixel.

Question 47

47 Captions? 📊 medium

Answer

Answer: Multiple human captions per image—evaluation with BLEU/CIDEr/SPICE; encourages descriptive models.

Question 48

48 Keypoints? 📊 medium

Answer

Answer: 17 body joints for person instances—AP computed with OKS instead of IoU for matching.

Question 49

49 Detection mAP on COCO? 🔥 hard

Answer

Answer: Primary AP averaged over IoU thresholds 0.5:0.05:0.95 (AP@[.5:.95]) plus AP50, AP75—stricter than VOC AP50 only.

Question 50

50 Mask AP? 📊 medium

Answer

Answer: AP computed on mask IoU instead of box IoU—segmentation quality can differ from bbox AP ranking.

Question 51

51 80 classes? ⚡ easy

Answer

Answer: Thing categories for detection—plus stuff classes in panoptic/stuff annotations; don’t confuse with 91 legacy lists in some code.

Question 52

52 train/val/test? 📊 medium

Answer

Answer: train2017, val2017 public; test-dev hidden for leaderboard—papers report val metrics for fair comparison.

Question 53

53 pycocotools? 📊 medium

Answer

Answer: Official eval code for mAP, mask IoU, RLE decode—implementations should match to reproduce leaderboard numbers.

Question 54

54 Small objects? 📊 medium

Answer

Answer: COCO reports AP_S/M/L by area—models struggle on small; anchors/FPN designs target scale variance.

Question 55

55 iscrowd flag? 🔥 hard

Answer

Answer: Annotations for groups/crowds where instance separation ambiguous—evaluation rules ignore or merge per protocol.

Question 56

56 Eval servers? ⚡ easy

Answer

Answer: Upload predictions for test sets—prevents test overfitting; val is for iteration.

Question 57

57 Relation to LVIS? 📊 medium

Answer

Answer: Long-tail vocabulary extension—similar tooling, different frequency spectrum; training often joint with COCO.

Question 58

58 Image source? ⚡ easy

Answer

Answer: Flickr-licensed photos of everyday scenes—more contextual clutter than ImageNet object-centric photos.

Question 59

59 Why default benchmark? 📊 medium

Answer

Answer: Challenging scale, multi-task labels, standardized API—dominant for comparing detectors and instance seg models.

Question 60

60 Common pitfalls? 📊 medium

Answer

Answer: Wrong bbox convention, ignoring iscrowd, different NMS thresholds, or not using official eval—numbers won’t match papers.

Evaluation & Benchmarks — Interview Q&A

CV Evaluation Metrics: 20 Essential Q&A

ImageNet: 20 Essential Q&A

MS COCO: 20 Essential Q&A

Full tutorial chapter