Computer Vision Interview 60 Q&A Chapter 20

Evaluation & Benchmarks — Interview Q&A

Detection and segmentation metrics (IoU, mAP), plus ImageNet and COCO benchmark datasets.

60 questions Chapter 20

CV Evaluation Metrics: 20 Essential Q&A

1 Why care about metrics? ⚡ easy
Answer: They define success criteria, compare models, and expose tradeoffs (precision vs recall)—wrong metric optimizes the wrong behavior.
2 When is accuracy misleading? 📊 medium
Answer: Imbalanced classes—99% negatives makes naive accuracy useless; need per-class and balanced metrics.
3 Define precision and recall. 📊 medium
Answer: Precision = TP/(TP+FP); Recall = TP/(TP+FN)—tension controlled by decision threshold.
4 F1? ⚡ easy
Answer: Harmonic mean of precision and recall—penalizes ignoring either; common single-number summary for binary/multiclass macro-F1.
5 Confusion matrix? 📊 medium
Answer: Counts predictions vs truth for all classes—shows confusion pairs and supports per-class recall.
6 ROC / AUC? 🔥 hard
Answer: TPR vs FPR curve as threshold sweeps; AUC summarizes ranking quality—invariant to prior when comparing rankers.
7 What is IoU? 📊 medium
Answer: Intersection over union of predicted vs ground-truth boxes/masks—range [0,1]; standard match criterion in detection.
iou = inter_area / (area_a + area_b - inter_area)
8 What is mAP in detection? 🔥 hard
Answer: Mean AP across classes—AP is area under precision–recall curve after IoU-thresholded matches; COCO averages multiple IoU thresholds.
9 AP vs mAP? 📊 medium
Answer: AP per class; mAP averages classes—report AP50 vs AP75 to show coarse vs tight localization skill.
10 NMS effect on metrics? 📊 medium
Answer: Suppresses overlapping boxes before evaluation—metric implementation must match competition rules (soft-NMS differs).
11 Segmentation IoU? 📊 medium
Answer: Per-class IoU on pixels; mean IoU (mIoU) across classes—ignore void label per dataset protocol.
12 Dice coefficient? 📊 medium
Answer: 2|A∩B|/(|A|+|B|)—related to F1 on masks; common in medical segmentation with class imbalance.
13 Threshold tuning? 🔥 hard
Answer: Pick operating point on validation to meet product constraint (min recall)—don’t tune on test set.
14 Micro vs macro averaging? 🔥 hard
Answer: Micro pools all examples; macro averages per-class stats—macro highlights rare class performance.
15 OKS in pose? 🔥 hard
Answer: Object keypoint similarity scales error by joint size—COCO pose AP builds on OKS thresholds.
16 Calibration? 📊 medium
Answer: Predicted probabilities match empirical frequencies—ECE, reliability diagrams; miscalibration hurts downstream decisions.
17 Sampling bias? 📊 medium
Answer: Geographic, demographic, or capture bias inflates benchmark scores—report subgroup metrics.
18 Benchmark leakage? ⚡ easy
Answer: Test images in pretraining data or duplicate near-neighbors—contaminates leaderboard comparisons.
19 Human baseline? ⚡ easy
Answer: Annotator agreement sets ceiling—if model beats humans, check task ambiguity or evaluation bugs.
20 What to report? 📊 medium
Answer: Primary metric + confidence intervals or multiple seeds, compute budget, and failure cases—not leaderboard cherry-picking.

ImageNet: 20 Essential Q&A

21 What is ImageNet? ⚡ easy
Answer: Large-scale image dataset organized by WordNet synsets—millions of labeled images driving classification pretraining.
22 ILSVRC? 📊 medium
Answer: Annual challenge subset (~1.2M train, 50k val, 1000 classes) used historically for ImageNet-1K classification benchmarks.
23 What is a synset? 📊 medium
Answer: WordNet sense (e.g. specific dog breed)—each class is a disambiguated noun phrase to reduce polysemy.
24 Scale? ⚡ easy
Answer: Roughly 1.28M training images for 1K ILSVRC classes—enough diversity to learn general visual features.
25 Why report top-5 error? 📊 medium
Answer: Fine-grained classes make single exact label harsh—top-5 was standard headline metric during AlexNet era.
26 Val vs test? 📊 medium
Answer: Public val for development; test held out for leaderboard—reproducible papers compare on val with fixed split.
27 Canonical preprocessing? 🔥 hard
Answer: Short side resize 256, center crop 224, mean/std normalization—must match weights (different for Inception vs ResNet sometimes).
# ImageNet norm: mean=[0.485,0.456,0.406], std=[0.229,0.224,0.225]
28 Transfer learning role? 📊 medium
Answer: Backbone trained on ImageNet features edges/textures/objects—finetune on small domain datasets with smaller LR.
29 Freeze backbone? 📊 medium
Answer: Early training only head when data tiny; unfreeze later—BN layers need care (batch stats) in finetune.
30 Noise / bias? 🔥 hard
Answer: Crowdsourced labels contain errors; geographic and demographic skew—ImageNet audit projects documented issues.
31 Hierarchical labels? 📊 medium
Answer: WordNet tree enables hierarchical metrics and zero-shot transfer—not all models exploit hierarchy in loss.
32 Classic augmentations? ⚡ easy
Answer: RandomResizedCrop, flip, color jitter—standard on ImageNet training recipes (RRC is critical for ResNet).
33 Tiny ImageNet? ⚡ easy
Answer: Teaching subset (200 classes, 64×64)—useful for coursework; not same distribution as full IN.
34 Relation to Open Images? 📊 medium
Answer: Different project (multi-label, boxes)—don’t confuse with ImageNet-1K single-label classification.
35 Licensing? ⚡ easy
Answer: Images scraped from web with varying rights—research use common; commercial redeployment needs legal review.
36 ObjectNet lesson? 🔥 hard
Answer: Controls viewpoint/background—shows ImageNet-trained models rely on spurious cues; stresses robust evaluation.
37 Beyond single-label? 📊 medium
Answer: Web-scale image-text (CLIP) reduces reliance on pure ImageNet classification for pretraining—still often finetuned with IN-like data.
38 EfficientNet story? 📊 medium
Answer: Compound scaling depth/width/resolution on ImageNet—Pareto frontier influenced mobile deployment targets.
39 ViT on ImageNet? 🔥 hard
Answer: Transformers need large data or strong augmentation + pretrain—ImageNet-1K alone smaller than JFT; hybrids mattered early.
40 Still pretrain on IN? ⚡ easy
Answer: Common baseline though larger multimodal corpora grow—ImageNet remains reference for architecture comparisons.

MS COCO: 20 Essential Q&A

41 What is MS COCO? ⚡ easy
Answer: Common Objects in Context—benchmark for detection, segmentation, captions, and keypoints with rich scene images.
42 Which tasks? 📊 medium
Answer: Object detection (bbox), instance seg, panoptic (thing+stuff), image captioning, person keypoints—each has metrics.
43 JSON annotations? 📊 medium
Answer: COCO format lists images, categories, annotations with bbox [x,y,w,h], segmentation polygons/RLE, area, iscrowd flag.
44 Bbox format? ⚡ easy
Answer: Top-left x,y plus width,height in pixels—convert carefully vs xyxy conventions in codebases.
45 Instance masks? 📊 medium
Answer: Often stored as RLE compression per object—Mask R-CNN training decodes to binary masks per instance.
46 Panoptic on COCO? 🔥 hard
Answer: Unifies semantic “stuff” and instance “things” with PQ metric—requires non-overlapping label assignment per pixel.
47 Captions? 📊 medium
Answer: Multiple human captions per image—evaluation with BLEU/CIDEr/SPICE; encourages descriptive models.
48 Keypoints? 📊 medium
Answer: 17 body joints for person instances—AP computed with OKS instead of IoU for matching.
49 Detection mAP on COCO? 🔥 hard
Answer: Primary AP averaged over IoU thresholds 0.5:0.05:0.95 (AP@[.5:.95]) plus AP50, AP75—stricter than VOC AP50 only.
50 Mask AP? 📊 medium
Answer: AP computed on mask IoU instead of box IoU—segmentation quality can differ from bbox AP ranking.
51 80 classes? ⚡ easy
Answer: Thing categories for detection—plus stuff classes in panoptic/stuff annotations; don’t confuse with 91 legacy lists in some code.
52 train/val/test? 📊 medium
Answer: train2017, val2017 public; test-dev hidden for leaderboard—papers report val metrics for fair comparison.
53 pycocotools? 📊 medium
Answer: Official eval code for mAP, mask IoU, RLE decode—implementations should match to reproduce leaderboard numbers.
from pycocotools.coco import COCO; coco = COCO("annotations.json")
54 Small objects? 📊 medium
Answer: COCO reports AP_S/M/L by area—models struggle on small; anchors/FPN designs target scale variance.
55 iscrowd flag? 🔥 hard
Answer: Annotations for groups/crowds where instance separation ambiguous—evaluation rules ignore or merge per protocol.
56 Eval servers? ⚡ easy
Answer: Upload predictions for test sets—prevents test overfitting; val is for iteration.
57 Relation to LVIS? 📊 medium
Answer: Long-tail vocabulary extension—similar tooling, different frequency spectrum; training often joint with COCO.
58 Image source? ⚡ easy
Answer: Flickr-licensed photos of everyday scenes—more contextual clutter than ImageNet object-centric photos.
59 Why default benchmark? 📊 medium
Answer: Challenging scale, multi-task labels, standardized API—dominant for comparing detectors and instance seg models.
60 Common pitfalls? 📊 medium
Answer: Wrong bbox convention, ignoring iscrowd, different NMS thresholds, or not using official eval—numbers won’t match papers.
Full tutorial chapter

Pair these interview notes with the matching CV tutorial chapter.

align-items-center flex-wrap gap-2"> Previous Next