Computer Vision Interview
60 Q&A
Chapter 20
Evaluation & Benchmarks — Interview Q&A
Detection and segmentation metrics (IoU, mAP), plus ImageNet and COCO benchmark datasets.
60 questions
Chapter 20
CV Evaluation Metrics: 20 Essential Q&A
1
Why care about metrics?
⚡ easy
Answer: They define success criteria, compare models, and expose tradeoffs (precision vs recall)—wrong metric optimizes the wrong behavior.
2
When is accuracy misleading?
📊 medium
Answer: Imbalanced classes—99% negatives makes naive accuracy useless; need per-class and balanced metrics.
3
Define precision and recall.
📊 medium
Answer: Precision = TP/(TP+FP); Recall = TP/(TP+FN)—tension controlled by decision threshold.
4
F1?
⚡ easy
Answer: Harmonic mean of precision and recall—penalizes ignoring either; common single-number summary for binary/multiclass macro-F1.
5
Confusion matrix?
📊 medium
Answer: Counts predictions vs truth for all classes—shows confusion pairs and supports per-class recall.
6
ROC / AUC?
🔥 hard
Answer: TPR vs FPR curve as threshold sweeps; AUC summarizes ranking quality—invariant to prior when comparing rankers.
7
What is IoU?
📊 medium
Answer: Intersection over union of predicted vs ground-truth boxes/masks—range [0,1]; standard match criterion in detection.
iou = inter_area / (area_a + area_b - inter_area)
8
What is mAP in detection?
🔥 hard
Answer: Mean AP across classes—AP is area under precision–recall curve after IoU-thresholded matches; COCO averages multiple IoU thresholds.
9
AP vs mAP?
📊 medium
Answer: AP per class; mAP averages classes—report AP50 vs AP75 to show coarse vs tight localization skill.
10
NMS effect on metrics?
📊 medium
Answer: Suppresses overlapping boxes before evaluation—metric implementation must match competition rules (soft-NMS differs).
11
Segmentation IoU?
📊 medium
Answer: Per-class IoU on pixels; mean IoU (mIoU) across classes—ignore void label per dataset protocol.
12
Dice coefficient?
📊 medium
Answer: 2|A∩B|/(|A|+|B|)—related to F1 on masks; common in medical segmentation with class imbalance.
13
Threshold tuning?
🔥 hard
Answer: Pick operating point on validation to meet product constraint (min recall)—don’t tune on test set.
14
Micro vs macro averaging?
🔥 hard
Answer: Micro pools all examples; macro averages per-class stats—macro highlights rare class performance.
15
OKS in pose?
🔥 hard
Answer: Object keypoint similarity scales error by joint size—COCO pose AP builds on OKS thresholds.
16
Calibration?
📊 medium
Answer: Predicted probabilities match empirical frequencies—ECE, reliability diagrams; miscalibration hurts downstream decisions.
17
Sampling bias?
📊 medium
Answer: Geographic, demographic, or capture bias inflates benchmark scores—report subgroup metrics.
18
Benchmark leakage?
⚡ easy
Answer: Test images in pretraining data or duplicate near-neighbors—contaminates leaderboard comparisons.
19
Human baseline?
⚡ easy
Answer: Annotator agreement sets ceiling—if model beats humans, check task ambiguity or evaluation bugs.
20
What to report?
📊 medium
Answer: Primary metric + confidence intervals or multiple seeds, compute budget, and failure cases—not leaderboard cherry-picking.
ImageNet: 20 Essential Q&A
21
What is ImageNet?
⚡ easy
Answer: Large-scale image dataset organized by WordNet synsets—millions of labeled images driving classification pretraining.
22
ILSVRC?
📊 medium
Answer: Annual challenge subset (~1.2M train, 50k val, 1000 classes) used historically for ImageNet-1K classification benchmarks.
23
What is a synset?
📊 medium
Answer: WordNet sense (e.g. specific dog breed)—each class is a disambiguated noun phrase to reduce polysemy.
24
Scale?
⚡ easy
Answer: Roughly 1.28M training images for 1K ILSVRC classes—enough diversity to learn general visual features.
25
Why report top-5 error?
📊 medium
Answer: Fine-grained classes make single exact label harsh—top-5 was standard headline metric during AlexNet era.
26
Val vs test?
📊 medium
Answer: Public val for development; test held out for leaderboard—reproducible papers compare on val with fixed split.
27
Canonical preprocessing?
🔥 hard
Answer: Short side resize 256, center crop 224, mean/std normalization—must match weights (different for Inception vs ResNet sometimes).
# ImageNet norm: mean=[0.485,0.456,0.406], std=[0.229,0.224,0.225]
28
Transfer learning role?
📊 medium
Answer: Backbone trained on ImageNet features edges/textures/objects—finetune on small domain datasets with smaller LR.
29
Freeze backbone?
📊 medium
Answer: Early training only head when data tiny; unfreeze later—BN layers need care (batch stats) in finetune.
30
Noise / bias?
🔥 hard
Answer: Crowdsourced labels contain errors; geographic and demographic skew—ImageNet audit projects documented issues.
31
Hierarchical labels?
📊 medium
Answer: WordNet tree enables hierarchical metrics and zero-shot transfer—not all models exploit hierarchy in loss.
32
Classic augmentations?
⚡ easy
Answer: RandomResizedCrop, flip, color jitter—standard on ImageNet training recipes (RRC is critical for ResNet).
33
Tiny ImageNet?
⚡ easy
Answer: Teaching subset (200 classes, 64×64)—useful for coursework; not same distribution as full IN.
34
Relation to Open Images?
📊 medium
Answer: Different project (multi-label, boxes)—don’t confuse with ImageNet-1K single-label classification.
35
Licensing?
⚡ easy
Answer: Images scraped from web with varying rights—research use common; commercial redeployment needs legal review.
36
ObjectNet lesson?
🔥 hard
Answer: Controls viewpoint/background—shows ImageNet-trained models rely on spurious cues; stresses robust evaluation.
37
Beyond single-label?
📊 medium
Answer: Web-scale image-text (CLIP) reduces reliance on pure ImageNet classification for pretraining—still often finetuned with IN-like data.
38
EfficientNet story?
📊 medium
Answer: Compound scaling depth/width/resolution on ImageNet—Pareto frontier influenced mobile deployment targets.
39
ViT on ImageNet?
🔥 hard
Answer: Transformers need large data or strong augmentation + pretrain—ImageNet-1K alone smaller than JFT; hybrids mattered early.
40
Still pretrain on IN?
⚡ easy
Answer: Common baseline though larger multimodal corpora grow—ImageNet remains reference for architecture comparisons.
MS COCO: 20 Essential Q&A
41
What is MS COCO?
⚡ easy
Answer: Common Objects in Context—benchmark for detection, segmentation, captions, and keypoints with rich scene images.
42
Which tasks?
📊 medium
Answer: Object detection (bbox), instance seg, panoptic (thing+stuff), image captioning, person keypoints—each has metrics.
43
JSON annotations?
📊 medium
Answer: COCO format lists images, categories, annotations with bbox [x,y,w,h], segmentation polygons/RLE, area, iscrowd flag.
44
Bbox format?
⚡ easy
Answer: Top-left x,y plus width,height in pixels—convert carefully vs xyxy conventions in codebases.
45
Instance masks?
📊 medium
Answer: Often stored as RLE compression per object—Mask R-CNN training decodes to binary masks per instance.
46
Panoptic on COCO?
🔥 hard
Answer: Unifies semantic “stuff” and instance “things” with PQ metric—requires non-overlapping label assignment per pixel.
47
Captions?
📊 medium
Answer: Multiple human captions per image—evaluation with BLEU/CIDEr/SPICE; encourages descriptive models.
48
Keypoints?
📊 medium
Answer: 17 body joints for person instances—AP computed with OKS instead of IoU for matching.
49
Detection mAP on COCO?
🔥 hard
Answer: Primary AP averaged over IoU thresholds 0.5:0.05:0.95 (AP@[.5:.95]) plus AP50, AP75—stricter than VOC AP50 only.
50
Mask AP?
📊 medium
Answer: AP computed on mask IoU instead of box IoU—segmentation quality can differ from bbox AP ranking.
51
80 classes?
⚡ easy
Answer: Thing categories for detection—plus stuff classes in panoptic/stuff annotations; don’t confuse with 91 legacy lists in some code.
52
train/val/test?
📊 medium
Answer: train2017, val2017 public; test-dev hidden for leaderboard—papers report val metrics for fair comparison.
53
pycocotools?
📊 medium
Answer: Official eval code for mAP, mask IoU, RLE decode—implementations should match to reproduce leaderboard numbers.
from pycocotools.coco import COCO; coco = COCO("annotations.json")
54
Small objects?
📊 medium
Answer: COCO reports AP_S/M/L by area—models struggle on small; anchors/FPN designs target scale variance.
55
iscrowd flag?
🔥 hard
Answer: Annotations for groups/crowds where instance separation ambiguous—evaluation rules ignore or merge per protocol.
56
Eval servers?
⚡ easy
Answer: Upload predictions for test sets—prevents test overfitting; val is for iteration.
57
Relation to LVIS?
📊 medium
Answer: Long-tail vocabulary extension—similar tooling, different frequency spectrum; training often joint with COCO.
58
Image source?
⚡ easy
Answer: Flickr-licensed photos of everyday scenes—more contextual clutter than ImageNet object-centric photos.
59
Why default benchmark?
📊 medium
Answer: Challenging scale, multi-task labels, standardized API—dominant for comparing detectors and instance seg models.
60
Common pitfalls?
📊 medium
Answer: Wrong bbox convention, ignoring iscrowd, different NMS thresholds, or not using official eval—numbers won’t match papers.
Full tutorial chapter
Pair these interview notes with the matching CV tutorial chapter.