CV evaluation metrics
IoU for axis-aligned boxes
import numpy as np
def iou_box(a, b):
# a,b = [x1,y1,x2,y2]
x1 = max(a[0], b[0])
y1 = max(a[1], b[1])
x2 = min(a[2], b[2])
y2 = min(a[3], b[3])
inter = max(0, x2 - x1) * max(0, y2 - y1)
area_a = max(0, a[2] - a[0]) * max(0, a[3] - a[1])
area_b = max(0, b[2] - b[0]) * max(0, b[3] - b[1])
union = area_a + area_b - inter + 1e-6
return inter / union
A prediction is a “true positive” for a class if IoU ≥ threshold (e.g. 0.5) with a unmatched ground-truth of that class.
Precision, recall, F1
from sklearn.metrics import precision_recall_fscore_support
# y_true, y_pred: per-sample class indices
p, r, f1, _ = precision_recall_fscore_support(y_true, y_pred, average="macro")
# average="weighted" weights by support; "binary" for two-class
Segmentation IoU / Dice
For masks A, B ∈ {0,…,K-1}^H×W, per-class IoU is |A=k ∩ B=k| / |A=k ∪ B=k|. Dice = 2|A∩B| / (|A|+|B|) for binary or per-class. Report mean over classes excluding void if protocol requires.
mAP (detection)
Sort predictions by score; traverse thresholds to build precision–recall curve; AP is area under that curve (or interpolated variant). mAP averages AP over classes. COCO-style evaluation adds IoU 0.5:0.95, area splits, and caps on detections per image—use pycocotools or framework builtins for exact parity.
Takeaways
- Always document splits, preprocessing, and class definitions when reporting numbers.
- For deployment, also measure latency, memory, and failure cases.
- Statistical tests or confidence intervals help when differences are small.
Quick FAQ
ImageNet
WordNet and synsets
Each class corresponds to a synset ID (e.g. n01440764 “tench”). Hierarchical relations in WordNet are not always reflected in flat 1-of-K training—hierarchical loss is optional research direction.
ILSVRC tasks
Beyond single-label classification, historical challenges included localization (bbox) and detection. Today, COCO is more common for detection benchmarks, while ImageNet-pretrained backbones remain the default initialization.
Using label names in code
from torchvision.models import resnet50, ResNet50_Weights
w = ResNet50_Weights.IMAGENET1K_V2
names = w.meta["categories"] # list of 1000 strings
# idx = logits.argmax(dim=1).item()
# print(names[idx])
Different weight versions may share the same 1k label order—confirm in the weights metadata you load.
Transfer learning
Replace the classifier head, freeze early layers optionally, train on your domain. Features are biased toward ImageNet objects; medical or industrial imagery may need more adaptation or different pretraining (SimCLR, CLIP, domain-specific data).
Takeaways
- ImageNet scale enabled modern CNNs; data governance and consent practices have evolved since early collection.
- Top-1 / top-5 error reported on fixed val split—do not compare to your test set blindly.
- Consider newer pretraining (web-scale contrastive) when label semantics differ.
Quick FAQ
COCO dataset
Annotation structure (instances)
Top-level keys include images (id, file_name, height, width), annotations (image_id, category_id, bbox [x,y,w,h], area, segmentation, iscrowd), and categories (id, name, supercategory). iscrowd=1 marks RLE regions for groups; evaluation rules differ from single-object polygons.
pycocotools
# pip install pycocotools
from pycocotools.coco import COCO
ann_file = "annotations/instances_val2017.json"
coco = COCO(ann_file)
ids = coco.getImgIds(catIds=coco.getCatIds(catNms=["person"]))
img = coco.loadImgs(ids[0])[0]
Use coco.loadAnns / showAnns for visualization; detection eval uses cocoEval with predicted JSON in COCO result format.
Related tracks
- Captions — image ↔ sentence pairs; metrics include BLEU, CIDEr, SPICE.
- Keypoints — 17 body joints per person instance.
- Panoptic — joint stuff + thing segmentation (separate challenge materials).
torchvision built-ins
from torchvision.datasets import CocoDetection
# root = image folder, annFile = instances_*.json
# ds = CocoDetection(root, annFile, transform=your_transform)
Pair with detection transforms (v2 APIs in recent torchvision) to return image + target dict.
Takeaways
- Always match evaluation protocol (IoU range, max detections, area buckets) when comparing papers.
- 2017 split is the common modern reference; older 2014 still appears in legacy code.
- Respect the COCO license and attribution when redistributing derived sets.