Evaluation & Benchmarks

CV evaluation metrics

IoU for axis-aligned boxes

import numpy as np

def iou_box(a, b):
    # a,b = [x1,y1,x2,y2]
    x1 = max(a[0], b[0])
    y1 = max(a[1], b[1])
    x2 = min(a[2], b[2])
    y2 = min(a[3], b[3])
    inter = max(0, x2 - x1) * max(0, y2 - y1)
    area_a = max(0, a[2] - a[0]) * max(0, a[3] - a[1])
    area_b = max(0, b[2] - b[0]) * max(0, b[3] - b[1])
    union = area_a + area_b - inter + 1e-6
    return inter / union

A prediction is a “true positive” for a class if IoU ≥ threshold (e.g. 0.5) with a unmatched ground-truth of that class.

Precision, recall, F1

from sklearn.metrics import precision_recall_fscore_support

# y_true, y_pred: per-sample class indices
p, r, f1, _ = precision_recall_fscore_support(y_true, y_pred, average="macro")
# average="weighted" weights by support; "binary" for two-class

Segmentation IoU / Dice

For masks A, B ∈ {0,…,K-1}^H×W, per-class IoU is |A=k ∩ B=k| / |A=k ∪ B=k|. Dice = 2|A∩B| / (|A|+|B|) for binary or per-class. Report mean over classes excluding void if protocol requires.

mAP (detection)

Sort predictions by score; traverse thresholds to build precision–recall curve; AP is area under that curve (or interpolated variant). mAP averages AP over classes. COCO-style evaluation adds IoU 0.5:0.95, area splits, and caps on detections per image—use pycocotools or framework builtins for exact parity.

                    Takeaways
                    Always document splits, preprocessing, and class definitions when reporting numbers.
For deployment, also measure latency, memory, and failure cases.
Statistical tests or confidence intervals help when differences are small.

                

Quick FAQ

Images contain variable numbers of objects; mAP summarizes localization and classification jointly at multiple score cutoffs.

Averaging predictions over flipped/scaled views can boost metrics; declare TTA in papers and watch inference cost.

ImageNet

WordNet and synsets

Each class corresponds to a synset ID (e.g. n01440764 “tench”). Hierarchical relations in WordNet are not always reflected in flat 1-of-K training—hierarchical loss is optional research direction.

ILSVRC tasks

Beyond single-label classification, historical challenges included localization (bbox) and detection. Today, COCO is more common for detection benchmarks, while ImageNet-pretrained backbones remain the default initialization.

Using label names in code

from torchvision.models import resnet50, ResNet50_Weights

w = ResNet50_Weights.IMAGENET1K_V2
names = w.meta["categories"]  # list of 1000 strings
# idx = logits.argmax(dim=1).item()
# print(names[idx])

Different weight versions may share the same 1k label order—confirm in the weights metadata you load.

Transfer learning

Replace the classifier head, freeze early layers optionally, train on your domain. Features are biased toward ImageNet objects; medical or industrial imagery may need more adaptation or different pretraining (SimCLR, CLIP, domain-specific data).

                    Takeaways
                    ImageNet scale enabled modern CNNs; data governance and consent practices have evolved since early collection.
Top-1 / top-5 error reported on fixed val split—do not compare to your test set blindly.
Consider newer pretraining (web-scale contrastive) when label semantics differ.

                

Quick FAQ

A larger superset with thousands of classes is used by some models (ViT, EfficientNet variants); preprocessing and head shapes differ from 1k checkpoints.

A common teaching subset (64×64, 200 classes)—not the official full ImageNet benchmark.

COCO dataset

Annotation structure (instances)

Top-level keys include images (id, file_name, height, width), annotations (image_id, category_id, bbox [x,y,w,h], area, segmentation, iscrowd), and categories (id, name, supercategory). iscrowd=1 marks RLE regions for groups; evaluation rules differ from single-object polygons.

pycocotools

# pip install pycocotools
from pycocotools.coco import COCO

ann_file = "annotations/instances_val2017.json"
coco = COCO(ann_file)
ids = coco.getImgIds(catIds=coco.getCatIds(catNms=["person"]))
img = coco.loadImgs(ids[0])[0]

Use coco.loadAnns / showAnns for visualization; detection eval uses cocoEval with predicted JSON in COCO result format.

Related tracks

Captions — image ↔ sentence pairs; metrics include BLEU, CIDEr, SPICE.
Keypoints — 17 body joints per person instance.
Panoptic — joint stuff + thing segmentation (separate challenge materials).

torchvision built-ins

from torchvision.datasets import CocoDetection

# root = image folder, annFile = instances_*.json
# ds = CocoDetection(root, annFile, transform=your_transform)

Pair with detection transforms (v2 APIs in recent torchvision) to return image + target dict.

                    Takeaways
                    Always match evaluation protocol (IoU range, max detections, area buckets) when comparing papers.
2017 split is the common modern reference; older 2014 still appears in legacy code.
Respect the COCO license and attribution when redistributing derived sets.

                

Quick FAQ

Historically a subset of val for faster iteration; definitions vary—state exactly which image IDs you use.

Open Images is larger and multi-label; evaluation tooling differs. Choose based on class vocabulary and annotation type.

Chapter FAQ

Quick FAQ

Images contain variable numbers of objects; mAP summarizes localization and classification jointly at multiple score cutoffs.

Averaging predictions over flipped/scaled views can boost metrics; declare TTA in papers and watch inference cost.

Quick FAQ

A larger superset with thousands of classes is used by some models (ViT, EfficientNet variants); preprocessing and head shapes differ from 1k checkpoints.

A common teaching subset (64×64, 200 classes)—not the official full ImageNet benchmark.

CV evaluation metrics

IoU for axis-aligned boxes

Precision, recall, F1

Segmentation IoU / Dice

mAP (detection)

Takeaways

Quick FAQ

Why not accuracy on detection?

Test-time augmentation?

ImageNet

WordNet and synsets

ILSVRC tasks

Using label names in code

Transfer learning

Takeaways

Quick FAQ

ImageNet-21k?

Tiny ImageNet?

COCO dataset

Annotation structure (instances)

pycocotools

Related tracks

torchvision built-ins

Takeaways

Quick FAQ

minival?

COCO vs Open Images?

Chapter FAQ

Quick FAQ

Why not accuracy on detection?

Test-time augmentation?

Quick FAQ

ImageNet-21k?

Tiny ImageNet?