CV evaluation metrics: CV guide

IoU for axis-aligned boxes

import numpy as np

def iou_box(a, b):
    # a,b = [x1,y1,x2,y2]
    x1 = max(a[0], b[0])
    y1 = max(a[1], b[1])
    x2 = min(a[2], b[2])
    y2 = min(a[3], b[3])
    inter = max(0, x2 - x1) * max(0, y2 - y1)
    area_a = max(0, a[2] - a[0]) * max(0, a[3] - a[1])
    area_b = max(0, b[2] - b[0]) * max(0, b[3] - b[1])
    union = area_a + area_b - inter + 1e-6
    return inter / union

A prediction is a “true positive” for a class if IoU ≥ threshold (e.g. 0.5) with a unmatched ground-truth of that class.

Precision, recall, F1

from sklearn.metrics import precision_recall_fscore_support

# y_true, y_pred: per-sample class indices
p, r, f1, _ = precision_recall_fscore_support(y_true, y_pred, average="macro")
# average="weighted" weights by support; "binary" for two-class

Segmentation IoU / Dice

For masks A, B ∈ {0,…,K-1}^H×W, per-class IoU is |A=k ∩ B=k| / |A=k ∪ B=k|. Dice = 2|A∩B| / (|A|+|B|) for binary or per-class. Report mean over classes excluding void if protocol requires.

mAP (detection)

Sort predictions by score; traverse thresholds to build precision–recall curve; AP is area under that curve (or interpolated variant). mAP averages AP over classes. COCO-style evaluation adds IoU 0.5:0.95, area splits, and caps on detections per image—use pycocotools or framework builtins for exact parity.

                    Takeaways
                    Always document splits, preprocessing, and class definitions when reporting numbers.
For deployment, also measure latency, memory, and failure cases.
Statistical tests or confidence intervals help when differences are small.

                

Quick FAQ

Images contain variable numbers of objects; mAP summarizes localization and classification jointly at multiple score cutoffs.

Averaging predictions over flipped/scaled views can boost metrics; declare TTA in papers and watch inference cost.

IoU for axis-aligned boxes

Precision, recall, F1

Segmentation IoU / Dice

mAP (detection)

Takeaways

Quick FAQ

Why not accuracy on detection?

Test-time augmentation?