Computer Vision Chapter 48

COCO dataset

MS COCO (Common Objects in Context) provides everyday-scene images with rich annotations: object detection (bounding boxes), instance segmentation (polygon or RLE masks), person keypoints, and captions. It is a standard benchmark for detection/segmentation mAP. Official files are JSON; the pycocotools package loads annotations and implements COCO-style evaluation. Train/val/test splits and download scripts are documented on the COCO website.

Annotation structure (instances)

Top-level keys include images (id, file_name, height, width), annotations (image_id, category_id, bbox [x,y,w,h], area, segmentation, iscrowd), and categories (id, name, supercategory). iscrowd=1 marks RLE regions for groups; evaluation rules differ from single-object polygons.

pycocotools

# pip install pycocotools
from pycocotools.coco import COCO

ann_file = "annotations/instances_val2017.json"
coco = COCO(ann_file)
ids = coco.getImgIds(catIds=coco.getCatIds(catNms=["person"]))
img = coco.loadImgs(ids[0])[0]

Use coco.loadAnns / showAnns for visualization; detection eval uses cocoEval with predicted JSON in COCO result format.

Related tracks

  • Captions — image ↔ sentence pairs; metrics include BLEU, CIDEr, SPICE.
  • Keypoints — 17 body joints per person instance.
  • Panoptic — joint stuff + thing segmentation (separate challenge materials).

torchvision built-ins

from torchvision.datasets import CocoDetection

# root = image folder, annFile = instances_*.json
# ds = CocoDetection(root, annFile, transform=your_transform)

Pair with detection transforms (v2 APIs in recent torchvision) to return image + target dict.

Takeaways

  • Always match evaluation protocol (IoU range, max detections, area buckets) when comparing papers.
  • 2017 split is the common modern reference; older 2014 still appears in legacy code.
  • Respect the COCO license and attribution when redistributing derived sets.

Quick FAQ

Historically a subset of val for faster iteration; definitions vary—state exactly which image IDs you use.

Open Images is larger and multi-label; evaluation tooling differs. Choose based on class vocabulary and annotation type.