Outputs per detection
For each instance, models typically emit: bounding box (x1,y1,x2,y2), class id, confidence score, and a mask (often 28×28 logits upsampled to the ROI and thresholded). Overlapping instances can occlude each other—ordering (painter’s algorithm) or alpha blending matters for visualization.
Mask IoU
Intersection over union of predicted vs ground-truth binary masks; averaged into APmask on benchmarks like COCO.
Panoptic
Unifies semantic “stuff” labels with instance “things” in one image—each pixel has class + optional instance id.
Mask R-CNN building blocks
- Backbone + FPN — multi-scale feature pyramid.
- Region Proposal Network (RPN) — objectness boxes in one forward pass.
- ROIAlign — bilinear sampling of features at proposal locations (fixes quantization error vs ROI Pool).
- Class + box head — refines category and box.
- Mask head — parallel branch: K binary masks per ROI (one per class) or class-specific channel, trained with per-pixel sigmoid + average loss.
Inference: Mask R-CNN (torchvision)
import torch
import torchvision.transforms as T
from torchvision.models.detection import maskrcnn_resnet50_fpn
from PIL import Image
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = maskrcnn_resnet50_fpn(weights="DEFAULT").to(device).eval()
img = Image.open("party.jpg").convert("RGB")
tensor = T.functional.to_tensor(img).to(device)
with torch.no_grad():
out = model([tensor])[0]
scores = out["scores"]
labels = out["labels"]
boxes = out["boxes"]
masks = out["masks"] # (N, 1, H, W), values in [0,1]
for i in range(len(scores)):
if scores[i] < 0.7:
continue
mask = (masks[i, 0] > 0.5).cpu().numpy()
# overlay mask on image with OpenCV or PIL
COCO class ids: use torchvision.models.detection.mask_rcnn.CocoEvaluator mappings or print labels with the COCO 91-category list.
Other families (names to search)
YOLACT — real-time instance masks via prototype masks + linear coefficients. SOLO / SOLOv2 — grid-based instance categories. Detectron2 (Meta) and Segment Anything (SAM) — strong interactive or promptable masks. Choice depends on latency, accuracy, and training budget.
Annotations
Instance datasets store polygon or RLE (run-length encoded) masks per object. COCO JSON is the de facto format; tools like LabelMe, CVAT, or Roboflow export compatible labels.
Takeaways
- Instance = detect objects and separate pixel ownership per object.
- Mask R-CNN = Faster R-CNN + mask branch + ROIAlign.
- Evaluate with mask AP, not only box AP.