Computer Vision Chapter 17

Instance segmentation

Instance segmentation answers: “Which pixels belong to this specific object?” Two people in a scene get two separate binary masks, not one merged “person” region. It combines object detection (where + what class) with a per-instance mask. The influential Mask R-CNN adds a small fully convolutional mask head on top of Faster R-CNN, predicting a fixed-resolution mask per region proposal.

Outputs per detection

For each instance, models typically emit: bounding box (x1,y1,x2,y2), class id, confidence score, and a mask (often 28×28 logits upsampled to the ROI and thresholded). Overlapping instances can occlude each other—ordering (painter’s algorithm) or alpha blending matters for visualization.

Mask IoU

Intersection over union of predicted vs ground-truth binary masks; averaged into APmask on benchmarks like COCO.

Panoptic

Unifies semantic “stuff” labels with instance “things” in one image—each pixel has class + optional instance id.

Mask R-CNN building blocks

  1. Backbone + FPN — multi-scale feature pyramid.
  2. Region Proposal Network (RPN) — objectness boxes in one forward pass.
  3. ROIAlign — bilinear sampling of features at proposal locations (fixes quantization error vs ROI Pool).
  4. Class + box head — refines category and box.
  5. Mask head — parallel branch: K binary masks per ROI (one per class) or class-specific channel, trained with per-pixel sigmoid + average loss.

Inference: Mask R-CNN (torchvision)

import torch
import torchvision.transforms as T
from torchvision.models.detection import maskrcnn_resnet50_fpn
from PIL import Image

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = maskrcnn_resnet50_fpn(weights="DEFAULT").to(device).eval()

img = Image.open("party.jpg").convert("RGB")
tensor = T.functional.to_tensor(img).to(device)

with torch.no_grad():
    out = model([tensor])[0]

scores = out["scores"]
labels = out["labels"]
boxes = out["boxes"]
masks = out["masks"]  # (N, 1, H, W), values in [0,1]

for i in range(len(scores)):
    if scores[i] < 0.7:
        continue
    mask = (masks[i, 0] > 0.5).cpu().numpy()
    # overlay mask on image with OpenCV or PIL

COCO class ids: use torchvision.models.detection.mask_rcnn.CocoEvaluator mappings or print labels with the COCO 91-category list.

Other families (names to search)

YOLACT — real-time instance masks via prototype masks + linear coefficients. SOLO / SOLOv2 — grid-based instance categories. Detectron2 (Meta) and Segment Anything (SAM) — strong interactive or promptable masks. Choice depends on latency, accuracy, and training budget.

Annotations

Instance datasets store polygon or RLE (run-length encoded) masks per object. COCO JSON is the de facto format; tools like LabelMe, CVAT, or Roboflow export compatible labels.

Takeaways

  • Instance = detect objects and separate pixel ownership per object.
  • Mask R-CNN = Faster R-CNN + mask branch + ROIAlign.
  • Evaluate with mask AP, not only box AP.

Quick FAQ

Increase input resolution, use FPN with finer levels, or switch to anchor-free / transformer detectors. Data augmentation with copy-paste of small instances helps.

Yes—set num_classes=2 in a custom model (background + your class) and fine-tune from COCO weights, replacing the prediction heads.