Computer Vision Chapter 7

Deep Segmentation

Segmentation overview, semantic pixel-wise labeling, and instance segmentation with masks.

Segmentation overview

Classical vs learning-based

Classical / interactive

Fast, no training data; needs hand-tuned assumptions (color clusters, user scribbles, smooth regions). Great for controlled capture or preprocessing.

Deep segmentation

Learns appearance and context from labeled images; handles diverse scenes. Heavier compute and dataset requirements.

Threshold + morphology + contours

The simplest “segmentation” is a binary mask: foreground vs background. Clean it with morphology, then extract outer boundaries or filled regions.

import cv2

gray = cv2.imread("blob.png", cv2.IMREAD_GRAYSCALE)
blur = cv2.GaussianBlur(gray, (5, 5), 0)
_, bw = cv2.threshold(blur, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)

k = cv2.getStructuringElement(cv2.MORPH_ELLIPSE, (5, 5))
bw = cv2.morphologyEx(bw, cv2.MORPH_OPEN, k, iterations=1)

cnts, _ = cv2.findContours(bw, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
vis = cv2.cvtColor(gray, cv2.COLOR_GRAY2BGR)
cv2.drawContours(vis, cnts, -1, (0, 255, 0), 2)

Connected components

Label each 4- or 8-connected blob; filter by area for counting objects or removing speckles.

import cv2

n, labels, stats, centroids = cv2.connectedComponentsWithStats(bw, connectivity=8)
# stats: [label, x, y, width, height, area]
h, w = bw.shape
out = cv2.cvtColor(bw, cv2.COLOR_GRAY2BGR)
for i in range(1, n):
    if stats[i, cv2.CC_STAT_AREA] < 100:
        continue
    x, y, ww, hh = stats[i, 0], stats[i, 1], stats[i, 2], stats[i, 3]
    cv2.rectangle(out, (x, y), (x + ww, y + hh), (255, 0, 0), 1)

Watershed with markers

The watershed treats the inverted distance transform as a height map; without markers it oversegments. Provide sure foreground, sure background, and unknown regions for stable basins.

import cv2
import numpy as np

img = cv2.imread("coins.jpg")
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
_, th = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)

# noise removal
kernel = np.ones((3, 3), np.uint8)
opening = cv2.morphologyEx(th, cv2.MORPH_OPEN, kernel, iterations=2)
sure_bg = cv2.dilate(opening, kernel, iterations=3)

dist = cv2.distanceTransform(opening, cv2.DIST_L2, 5)
_, sure_fg = cv2.threshold(dist, 0.5 * dist.max(), 255, 0)
sure_fg = np.uint8(sure_fg)
unknown = cv2.subtract(sure_bg, sure_fg)

_, markers = cv2.connectedComponents(sure_fg)
markers = markers + 1
markers[unknown == 255] = 0

markers = cv2.watershed(img, markers)
img[markers == -1] = [0, 0, 255]

Boundaries are marked as -1 in markers; here drawn in red on the original BGR image.

GrabCut (interactive box)

GrabCut iteratively refines a Gaussian mixture model of foreground/background color. Start from a rectangle or a user mask.

import cv2
import numpy as np

img = cv2.imread("portrait.jpg")
mask = np.zeros(img.shape[:2], np.uint8)
bgd = np.zeros((1, 65), np.float64)
fgd = np.zeros((1, 65), np.float64)
h, w = img.shape[:2]
rect = (int(0.1 * w), int(0.05 * h), int(0.8 * w), int(0.9 * h))

cv2.grabCut(img, mask, rect, bgd, fgd, 5, cv2.GC_INIT_WITH_RECT)
binmask = np.where((mask == 2) | (mask == 0), 0, 1).astype("uint8")
fg = img * binmask[:, :, np.newaxis]

Refine with sure-foreground strokes (concept)

# mask values: GC_BGD, GC_FGD, GC_PR_BGD, GC_PR_FGD — set user scribbles then:
# cv2.grabCut(img, mask, None, bgd, fgd, 5, cv2.GC_INIT_WITH_MASK)

k-means clustering in LAB

Flatten pixels to feature vectors (e.g. L,a,b), run cv2.kmeans, map labels back to an image—simple color segmentation.

import cv2
import numpy as np

bgr = cv2.imread("fruit.jpg")
lab = cv2.cvtColor(bgr, cv2.COLOR_BGR2LAB)
h, w, c = lab.shape
Z = lab.reshape(-1, 3).astype(np.float32)

criteria = (cv2.TERM_CRITERIA_EPS + cv2.TERM_CRITERIA_MAX_ITER, 20, 0.5)
K = 4
_, labels, centers = cv2.kmeans(Z, K, None, criteria, 10, cv2.KMEANS_PP_CENTERS)

centers_u8 = np.uint8(centers)
seg = centers_u8[labels.flatten()].reshape(h, w, 3)
seg_bgr = cv2.cvtColor(seg, cv2.COLOR_LAB2BGR)

What’s next in this series

Semantic segmentation assigns a class label to every pixel (road, sky, person) with networks like FCN, U-Net, DeepLab. Instance segmentation additionally separates individual object masks (Mask R-CNN). Follow the next hub pages for those topics when you are ready to move from classical pipelines to trained models.

Takeaways

  • Always combine raw thresholds with morphology and area filters for robust masks.
  • Watershed needs markers—distance transform + connected components is a standard recipe.
  • GrabCut and k-means leverage color statistics; deep models add semantic understanding.

Quick FAQ

Improve the sure-foreground mask (higher distance threshold, better thresholding) or incorporate gradient-based markers. Markers are the main control lever.

Try elbow method on within-cluster error, or pick K from the number of dominant colors you expect. For objects with shared colors, pure k-means will merge them—use learning-based segmentation instead.

Semantic segmentation

Problem setup

Input: RGB image H×W×3. Output: label map H×W with integer class ids (0 … C−1), often plus a special ignore index for unlabeled pixels. Training pairs are images plus pixel-wise masks. Unlike object detection, there are no boxes—only full-image classification at each location.

vs instance segmentation

Semantic: all “person” pixels share one label. Instance: each person gets a separate object id and mask.

vs panoptic

Panoptic merges “stuff” (sky, road) and “things” (countable objects) into one unified labeling—beyond pure semantic.

Representative architectures

  • FCN — Fully Convolutional Networks replace dense layers with convolutions; skip connections fuse coarse semantic and fine spatial detail.
  • U-Net — Symmetric encoder–decoder with skip concatenations; very common in medical imaging and small datasets.
  • DeepLab — Uses atrous (dilated) convolutions to enlarge receptive field without losing resolution as fast as pooling; ASPP combines multiple dilation rates.
  • SegFormer / Mask2Former — Transformer-based designs for strong context (covered in advanced courses).

Losses and metrics

Cross-entropy per pixel (with optional class weights) is the standard. For imbalanced classes (rare “pole” vs common “road”), use weighted CE, focal loss, or Dice/Lovász variants. IoU (Jaccard) per class and mIoU (mean over classes) are standard quality measures; pixel accuracy can be misleading when background dominates.

# Conceptual IoU for one class (NumPy-style)
def iou(pred_mask, gt_mask):
    inter = (pred_mask & gt_mask).sum()
    union = (pred_mask | gt_mask).sum()
    return inter / (union + 1e-6)

Inference: DeepLabV3 (torchvision)

Requires torch and torchvision. Uses COCO-pretrained 21-class Pascal VOC heads by default on the hub model—map labels to your task or fine-tune.

import torch
import torchvision.transforms as T
from torchvision.models.segmentation import deeplabv3_resnet50

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = deeplabv3_resnet50(weights="DEFAULT").to(device).eval()

preprocess = T.Compose([
    T.ToTensor(),
    T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

from PIL import Image
img = Image.open("street.jpg").convert("RGB")
inp = preprocess(img).unsqueeze(0).to(device)

with torch.no_grad():
    out = model(inp)["out"]  # (1, num_classes, h, w)
pred = out.argmax(dim=1).squeeze(0).cpu().numpy()

Resize pred back to original image size with cv2.resize(..., interpolation=cv2.INTER_NEAREST) to avoid mixing class ids.

Colorize label map for debugging

import numpy as np

# pseudo-color: map class id to RGB (toy palette of length num_classes)
palette = np.random.default_rng(0).integers(0, 255, size=(21, 3), dtype=np.uint8)
vis = palette[pred]

Training tips (brief)

Apply the same geometric transform to image and mask (flip, scale, crop). Use sync batch norm or group norm on small batch sizes. Start from ImageNet-pretrained backbones; freeze backbone briefly then unfreeze. Validate with mIoU on a held-out set, not just loss.

Takeaways

  • Semantic = one label per pixel per class, not per object instance.
  • Encoder–decoder and dilated convs recover full-resolution labels.
  • Report mIoU; handle class imbalance in the loss.

Quick FAQ

Strong downsampling without good skip connections smooths edges. Try U-Net-style skips, higher-resolution training crops, or boundary-aware losses.

Replace the final classifier conv layer to output C channels; fine-tune on your dataset. Reinitialize that layer; lower learning rate on the backbone.

Instance segmentation

Outputs per detection

For each instance, models typically emit: bounding box (x1,y1,x2,y2), class id, confidence score, and a mask (often 28×28 logits upsampled to the ROI and thresholded). Overlapping instances can occlude each other—ordering (painter’s algorithm) or alpha blending matters for visualization.

Mask IoU

Intersection over union of predicted vs ground-truth binary masks; averaged into APmask on benchmarks like COCO.

Panoptic

Unifies semantic “stuff” labels with instance “things” in one image—each pixel has class + optional instance id.

Mask R-CNN building blocks

  1. Backbone + FPN — multi-scale feature pyramid.
  2. Region Proposal Network (RPN) — objectness boxes in one forward pass.
  3. ROIAlign — bilinear sampling of features at proposal locations (fixes quantization error vs ROI Pool).
  4. Class + box head — refines category and box.
  5. Mask head — parallel branch: K binary masks per ROI (one per class) or class-specific channel, trained with per-pixel sigmoid + average loss.

Inference: Mask R-CNN (torchvision)

import torch
import torchvision.transforms as T
from torchvision.models.detection import maskrcnn_resnet50_fpn
from PIL import Image

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = maskrcnn_resnet50_fpn(weights="DEFAULT").to(device).eval()

img = Image.open("party.jpg").convert("RGB")
tensor = T.functional.to_tensor(img).to(device)

with torch.no_grad():
    out = model([tensor])[0]

scores = out["scores"]
labels = out["labels"]
boxes = out["boxes"]
masks = out["masks"]  # (N, 1, H, W), values in [0,1]

for i in range(len(scores)):
    if scores[i] < 0.7:
        continue
    mask = (masks[i, 0] > 0.5).cpu().numpy()
    # overlay mask on image with OpenCV or PIL

COCO class ids: use torchvision.models.detection.mask_rcnn.CocoEvaluator mappings or print labels with the COCO 91-category list.

Other families (names to search)

YOLACT — real-time instance masks via prototype masks + linear coefficients. SOLO / SOLOv2 — grid-based instance categories. Detectron2 (Meta) and Segment Anything (SAM) — strong interactive or promptable masks. Choice depends on latency, accuracy, and training budget.

Annotations

Instance datasets store polygon or RLE (run-length encoded) masks per object. COCO JSON is the de facto format; tools like LabelMe, CVAT, or Roboflow export compatible labels.

Takeaways

  • Instance = detect objects and separate pixel ownership per object.
  • Mask R-CNN = Faster R-CNN + mask branch + ROIAlign.
  • Evaluate with mask AP, not only box AP.

Quick FAQ

Increase input resolution, use FPN with finer levels, or switch to anchor-free / transformer detectors. Data augmentation with copy-paste of small instances helps.

Yes—set num_classes=2 in a custom model (background + your class) and fine-tune from COCO weights, replacing the prediction heads.

Chapter FAQ

Quick FAQ

Improve the sure-foreground mask (higher distance threshold, better thresholding) or incorporate gradient-based markers. Markers are the main control lever.

Try elbow method on within-cluster error, or pick K from the number of dominant colors you expect. For objects with shared colors, pure k-means will merge them—use learning-based segmentation instead.

Quick FAQ

Strong downsampling without good skip connections smooths edges. Try U-Net-style skips, higher-resolution training crops, or boundary-aware losses.

Replace the final classifier conv layer to output C channels; fine-tune on your dataset. Reinitialize that layer; lower learning rate on the backbone.