Segmentation overview
Classical vs learning-based
Classical / interactive
Fast, no training data; needs hand-tuned assumptions (color clusters, user scribbles, smooth regions). Great for controlled capture or preprocessing.
Deep segmentation
Learns appearance and context from labeled images; handles diverse scenes. Heavier compute and dataset requirements.
Threshold + morphology + contours
The simplest “segmentation” is a binary mask: foreground vs background. Clean it with morphology, then extract outer boundaries or filled regions.
import cv2
gray = cv2.imread("blob.png", cv2.IMREAD_GRAYSCALE)
blur = cv2.GaussianBlur(gray, (5, 5), 0)
_, bw = cv2.threshold(blur, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)
k = cv2.getStructuringElement(cv2.MORPH_ELLIPSE, (5, 5))
bw = cv2.morphologyEx(bw, cv2.MORPH_OPEN, k, iterations=1)
cnts, _ = cv2.findContours(bw, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
vis = cv2.cvtColor(gray, cv2.COLOR_GRAY2BGR)
cv2.drawContours(vis, cnts, -1, (0, 255, 0), 2)
Connected components
Label each 4- or 8-connected blob; filter by area for counting objects or removing speckles.
import cv2
n, labels, stats, centroids = cv2.connectedComponentsWithStats(bw, connectivity=8)
# stats: [label, x, y, width, height, area]
h, w = bw.shape
out = cv2.cvtColor(bw, cv2.COLOR_GRAY2BGR)
for i in range(1, n):
if stats[i, cv2.CC_STAT_AREA] < 100:
continue
x, y, ww, hh = stats[i, 0], stats[i, 1], stats[i, 2], stats[i, 3]
cv2.rectangle(out, (x, y), (x + ww, y + hh), (255, 0, 0), 1)
Watershed with markers
The watershed treats the inverted distance transform as a height map; without markers it oversegments. Provide sure foreground, sure background, and unknown regions for stable basins.
import cv2
import numpy as np
img = cv2.imread("coins.jpg")
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
_, th = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)
# noise removal
kernel = np.ones((3, 3), np.uint8)
opening = cv2.morphologyEx(th, cv2.MORPH_OPEN, kernel, iterations=2)
sure_bg = cv2.dilate(opening, kernel, iterations=3)
dist = cv2.distanceTransform(opening, cv2.DIST_L2, 5)
_, sure_fg = cv2.threshold(dist, 0.5 * dist.max(), 255, 0)
sure_fg = np.uint8(sure_fg)
unknown = cv2.subtract(sure_bg, sure_fg)
_, markers = cv2.connectedComponents(sure_fg)
markers = markers + 1
markers[unknown == 255] = 0
markers = cv2.watershed(img, markers)
img[markers == -1] = [0, 0, 255]
Boundaries are marked as -1 in markers; here drawn in red on the original BGR image.
GrabCut (interactive box)
GrabCut iteratively refines a Gaussian mixture model of foreground/background color. Start from a rectangle or a user mask.
import cv2
import numpy as np
img = cv2.imread("portrait.jpg")
mask = np.zeros(img.shape[:2], np.uint8)
bgd = np.zeros((1, 65), np.float64)
fgd = np.zeros((1, 65), np.float64)
h, w = img.shape[:2]
rect = (int(0.1 * w), int(0.05 * h), int(0.8 * w), int(0.9 * h))
cv2.grabCut(img, mask, rect, bgd, fgd, 5, cv2.GC_INIT_WITH_RECT)
binmask = np.where((mask == 2) | (mask == 0), 0, 1).astype("uint8")
fg = img * binmask[:, :, np.newaxis]
Refine with sure-foreground strokes (concept)
# mask values: GC_BGD, GC_FGD, GC_PR_BGD, GC_PR_FGD — set user scribbles then:
# cv2.grabCut(img, mask, None, bgd, fgd, 5, cv2.GC_INIT_WITH_MASK)
k-means clustering in LAB
Flatten pixels to feature vectors (e.g. L,a,b), run cv2.kmeans, map labels back to an image—simple color segmentation.
import cv2
import numpy as np
bgr = cv2.imread("fruit.jpg")
lab = cv2.cvtColor(bgr, cv2.COLOR_BGR2LAB)
h, w, c = lab.shape
Z = lab.reshape(-1, 3).astype(np.float32)
criteria = (cv2.TERM_CRITERIA_EPS + cv2.TERM_CRITERIA_MAX_ITER, 20, 0.5)
K = 4
_, labels, centers = cv2.kmeans(Z, K, None, criteria, 10, cv2.KMEANS_PP_CENTERS)
centers_u8 = np.uint8(centers)
seg = centers_u8[labels.flatten()].reshape(h, w, 3)
seg_bgr = cv2.cvtColor(seg, cv2.COLOR_LAB2BGR)
What’s next in this series
Semantic segmentation assigns a class label to every pixel (road, sky, person) with networks like FCN, U-Net, DeepLab. Instance segmentation additionally separates individual object masks (Mask R-CNN). Follow the next hub pages for those topics when you are ready to move from classical pipelines to trained models.
Takeaways
- Always combine raw thresholds with morphology and area filters for robust masks.
- Watershed needs markers—distance transform + connected components is a standard recipe.
- GrabCut and k-means leverage color statistics; deep models add semantic understanding.
Quick FAQ
Semantic segmentation
Problem setup
Input: RGB image H×W×3. Output: label map H×W with integer class ids (0 … C−1), often plus a special ignore index for unlabeled pixels. Training pairs are images plus pixel-wise masks. Unlike object detection, there are no boxes—only full-image classification at each location.
vs instance segmentation
Semantic: all “person” pixels share one label. Instance: each person gets a separate object id and mask.
vs panoptic
Panoptic merges “stuff” (sky, road) and “things” (countable objects) into one unified labeling—beyond pure semantic.
Representative architectures
- FCN — Fully Convolutional Networks replace dense layers with convolutions; skip connections fuse coarse semantic and fine spatial detail.
- U-Net — Symmetric encoder–decoder with skip concatenations; very common in medical imaging and small datasets.
- DeepLab — Uses atrous (dilated) convolutions to enlarge receptive field without losing resolution as fast as pooling; ASPP combines multiple dilation rates.
- SegFormer / Mask2Former — Transformer-based designs for strong context (covered in advanced courses).
Losses and metrics
Cross-entropy per pixel (with optional class weights) is the standard. For imbalanced classes (rare “pole” vs common “road”), use weighted CE, focal loss, or Dice/Lovász variants. IoU (Jaccard) per class and mIoU (mean over classes) are standard quality measures; pixel accuracy can be misleading when background dominates.
# Conceptual IoU for one class (NumPy-style)
def iou(pred_mask, gt_mask):
inter = (pred_mask & gt_mask).sum()
union = (pred_mask | gt_mask).sum()
return inter / (union + 1e-6)
Inference: DeepLabV3 (torchvision)
Requires torch and torchvision. Uses COCO-pretrained 21-class Pascal VOC heads by default on the hub model—map labels to your task or fine-tune.
import torch
import torchvision.transforms as T
from torchvision.models.segmentation import deeplabv3_resnet50
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = deeplabv3_resnet50(weights="DEFAULT").to(device).eval()
preprocess = T.Compose([
T.ToTensor(),
T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
from PIL import Image
img = Image.open("street.jpg").convert("RGB")
inp = preprocess(img).unsqueeze(0).to(device)
with torch.no_grad():
out = model(inp)["out"] # (1, num_classes, h, w)
pred = out.argmax(dim=1).squeeze(0).cpu().numpy()
Resize pred back to original image size with cv2.resize(..., interpolation=cv2.INTER_NEAREST) to avoid mixing class ids.
Colorize label map for debugging
import numpy as np
# pseudo-color: map class id to RGB (toy palette of length num_classes)
palette = np.random.default_rng(0).integers(0, 255, size=(21, 3), dtype=np.uint8)
vis = palette[pred]
Training tips (brief)
Apply the same geometric transform to image and mask (flip, scale, crop). Use sync batch norm or group norm on small batch sizes. Start from ImageNet-pretrained backbones; freeze backbone briefly then unfreeze. Validate with mIoU on a held-out set, not just loss.
Takeaways
- Semantic = one label per pixel per class, not per object instance.
- Encoder–decoder and dilated convs recover full-resolution labels.
- Report mIoU; handle class imbalance in the loss.
Quick FAQ
Instance segmentation
Outputs per detection
For each instance, models typically emit: bounding box (x1,y1,x2,y2), class id, confidence score, and a mask (often 28×28 logits upsampled to the ROI and thresholded). Overlapping instances can occlude each other—ordering (painter’s algorithm) or alpha blending matters for visualization.
Mask IoU
Intersection over union of predicted vs ground-truth binary masks; averaged into APmask on benchmarks like COCO.
Panoptic
Unifies semantic “stuff” labels with instance “things” in one image—each pixel has class + optional instance id.
Mask R-CNN building blocks
- Backbone + FPN — multi-scale feature pyramid.
- Region Proposal Network (RPN) — objectness boxes in one forward pass.
- ROIAlign — bilinear sampling of features at proposal locations (fixes quantization error vs ROI Pool).
- Class + box head — refines category and box.
- Mask head — parallel branch: K binary masks per ROI (one per class) or class-specific channel, trained with per-pixel sigmoid + average loss.
Inference: Mask R-CNN (torchvision)
import torch
import torchvision.transforms as T
from torchvision.models.detection import maskrcnn_resnet50_fpn
from PIL import Image
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = maskrcnn_resnet50_fpn(weights="DEFAULT").to(device).eval()
img = Image.open("party.jpg").convert("RGB")
tensor = T.functional.to_tensor(img).to(device)
with torch.no_grad():
out = model([tensor])[0]
scores = out["scores"]
labels = out["labels"]
boxes = out["boxes"]
masks = out["masks"] # (N, 1, H, W), values in [0,1]
for i in range(len(scores)):
if scores[i] < 0.7:
continue
mask = (masks[i, 0] > 0.5).cpu().numpy()
# overlay mask on image with OpenCV or PIL
COCO class ids: use torchvision.models.detection.mask_rcnn.CocoEvaluator mappings or print labels with the COCO 91-category list.
Other families (names to search)
YOLACT — real-time instance masks via prototype masks + linear coefficients. SOLO / SOLOv2 — grid-based instance categories. Detectron2 (Meta) and Segment Anything (SAM) — strong interactive or promptable masks. Choice depends on latency, accuracy, and training budget.
Annotations
Instance datasets store polygon or RLE (run-length encoded) masks per object. COCO JSON is the de facto format; tools like LabelMe, CVAT, or Roboflow export compatible labels.
Takeaways
- Instance = detect objects and separate pixel ownership per object.
- Mask R-CNN = Faster R-CNN + mask branch + ROIAlign.
- Evaluate with mask AP, not only box AP.