Deep Segmentation

Segmentation overview

Classical vs learning-based

Classical / interactive

Fast, no training data; needs hand-tuned assumptions (color clusters, user scribbles, smooth regions). Great for controlled capture or preprocessing.

Deep segmentation

Learns appearance and context from labeled images; handles diverse scenes. Heavier compute and dataset requirements.

Threshold + morphology + contours

The simplest “segmentation” is a binary mask: foreground vs background. Clean it with morphology, then extract outer boundaries or filled regions.

import cv2

gray = cv2.imread("blob.png", cv2.IMREAD_GRAYSCALE)
blur = cv2.GaussianBlur(gray, (5, 5), 0)
_, bw = cv2.threshold(blur, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)

k = cv2.getStructuringElement(cv2.MORPH_ELLIPSE, (5, 5))
bw = cv2.morphologyEx(bw, cv2.MORPH_OPEN, k, iterations=1)

cnts, _ = cv2.findContours(bw, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
vis = cv2.cvtColor(gray, cv2.COLOR_GRAY2BGR)
cv2.drawContours(vis, cnts, -1, (0, 255, 0), 2)

Connected components

Label each 4- or 8-connected blob; filter by area for counting objects or removing speckles.

import cv2

n, labels, stats, centroids = cv2.connectedComponentsWithStats(bw, connectivity=8)
# stats: [label, x, y, width, height, area]
h, w = bw.shape
out = cv2.cvtColor(bw, cv2.COLOR_GRAY2BGR)
for i in range(1, n):
    if stats[i, cv2.CC_STAT_AREA] < 100:
        continue
    x, y, ww, hh = stats[i, 0], stats[i, 1], stats[i, 2], stats[i, 3]
    cv2.rectangle(out, (x, y), (x + ww, y + hh), (255, 0, 0), 1)

Watershed with markers

The watershed treats the inverted distance transform as a height map; without markers it oversegments. Provide sure foreground, sure background, and unknown regions for stable basins.

import cv2
import numpy as np

img = cv2.imread("coins.jpg")
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
_, th = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)

# noise removal
kernel = np.ones((3, 3), np.uint8)
opening = cv2.morphologyEx(th, cv2.MORPH_OPEN, kernel, iterations=2)
sure_bg = cv2.dilate(opening, kernel, iterations=3)

dist = cv2.distanceTransform(opening, cv2.DIST_L2, 5)
_, sure_fg = cv2.threshold(dist, 0.5 * dist.max(), 255, 0)
sure_fg = np.uint8(sure_fg)
unknown = cv2.subtract(sure_bg, sure_fg)

_, markers = cv2.connectedComponents(sure_fg)
markers = markers + 1
markers[unknown == 255] = 0

markers = cv2.watershed(img, markers)
img[markers == -1] = [0, 0, 255]

Boundaries are marked as -1 in markers; here drawn in red on the original BGR image.

GrabCut (interactive box)

GrabCut iteratively refines a Gaussian mixture model of foreground/background color. Start from a rectangle or a user mask.

import cv2
import numpy as np

img = cv2.imread("portrait.jpg")
mask = np.zeros(img.shape[:2], np.uint8)
bgd = np.zeros((1, 65), np.float64)
fgd = np.zeros((1, 65), np.float64)
h, w = img.shape[:2]
rect = (int(0.1 * w), int(0.05 * h), int(0.8 * w), int(0.9 * h))

cv2.grabCut(img, mask, rect, bgd, fgd, 5, cv2.GC_INIT_WITH_RECT)
binmask = np.where((mask == 2) | (mask == 0), 0, 1).astype("uint8")
fg = img * binmask[:, :, np.newaxis]

Refine with sure-foreground strokes (concept)

# mask values: GC_BGD, GC_FGD, GC_PR_BGD, GC_PR_FGD — set user scribbles then:
# cv2.grabCut(img, mask, None, bgd, fgd, 5, cv2.GC_INIT_WITH_MASK)

k-means clustering in LAB

Flatten pixels to feature vectors (e.g. L,a,b), run cv2.kmeans, map labels back to an image—simple color segmentation.

import cv2
import numpy as np

bgr = cv2.imread("fruit.jpg")
lab = cv2.cvtColor(bgr, cv2.COLOR_BGR2LAB)
h, w, c = lab.shape
Z = lab.reshape(-1, 3).astype(np.float32)

criteria = (cv2.TERM_CRITERIA_EPS + cv2.TERM_CRITERIA_MAX_ITER, 20, 0.5)
K = 4
_, labels, centers = cv2.kmeans(Z, K, None, criteria, 10, cv2.KMEANS_PP_CENTERS)

centers_u8 = np.uint8(centers)
seg = centers_u8[labels.flatten()].reshape(h, w, 3)
seg_bgr = cv2.cvtColor(seg, cv2.COLOR_LAB2BGR)

What’s next in this series

Semantic segmentation assigns a class label to every pixel (road, sky, person) with networks like FCN, U-Net, DeepLab. Instance segmentation additionally separates individual object masks (Mask R-CNN). Follow the next hub pages for those topics when you are ready to move from classical pipelines to trained models.

                    Takeaways
                    Always combine raw thresholds with morphology and area filters for robust masks.
Watershed needs markers—distance transform + connected components is a standard recipe.
GrabCut and k-means leverage color statistics; deep models add semantic understanding.

                

Quick FAQ

Improve the sure-foreground mask (higher distance threshold, better thresholding) or incorporate gradient-based markers. Markers are the main control lever.

Try elbow method on within-cluster error, or pick K from the number of dominant colors you expect. For objects with shared colors, pure k-means will merge them—use learning-based segmentation instead.

Semantic segmentation

Problem setup

Input: RGB image H×W×3. Output: label map H×W with integer class ids (0 … C−1), often plus a special ignore index for unlabeled pixels. Training pairs are images plus pixel-wise masks. Unlike object detection, there are no boxes—only full-image classification at each location.

vs instance segmentation

Semantic: all “person” pixels share one label. Instance: each person gets a separate object id and mask.

vs panoptic

Panoptic merges “stuff” (sky, road) and “things” (countable objects) into one unified labeling—beyond pure semantic.

Representative architectures

FCN — Fully Convolutional Networks replace dense layers with convolutions; skip connections fuse coarse semantic and fine spatial detail.
U-Net — Symmetric encoder–decoder with skip concatenations; very common in medical imaging and small datasets.
DeepLab — Uses atrous (dilated) convolutions to enlarge receptive field without losing resolution as fast as pooling; ASPP combines multiple dilation rates.
SegFormer / Mask2Former — Transformer-based designs for strong context (covered in advanced courses).

Losses and metrics

Cross-entropy per pixel (with optional class weights) is the standard. For imbalanced classes (rare “pole” vs common “road”), use weighted CE, focal loss, or Dice/Lovász variants. IoU (Jaccard) per class and mIoU (mean over classes) are standard quality measures; pixel accuracy can be misleading when background dominates.

# Conceptual IoU for one class (NumPy-style)
def iou(pred_mask, gt_mask):
    inter = (pred_mask & gt_mask).sum()
    union = (pred_mask | gt_mask).sum()
    return inter / (union + 1e-6)

Inference: DeepLabV3 (torchvision)

Requires torch and torchvision. Uses COCO-pretrained 21-class Pascal VOC heads by default on the hub model—map labels to your task or fine-tune.

import torch
import torchvision.transforms as T
from torchvision.models.segmentation import deeplabv3_resnet50

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = deeplabv3_resnet50(weights="DEFAULT").to(device).eval()

preprocess = T.Compose([
    T.ToTensor(),
    T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

from PIL import Image
img = Image.open("street.jpg").convert("RGB")
inp = preprocess(img).unsqueeze(0).to(device)

with torch.no_grad():
    out = model(inp)["out"]  # (1, num_classes, h, w)
pred = out.argmax(dim=1).squeeze(0).cpu().numpy()

Resize pred back to original image size with cv2.resize(..., interpolation=cv2.INTER_NEAREST) to avoid mixing class ids.

Colorize label map for debugging

import numpy as np

# pseudo-color: map class id to RGB (toy palette of length num_classes)
palette = np.random.default_rng(0).integers(0, 255, size=(21, 3), dtype=np.uint8)
vis = palette[pred]

Training tips (brief)

Apply the same geometric transform to image and mask (flip, scale, crop). Use sync batch norm or group norm on small batch sizes. Start from ImageNet-pretrained backbones; freeze backbone briefly then unfreeze. Validate with mIoU on a held-out set, not just loss.

                    Takeaways
                    Semantic = one label per pixel per class, not per object instance.
Encoder–decoder and dilated convs recover full-resolution labels.
Report mIoU; handle class imbalance in the loss.

                

Quick FAQ

Strong downsampling without good skip connections smooths edges. Try U-Net-style skips, higher-resolution training crops, or boundary-aware losses.

Replace the final classifier conv layer to output C channels; fine-tune on your dataset. Reinitialize that layer; lower learning rate on the backbone.

Instance segmentation

Outputs per detection

For each instance, models typically emit: bounding box (x1,y1,x2,y2), class id, confidence score, and a mask (often 28×28 logits upsampled to the ROI and thresholded). Overlapping instances can occlude each other—ordering (painter’s algorithm) or alpha blending matters for visualization.

Mask IoU

Intersection over union of predicted vs ground-truth binary masks; averaged into AP^mask on benchmarks like COCO.

Panoptic

Unifies semantic “stuff” labels with instance “things” in one image—each pixel has class + optional instance id.

Mask R-CNN building blocks

Backbone + FPN — multi-scale feature pyramid.
Region Proposal Network (RPN) — objectness boxes in one forward pass.
ROIAlign — bilinear sampling of features at proposal locations (fixes quantization error vs ROI Pool).
Class + box head — refines category and box.
Mask head — parallel branch: K binary masks per ROI (one per class) or class-specific channel, trained with per-pixel sigmoid + average loss.

Inference: Mask R-CNN (torchvision)

import torch
import torchvision.transforms as T
from torchvision.models.detection import maskrcnn_resnet50_fpn
from PIL import Image

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = maskrcnn_resnet50_fpn(weights="DEFAULT").to(device).eval()

img = Image.open("party.jpg").convert("RGB")
tensor = T.functional.to_tensor(img).to(device)

with torch.no_grad():
    out = model([tensor])[0]

scores = out["scores"]
labels = out["labels"]
boxes = out["boxes"]
masks = out["masks"]  # (N, 1, H, W), values in [0,1]

for i in range(len(scores)):
    if scores[i] < 0.7:
        continue
    mask = (masks[i, 0] > 0.5).cpu().numpy()
    # overlay mask on image with OpenCV or PIL

COCO class ids: use torchvision.models.detection.mask_rcnn.CocoEvaluator mappings or print labels with the COCO 91-category list.

Other families (names to search)

YOLACT — real-time instance masks via prototype masks + linear coefficients. SOLO / SOLOv2 — grid-based instance categories. Detectron2 (Meta) and Segment Anything (SAM) — strong interactive or promptable masks. Choice depends on latency, accuracy, and training budget.

Annotations

Instance datasets store polygon or RLE (run-length encoded) masks per object. COCO JSON is the de facto format; tools like LabelMe, CVAT, or Roboflow export compatible labels.

                    Takeaways
                    Instance = detect objects and separate pixel ownership per object.
Mask R-CNN = Faster R-CNN + mask branch + ROIAlign.
Evaluate with mask AP, not only box AP.

                

Quick FAQ

Increase input resolution, use FPN with finer levels, or switch to anchor-free / transformer detectors. Data augmentation with copy-paste of small instances helps.

Yes—set num_classes=2 in a custom model (background + your class) and fine-tune from COCO weights, replacing the prediction heads.

Chapter FAQ

Quick FAQ

Improve the sure-foreground mask (higher distance threshold, better thresholding) or incorporate gradient-based markers. Markers are the main control lever.

Quick FAQ

Strong downsampling without good skip connections smooths edges. Try U-Net-style skips, higher-resolution training crops, or boundary-aware losses.

Replace the final classifier conv layer to output C channels; fine-tune on your dataset. Reinitialize that layer; lower learning rate on the backbone.

Segmentation overview

Classical vs learning-based

Classical / interactive

Deep segmentation

Threshold + morphology + contours

Connected components

Watershed with markers

GrabCut (interactive box)

Refine with sure-foreground strokes (concept)

k-means clustering in LAB

What’s next in this series

Takeaways

Quick FAQ

Watershed leaks across weak edges?

k-means K?

Semantic segmentation

Problem setup

vs instance segmentation

vs panoptic

Representative architectures

Losses and metrics

Inference: DeepLabV3 (torchvision)

Colorize label map for debugging

Training tips (brief)

Takeaways

Quick FAQ

Blurry object boundaries?

Custom number of classes?

Instance segmentation

Outputs per detection

Mask IoU

Panoptic

Mask R-CNN building blocks

Inference: Mask R-CNN (torchvision)

Other families (names to search)

Annotations

Takeaways

Quick FAQ

Small objects missing?

Can I train on one class only?

Chapter FAQ

Quick FAQ

Watershed leaks across weak edges?

k-means K?

Quick FAQ

Blurry object boundaries?

Custom number of classes?