Semantic segmentation: CV guide

Problem setup

Input: RGB image H×W×3. Output: label map H×W with integer class ids (0 … C−1), often plus a special ignore index for unlabeled pixels. Training pairs are images plus pixel-wise masks. Unlike object detection, there are no boxes—only full-image classification at each location.

vs instance segmentation

Semantic: all “person” pixels share one label. Instance: each person gets a separate object id and mask.

vs panoptic

Panoptic merges “stuff” (sky, road) and “things” (countable objects) into one unified labeling—beyond pure semantic.

Representative architectures

FCN — Fully Convolutional Networks replace dense layers with convolutions; skip connections fuse coarse semantic and fine spatial detail.
U-Net — Symmetric encoder–decoder with skip concatenations; very common in medical imaging and small datasets.
DeepLab — Uses atrous (dilated) convolutions to enlarge receptive field without losing resolution as fast as pooling; ASPP combines multiple dilation rates.
SegFormer / Mask2Former — Transformer-based designs for strong context (covered in advanced courses).

Losses and metrics

Cross-entropy per pixel (with optional class weights) is the standard. For imbalanced classes (rare “pole” vs common “road”), use weighted CE, focal loss, or Dice/Lovász variants. IoU (Jaccard) per class and mIoU (mean over classes) are standard quality measures; pixel accuracy can be misleading when background dominates.

# Conceptual IoU for one class (NumPy-style)
def iou(pred_mask, gt_mask):
    inter = (pred_mask & gt_mask).sum()
    union = (pred_mask | gt_mask).sum()
    return inter / (union + 1e-6)

Inference: DeepLabV3 (torchvision)

Requires torch and torchvision. Uses COCO-pretrained 21-class Pascal VOC heads by default on the hub model—map labels to your task or fine-tune.

import torch
import torchvision.transforms as T
from torchvision.models.segmentation import deeplabv3_resnet50

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = deeplabv3_resnet50(weights="DEFAULT").to(device).eval()

preprocess = T.Compose([
    T.ToTensor(),
    T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

from PIL import Image
img = Image.open("street.jpg").convert("RGB")
inp = preprocess(img).unsqueeze(0).to(device)

with torch.no_grad():
    out = model(inp)["out"]  # (1, num_classes, h, w)
pred = out.argmax(dim=1).squeeze(0).cpu().numpy()

Resize pred back to original image size with cv2.resize(..., interpolation=cv2.INTER_NEAREST) to avoid mixing class ids.

Colorize label map for debugging

import numpy as np

# pseudo-color: map class id to RGB (toy palette of length num_classes)
palette = np.random.default_rng(0).integers(0, 255, size=(21, 3), dtype=np.uint8)
vis = palette[pred]

Training tips (brief)

Apply the same geometric transform to image and mask (flip, scale, crop). Use sync batch norm or group norm on small batch sizes. Start from ImageNet-pretrained backbones; freeze backbone briefly then unfreeze. Validate with mIoU on a held-out set, not just loss.

                    Takeaways
                    Semantic = one label per pixel per class, not per object instance.
Encoder–decoder and dilated convs recover full-resolution labels.
Report mIoU; handle class imbalance in the loss.

                

Quick FAQ

Strong downsampling without good skip connections smooths edges. Try U-Net-style skips, higher-resolution training crops, or boundary-aware losses.

Replace the final classifier conv layer to output C channels; fine-tune on your dataset. Reinitialize that layer; lower learning rate on the backbone.