Problem setup
Input: RGB image H×W×3. Output: label map H×W with integer class ids (0 … C−1), often plus a special ignore index for unlabeled pixels. Training pairs are images plus pixel-wise masks. Unlike object detection, there are no boxes—only full-image classification at each location.
vs instance segmentation
Semantic: all “person” pixels share one label. Instance: each person gets a separate object id and mask.
vs panoptic
Panoptic merges “stuff” (sky, road) and “things” (countable objects) into one unified labeling—beyond pure semantic.
Representative architectures
- FCN — Fully Convolutional Networks replace dense layers with convolutions; skip connections fuse coarse semantic and fine spatial detail.
- U-Net — Symmetric encoder–decoder with skip concatenations; very common in medical imaging and small datasets.
- DeepLab — Uses atrous (dilated) convolutions to enlarge receptive field without losing resolution as fast as pooling; ASPP combines multiple dilation rates.
- SegFormer / Mask2Former — Transformer-based designs for strong context (covered in advanced courses).
Losses and metrics
Cross-entropy per pixel (with optional class weights) is the standard. For imbalanced classes (rare “pole” vs common “road”), use weighted CE, focal loss, or Dice/Lovász variants. IoU (Jaccard) per class and mIoU (mean over classes) are standard quality measures; pixel accuracy can be misleading when background dominates.
# Conceptual IoU for one class (NumPy-style)
def iou(pred_mask, gt_mask):
inter = (pred_mask & gt_mask).sum()
union = (pred_mask | gt_mask).sum()
return inter / (union + 1e-6)
Inference: DeepLabV3 (torchvision)
Requires torch and torchvision. Uses COCO-pretrained 21-class Pascal VOC heads by default on the hub model—map labels to your task or fine-tune.
import torch
import torchvision.transforms as T
from torchvision.models.segmentation import deeplabv3_resnet50
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = deeplabv3_resnet50(weights="DEFAULT").to(device).eval()
preprocess = T.Compose([
T.ToTensor(),
T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
from PIL import Image
img = Image.open("street.jpg").convert("RGB")
inp = preprocess(img).unsqueeze(0).to(device)
with torch.no_grad():
out = model(inp)["out"] # (1, num_classes, h, w)
pred = out.argmax(dim=1).squeeze(0).cpu().numpy()
Resize pred back to original image size with cv2.resize(..., interpolation=cv2.INTER_NEAREST) to avoid mixing class ids.
Colorize label map for debugging
import numpy as np
# pseudo-color: map class id to RGB (toy palette of length num_classes)
palette = np.random.default_rng(0).integers(0, 255, size=(21, 3), dtype=np.uint8)
vis = palette[pred]
Training tips (brief)
Apply the same geometric transform to image and mask (flip, scale, crop). Use sync batch norm or group norm on small batch sizes. Start from ImageNet-pretrained backbones; freeze backbone briefly then unfreeze. Validate with mIoU on a held-out set, not just loss.
Takeaways
- Semantic = one label per pixel per class, not per object instance.
- Encoder–decoder and dilated convs recover full-resolution labels.
- Report mIoU; handle class imbalance in the loss.