Computer Vision Interview 20 essential Q&A Updated 2026
semantic seg

Semantic Segmentation: 20 Essential Q&A

Pixel-wise class labels, encoder–decoder designs, and how we score dense prediction.

~12 min read 20 questions Advanced
FCNU-NetmIoUdice
1 What is semantic segmentation? ⚡ easy
Answer: Assigning a class label to every pixel (road, sky, person)—no distinction between different instances of the same class.
2 How does it differ from classification? ⚡ easy
Answer: Classification: one label per image. Semantic segmentation: dense spatial map of labels—requires localization and context.
3 What did FCN change? 📊 medium
Answer: Replaced fully connected layers with 1×1 convolutions so arbitrary input sizes work; learnable upsampling (deconv/transposed conv) to recover resolution.
4 Why U-Net skips? 📊 medium
Answer: Encoder downsamples for context; decoder upsamples; skip connections fuse fine detail from shallow layers with semantic deep features—sharp boundaries.
5 Common upsampling methods? 📊 medium
Answer: Transposed convolution, bilinear upsample + conv, sub-pixel shuffle—each trades artifacts, parameters, and speed differently.
6 What is mIoU? 📊 medium
Answer: Mean Intersection over Union per class (then averaged): measures overlap of predicted vs ground-truth masks—standard benchmark metric.
7 What is Dice coefficient? 📊 medium
Answer: 2|A∩B|/(|A|+|B|)—closely related to F1 for binary masks; common loss for medical segmentation when foreground is tiny.
8 Standard loss? ⚡ easy
Answer: Per-pixel cross-entropy (softmax over classes); can weight rare classes or use focal variants for hard pixels.
9 Why are boundaries hard? 🔥 hard
Answer: Ambiguous edges, thin structures disappear at low res—fixes: deep supervision, boundary-aware loss, high-res branches, or larger input crops.
10 Handle class imbalance? 📊 medium
Answer: Weighted CE, oversampling rare classes, focal loss, dice loss, or balanced sampling in batches.
11 What is ASPP? 🔥 hard
Answer: Atrous spatial pyramid pooling—parallel dilated convs at multiple rates capture multi-scale context without losing resolution (DeepLab family).
12 What is PSPNet idea? 📊 medium
Answer: Pyramid pooling at several scales then upsample and concatenate—rich global scene context for each pixel.
13 Multi-scale inference? 📊 medium
Answer: Run network on several scales / flipped inputs and average logits—boosts mIoU at inference cost.
14 Weakly supervised segmentation? 🔥 hard
Answer: Train from image tags, scribbles, or bounding boxes using constraints (e.g. MIL, GrabCut-style seeds)—less pixel labels needed.
15 Link to panoptic? 📊 medium
Answer: Panoptic adds instance IDs for “things” while semantic handles “stuff”—semantic is a component of full scene parsing.
16 Use CRF post-processing? 📊 medium
Answer: Historically refined CNN outputs with pairwise smoothness; less dominant now with stronger architectures but still taught in interviews.
17 Can semantic separate two people? ⚡ easy
Answer: No—both get label “person”; need instance segmentation for separate masks.
18 Why is data expensive? ⚡ easy
Answer: Pixel-accurate masks per image vs bounding boxes—tools like semi-auto labeling and synthetic data help.
19 Transformers for segmentation? 🔥 hard
Answer: SegFormer, Mask2Former, Segmenter—global attention and mask queries compete with CNN encoders on benchmarks.
20 Real-time models? 📊 medium
Answer: Lightweight backbones (MobileNet), BiSeNet, Fast-SCNN—trade mIoU for FPS on edge devices.

Semantic Segmentation Cheat Sheet

Architecture
  • Encoder–decoder
  • Skips (U-Net)
Metric
  • mIoU
  • Dice (medical)
Context
  • ASPP / PSP
  • Multi-scale test

💡 Pro tip: Dense per-pixel labels; same class shares one semantic mask.

Full tutorial track

Go deeper with the matching tutorial chapter and code examples.