Computer Vision Interview 60 Q&A Chapter 7

Deep Segmentation — Interview Q&A

Segmentation overview, semantic pixel-wise labeling, and instance segmentation with masks.

60 questions Chapter 7

Segmentation Overview: 20 Essential Q&A

1 What is image segmentation? ⚡ easy

Answer: Partitioning an image into regions or labeling each pixel with a category—bridges low-level pixels and high-level objects.

2 What is semantic segmentation? 📊 medium

Answer: Each pixel gets a class label (road, sky, person) without distinguishing different instances of the same class.

3 What is instance segmentation? 📊 medium

Answer: Separate individual objects even within same class—each instance has its own mask and ID.

4 What is panoptic segmentation? 🔥 hard

Answer: Unifies stuff (amorphous regions like sky) and things (countable objects) with non-overlapping masks covering the whole image.

5 Is thresholding segmentation? ⚡ easy

Answer: Yes for binary foreground/background—simplest form; limited for complex scenes without additional cues.

6 What is region growing? 📊 medium

Answer: Start from seeds, merge neighboring pixels similar under a criterion (intensity, texture)—sensitive to seed placement and noise.

7 Split and merge? 📊 medium

Answer: Recursively split non-uniform regions, then merge adjacent similar regions—quadtree-style classical approach.

8 Segment with k-means in color space? 📊 medium

Answer: Cluster pixel colors (RGB or LAB); each cluster is a segment—produces patchy results without spatial smoothness unless augmented.

9 What is mean shift segmentation? 🔥 hard

Answer: Mode-seeking in joint color-spatial feature space—clusters pixels to local density peaks; smooths labels but can be slow.

10 What is the watershed transform? 📊 medium

Answer: Treat gradient magnitude as height map; flood from markers—without markers causes oversegmentation; marker-controlled watershed is common.

11 What are graph cuts? 🔥 hard

Answer: Pixels as nodes, pairwise smoothness + unary data costs; find min-cut for globally good binary partition—used in GrabCut-style energy minimization.

12 What is GrabCut? 📊 medium

Answer: Iterative graph-cut segmentation with Gaussian mixture models on RGB—user provides box or strokes; refines foreground/background.

13 What are active contours (snakes)? 📊 medium

Answer: Deform curve minimizing internal smoothness + external edge attraction—classic for medical boundaries; level-set extensions handle topology changes.

14 Oversegmentation? ⚡ easy

Answer: Too many small regions—watershed without markers; fix with merging, markers, or learned superpixels (SLIC).

15 Evaluate masks with IoU? 📊 medium

Answer: Intersection over union per class or instance; mean IoU (mIoU) standard for semantic segmentation benchmarks.

16 Boundary F-score? 🔥 hard

Answer: Measures alignment of predicted vs GT contours—complements IoU for thin structures.

17 Deep learning for segmentation? 📊 medium

Answer: FCN replaces FC layers with convs; U-Net encoder-decoder with skip connections—dominant paradigm now with transformers emerging.

18 Video segmentation? 📊 medium

Answer: Temporal consistency, optical flow warping, or memory networks—object masks tracked across frames (VOS).

19 Interactive segmentation? ⚡ easy

Answer: User clicks/scribbles guide model (GrabCut, deep interactive)—few-shot refinement for editing.

20 Pick classical vs deep? 📊 medium

Answer: Classical: fast, little data, controlled scenes. Deep: cluttered natural images, need labels and compute—often hybrid for industrial + DL refinement.

Semantic Segmentation: 20 Essential Q&A

21 What is semantic segmentation? ⚡ easy

Answer: Assigning a class label to every pixel (road, sky, person)—no distinction between different instances of the same class.

22 How does it differ from classification? ⚡ easy

Answer: Classification: one label per image. Semantic segmentation: dense spatial map of labels—requires localization and context.

23 What did FCN change? 📊 medium

Answer: Replaced fully connected layers with 1×1 convolutions so arbitrary input sizes work; learnable upsampling (deconv/transposed conv) to recover resolution.

24 Why U-Net skips? 📊 medium

Answer: Encoder downsamples for context; decoder upsamples; skip connections fuse fine detail from shallow layers with semantic deep features—sharp boundaries.

25 Common upsampling methods? 📊 medium

Answer: Transposed convolution, bilinear upsample + conv, sub-pixel shuffle—each trades artifacts, parameters, and speed differently.

26 What is mIoU? 📊 medium

Answer: Mean Intersection over Union per class (then averaged): measures overlap of predicted vs ground-truth masks—standard benchmark metric.

27 What is Dice coefficient? 📊 medium

Answer: 2|A∩B|/(|A|+|B|)—closely related to F1 for binary masks; common loss for medical segmentation when foreground is tiny.

28 Standard loss? ⚡ easy

Answer: Per-pixel cross-entropy (softmax over classes); can weight rare classes or use focal variants for hard pixels.

29 Why are boundaries hard? 🔥 hard

Answer: Ambiguous edges, thin structures disappear at low res—fixes: deep supervision, boundary-aware loss, high-res branches, or larger input crops.

30 Handle class imbalance? 📊 medium

Answer: Weighted CE, oversampling rare classes, focal loss, dice loss, or balanced sampling in batches.

31 What is ASPP? 🔥 hard

Answer: Atrous spatial pyramid pooling—parallel dilated convs at multiple rates capture multi-scale context without losing resolution (DeepLab family).

32 What is PSPNet idea? 📊 medium

Answer: Pyramid pooling at several scales then upsample and concatenate—rich global scene context for each pixel.

33 Multi-scale inference? 📊 medium

Answer: Run network on several scales / flipped inputs and average logits—boosts mIoU at inference cost.

34 Weakly supervised segmentation? 🔥 hard

Answer: Train from image tags, scribbles, or bounding boxes using constraints (e.g. MIL, GrabCut-style seeds)—less pixel labels needed.

35 Link to panoptic? 📊 medium

Answer: Panoptic adds instance IDs for “things” while semantic handles “stuff”—semantic is a component of full scene parsing.

36 Use CRF post-processing? 📊 medium

Answer: Historically refined CNN outputs with pairwise smoothness; less dominant now with stronger architectures but still taught in interviews.

37 Can semantic separate two people? ⚡ easy

Answer: No—both get label “person”; need instance segmentation for separate masks.

38 Why is data expensive? ⚡ easy

Answer: Pixel-accurate masks per image vs bounding boxes—tools like semi-auto labeling and synthetic data help.

39 Transformers for segmentation? 🔥 hard

Answer: SegFormer, Mask2Former, Segmenter—global attention and mask queries compete with CNN encoders on benchmarks.

40 Real-time models? 📊 medium

Answer: Lightweight backbones (MobileNet), BiSeNet, Fast-SCNN—trade mIoU for FPS on edge devices.

Instance Segmentation: 20 Essential Q&A

41 What is instance segmentation? ⚡ easy

Answer: Each object instance gets its own binary mask and class label—even two “person” pixels belong to different instances if on different people.

42 Semantic vs instance? 📊 medium

Answer: Semantic: one mask per class. Instance: N masks for N objects, possibly same class—handles overlap with distinct IDs.

43 How does Mask R-CNN extend Faster R-CNN? 📊 medium

Answer: Adds parallel mask head: small FCN on each RoI predicts K×K binary mask per class—multi-task with box + class.

44 Why RoIAlign? 🔥 hard

Answer: RoIPool quantizes coordinates → misalignment for masks. RoIAlign uses bilinear sampling at exact float locations—critical for pixel-accurate masks.

45 Mask branch output? 📊 medium

Answer: Typically 28×28 logits upsampled to RoI size with threshold—lightweight per-region FCN.

46 Loss on masks? 📊 medium

Answer: Per-pixel sigmoid + BCE on the target class mask only (not softmax over all classes per pixel in the classic formulation).

47 Can two instance masks overlap in GT? ⚡ easy

Answer: Yes—foreground object in front of another; model must predict ordering or independent masks per instance.

48 Panoptic segmentation? 📊 medium

Answer: Unifies semantic “stuff” and instance “things” with non-overlapping full-scene labeling—each pixel has one label + optional instance id.

49 What is YOLACT? 📊 medium

Answer: One-stage: combines prototype masks with per-instance coefficients for fast instance segmentation—speed-quality tradeoff.

50 SOLO / SOLOv2 idea? 🔥 hard

Answer: Define instance by grid location and scale—predict category and mask for each grid cell without anchors in the traditional sense.

51 DETR for masks? 🔥 hard

Answer: Set prediction with mask head or panoptic head—queries attend to image features to produce instance masks end-to-end.

52 What is mask AP? 📊 medium

Answer: AP computed on mask IoU instead of box IoU—COCO primary metric for instance segmentation quality.

53 Polygon vs raster? ⚡ easy

Answer: Datasets may store COCO RLE or polygons; training often rasterizes to fixed resolution masks for loss.

54 COCO stuff vs things? 📊 medium

Answer: Things are countable instances; stuff is amorphous (grass, sky)—panoptic benchmark merges both.

55 Small instances? 📊 medium

Answer: High-res FPN levels, copy-paste augmentation, and specialized heads help—same challenges as object detection.

56 Why slower than detection? ⚡ easy

Answer: Extra per-RoI mask computation and higher memory—one-stage mask methods aim to close the gap.

57 Role of FPN? 📊 medium

Answer: Multi-scale object proposals and features so small and large instances both get good mask features.

58 HTC / Cascade? 🔥 hard

Answer: Iteratively refine boxes and masks with cascaded stages and inter-task fusion—state-of-art on COCO era leaderboards.

59 Refine boundaries? 🔥 hard

Answer: Methods like PointRend adaptively sample points on uncertain boundaries for fine mask prediction—better edges.

60 Annotation? ⚡ easy

Answer: Instance masks are most expensive—interactive tools, synthetic data, and weak supervision are active research areas.

Full tutorial chapter

Pair these interview notes with the matching CV tutorial chapter.

CV Tutorial

Previous Next

align-items-center flex-wrap gap-2"> Previous Next