Computer Vision Interview 60 Q&A Chapter 7

Deep Segmentation — Interview Q&A

Segmentation overview, semantic pixel-wise labeling, and instance segmentation with masks.

60 questions Chapter 7

Segmentation Overview: 20 Essential Q&A

1 What is image segmentation? ⚡ easy
Answer: Partitioning an image into regions or labeling each pixel with a category—bridges low-level pixels and high-level objects.
2 What is semantic segmentation? 📊 medium
Answer: Each pixel gets a class label (road, sky, person) without distinguishing different instances of the same class.
3 What is instance segmentation? 📊 medium
Answer: Separate individual objects even within same class—each instance has its own mask and ID.
4 What is panoptic segmentation? 🔥 hard
Answer: Unifies stuff (amorphous regions like sky) and things (countable objects) with non-overlapping masks covering the whole image.
5 Is thresholding segmentation? ⚡ easy
Answer: Yes for binary foreground/background—simplest form; limited for complex scenes without additional cues.
6 What is region growing? 📊 medium
Answer: Start from seeds, merge neighboring pixels similar under a criterion (intensity, texture)—sensitive to seed placement and noise.
7 Split and merge? 📊 medium
Answer: Recursively split non-uniform regions, then merge adjacent similar regions—quadtree-style classical approach.
8 Segment with k-means in color space? 📊 medium
Answer: Cluster pixel colors (RGB or LAB); each cluster is a segment—produces patchy results without spatial smoothness unless augmented.
9 What is mean shift segmentation? 🔥 hard
Answer: Mode-seeking in joint color-spatial feature space—clusters pixels to local density peaks; smooths labels but can be slow.
10 What is the watershed transform? 📊 medium
Answer: Treat gradient magnitude as height map; flood from markers—without markers causes oversegmentation; marker-controlled watershed is common.
11 What are graph cuts? 🔥 hard
Answer: Pixels as nodes, pairwise smoothness + unary data costs; find min-cut for globally good binary partition—used in GrabCut-style energy minimization.
12 What is GrabCut? 📊 medium
Answer: Iterative graph-cut segmentation with Gaussian mixture models on RGB—user provides box or strokes; refines foreground/background.
13 What are active contours (snakes)? 📊 medium
Answer: Deform curve minimizing internal smoothness + external edge attraction—classic for medical boundaries; level-set extensions handle topology changes.
14 Oversegmentation? ⚡ easy
Answer: Too many small regions—watershed without markers; fix with merging, markers, or learned superpixels (SLIC).
15 Evaluate masks with IoU? 📊 medium
Answer: Intersection over union per class or instance; mean IoU (mIoU) standard for semantic segmentation benchmarks.
16 Boundary F-score? 🔥 hard
Answer: Measures alignment of predicted vs GT contours—complements IoU for thin structures.
17 Deep learning for segmentation? 📊 medium
Answer: FCN replaces FC layers with convs; U-Net encoder-decoder with skip connections—dominant paradigm now with transformers emerging.
18 Video segmentation? 📊 medium
Answer: Temporal consistency, optical flow warping, or memory networks—object masks tracked across frames (VOS).
19 Interactive segmentation? ⚡ easy
Answer: User clicks/scribbles guide model (GrabCut, deep interactive)—few-shot refinement for editing.
20 Pick classical vs deep? 📊 medium
Answer: Classical: fast, little data, controlled scenes. Deep: cluttered natural images, need labels and compute—often hybrid for industrial + DL refinement.

Semantic Segmentation: 20 Essential Q&A

21 What is semantic segmentation? ⚡ easy
Answer: Assigning a class label to every pixel (road, sky, person)—no distinction between different instances of the same class.
22 How does it differ from classification? ⚡ easy
Answer: Classification: one label per image. Semantic segmentation: dense spatial map of labels—requires localization and context.
23 What did FCN change? 📊 medium
Answer: Replaced fully connected layers with 1×1 convolutions so arbitrary input sizes work; learnable upsampling (deconv/transposed conv) to recover resolution.
24 Why U-Net skips? 📊 medium
Answer: Encoder downsamples for context; decoder upsamples; skip connections fuse fine detail from shallow layers with semantic deep features—sharp boundaries.
25 Common upsampling methods? 📊 medium
Answer: Transposed convolution, bilinear upsample + conv, sub-pixel shuffle—each trades artifacts, parameters, and speed differently.
26 What is mIoU? 📊 medium
Answer: Mean Intersection over Union per class (then averaged): measures overlap of predicted vs ground-truth masks—standard benchmark metric.
27 What is Dice coefficient? 📊 medium
Answer: 2|A∩B|/(|A|+|B|)—closely related to F1 for binary masks; common loss for medical segmentation when foreground is tiny.
28 Standard loss? ⚡ easy
Answer: Per-pixel cross-entropy (softmax over classes); can weight rare classes or use focal variants for hard pixels.
29 Why are boundaries hard? 🔥 hard
Answer: Ambiguous edges, thin structures disappear at low res—fixes: deep supervision, boundary-aware loss, high-res branches, or larger input crops.
30 Handle class imbalance? 📊 medium
Answer: Weighted CE, oversampling rare classes, focal loss, dice loss, or balanced sampling in batches.
31 What is ASPP? 🔥 hard
Answer: Atrous spatial pyramid pooling—parallel dilated convs at multiple rates capture multi-scale context without losing resolution (DeepLab family).
32 What is PSPNet idea? 📊 medium
Answer: Pyramid pooling at several scales then upsample and concatenate—rich global scene context for each pixel.
33 Multi-scale inference? 📊 medium
Answer: Run network on several scales / flipped inputs and average logits—boosts mIoU at inference cost.
34 Weakly supervised segmentation? 🔥 hard
Answer: Train from image tags, scribbles, or bounding boxes using constraints (e.g. MIL, GrabCut-style seeds)—less pixel labels needed.
35 Link to panoptic? 📊 medium
Answer: Panoptic adds instance IDs for “things” while semantic handles “stuff”—semantic is a component of full scene parsing.
36 Use CRF post-processing? 📊 medium
Answer: Historically refined CNN outputs with pairwise smoothness; less dominant now with stronger architectures but still taught in interviews.
37 Can semantic separate two people? ⚡ easy
Answer: No—both get label “person”; need instance segmentation for separate masks.
38 Why is data expensive? ⚡ easy
Answer: Pixel-accurate masks per image vs bounding boxes—tools like semi-auto labeling and synthetic data help.
39 Transformers for segmentation? 🔥 hard
Answer: SegFormer, Mask2Former, Segmenter—global attention and mask queries compete with CNN encoders on benchmarks.
40 Real-time models? 📊 medium
Answer: Lightweight backbones (MobileNet), BiSeNet, Fast-SCNN—trade mIoU for FPS on edge devices.

Instance Segmentation: 20 Essential Q&A

41 What is instance segmentation? ⚡ easy
Answer: Each object instance gets its own binary mask and class label—even two “person” pixels belong to different instances if on different people.
42 Semantic vs instance? 📊 medium
Answer: Semantic: one mask per class. Instance: N masks for N objects, possibly same class—handles overlap with distinct IDs.
43 How does Mask R-CNN extend Faster R-CNN? 📊 medium
Answer: Adds parallel mask head: small FCN on each RoI predicts K×K binary mask per class—multi-task with box + class.
44 Why RoIAlign? 🔥 hard
Answer: RoIPool quantizes coordinates → misalignment for masks. RoIAlign uses bilinear sampling at exact float locations—critical for pixel-accurate masks.
45 Mask branch output? 📊 medium
Answer: Typically 28×28 logits upsampled to RoI size with threshold—lightweight per-region FCN.
46 Loss on masks? 📊 medium
Answer: Per-pixel sigmoid + BCE on the target class mask only (not softmax over all classes per pixel in the classic formulation).
47 Can two instance masks overlap in GT? ⚡ easy
Answer: Yes—foreground object in front of another; model must predict ordering or independent masks per instance.
48 Panoptic segmentation? 📊 medium
Answer: Unifies semantic “stuff” and instance “things” with non-overlapping full-scene labeling—each pixel has one label + optional instance id.
49 What is YOLACT? 📊 medium
Answer: One-stage: combines prototype masks with per-instance coefficients for fast instance segmentation—speed-quality tradeoff.
50 SOLO / SOLOv2 idea? 🔥 hard
Answer: Define instance by grid location and scale—predict category and mask for each grid cell without anchors in the traditional sense.
51 DETR for masks? 🔥 hard
Answer: Set prediction with mask head or panoptic head—queries attend to image features to produce instance masks end-to-end.
52 What is mask AP? 📊 medium
Answer: AP computed on mask IoU instead of box IoU—COCO primary metric for instance segmentation quality.
53 Polygon vs raster? ⚡ easy
Answer: Datasets may store COCO RLE or polygons; training often rasterizes to fixed resolution masks for loss.
54 COCO stuff vs things? 📊 medium
Answer: Things are countable instances; stuff is amorphous (grass, sky)—panoptic benchmark merges both.
55 Small instances? 📊 medium
Answer: High-res FPN levels, copy-paste augmentation, and specialized heads help—same challenges as object detection.
56 Why slower than detection? ⚡ easy
Answer: Extra per-RoI mask computation and higher memory—one-stage mask methods aim to close the gap.
57 Role of FPN? 📊 medium
Answer: Multi-scale object proposals and features so small and large instances both get good mask features.
58 HTC / Cascade? 🔥 hard
Answer: Iteratively refine boxes and masks with cascaded stages and inter-task fusion—state-of-art on COCO era leaderboards.
59 Refine boundaries? 🔥 hard
Answer: Methods like PointRend adaptively sample points on uncertain boundaries for fine mask prediction—better edges.
60 Annotation? ⚡ easy
Answer: Instance masks are most expensive—interactive tools, synthetic data, and weak supervision are active research areas.
Full tutorial chapter

Pair these interview notes with the matching CV tutorial chapter.

align-items-center flex-wrap gap-2"> Previous Next