Computer Vision Interview 40 Q&A Chapter 17

Face & Pose Estimation — Interview Q&A

Face recognition pipelines and human pose keypoint estimation.

40 questions Chapter 17

Face Recognition: 20 Essential Q&A

1 Typical pipeline? ⚡ easy
Answer: Detect face → align to canonical pose → CNN embedding → compare cosine/L2 distance.
2 Face detection? 📊 medium
Answer: Find boxes/scales (MTCNN, RetinaFace, YuNet)—must handle profile, small faces, and clutter before recognition.
3 Alignment? 📊 medium
Answer: Use 5 or more landmarks to similarity-transform face to fixed template—reduces pose/light variance before embedding.
4 What is an embedding? 📊 medium
Answer: L2-normalized vector (e.g. 512-D) such that same identity is close, different identities far—learned with metric objectives.
sim = F.cosine_similarity(emb_a, emb_b)  # face verification
5 Verification vs identification? ⚡ easy
Answer: Verification: same person or not (1:1). Identification: match probe to gallery (1:N)—needs threshold and rank metrics.
6 Open-set identification? 📊 medium
Answer: Probe may be unknown—need rejection option based on similarity threshold to avoid false accepts.
7 Triplet loss? 🔥 hard
Answer: Anchor closer to positive than to negative by margin—hard negative mining critical for convergence (FaceNet).
8 ArcFace? 🔥 hard
Answer: Angular margin on hypersphere between logits—enforces larger inter-class angular separation; state-of-the-art metric learning.
9 FaceNet? 📊 medium
Answer: End-to-end CNN with triplet loss producing compact embeddings—popularized deep face recognition at scale.
10 Benchmarks? ⚡ easy
Answer: LFW, CFP-FP, IJB-C, MegaFace—vary in pose, N protocol, and difficulty; report TAR@FAR for verification.
11 Threshold tuning? 📊 medium
Answer: Set operating point on validation to balance FAR vs FRR for the deployment constraint (access control vs convenience).
12 Anti-spoofing? 📊 medium
Answer: Detect print/screen/replay attacks with texture, depth, or rPPG—required for liveness in banking kiosks.
13 Masks / COVID era? 📊 medium
Answer: Periocular focus, synthetic mask augmentation, or dedicated training—lower accuracy if model not adapted.
14 Demographic bias? 🔥 hard
Answer: Unequal error rates across groups—audit datasets, balanced training, and fairness constraints in deployment.
15 Privacy? ⚡ easy
Answer: Biometric data is sensitive—encrypt templates, consent, retention limits, on-device processing where possible.
16 3D morphable models? 📊 medium
Answer: Fit 3DMM for pose-invariant recognition or generate synthetic views—helps extreme pose.
17 On-device? ⚡ easy
Answer: Quantized MobileFaceNet-style backbones, NNAPI/CoreML—latency and power constrained.
18 Quality assessment? 📊 medium
Answer: Blur, exposure, resolution gates before embedding—reject low-quality captures to reduce false matches.
19 Synthetic faces? 📊 medium
Answer: GAN-generated diversity for training—watch for domain gap and identity leakage in synthetic sets.
20 Presentation attacks? 📊 medium
Answer: ISO standards categorize attack instruments—multimodal liveness (depth, IR) mitigates many.

Pose Estimation: 20 Essential Q&A

21 What is pose estimation? ⚡ easy
Answer: Predict joint locations (shoulders, elbows, etc.) for people in an image/video—2D pixel coords or 3D body config.
22 Keypoint formats? 📊 medium
Answer: xy coordinates, confidence, sometimes visibility flags—datasets define fixed skeleton topology (COCO 17 joints).
23 Heatmap regression? 📊 medium
Answer: Per-joint Gaussian maps; argmax or soft-argmax for coordinate—preserves spatial uncertainty vs direct regression.
# heatmap argmax → (x,y) joint; soft-argmax differentiable
24 COCO pose? ⚡ easy
Answer: 17 body keypoints per person—standard for detection+pose benchmarks and pretrained models.
25 Top-down approach? 📊 medium
Answer: Person detector first, then single-person pose inside each ROI—accurate when detector is good, slower with many people.
26 Bottom-up? 📊 medium
Answer: Predict all joints then group into people (OpenPose PAFs, Associative Embedding)—better scaling in crowds.
27 OpenPose PAFs? 🔥 hard
Answer: Part affinity fields encode limb orientation to connect candidate joints—enables real-time multi-person 2D pose.
28 HRNet? 🔥 hard
Answer: Maintains high-resolution streams parallel to low-res with repeated fusions—sharp heatmaps, strong 2D accuracy.
29 Loss functions? 📊 medium
Answer: MSE on heatmaps; or L1 on coords; auxiliary intermediate supervision in hourglass nets aids deep training.
30 Occlusion? 📊 medium
Answer: Low visibility flags, context from torso, temporal smoothing in video—still hard for heavy overlap.
31 Multi-person overlap? 📊 medium
Answer: NMS on detections; association graph solvers; transformer decoders predicting sets of poses (PETR-style ideas).
32 3D pose? 🔥 hard
Answer: Direct regression of camera-space joints or volumetric representations—needs depth, multi-view, or weak 3D supervision.
33 Lifting 2D→3D? 📊 medium
Answer: Use skeleton constraints + camera model or learned prior (VIBE, VideoPose3D) from monocular sequences.
34 MediaPipe / BlazePose? 📊 medium
Answer: Lightweight graphs for mobile AR—33-point topology, real-time on phone GPUs.
35 Real-time? ⚡ easy
Answer: Light backbones, lower input res, single-person mode—30+ FPS on GPU for fitness apps.
36 Graph models? 🔥 hard
Answer: GCN over joints exploits kinematic structure—complements conv heatmap methods especially for 3D.
37 OKS mAP? 📊 medium
Answer: Object keypoint similarity scales error by joint size—COCO pose AP aggregates across OKS thresholds.
38 Augmentation? ⚡ easy
Answer: Random rotation/scale, flip with joint swap, cutout—preserve skeleton validity after transform.
39 Mobile deployment? 📊 medium
Answer: INT8 quant, smaller input, ROI cropping—trade accuracy for thermal/power on edge.
40 Limitations? ⚡ easy
Answer: Rare poses underrepresented, clothing hides joints, single depth ambiguity in monocular 3D—combine sensors or multi-view when possible.
Full tutorial chapter

Pair these interview notes with the matching CV tutorial chapter.

align-items-center flex-wrap gap-2"> Previous Next