Computer Vision Interview
40 Q&A
Chapter 17
Face & Pose Estimation — Interview Q&A
Face recognition pipelines and human pose keypoint estimation.
40 questions
Chapter 17
Face Recognition: 20 Essential Q&A
1
Typical pipeline?
⚡ easy
Answer: Detect face → align to canonical pose → CNN embedding → compare cosine/L2 distance.
2
Face detection?
📊 medium
Answer: Find boxes/scales (MTCNN, RetinaFace, YuNet)—must handle profile, small faces, and clutter before recognition.
3
Alignment?
📊 medium
Answer: Use 5 or more landmarks to similarity-transform face to fixed template—reduces pose/light variance before embedding.
4
What is an embedding?
📊 medium
Answer: L2-normalized vector (e.g. 512-D) such that same identity is close, different identities far—learned with metric objectives.
sim = F.cosine_similarity(emb_a, emb_b) # face verification
5
Verification vs identification?
⚡ easy
Answer: Verification: same person or not (1:1). Identification: match probe to gallery (1:N)—needs threshold and rank metrics.
6
Open-set identification?
📊 medium
Answer: Probe may be unknown—need rejection option based on similarity threshold to avoid false accepts.
7
Triplet loss?
🔥 hard
Answer: Anchor closer to positive than to negative by margin—hard negative mining critical for convergence (FaceNet).
8
ArcFace?
🔥 hard
Answer: Angular margin on hypersphere between logits—enforces larger inter-class angular separation; state-of-the-art metric learning.
9
FaceNet?
📊 medium
Answer: End-to-end CNN with triplet loss producing compact embeddings—popularized deep face recognition at scale.
10
Benchmarks?
⚡ easy
Answer: LFW, CFP-FP, IJB-C, MegaFace—vary in pose, N protocol, and difficulty; report TAR@FAR for verification.
11
Threshold tuning?
📊 medium
Answer: Set operating point on validation to balance FAR vs FRR for the deployment constraint (access control vs convenience).
12
Anti-spoofing?
📊 medium
Answer: Detect print/screen/replay attacks with texture, depth, or rPPG—required for liveness in banking kiosks.
13
Masks / COVID era?
📊 medium
Answer: Periocular focus, synthetic mask augmentation, or dedicated training—lower accuracy if model not adapted.
14
Demographic bias?
🔥 hard
Answer: Unequal error rates across groups—audit datasets, balanced training, and fairness constraints in deployment.
15
Privacy?
⚡ easy
Answer: Biometric data is sensitive—encrypt templates, consent, retention limits, on-device processing where possible.
16
3D morphable models?
📊 medium
Answer: Fit 3DMM for pose-invariant recognition or generate synthetic views—helps extreme pose.
17
On-device?
⚡ easy
Answer: Quantized MobileFaceNet-style backbones, NNAPI/CoreML—latency and power constrained.
18
Quality assessment?
📊 medium
Answer: Blur, exposure, resolution gates before embedding—reject low-quality captures to reduce false matches.
19
Synthetic faces?
📊 medium
Answer: GAN-generated diversity for training—watch for domain gap and identity leakage in synthetic sets.
20
Presentation attacks?
📊 medium
Answer: ISO standards categorize attack instruments—multimodal liveness (depth, IR) mitigates many.
Pose Estimation: 20 Essential Q&A
21
What is pose estimation?
⚡ easy
Answer: Predict joint locations (shoulders, elbows, etc.) for people in an image/video—2D pixel coords or 3D body config.
22
Keypoint formats?
📊 medium
Answer: xy coordinates, confidence, sometimes visibility flags—datasets define fixed skeleton topology (COCO 17 joints).
23
Heatmap regression?
📊 medium
Answer: Per-joint Gaussian maps; argmax or soft-argmax for coordinate—preserves spatial uncertainty vs direct regression.
# heatmap argmax → (x,y) joint; soft-argmax differentiable
24
COCO pose?
⚡ easy
Answer: 17 body keypoints per person—standard for detection+pose benchmarks and pretrained models.
25
Top-down approach?
📊 medium
Answer: Person detector first, then single-person pose inside each ROI—accurate when detector is good, slower with many people.
26
Bottom-up?
📊 medium
Answer: Predict all joints then group into people (OpenPose PAFs, Associative Embedding)—better scaling in crowds.
27
OpenPose PAFs?
🔥 hard
Answer: Part affinity fields encode limb orientation to connect candidate joints—enables real-time multi-person 2D pose.
28
HRNet?
🔥 hard
Answer: Maintains high-resolution streams parallel to low-res with repeated fusions—sharp heatmaps, strong 2D accuracy.
29
Loss functions?
📊 medium
Answer: MSE on heatmaps; or L1 on coords; auxiliary intermediate supervision in hourglass nets aids deep training.
30
Occlusion?
📊 medium
Answer: Low visibility flags, context from torso, temporal smoothing in video—still hard for heavy overlap.
31
Multi-person overlap?
📊 medium
Answer: NMS on detections; association graph solvers; transformer decoders predicting sets of poses (PETR-style ideas).
32
3D pose?
🔥 hard
Answer: Direct regression of camera-space joints or volumetric representations—needs depth, multi-view, or weak 3D supervision.
33
Lifting 2D→3D?
📊 medium
Answer: Use skeleton constraints + camera model or learned prior (VIBE, VideoPose3D) from monocular sequences.
34
MediaPipe / BlazePose?
📊 medium
Answer: Lightweight graphs for mobile AR—33-point topology, real-time on phone GPUs.
35
Real-time?
⚡ easy
Answer: Light backbones, lower input res, single-person mode—30+ FPS on GPU for fitness apps.
36
Graph models?
🔥 hard
Answer: GCN over joints exploits kinematic structure—complements conv heatmap methods especially for 3D.
37
OKS mAP?
📊 medium
Answer: Object keypoint similarity scales error by joint size—COCO pose AP aggregates across OKS thresholds.
38
Augmentation?
⚡ easy
Answer: Random rotation/scale, flip with joint swap, cutout—preserve skeleton validity after transform.
39
Mobile deployment?
📊 medium
Answer: INT8 quant, smaller input, ROI cropping—trade accuracy for thermal/power on edge.
40
Limitations?
⚡ easy
Answer: Rare poses underrepresented, clothing hides joints, single depth ambiguity in monocular 3D—combine sensors or multi-view when possible.
Full tutorial chapter
Pair these interview notes with the matching CV tutorial chapter.