Computer Vision Interview 20 essential Q&A Updated 2026
3D vision

3D Vision Introduction: 20 Essential Q&A

From 2D images to depth and geometry—stereo, monocular cues, and 3D representations.

~11 min read 20 questions Advanced
depthstereopoint cloudpinhole
1 What is 3D computer vision? ⚡ easy
Answer: Reasoning about geometry of the scene—depth, shape, pose, and 3D structure—from images, video, or range sensors.
2 Depth from stereo? 📊 medium
Answer: Triangulate corresponding points in two calibrated views—baseline provides parallax; disparity inversely related to depth.
3 Define disparity. 📊 medium
Answer: Horizontal shift between conjugate pixels in rectified stereo pair—larger disparity means closer object (for standard forward stereo).
4 What is the epipolar constraint? 🔥 hard
Answer: Corresponding point in second image lies on a line (epipolar line)—reduces matching from 2D search to 1D after rectification.
5 Monocular depth? 📊 medium
Answer: Uses cues (perspective, texture, learned priors) or supervised/self-supervised CNNs—scale ambiguous without extra info.
6 What is a point cloud? ⚡ easy
Answer: Set of 3D points (x,y,z), often with color/normal—raw output of LiDAR/stereo fusion or depth cameras.
7 Voxel vs mesh? 📊 medium
Answer: Voxel grid discretizes 3D space—good for conv nets; mesh stores vertices+faces—compact for graphics and surface reasoning.
8 Pinhole camera model? 📊 medium
Answer: Projects 3D X to image x via similar triangles: x = K [R|t] X (homogeneous)—basis for calibration and triangulation.
9 Intrinsic matrix K? 📊 medium
Answer: Maps camera coordinates to pixels: focal lengths f_x,f_y and principal point c_x,c_y; may include skew in general form.
10 Extrinsics? 📊 medium
Answer: Rotation R and translation t from world to camera frame—pose of camera in scene.
11 RGB-D cameras? ⚡ easy
Answer: Structured light or time-of-flight provides registered depth + color (Kinect, RealSense)—no stereo baseline needed but range/artifact limits.
12 LiDAR? 📊 medium
Answer: Active ranging by laser pulses—sparse accurate 3D, widely used in autonomy; different noise profile than passive stereo.
13 Structure from motion? 📊 medium
Answer: Estimate sparse 3D points and camera poses from many images—basis of photogrammetry pipelines.
14 SLAM in one line? 📊 medium
Answer: Simultaneously localize sensor and build map of environment—needs data association and loop closure.
15 What is NeRF? 🔥 hard
Answer: Neural radiance field represents scene as MLP of density+color in 5D (x,y,z,θ,φ)—novel view synthesis; hot research direction.
16 Scale ambiguity? 📊 medium
Answer: Monocular SfM/SLAM recovers geometry up to similarity transform without metric scale—IMU or known object fixes scale.
17 What is ICP? 📊 medium
Answer: Iterative Closest Point aligns two point clouds by minimizing distances between correspondences—registration and tracking.
18 BEV representation? 📊 medium
Answer: Top-down grid of scene used in driving—fuses multi-view or LiDAR into 2D bird’s-eye feature maps for detection/planning.
19 Applications? ⚡ easy
Answer: AR overlay needs 6-DoF pose; robotics needs grasp planning collision checking—both need reliable 3D perception.
20 Example datasets? ⚡ easy
Answer: KITTI, nuScenes, ScanNet, ShapeNet—each emphasizes driving, multi-sensor, indoor scans, or CAD models respectively.

3D Vision Cheat Sheet

Stereo
  • Disparity → depth
  • Epipolar line
Models
  • K, [R|t]
  • Pinhole
Data
  • Point cloud
  • RGB-D / LiDAR

💡 Pro tip: Stereo needs calibration + rectification for 1D disparity search.

Full tutorial track

Go deeper with the matching tutorial chapter and code examples.