3D Vision & Depth — Interview Q&A

Question 1

1 What is 3D computer vision? ⚡ easy

Answer

Answer: Reasoning about geometry of the scene—depth, shape, pose, and 3D structure—from images, video, or range sensors.

Question 2

2 Depth from stereo? 📊 medium

Answer

Answer: Triangulate corresponding points in two calibrated views—baseline provides parallax; disparity inversely related to depth.

Question 3

3 Define disparity. 📊 medium

Answer

Answer: Horizontal shift between conjugate pixels in rectified stereo pair—larger disparity means closer object (for standard forward stereo).

Question 4

4 What is the epipolar constraint? 🔥 hard

Answer

Answer: Corresponding point in second image lies on a line (epipolar line)—reduces matching from 2D search to 1D after rectification.

Question 5

5 Monocular depth? 📊 medium

Answer

Answer: Uses cues (perspective, texture, learned priors) or supervised/self-supervised CNNs—scale ambiguous without extra info.

Question 6

6 What is a point cloud? ⚡ easy

Answer

Answer: Set of 3D points (x,y,z), often with color/normal—raw output of LiDAR/stereo fusion or depth cameras.

Question 7

7 Voxel vs mesh? 📊 medium

Answer

Answer: Voxel grid discretizes 3D space—good for conv nets; mesh stores vertices+faces—compact for graphics and surface reasoning.

Question 8

8 Pinhole camera model? 📊 medium

Answer

Answer: Projects 3D X to image x via similar triangles: x = K [R|t] X (homogeneous)—basis for calibration and triangulation.

Question 9

9 Intrinsic matrix K? 📊 medium

Answer

Answer: Maps camera coordinates to pixels: focal lengths f_x,f_y and principal point c_x,c_y; may include skew in general form.

Question 10

10 Extrinsics? 📊 medium

Answer

Answer: Rotation R and translation t from world to camera frame—pose of camera in scene.

Question 11

11 RGB-D cameras? ⚡ easy

Answer

Answer: Structured light or time-of-flight provides registered depth + color (Kinect, RealSense)—no stereo baseline needed but range/artifact limits.

Question 12

12 LiDAR? 📊 medium

Answer

Answer: Active ranging by laser pulses—sparse accurate 3D, widely used in autonomy; different noise profile than passive stereo.

Question 13

13 Structure from motion? 📊 medium

Answer

Answer: Estimate sparse 3D points and camera poses from many images—basis of photogrammetry pipelines.

Question 14

14 SLAM in one line? 📊 medium

Answer

Answer: Simultaneously localize sensor and build map of environment—needs data association and loop closure.

Question 15

15 What is NeRF? 🔥 hard

Answer

Answer: Neural radiance field represents scene as MLP of density+color in 5D (x,y,z,θ,φ)—novel view synthesis; hot research direction.

Question 16

16 Scale ambiguity? 📊 medium

Answer

Answer: Monocular SfM/SLAM recovers geometry up to similarity transform without metric scale—IMU or known object fixes scale.

Question 17

17 What is ICP? 📊 medium

Answer

Answer: Iterative Closest Point aligns two point clouds by minimizing distances between correspondences—registration and tracking.

Question 18

18 BEV representation? 📊 medium

Answer

Answer: Top-down grid of scene used in driving—fuses multi-view or LiDAR into 2D bird’s-eye feature maps for detection/planning.

Question 19

19 Applications? ⚡ easy

Answer

Answer: AR overlay needs 6-DoF pose; robotics needs grasp planning collision checking—both need reliable 3D perception.

Question 20

20 Example datasets? ⚡ easy

Answer

Answer: KITTI, nuScenes, ScanNet, ShapeNet—each emphasizes driving, multi-sensor, indoor scans, or CAD models respectively.

Question 21

21 What is camera calibration? ⚡ easy

Answer

Answer: Estimating intrinsic (focal length, principal point, distortion) and often extrinsic (pose) parameters so pixel measurements map correctly to 3D rays.

Question 22

22 What are intrinsics? 📊 medium

Answer

Answer: Properties of the camera/lens fixed w.r.t. the sensor—encoded in K and distortion coeffs—independent of where the camera sits in the world.

Question 23

23 What are extrinsics? 📊 medium

Answer

Answer: Rigid transform [R|t] from world (or calibration object) frame to camera frame—changes when the camera moves.

Question 24

24 What is the intrinsic matrix K? 📊 medium

Answer

Answer: 3×3 upper-triangular mapping normalized camera coordinates to pixels: focal lengths f_x,f_y, principal point c_x,c_y, optional skew γ.

Question 25

25 What is radial distortion? 📊 medium

Answer

Answer: Lens bends rays—barrel (outward) or pincushion (inward); modeled as r-dependent scaling of image radius from optical center.

Question 26

26 What is tangential distortion? 📊 medium

Answer

Answer: Lens not perfectly parallel to sensor—modeled with extra parameters (p1,p2) shifting points tangentially; common in OpenCV 5-coeff model.

Question 27

27 Brown–Conrady model? 🔥 hard

Answer

Answer: Classic polynomial radial + tangential distortion used in OpenCV calibrateCamera—may use k1–k3, p1,p2; fisheye uses different high-FOV model.

Question 28

28 How does Zhang’s method work? 🔥 hard

Answer

Answer: Uses multiple views of a planar calibration pattern; each view gives a homography constraining intrinsics; closed-form init then non-linear refinement minimizing reprojection error.

Question 29

29 Why checkerboards? 📊 medium

Answer

Answer: Corner intersections are easy to detect sub-pixel; known 3D layout on plane gives 2D–3D correspondences per image.

Question 30

30 What is reprojection error? 📊 medium

Answer

Answer: Distance between detected image points and projection of 3D model points with estimated parameters—lower is better; report RMS in pixels.

Question 31

31 OpenCV pipeline? ⚡ easy

Answer

Answer: findChessboardCorners → calibrateCamera → get K, distCoeffs; optional stereoCalibrate for two cameras.

Question 32

32 Stereo calibration? 🔥 hard

Answer

Answer: Estimate intrinsics per camera plus relative pose (R,T) between cameras and often rectify so epipolar lines align—needed for triangulation.

Question 33

33 When use fisheye module? 📊 medium

Answer

Answer: Very wide FOV where polynomial model breaks—OpenCV fisheye:: namespace uses different distortion and projection.

Question 34

34 Principal point? ⚡ easy

Answer

Answer: Optical axis intersection with image plane (c_x,c_y)—often near image center but not exactly; important for undistortion and 3D.

Question 35

35 Skew γ? 🔥 hard

Answer

Answer: Non-orthogonal pixel axes—often assumed 0 for modern sensors; included in full K for completeness.

Question 36

36 World frame choice? 📊 medium

Answer

Answer: Usually attach to calibration board plane (Z=0 on pattern)—extrinsics are board-to-camera per capture.

Question 37

37 Calibrate from homography only? 📊 medium

Answer

Answer: Single plane gives partial constraints—need multiple orientations/distances to fix intrinsics uniquely (Zhang’s multi-view idea).

Question 38

38 Why calibrate for AR? ⚡ easy

Answer

Answer: Overlay virtual objects requires accurate projection and undistortion—wrong K causes “swimming” augmentations.

Question 39

39 When recalibrate? ⚡ easy

Answer

Answer: Zoom/focus change, different camera, temperature extremes, or new lens—intrinsics are not universal across devices.

Question 40

40 Link to bundle adjustment? 🔥 hard

Answer

Answer: Joint non-linear refinement of many cameras and 3D points—structure-from-motion and SLAM extend calibration ideas to large scenes.

Question 41

41 What is stereo vision? ⚡ easy

Answer

Answer: Using two (or more) calibrated views with known baseline to recover depth via triangulation of corresponding points.

Question 42

42 Define disparity. 📊 medium

Answer

Answer: Horizontal shift between corresponding pixels in a rectified stereo pair—larger disparity means closer surface (inverse relation to depth).

Question 43

43 Depth from disparity? 📊 medium

Answer

Answer: Z ≈ f × B / d (f focal length, B baseline, d disparity)—assumes rectified parallel cameras and pinhole model.

Question 44

44 Baseline tradeoff? 📊 medium

Answer

Answer: Larger B increases depth precision (more parallax) but worsens occlusions and matching in narrow scenes; small B reduces measurable disparity range.

Question 45

45 What is rectification? 🔥 hard

Answer

Answer: Warp both images so epipolar lines are horizontal scanlines—reduces correspondence search to 1D and simplifies disparity.

Question 46

46 Epipolar constraint? 📊 medium

Answer

Answer: Without rectification, match for a point lies on a line in the other image—comes from epipolar geometry of two views.

Question 47

47 What is stereo matching? 📊 medium

Answer

Answer: For each pixel (or patch), find best match along epipolar line using photometric cost (SAD, census, CNN features).

Question 48

48 Cost volume? 🔥 hard

Answer

Answer: 3D array H×W×D of matching costs over disparity levels—winner-take-all or global optimization (SGC, belief propagation) picks disparities.

Question 49

49 What is SGM? 🔥 hard

Answer

Answer: Semi-Global Matching aggregates costs along many paths with smoothness penalties—good quality/speed tradeoff in OpenCV StereoSGBM.

Question 50

50 Occlusion regions? 📊 medium

Answer

Answer: Pixels visible in only one view have undefined disparity—detected by consistency checks or left-right validation.

Question 51

51 Sub-pixel disparity? 📊 medium

Answer

Answer: Parabolic fit around discrete minimum or phase-based methods—needed for smooth surfaces and accurate 3D.

Question 52

52 Common errors? 📊 medium

Answer

Answer: Calibration errors, textureless regions, repetitive patterns, specular highlights, and motion if scene moves between exposures.

Question 53

53 StereoBM vs SGBM? ⚡ easy

Answer

Answer: BM: fixed small block, fast, blocky. SGBM: semi-global, slower, smoother—preferred when quality matters.

Question 54

54 Monocular depth? 📊 medium

Answer

Answer: Single image lacks scale without priors—learned networks predict relative depth; stereo gives metric depth with calibration.

Question 55

55 vs RGB-D? ⚡ easy

Answer

Answer: Structured light / ToF gives depth directly—no correspondence problem but range/resolution limits; stereo passive but needs texture.

Question 56

56 Multi-view stereo? 🔥 hard

Answer

Answer: Fuse many images (MVS) for dense point clouds—used in photogrammetry beyond two-camera stereo.

Question 57

57 Stereo in driving? 📊 medium

Answer

Answer: Wide-baseline camera pairs on vehicles for obstacle depth; often fused with radar/LiDAR and learned refinement.

Question 58

58 Fuse with LiDAR? 🔥 hard

Answer

Answer: Sparse accurate LiDAR anchors depth map from stereo; learning-based fusion common in autonomy stacks.

Question 59

59 Learned stereo? 📊 medium

Answer

Answer: CNNs build cost volumes or regress disparity directly (e.g. PSMNet)—strong on benchmarks when enough training data.

Question 60

60 Need calibration? ⚡ easy

Answer

Answer: Yes for metric depth—need K, distortion, and stereo extrinsics; rectification matrices derived from them.

3D Vision & Depth — Interview Q&A

3D Vision Introduction: 20 Essential Q&A

Camera Calibration: 20 Essential Q&A

Stereo Vision: 20 Essential Q&A

Full tutorial chapter