Computer Vision Chapter 28

SLAM

SLAM (Simultaneous Localization and Mapping) estimates the sensor trajectory and a map of the environment at the same time—classic in robotics and AR. Visual SLAM uses cameras (monocular, stereo, RGB-D). Building blocks include feature tracking, two-view / multi-view geometry, bundle adjustment, loop closure to correct drift, and pose graphs. Full systems (ORB-SLAM, VINS, LSD-SLAM) are large; this chapter explains the idea and shows OpenCV two-frame motion from matched points—one step inside visual odometry.

Localization vs mapping

Localization: given a map, where am I? Mapping: given poses, what does the world look like? SLAM couples both because each helps the other: a better map improves pose; better poses improve the map. Without global constraints, integrating only relative motion drifts—loop closure detects revisits and distributes error.

Visual odometry (VO)

Frame-to-frame motion; subset of SLAM without global optimization or loop closure—can be a module inside SLAM.

Sensor fusion

IMU + camera (VIO), LiDAR + camera—filters (EKF) or factor graphs fuse predictions.

Typical visual SLAM pipeline

  1. Extract and match features (ORB, FAST+BRIEF, …).
  2. Estimate motion between frames (essential/fundamental matrix, PnP with map points).
  3. Triangulate new 3D landmarks; local bundle adjustment.
  4. Detect loops; optimize pose graph or global BA.

Example: match two frames

import cv2
import numpy as np

g1 = cv2.imread("frame0.jpg", cv2.IMREAD_GRAYSCALE)
g2 = cv2.imread("frame1.jpg", cv2.IMREAD_GRAYSCALE)
orb = cv2.ORB_create(1000)
k1, d1 = orb.detectAndCompute(g1, None)
k2, d2 = orb.detectAndCompute(g2, None)

bf = cv2.BFMatcher(cv2.NORM_HAMMING, crossCheck=True)
matches = bf.match(d1, d2)
matches = sorted(matches, key=lambda m: m.distance)[:200]

pts1 = np.float32([k1[m.queryIdx].pt for m in matches])
pts2 = np.float32([k2[m.trainIdx].pt for m in matches])

Essential matrix and recoverPose

With calibrated intrinsics K, the essential matrix E relates normalized correspondences. recoverPose yields R, t (up to scale for monocular).

K = np.array([[fx, 0, cx], [0, fy, cy], [0, 0, 1]], dtype=np.float64)

E, mask = cv2.findEssentialMat(pts1, pts2, K, method=cv2.RANSAC, prob=0.999, threshold=1.0)
in1 = pts1[mask.ravel() == 1]
in2 = pts2[mask.ravel() == 1]

_, R, t, _ = cv2.recoverPose(E, in1, in2, K)
# R,t: motion from frame1 coords to frame2 (convention: verify OpenCV docs)

Fundamental matrix (uncalibrated)

F, mask = cv2.findFundamentalMat(pts1, pts2, cv2.FM_RANSAC, 1.0, 0.99)

Frameworks to explore

  • ORB-SLAM3 — feature-based, mono/stereo/RGB-D, loop closing.
  • OpenVINS, VINS-Mono — tightly coupled VIO.
  • RTAB-Map — RGB-D mapping with ROS integration.

Takeaways

  • SLAM = online pose + map; drift needs loop closure or global sensors.
  • Two-view geometry (E, recoverPose) is a VO building block.
  • Production SLAM uses factor graphs / BA—not just pairwise estimates.

Quick FAQ

Translation from two views is known only up to scale without metric references (stereo baseline, IMU, known object size).

Structure-from-Motion often processes an unordered photo collection offline; SLAM stresses real-time operation and sequential state estimation.