SLAM: CV guide

Localization vs mapping

Localization: given a map, where am I? Mapping: given poses, what does the world look like? SLAM couples both because each helps the other: a better map improves pose; better poses improve the map. Without global constraints, integrating only relative motion drifts—loop closure detects revisits and distributes error.

Visual odometry (VO)

Frame-to-frame motion; subset of SLAM without global optimization or loop closure—can be a module inside SLAM.

Sensor fusion

IMU + camera (VIO), LiDAR + camera—filters (EKF) or factor graphs fuse predictions.

Typical visual SLAM pipeline

Extract and match features (ORB, FAST+BRIEF, …).
Estimate motion between frames (essential/fundamental matrix, PnP with map points).
Triangulate new 3D landmarks; local bundle adjustment.
Detect loops; optimize pose graph or global BA.

Example: match two frames

import cv2
import numpy as np

g1 = cv2.imread("frame0.jpg", cv2.IMREAD_GRAYSCALE)
g2 = cv2.imread("frame1.jpg", cv2.IMREAD_GRAYSCALE)
orb = cv2.ORB_create(1000)
k1, d1 = orb.detectAndCompute(g1, None)
k2, d2 = orb.detectAndCompute(g2, None)

bf = cv2.BFMatcher(cv2.NORM_HAMMING, crossCheck=True)
matches = bf.match(d1, d2)
matches = sorted(matches, key=lambda m: m.distance)[:200]

pts1 = np.float32([k1[m.queryIdx].pt for m in matches])
pts2 = np.float32([k2[m.trainIdx].pt for m in matches])

Essential matrix and `recoverPose`

With calibrated intrinsics K, the essential matrix E relates normalized correspondences. recoverPose yields R, t (up to scale for monocular).

K = np.array([[fx, 0, cx], [0, fy, cy], [0, 0, 1]], dtype=np.float64)

E, mask = cv2.findEssentialMat(pts1, pts2, K, method=cv2.RANSAC, prob=0.999, threshold=1.0)
in1 = pts1[mask.ravel() == 1]
in2 = pts2[mask.ravel() == 1]

_, R, t, _ = cv2.recoverPose(E, in1, in2, K)
# R,t: motion from frame1 coords to frame2 (convention: verify OpenCV docs)

Fundamental matrix (uncalibrated)

F, mask = cv2.findFundamentalMat(pts1, pts2, cv2.FM_RANSAC, 1.0, 0.99)

Frameworks to explore

ORB-SLAM3 — feature-based, mono/stereo/RGB-D, loop closing.
OpenVINS, VINS-Mono — tightly coupled VIO.
RTAB-Map — RGB-D mapping with ROS integration.

                    Takeaways
                    SLAM = online pose + map; drift needs loop closure or global sensors.
Two-view geometry (E, recoverPose) is a VO building block.
Production SLAM uses factor graphs / BA—not just pairwise estimates.

                

Quick FAQ

Translation from two views is known only up to scale without metric references (stereo baseline, IMU, known object size).

Structure-from-Motion often processes an unordered photo collection offline; SLAM stresses real-time operation and sequential state estimation.

Localization vs mapping

Visual odometry (VO)

Sensor fusion

Typical visual SLAM pipeline

Example: match two frames

Essential matrix and recoverPose

Fundamental matrix (uncalibrated)

Frameworks to explore

Takeaways

Quick FAQ

Monocular scale?

SLAM vs SfM?

Essential matrix and `recoverPose`