Object tracking basics
Detection vs tracking
Detection classifies and localizes all objects each frame—independent of history. Tracking exploits temporal continuity: prediction from the previous frame reduces search cost and stabilizes identity. Hybrid pipelines run a detector every N frames and a cheap tracker in between, or fuse detections with Kalman prediction and Hungarian matching.
Single-object
One initialized box; tracker updates each frame—OpenCV Tracker* API.
Multi-object (MOT)
Many IDs; needs association to match detections to trajectories across frames.
OpenCV: CSRT tracker (example)
CSRT (Channel and Spatial Reliability) is accurate but slower than KCF. On OpenCV 4.x, legacy trackers often live under cv2.legacy.
import cv2
cap = cv2.VideoCapture("clip.mp4")
ok, frame = cap.read()
bbox = cv2.selectROI("ROI", frame, showCrosshair=True, fromCenter=False)
cv2.destroyWindow("ROI")
tracker = cv2.legacy.TrackerCSRT_create()
tracker.init(frame, bbox)
while True:
ok, frame = cap.read()
if not ok:
break
ok, bbox = tracker.update(frame)
if ok:
x, y, w, h = [int(v) for v in bbox]
cv2.rectangle(frame, (x, y), (x + w, y + h), (0, 255, 0), 2)
cv2.imshow("track", frame)
if cv2.waitKey(1) == 27:
break
If cv2.legacy is missing, try cv2.TrackerCSRT_create() on older builds, or install opencv-contrib-python.
Faster option: KCF
tracker = cv2.legacy.TrackerKCF_create()
tracker.init(frame, bbox)
KCF is faster; CSRT handles deformation and occlusion slightly better. MOSSE is older and very fast but brittle on scale change.
When trackers fail
- Drift — model updates on wrong pixels; use conservative learning rates or stop updating on low confidence.
- Occlusion / motion blur — switch to detection-based re-id (DeepSORT) or manual re-init.
- Scale / out-of-plane rotation — use scale-pyramid extensions or bounding-box regression from a detector.
MIL tracker (brief)
MIL (Multiple Instance Learning) treats ambiguous positive bags of patches inside the box—more robust to slight misalignment than naive correlation trackers. Create with cv2.legacy.TrackerMIL_create() where available.
Takeaways
- Classic OpenCV trackers = single-object, short-term, init once.
- For many objects + IDs, combine a detector with SORT / DeepSORT.
- Profile CSRT vs KCF on your resolution and FPS budget.
Quick FAQ
init with its ROI, and call update per frame. For consistent IDs across occlusions, prefer detection + association (next chapters).cv2.meanShift, CamShift) work on controlled color distributions; modern pipelines usually prefer learned trackers or detectors.Kalman filter
State, model, noise
Discrete-time model: xk = F xk−1 + w with process noise w; measurement zk = H xk + v. The filter maintains mean and covariance of the state estimate. F encodes constant-velocity or constant-acceleration assumptions; H picks which state components we observe (e.g. only position).
Process noise Q
Larger Q → trust motion model less, follow measurements more (jittery but responsive).
Measurement noise R
Larger R → smoother state, slower to react to true maneuvers.
2D centroid with constant velocity
State [cx, cy, vx, vy]ᵀ; measurement [cx, cy]ᵀ. Time step assumes unit frame interval (scale F if you know Δt).
import cv2
import numpy as np
kf = cv2.KalmanFilter(4, 2)
kf.transitionMatrix = np.array([
[1, 0, 1, 0],
[0, 1, 0, 1],
[0, 0, 1, 0],
[0, 0, 0, 1],
], dtype=np.float32)
kf.measurementMatrix = np.array([
[1, 0, 0, 0],
[0, 1, 0, 0],
], dtype=np.float32)
kf.processNoiseCov = np.eye(4, dtype=np.float32) * 1e-2
kf.measurementNoiseCov = np.eye(2, dtype=np.float32) * 1e-1
kf.errorCovPost = np.eye(4, dtype=np.float32)
# First observation (e.g. detector center)
z = np.array([[120.0], [80.0]], dtype=np.float32)
kf.statePost = np.array([[z[0,0]], [z[1,0]], [0], [0]], dtype=np.float32)
pred = kf.predict()
z2 = np.array([[125.0], [82.0]], dtype=np.float32)
est = kf.correct(z2)
# est[0:2] = smoothed center
Per-frame loop pattern
# Each video frame:
prediction = kf.predict()
if detection_available:
z = np.array([[mx], [my]], dtype=np.float32)
state = kf.correct(z)
else:
state = prediction # coast on prediction (occlusion)
Long occlusions need gating or a separate re-acquisition module—pure Kalman drift grows without measurements.
Beyond linear Kalman
Nonlinear motion or sensors use Extended Kalman Filter (EKF) or Unscented Kalman Filter (UKF). Particle filters handle multi-modal uncertainty. Libraries like filterpy offer EKF/UKF in Python; OpenCV focuses on the linear case.
Takeaways
- Predict then correct each step; tune
QandRfor your noise and frame rate. - Great for smoothing box centers before association (SORT).
- Use EKF/UKF when models are nonlinear (camera projection, IMU fusion).
Quick FAQ
measurementMatrix matches measurement shape. Numerical issues—try smaller process noise or double precision.[cx, cy, w, h, vx, vy, …] with block-diagonal F or separate filters per quantity; measurements become box from detector.SORT & DeepSORT
SORT pipeline
- Detect — bounding boxes + scores from YOLO, Faster R-CNN, etc.
- Predict — each active track advances its Kalman state (e.g. box center and scale velocity).
- Associate — build a cost matrix (often
1 − IoUbetween predicted boxes and detections); solve assignment with Hungarian / linear sum assignment. - Update — matched tracks get Kalman correction; unmatched detections spawn new tracks; unmatched tracks age out after T frames.
IoU cost and assignment (sketch)
import numpy as np
from scipy.optimize import linear_sum_assignment
def iou(a, b):
x1 = max(a[0], b[0]); y1 = max(a[1], b[1])
x2 = min(a[2], b[2]); y2 = min(a[3], b[3])
inter = max(0, x2 - x1) * max(0, y2 - y1)
ua = (a[2]-a[0])*(a[3]-a[1]) + (b[2]-b[0])*(b[3]-b[1]) - inter
return inter / (ua + 1e-6)
# pred_boxes[i], det_boxes[j] as [x1,y1,x2,y2]
# cost[i,j] = 1 - iou(pred_i, det_j); gate large cost if iou < 0.01
cost = np.ones((len(pred_boxes), len(det_boxes)))
for i, p in enumerate(pred_boxes):
for j, d in enumerate(det_boxes):
v = iou(p, d)
cost[i, j] = 1.0 - v if v > 0.1 else 99.0
row_ind, col_ind = linear_sum_assignment(cost)
Production code vectorizes IoU and applies min-cost thresholds; this shows the idea.
DeepSORT: motion + appearance
DeepSORT keeps a Kalman state in measurement space (including box aspect ratio and height) and uses a Mahalanobis distance gate between predicted distribution and detections. A small CNN crops the pedestrian bbox and outputs a L2-normalized embedding; cosine distance between track gallery and detection provides a second association cue. Final cost blends motion and appearance (with tuned weights and gates).
Why embeddings?
IoU fails when people cross paths—appearance helps disambiguate after split seconds of overlap.
Modern variants
StrongSORT, BoT-SORT, OC-SORT refine association and camera motion; check MOTChallenge leaderboards.
Libraries
Reference repos implement full SORT/DeepSORT (track lifecycle, cascade matching). Ultralytics and BoxMOT-style packages integrate trackers with YOLO outputs. For learning, read the original papers then trace one clean Python implementation.
MOT metrics (names)
MOTA combines false positives, false negatives, and ID switches. IDF1 emphasizes identity consistency. HOTA unifies detection and association quality. Benchmarks: MOT17, MOT20, DanceTrack.
Takeaways
- SORT = detector + Kalman + IoU + Hungarian + track management.
- DeepSORT adds ReID embeddings for harder association.
- Detector quality dominates overall MOT performance.
Quick FAQ
Chapter FAQ
Quick FAQ
init with its ROI, and call update per frame. For consistent IDs across occlusions, prefer detection + association (next chapters).cv2.meanShift, CamShift) work on controlled color distributions; modern pipelines usually prefer learned trackers or detectors.Quick FAQ
measurementMatrix matches measurement shape. Numerical issues—try smaller process noise or double precision.[cx, cy, w, h, vx, vy, …] with block-diagonal F or separate filters per quantity; measurements become box from detector.