Computer Vision Chapter 10

Object Tracking

Video tracking basics, Kalman filtering, SORT/DeepSORT, and modern multi-object tracking ideas.

Object tracking basics

Detection vs tracking

Detection classifies and localizes all objects each frame—independent of history. Tracking exploits temporal continuity: prediction from the previous frame reduces search cost and stabilizes identity. Hybrid pipelines run a detector every N frames and a cheap tracker in between, or fuse detections with Kalman prediction and Hungarian matching.

Single-object

One initialized box; tracker updates each frame—OpenCV Tracker* API.

Multi-object (MOT)

Many IDs; needs association to match detections to trajectories across frames.

OpenCV: CSRT tracker (example)

CSRT (Channel and Spatial Reliability) is accurate but slower than KCF. On OpenCV 4.x, legacy trackers often live under cv2.legacy.

import cv2

cap = cv2.VideoCapture("clip.mp4")
ok, frame = cap.read()
bbox = cv2.selectROI("ROI", frame, showCrosshair=True, fromCenter=False)
cv2.destroyWindow("ROI")

tracker = cv2.legacy.TrackerCSRT_create()
tracker.init(frame, bbox)

while True:
    ok, frame = cap.read()
    if not ok:
        break
    ok, bbox = tracker.update(frame)
    if ok:
        x, y, w, h = [int(v) for v in bbox]
        cv2.rectangle(frame, (x, y), (x + w, y + h), (0, 255, 0), 2)
    cv2.imshow("track", frame)
    if cv2.waitKey(1) == 27:
        break

If cv2.legacy is missing, try cv2.TrackerCSRT_create() on older builds, or install opencv-contrib-python.

Faster option: KCF

tracker = cv2.legacy.TrackerKCF_create()
tracker.init(frame, bbox)

KCF is faster; CSRT handles deformation and occlusion slightly better. MOSSE is older and very fast but brittle on scale change.

When trackers fail

  • Drift — model updates on wrong pixels; use conservative learning rates or stop updating on low confidence.
  • Occlusion / motion blur — switch to detection-based re-id (DeepSORT) or manual re-init.
  • Scale / out-of-plane rotation — use scale-pyramid extensions or bounding-box regression from a detector.

MIL tracker (brief)

MIL (Multiple Instance Learning) treats ambiguous positive bags of patches inside the box—more robust to slight misalignment than naive correlation trackers. Create with cv2.legacy.TrackerMIL_create() where available.

Takeaways

  • Classic OpenCV trackers = single-object, short-term, init once.
  • For many objects + IDs, combine a detector with SORT / DeepSORT.
  • Profile CSRT vs KCF on your resolution and FPS budget.

Quick FAQ

Maintain a list of tracker instances, each init with its ROI, and call update per frame. For consistent IDs across occlusions, prefer detection + association (next chapters).

Color-histogram modes in OpenCV (cv2.meanShift, CamShift) work on controlled color distributions; modern pipelines usually prefer learned trackers or detectors.

Kalman filter

State, model, noise

Discrete-time model: xk = F xk−1 + w with process noise w; measurement zk = H xk + v. The filter maintains mean and covariance of the state estimate. F encodes constant-velocity or constant-acceleration assumptions; H picks which state components we observe (e.g. only position).

Process noise Q

Larger Q → trust motion model less, follow measurements more (jittery but responsive).

Measurement noise R

Larger R → smoother state, slower to react to true maneuvers.

2D centroid with constant velocity

State [cx, cy, vx, vy]ᵀ; measurement [cx, cy]ᵀ. Time step assumes unit frame interval (scale F if you know Δt).

import cv2
import numpy as np

kf = cv2.KalmanFilter(4, 2)
kf.transitionMatrix = np.array([
    [1, 0, 1, 0],
    [0, 1, 0, 1],
    [0, 0, 1, 0],
    [0, 0, 0, 1],
], dtype=np.float32)
kf.measurementMatrix = np.array([
    [1, 0, 0, 0],
    [0, 1, 0, 0],
], dtype=np.float32)

kf.processNoiseCov = np.eye(4, dtype=np.float32) * 1e-2
kf.measurementNoiseCov = np.eye(2, dtype=np.float32) * 1e-1
kf.errorCovPost = np.eye(4, dtype=np.float32)

# First observation (e.g. detector center)
z = np.array([[120.0], [80.0]], dtype=np.float32)
kf.statePost = np.array([[z[0,0]], [z[1,0]], [0], [0]], dtype=np.float32)

pred = kf.predict()
z2 = np.array([[125.0], [82.0]], dtype=np.float32)
est = kf.correct(z2)
# est[0:2] = smoothed center

Per-frame loop pattern

# Each video frame:
prediction = kf.predict()
if detection_available:
    z = np.array([[mx], [my]], dtype=np.float32)
    state = kf.correct(z)
else:
    state = prediction  # coast on prediction (occlusion)

Long occlusions need gating or a separate re-acquisition module—pure Kalman drift grows without measurements.

Beyond linear Kalman

Nonlinear motion or sensors use Extended Kalman Filter (EKF) or Unscented Kalman Filter (UKF). Particle filters handle multi-modal uncertainty. Libraries like filterpy offer EKF/UKF in Python; OpenCV focuses on the linear case.

Takeaways

  • Predict then correct each step; tune Q and R for your noise and frame rate.
  • Great for smoothing box centers before association (SORT).
  • Use EKF/UKF when models are nonlinear (camera projection, IMU fusion).

Quick FAQ

Check matrix dimensions, symmetry of covariances, and that measurementMatrix matches measurement shape. Numerical issues—try smaller process noise or double precision.

Extend state to [cx, cy, w, h, vx, vy, …] with block-diagonal F or separate filters per quantity; measurements become box from detector.

SORT & DeepSORT

SORT pipeline

  1. Detect — bounding boxes + scores from YOLO, Faster R-CNN, etc.
  2. Predict — each active track advances its Kalman state (e.g. box center and scale velocity).
  3. Associate — build a cost matrix (often 1 − IoU between predicted boxes and detections); solve assignment with Hungarian / linear sum assignment.
  4. Update — matched tracks get Kalman correction; unmatched detections spawn new tracks; unmatched tracks age out after T frames.

IoU cost and assignment (sketch)

import numpy as np
from scipy.optimize import linear_sum_assignment

def iou(a, b):
    x1 = max(a[0], b[0]); y1 = max(a[1], b[1])
    x2 = min(a[2], b[2]); y2 = min(a[3], b[3])
    inter = max(0, x2 - x1) * max(0, y2 - y1)
    ua = (a[2]-a[0])*(a[3]-a[1]) + (b[2]-b[0])*(b[3]-b[1]) - inter
    return inter / (ua + 1e-6)

# pred_boxes[i], det_boxes[j] as [x1,y1,x2,y2]
# cost[i,j] = 1 - iou(pred_i, det_j); gate large cost if iou < 0.01
cost = np.ones((len(pred_boxes), len(det_boxes)))
for i, p in enumerate(pred_boxes):
    for j, d in enumerate(det_boxes):
        v = iou(p, d)
        cost[i, j] = 1.0 - v if v > 0.1 else 99.0

row_ind, col_ind = linear_sum_assignment(cost)

Production code vectorizes IoU and applies min-cost thresholds; this shows the idea.

DeepSORT: motion + appearance

DeepSORT keeps a Kalman state in measurement space (including box aspect ratio and height) and uses a Mahalanobis distance gate between predicted distribution and detections. A small CNN crops the pedestrian bbox and outputs a L2-normalized embedding; cosine distance between track gallery and detection provides a second association cue. Final cost blends motion and appearance (with tuned weights and gates).

Why embeddings?

IoU fails when people cross paths—appearance helps disambiguate after split seconds of overlap.

Modern variants

StrongSORT, BoT-SORT, OC-SORT refine association and camera motion; check MOTChallenge leaderboards.

Libraries

Reference repos implement full SORT/DeepSORT (track lifecycle, cascade matching). Ultralytics and BoxMOT-style packages integrate trackers with YOLO outputs. For learning, read the original papers then trace one clean Python implementation.

MOT metrics (names)

MOTA combines false positives, false negatives, and ID switches. IDF1 emphasizes identity consistency. HOTA unifies detection and association quality. Benchmarks: MOT17, MOT20, DanceTrack.

Takeaways

  • SORT = detector + Kalman + IoU + Hungarian + track management.
  • DeepSORT adds ReID embeddings for harder association.
  • Detector quality dominates overall MOT performance.

Quick FAQ

Improve detector recall, tighten Kalman noise, tune IoU threshold, or upgrade to appearance-based association. Very fast motion needs higher frame rate or better motion model.

Run detector sparsely and propagate with Kalman or correlation trackers between runs—trade accuracy for speed; re-sync when detector fires.

Chapter FAQ

Quick FAQ

Maintain a list of tracker instances, each init with its ROI, and call update per frame. For consistent IDs across occlusions, prefer detection + association (next chapters).

Color-histogram modes in OpenCV (cv2.meanShift, CamShift) work on controlled color distributions; modern pipelines usually prefer learned trackers or detectors.

Quick FAQ

Check matrix dimensions, symmetry of covariances, and that measurementMatrix matches measurement shape. Numerical issues—try smaller process noise or double precision.

Extend state to [cx, cy, w, h, vx, vy, …] with block-diagonal F or separate filters per quantity; measurements become box from detector.