Object Tracking

Object tracking basics

Detection vs tracking

Detection classifies and localizes all objects each frame—independent of history. Tracking exploits temporal continuity: prediction from the previous frame reduces search cost and stabilizes identity. Hybrid pipelines run a detector every N frames and a cheap tracker in between, or fuse detections with Kalman prediction and Hungarian matching.

Single-object

One initialized box; tracker updates each frame—OpenCV Tracker* API.

Multi-object (MOT)

Many IDs; needs association to match detections to trajectories across frames.

OpenCV: CSRT tracker (example)

CSRT (Channel and Spatial Reliability) is accurate but slower than KCF. On OpenCV 4.x, legacy trackers often live under cv2.legacy.

import cv2

cap = cv2.VideoCapture("clip.mp4")
ok, frame = cap.read()
bbox = cv2.selectROI("ROI", frame, showCrosshair=True, fromCenter=False)
cv2.destroyWindow("ROI")

tracker = cv2.legacy.TrackerCSRT_create()
tracker.init(frame, bbox)

while True:
    ok, frame = cap.read()
    if not ok:
        break
    ok, bbox = tracker.update(frame)
    if ok:
        x, y, w, h = [int(v) for v in bbox]
        cv2.rectangle(frame, (x, y), (x + w, y + h), (0, 255, 0), 2)
    cv2.imshow("track", frame)
    if cv2.waitKey(1) == 27:
        break

If cv2.legacy is missing, try cv2.TrackerCSRT_create() on older builds, or install opencv-contrib-python.

Faster option: KCF

tracker = cv2.legacy.TrackerKCF_create()
tracker.init(frame, bbox)

KCF is faster; CSRT handles deformation and occlusion slightly better. MOSSE is older and very fast but brittle on scale change.

When trackers fail

Drift — model updates on wrong pixels; use conservative learning rates or stop updating on low confidence.
Occlusion / motion blur — switch to detection-based re-id (DeepSORT) or manual re-init.
Scale / out-of-plane rotation — use scale-pyramid extensions or bounding-box regression from a detector.

MIL tracker (brief)

MIL (Multiple Instance Learning) treats ambiguous positive bags of patches inside the box—more robust to slight misalignment than naive correlation trackers. Create with cv2.legacy.TrackerMIL_create() where available.

                    Takeaways
                    Classic OpenCV trackers = single-object, short-term, init once.
For many objects + IDs, combine a detector with SORT / DeepSORT.
Profile CSRT vs KCF on your resolution and FPS budget.

                

Quick FAQ

Maintain a list of tracker instances, each init with its ROI, and call update per frame. For consistent IDs across occlusions, prefer detection + association (next chapters).

Color-histogram modes in OpenCV (cv2.meanShift, CamShift) work on controlled color distributions; modern pipelines usually prefer learned trackers or detectors.

Kalman filter

State, model, noise

Discrete-time model: x_k = F x_k−1 + w with process noise w; measurement z_k = H x_k + v. The filter maintains mean and covariance of the state estimate. F encodes constant-velocity or constant-acceleration assumptions; H picks which state components we observe (e.g. only position).

Process noise `Q`

Larger Q → trust motion model less, follow measurements more (jittery but responsive).

Measurement noise `R`

Larger R → smoother state, slower to react to true maneuvers.

2D centroid with constant velocity

State [cx, cy, vx, vy]ᵀ; measurement [cx, cy]ᵀ. Time step assumes unit frame interval (scale F if you know Δt).

import cv2
import numpy as np

kf = cv2.KalmanFilter(4, 2)
kf.transitionMatrix = np.array([
    [1, 0, 1, 0],
    [0, 1, 0, 1],
    [0, 0, 1, 0],
    [0, 0, 0, 1],
], dtype=np.float32)
kf.measurementMatrix = np.array([
    [1, 0, 0, 0],
    [0, 1, 0, 0],
], dtype=np.float32)

kf.processNoiseCov = np.eye(4, dtype=np.float32) * 1e-2
kf.measurementNoiseCov = np.eye(2, dtype=np.float32) * 1e-1
kf.errorCovPost = np.eye(4, dtype=np.float32)

# First observation (e.g. detector center)
z = np.array([[120.0], [80.0]], dtype=np.float32)
kf.statePost = np.array([[z[0,0]], [z[1,0]], [0], [0]], dtype=np.float32)

pred = kf.predict()
z2 = np.array([[125.0], [82.0]], dtype=np.float32)
est = kf.correct(z2)
# est[0:2] = smoothed center

Per-frame loop pattern

# Each video frame:
prediction = kf.predict()
if detection_available:
    z = np.array([[mx], [my]], dtype=np.float32)
    state = kf.correct(z)
else:
    state = prediction  # coast on prediction (occlusion)

Long occlusions need gating or a separate re-acquisition module—pure Kalman drift grows without measurements.

Beyond linear Kalman

Nonlinear motion or sensors use Extended Kalman Filter (EKF) or Unscented Kalman Filter (UKF). Particle filters handle multi-modal uncertainty. Libraries like filterpy offer EKF/UKF in Python; OpenCV focuses on the linear case.

                    Takeaways
                    Predict then correct each step; tune Q and R for your noise and frame rate.
Great for smoothing box centers before association (SORT).
Use EKF/UKF when models are nonlinear (camera projection, IMU fusion).

                

Quick FAQ

Check matrix dimensions, symmetry of covariances, and that measurementMatrix matches measurement shape. Numerical issues—try smaller process noise or double precision.

Extend state to [cx, cy, w, h, vx, vy, …] with block-diagonal F or separate filters per quantity; measurements become box from detector.

SORT & DeepSORT

SORT pipeline

Detect — bounding boxes + scores from YOLO, Faster R-CNN, etc.
Predict — each active track advances its Kalman state (e.g. box center and scale velocity).
Associate — build a cost matrix (often 1 − IoU between predicted boxes and detections); solve assignment with Hungarian / linear sum assignment.
Update — matched tracks get Kalman correction; unmatched detections spawn new tracks; unmatched tracks age out after T frames.

IoU cost and assignment (sketch)

import numpy as np
from scipy.optimize import linear_sum_assignment

def iou(a, b):
    x1 = max(a[0], b[0]); y1 = max(a[1], b[1])
    x2 = min(a[2], b[2]); y2 = min(a[3], b[3])
    inter = max(0, x2 - x1) * max(0, y2 - y1)
    ua = (a[2]-a[0])*(a[3]-a[1]) + (b[2]-b[0])*(b[3]-b[1]) - inter
    return inter / (ua + 1e-6)

# pred_boxes[i], det_boxes[j] as [x1,y1,x2,y2]
# cost[i,j] = 1 - iou(pred_i, det_j); gate large cost if iou < 0.01
cost = np.ones((len(pred_boxes), len(det_boxes)))
for i, p in enumerate(pred_boxes):
    for j, d in enumerate(det_boxes):
        v = iou(p, d)
        cost[i, j] = 1.0 - v if v > 0.1 else 99.0

row_ind, col_ind = linear_sum_assignment(cost)

Production code vectorizes IoU and applies min-cost thresholds; this shows the idea.

DeepSORT: motion + appearance

DeepSORT keeps a Kalman state in measurement space (including box aspect ratio and height) and uses a Mahalanobis distance gate between predicted distribution and detections. A small CNN crops the pedestrian bbox and outputs a L2-normalized embedding; cosine distance between track gallery and detection provides a second association cue. Final cost blends motion and appearance (with tuned weights and gates).

Why embeddings?

IoU fails when people cross paths—appearance helps disambiguate after split seconds of overlap.

Modern variants

StrongSORT, BoT-SORT, OC-SORT refine association and camera motion; check MOTChallenge leaderboards.

Libraries

Reference repos implement full SORT/DeepSORT (track lifecycle, cascade matching). Ultralytics and BoxMOT-style packages integrate trackers with YOLO outputs. For learning, read the original papers then trace one clean Python implementation.

MOT metrics (names)

MOTA combines false positives, false negatives, and ID switches. IDF1 emphasizes identity consistency. HOTA unifies detection and association quality. Benchmarks: MOT17, MOT20, DanceTrack.

                    Takeaways
                    SORT = detector + Kalman + IoU + Hungarian + track management.
DeepSORT adds ReID embeddings for harder association.
Detector quality dominates overall MOT performance.

                

Quick FAQ

Improve detector recall, tighten Kalman noise, tune IoU threshold, or upgrade to appearance-based association. Very fast motion needs higher frame rate or better motion model.

Run detector sparsely and propagate with Kalman or correlation trackers between runs—trade accuracy for speed; re-sync when detector fires.

Chapter FAQ

Quick FAQ

Maintain a list of tracker instances, each init with its ROI, and call update per frame. For consistent IDs across occlusions, prefer detection + association (next chapters).

Color-histogram modes in OpenCV (cv2.meanShift, CamShift) work on controlled color distributions; modern pipelines usually prefer learned trackers or detectors.

Quick FAQ

Check matrix dimensions, symmetry of covariances, and that measurementMatrix matches measurement shape. Numerical issues—try smaller process noise or double precision.

Extend state to [cx, cy, w, h, vx, vy, …] with block-diagonal F or separate filters per quantity; measurements become box from detector.

Object tracking basics

Detection vs tracking

Single-object

Multi-object (MOT)

OpenCV: CSRT tracker (example)

Faster option: KCF

When trackers fail

MIL tracker (brief)

Takeaways

Quick FAQ

Track multiple objects with OpenCV?

Mean shift / CamShift?

Kalman filter

State, model, noise

Process noise Q

Measurement noise R

2D centroid with constant velocity

Per-frame loop pattern

Beyond linear Kalman

Takeaways

Quick FAQ

State blows up?

Track width/height too?

SORT &amp; DeepSORT

SORT pipeline

IoU cost and assignment (sketch)

DeepSORT: motion + appearance

Why embeddings?

Modern variants

Libraries

MOT metrics (names)

Takeaways

Quick FAQ

ID switches still high?

Track without detector every frame?

Chapter FAQ

Quick FAQ

Track multiple objects with OpenCV?

Mean shift / CamShift?

Quick FAQ

State blows up?

Track width/height too?

Process noise `Q`

Measurement noise `R`

SORT & DeepSORT