Computer Vision Chapter 11

Video Analysis

Video processing fundamentals, optical flow, and action recognition in video.

Video processing

OpenCV: read a file

import cv2

cap = cv2.VideoCapture("clip.mp4")
if not cap.isOpened():
    raise RuntimeError("cannot open video")

fps = cap.get(cv2.CAP_PROP_FPS)
w = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
h = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
n = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
print(fps, w, h, n)

while True:
    ok, frame = cap.read()
    if not ok:
        break
    # frame: BGR uint8, shape (h, w, 3)

cap.release()

Always call release() (or use a context-style pattern) so file handles and camera devices are freed.

Frame index and seek

cap.set(cv2.CAP_PROP_POS_FRAMES, 120)  # jump to frame 120
ok, frame = cap.read()
ms = cap.get(cv2.CAP_PROP_POS_MSEC)     # position in ms (if available)

Seek accuracy depends on codec and container; keyframe-only seeking can land on the nearest keyframe.

Webcam and backend

cap = cv2.VideoCapture(0, cv2.CAP_DSHOW)  # Windows: DirectShow optional
cap.set(cv2.CAP_PROP_FRAME_WIDTH, 1280)
cap.set(cv2.CAP_PROP_FRAME_HEIGHT, 720)
if not cap.isOpened():
    raise RuntimeError("camera not available")

Write MP4 (example)

fourcc = cv2.VideoWriter_fourcc(*"mp4v")
out = cv2.VideoWriter("out.mp4", fourcc, 30.0, (640, 480))
# ... for each frame BGR:
# out.write(frame)
out.release()

Codec fourcc must match what your OpenCV build supports; on some systems avc1 or H264 works better than mp4v.

torchvision: read_video

from torchvision.io import read_video

video, audio, info = read_video("clip.mp4", start_pts=0, end_pts=4, pts_unit="sec")
# video: (T, H, W, C) uint8 in RGB order
print(video.shape, info)

Useful for training clips; for long files prefer decoders that stream frames to limit RAM.

Takeaways

  • BGR in OpenCV vs RGB in many deep models—convert with cv2.cvtColor when needed.
  • Temporal methods need consistent FPS or explicit timestamps.
  • Next: optical flow ties neighboring frames through motion fields.

Quick FAQ

Decoding and disk I/O dominate; use smaller resolution, skip frames, GPU decoders, or extract frames offline.

Pass cv2.CAP_PROP_CONVERT_RGB, 0 where supported, or cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY) after read.

Optical flow

Sparse: Lucas–Kanade + pyramid

import cv2
import numpy as np

cap = cv2.VideoCapture("clip.mp4")
ret, old = cap.read()
old_gray = cv2.cvtColor(old, cv2.COLOR_BGR2GRAY)
pts = cv2.goodFeaturesToTrack(old_gray, maxCorners=200, qualityLevel=0.01, minDistance=7, blockSize=7)

lk_params = dict(
    winSize=(21, 21),
    maxLevel=3,
    criteria=(cv2.TERM_CRITERIA_EPS | cv2.TERM_CRITERIA_COUNT, 30, 0.01),
)

while True:
    ret, frame = cap.read()
    if not ret:
        break
    gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
    new_pts, st, err = cv2.calcOpticalFlowPyrLK(old_gray, gray, pts, None, **lk_params)
    good_new = new_pts[st == 1]
    good_old = pts[st == 1]
    # draw line from old to new for visualization
    old_gray = gray.copy()
    pts = good_new.reshape(-1, 1, 2)

cap.release()

Re-seed goodFeaturesToTrack periodically if tracks drift or disappear.

Dense: Farneback

flow = cv2.calcOpticalFlowFarneback(
    prev_gray, next_gray, None,
    pyr_scale=0.5, levels=3, winsize=15, iterations=3,
    poly_n=5, poly_sigma=1.2, flags=0,
)
# flow shape (H, W, 2): flow[...,0] = dx, flow[...,1] = dy

Visualize flow as HSV

mag, ang = cv2.cartToPolar(flow[..., 0], flow[..., 1])
hsv = np.zeros((*flow.shape[:2], 3), dtype=np.uint8)
hsv[..., 0] = ang * 180 / np.pi / 2
hsv[..., 1] = 255
hsv[..., 2] = cv2.normalize(mag, None, 0, 255, cv2.NORM_MINMAX)
bgr = cv2.cvtColor(hsv, cv2.COLOR_HSV2BGR)

Horn–Schunck (idea)

Minimizes ∫ (I_x u + I_y v + I_t)² + λ(|∇u|² + |∇v|²) dx for smooth, dense flow. Classic global method; not in core OpenCV Python as a single call—implement via iterative schemes or use specialized libraries.

Takeaways

  • Sparse LK: fast, needs texture; pyramids extend range.
  • Farneback: dense field; heavier per frame.
  • Deep learning flow (RAFT, PWC-Net) often wins on accuracy for hard motion.

Quick FAQ

Along edges, only the normal component of motion is observable locally—corners and texture reduce ambiguity.

Increase pyramid levels, reduce frame interval, or use coarse-to-fine / deep optical flow.

Action recognition

Clip tensor shape

Many 3D CNNs expect input [N, C, T, H, W]: batch, channels (usually 3), T RGB frames, height, width. Uniformly sample or stride frames across the segment, resize/crop to the model’s training resolution (often 112×112 or 224×224).

import torch

# Example: 16 frames, 3 channels, 112x112
clip = torch.randn(1, 3, 16, 112, 112)

torchvision: R3D-18 (Kinetics-400)

from torchvision.models.video import r3d_18, R3D_18_Weights

weights = R3D_18_Weights.KINETICS400_V1
model = r3d_18(weights=weights).eval()
preprocess = weights.transforms()

# clip: [1, 3, T, H, W] after preprocess
with torch.no_grad():
    logits = model(clip)
probs = logits.softmax(1)
top = int(probs.argmax(1))
print(weights.meta["categories"][top])

Check weights.transforms() for required T and spatial size for your torchvision version.

Other torchvision video models

from torchvision.models.video import mc3_18, s3d, MC3_18_Weights, S3D_Weights

# Mixed convolution (MC3), Separable 3D (S3D) — same pattern as R3D: weights + transforms()
# Newer torchvision builds may also expose MViT-style video transformers; check the docs.

API names vary slightly by PyTorch release; use dir(torchvision.models.video) locally.

Two-stream idea

One stream ingests RGB frames for appearance; another ingests stacked optical flow volumes for motion. Late fusion averages or learns to combine logits—still a strong conceptual baseline before end-to-end 3D nets dominated many benchmarks.

Takeaways

  • Temporal receptive field matters: short clips may miss context.
  • Pretrained Kinetics weights transfer to smaller datasets via fine-tuning.
  • Consider compute: 3D convs and transformers are heavier than 2D per-frame models.

Quick FAQ

Sample fixed T, use multiple clips + average logits, or use models with temporal pooling / attention over many frames.

I3D inflates 2D Inception weights to 3D; R(2+1)D factorizes a 3D kernel into 2D spatial + 1D temporal for efficiency—both are standard 3D CNN families.

Chapter FAQ

Quick FAQ

Decoding and disk I/O dominate; use smaller resolution, skip frames, GPU decoders, or extract frames offline.

Pass cv2.CAP_PROP_CONVERT_RGB, 0 where supported, or cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY) after read.

Quick FAQ

Along edges, only the normal component of motion is observable locally—corners and texture reduce ambiguity.

Increase pyramid levels, reduce frame interval, or use coarse-to-fine / deep optical flow.