Video processing
OpenCV: read a file
import cv2
cap = cv2.VideoCapture("clip.mp4")
if not cap.isOpened():
raise RuntimeError("cannot open video")
fps = cap.get(cv2.CAP_PROP_FPS)
w = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
h = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
n = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
print(fps, w, h, n)
while True:
ok, frame = cap.read()
if not ok:
break
# frame: BGR uint8, shape (h, w, 3)
cap.release()
Always call release() (or use a context-style pattern) so file handles and camera devices are freed.
Frame index and seek
cap.set(cv2.CAP_PROP_POS_FRAMES, 120) # jump to frame 120
ok, frame = cap.read()
ms = cap.get(cv2.CAP_PROP_POS_MSEC) # position in ms (if available)
Seek accuracy depends on codec and container; keyframe-only seeking can land on the nearest keyframe.
Webcam and backend
cap = cv2.VideoCapture(0, cv2.CAP_DSHOW) # Windows: DirectShow optional
cap.set(cv2.CAP_PROP_FRAME_WIDTH, 1280)
cap.set(cv2.CAP_PROP_FRAME_HEIGHT, 720)
if not cap.isOpened():
raise RuntimeError("camera not available")
Write MP4 (example)
fourcc = cv2.VideoWriter_fourcc(*"mp4v")
out = cv2.VideoWriter("out.mp4", fourcc, 30.0, (640, 480))
# ... for each frame BGR:
# out.write(frame)
out.release()
Codec fourcc must match what your OpenCV build supports; on some systems avc1 or H264 works better than mp4v.
torchvision: read_video
from torchvision.io import read_video
video, audio, info = read_video("clip.mp4", start_pts=0, end_pts=4, pts_unit="sec")
# video: (T, H, W, C) uint8 in RGB order
print(video.shape, info)
Useful for training clips; for long files prefer decoders that stream frames to limit RAM.
Takeaways
- BGR in OpenCV vs RGB in many deep models—convert with
cv2.cvtColorwhen needed. - Temporal methods need consistent FPS or explicit timestamps.
- Next: optical flow ties neighboring frames through motion fields.
Quick FAQ
cv2.CAP_PROP_CONVERT_RGB, 0 where supported, or cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY) after read.Optical flow
Sparse: Lucas–Kanade + pyramid
import cv2
import numpy as np
cap = cv2.VideoCapture("clip.mp4")
ret, old = cap.read()
old_gray = cv2.cvtColor(old, cv2.COLOR_BGR2GRAY)
pts = cv2.goodFeaturesToTrack(old_gray, maxCorners=200, qualityLevel=0.01, minDistance=7, blockSize=7)
lk_params = dict(
winSize=(21, 21),
maxLevel=3,
criteria=(cv2.TERM_CRITERIA_EPS | cv2.TERM_CRITERIA_COUNT, 30, 0.01),
)
while True:
ret, frame = cap.read()
if not ret:
break
gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
new_pts, st, err = cv2.calcOpticalFlowPyrLK(old_gray, gray, pts, None, **lk_params)
good_new = new_pts[st == 1]
good_old = pts[st == 1]
# draw line from old to new for visualization
old_gray = gray.copy()
pts = good_new.reshape(-1, 1, 2)
cap.release()
Re-seed goodFeaturesToTrack periodically if tracks drift or disappear.
Dense: Farneback
flow = cv2.calcOpticalFlowFarneback(
prev_gray, next_gray, None,
pyr_scale=0.5, levels=3, winsize=15, iterations=3,
poly_n=5, poly_sigma=1.2, flags=0,
)
# flow shape (H, W, 2): flow[...,0] = dx, flow[...,1] = dy
Visualize flow as HSV
mag, ang = cv2.cartToPolar(flow[..., 0], flow[..., 1])
hsv = np.zeros((*flow.shape[:2], 3), dtype=np.uint8)
hsv[..., 0] = ang * 180 / np.pi / 2
hsv[..., 1] = 255
hsv[..., 2] = cv2.normalize(mag, None, 0, 255, cv2.NORM_MINMAX)
bgr = cv2.cvtColor(hsv, cv2.COLOR_HSV2BGR)
Horn–Schunck (idea)
Minimizes ∫ (I_x u + I_y v + I_t)² + λ(|∇u|² + |∇v|²) dx for smooth, dense flow. Classic global method; not in core OpenCV Python as a single call—implement via iterative schemes or use specialized libraries.
Takeaways
- Sparse LK: fast, needs texture; pyramids extend range.
- Farneback: dense field; heavier per frame.
- Deep learning flow (RAFT, PWC-Net) often wins on accuracy for hard motion.
Quick FAQ
Action recognition
Clip tensor shape
Many 3D CNNs expect input [N, C, T, H, W]: batch, channels (usually 3), T RGB frames, height, width. Uniformly sample or stride frames across the segment, resize/crop to the model’s training resolution (often 112×112 or 224×224).
import torch
# Example: 16 frames, 3 channels, 112x112
clip = torch.randn(1, 3, 16, 112, 112)
torchvision: R3D-18 (Kinetics-400)
from torchvision.models.video import r3d_18, R3D_18_Weights
weights = R3D_18_Weights.KINETICS400_V1
model = r3d_18(weights=weights).eval()
preprocess = weights.transforms()
# clip: [1, 3, T, H, W] after preprocess
with torch.no_grad():
logits = model(clip)
probs = logits.softmax(1)
top = int(probs.argmax(1))
print(weights.meta["categories"][top])
Check weights.transforms() for required T and spatial size for your torchvision version.
Other torchvision video models
from torchvision.models.video import mc3_18, s3d, MC3_18_Weights, S3D_Weights
# Mixed convolution (MC3), Separable 3D (S3D) — same pattern as R3D: weights + transforms()
# Newer torchvision builds may also expose MViT-style video transformers; check the docs.
API names vary slightly by PyTorch release; use dir(torchvision.models.video) locally.
Two-stream idea
One stream ingests RGB frames for appearance; another ingests stacked optical flow volumes for motion. Late fusion averages or learns to combine logits—still a strong conceptual baseline before end-to-end 3D nets dominated many benchmarks.
Takeaways
- Temporal receptive field matters: short clips may miss context.
- Pretrained Kinetics weights transfer to smaller datasets via fine-tuning.
- Consider compute: 3D convs and transformers are heavier than 2D per-frame models.
Quick FAQ
T, use multiple clips + average logits, or use models with temporal pooling / attention over many frames.Chapter FAQ
Quick FAQ
cv2.CAP_PROP_CONVERT_RGB, 0 where supported, or cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY) after read.