Video Analysis MCQ
Video processing fundamentals, optical flow, and action recognition in video.
Video Processing MCQ
Vision meets time
Video is a sequence of frames (or volumetric data). Pipelines sample clips at a frame rate, optionally compute optical flow or use 3D convolutions / transformers over space-time tokens. Tasks include action recognition, detection in video, and generation.
Temporal context
Single frames may be ambiguous; neighboring frames disambiguate motion and actions.
Key ideas
Frame sampling
FPS, stride, and clip length trade compute vs motion cues.
3D convolution
Extends kernels over time and space in one op.
Two-stream
RGB path + optical-flow path fused for actions.
Memory
RNNs or attention aggregate time after per-frame CNN features.
Simple pipeline
decode clip → preprocess → temporal model → task head
Optical Flow MCQ
Optical flow basics
Optical flow assigns a 2D displacement to each pixel between frames. Brightness constancy assumes pixel intensity is preserved along motion; combined with spatial smoothness (Horn–Schunck) or local linearization (Lucas–Kanade), it yields estimators. Deep networks now predict dense flow end-to-end (FlowNet, RAFT).
Aperture problem
Along a 1D edge, only the normal component of flow is observable locally—additional constraints resolve ambiguity.
Key ideas
Brightness constancy
I_x u + I_y v + I_t ≈ 0 (first-order).
Lucas–Kanade
Least squares on a patch assuming constant (u,v).
Horn–Schunck
Global smoothness regularizer + data term.
Deep flow
CNNs regress flow from image pairs directly.
From two frames
warp frame2 toward frame1 using estimated flow; minimize photometric error
Action Recognition MCQ
Actions in video
Action recognition assigns a label (e.g. diving, waving) to short clips. Methods range from frame CNNs + temporal pooling to two-stream RGB and optical-flow fusion, 3D convolutions (C3D, I3D), and transformers over space-time tokens. Large datasets (Kinetics) drive supervised pretraining.
Why motion matters
Static frames can be ambiguous; temporal patterns distinguish many action classes.
Key ideas
Clip input
Fixed-length segment sampled from longer video.
Two-stream
Separate nets for appearance and motion then fuse.
3D CNN
Spatiotemporal filters learn motion templates.
Kinetics
Large-scale labeled clips for pretraining.
Typical head
backbone features → temporal aggregate → softmax over action classes