CV MCQ — Chapter 11 0 Questions
Video Analysis

Video Analysis MCQ

Video processing fundamentals, optical flow, and action recognition in video.

Easy: 0 Q Medium: 0 Q Hard: 0 Q

Video Processing MCQ

Vision meets time

Video is a sequence of frames (or volumetric data). Pipelines sample clips at a frame rate, optionally compute optical flow or use 3D convolutions / transformers over space-time tokens. Tasks include action recognition, detection in video, and generation.

Temporal context

Single frames may be ambiguous; neighboring frames disambiguate motion and actions.

Key ideas

Frame sampling

FPS, stride, and clip length trade compute vs motion cues.

3D convolution

Extends kernels over time and space in one op.

Two-stream

RGB path + optical-flow path fused for actions.

Memory

RNNs or attention aggregate time after per-frame CNN features.

Simple pipeline

decode clip → preprocess → temporal model → task head

Pro tip: Heavy videos: decode on-the-fly with workers; cache sparse clips for long-form.

Optical Flow MCQ

Optical flow basics

Optical flow assigns a 2D displacement to each pixel between frames. Brightness constancy assumes pixel intensity is preserved along motion; combined with spatial smoothness (Horn–Schunck) or local linearization (Lucas–Kanade), it yields estimators. Deep networks now predict dense flow end-to-end (FlowNet, RAFT).

Aperture problem

Along a 1D edge, only the normal component of flow is observable locally—additional constraints resolve ambiguity.

Key ideas

Brightness constancy

I_x u + I_y v + I_t ≈ 0 (first-order).

Lucas–Kanade

Least squares on a patch assuming constant (u,v).

Horn–Schunck

Global smoothness regularizer + data term.

Deep flow

CNNs regress flow from image pairs directly.

From two frames

warp frame2 toward frame1 using estimated flow; minimize photometric error

Pro tip: Large displacements need pyramids (coarse-to-fine) or iterative refinement (e.g. RAFT).

Action Recognition MCQ

Actions in video

Action recognition assigns a label (e.g. diving, waving) to short clips. Methods range from frame CNNs + temporal pooling to two-stream RGB and optical-flow fusion, 3D convolutions (C3D, I3D), and transformers over space-time tokens. Large datasets (Kinetics) drive supervised pretraining.

Why motion matters

Static frames can be ambiguous; temporal patterns distinguish many action classes.

Key ideas

Clip input

Fixed-length segment sampled from longer video.

Two-stream

Separate nets for appearance and motion then fuse.

3D CNN

Spatiotemporal filters learn motion templates.

Kinetics

Large-scale labeled clips for pretraining.

Typical head

backbone features → temporal aggregate → softmax over action classes

Pro tip: Test-time augmentation: multi-crop and multi-scale clips improve robustness.