Computer Vision Interview
60 Q&A
Chapter 11
Video Analysis — Interview Q&A
Video processing fundamentals, optical flow, and action recognition in video.
60 questions
Chapter 11
Video Processing: 20 Essential Q&A
1
How is video different from images?
⚡ easy
Answer: Adds a time axis—motion, temporal context, and much higher data rate; models must aggregate or align frames.
2
What is FPS?
⚡ easy
Answer: Frames per second—higher FPS preserves motion detail but increases compute; downsample for recognition if action is slow.
3
Temporal redundancy?
📊 medium
Answer: Adjacent frames are highly correlated—exploited by codecs (motion compensation) and by sparse sampling for recognition.
4
Frame sampling strategies?
📊 medium
Answer: Uniform stride, random jitter, clip cropping, or dense optical flow—trade temporal coverage vs GPU memory.
5
3D convolutions?
📊 medium
Answer: Kernel spans H×W×T—learns motion patterns directly; expensive vs 2D+temporal modules.
6
Two-stream architecture?
📊 medium
Answer: One stream on RGB (appearance), one on stacked optical flow (motion)—late fusion for action recognition classic baseline.
7
CNN + LSTM?
📊 medium
Answer: Per-frame CNN embeddings fed to LSTM/GRU to model sequence—flexible length, order-sensitive.
8
Codec vs container?
⚡ easy
Answer: Codec (H.264/HEVC/AV1) compresses pixels; container (MP4/MKV) wraps streams and metadata—decode to raw frames for training.
9
Decode pipeline?
📊 medium
Answer: VideoCapture → seek/keyframe → decode to tensors—hardware decoders (NVDEC) critical for throughput at scale.
10
Background subtraction?
📊 medium
Answer: MOG2, KNN—foreground mask for surveillance; sensitive to lighting and camera jitter.
cap = cv2.VideoCapture("clip.mp4"); ret, frame = cap.read()
11
Stabilization?
🔥 hard
Answer: Estimate global motion, warp frames to smooth path—used in consumer video; different from optical flow per pixel.
12
Tracking in video?
📊 medium
Answer: Detection per frame + association (SORT, DeepSORT)—links identities across time for analytics.
13
Long videos?
🔥 hard
Answer: Clip sampling, hierarchical pooling, memory transformers, or sparse attention—full-length movies don’t fit one forward pass.
14
Augmentation?
📊 medium
Answer: Temporal jitter, random clip length, mixup on clips, cutout—must preserve label semantics (temporal order for actions).
15
Example datasets?
⚡ easy
Answer: Kinetics, Something-Something, UCF101, AVA—vary in duration, labels, and multi-label structure.
16
Real-time?
📊 medium
Answer: Batch size 1, TensorRT, smaller backbone, ROI processing—latency budget drives architecture choice.
17
SlowFast pathways?
🔥 hard
Answer: Slow branch low frame rate high channel capacity; Fast branch high FPS lightweight—captures semantics and motion efficiently.
18
Video ViT?
🔥 hard
Answer: Patchify across space-time or factorized space/time attention—scalable with large data (TimeSformer, Video Swin).
19
Anomaly detection?
📊 medium
Answer: Reconstruction or prediction error in latent space across frames—surveillance and manufacturing QC.
20
Deployment?
⚡ easy
Answer: Stream ingestion (RTSP), frame sync, GPU decode, model serving with batching policies for variable FPS.
Optical Flow: 20 Essential Q&A
21
What is optical flow?
⚡ easy
Answer: Per-pixel 2D motion field (u,v) mapping points in frame t to frame t+1—under brightness and smoothness assumptions.
22
Brightness constancy?
📊 medium
Answer: Assumes I(x,y,t) ≈ I(x+u,y+v,t+1)—linearize for small motion; breaks with lighting change or specularities.
23
Smoothness prior?
📊 medium
Answer: Neighboring pixels should have similar flow—regularizes ill-posed problem except at motion boundaries.
24
Lucas–Kanade?
📊 medium
Answer: Assume constant flow in patch—solve least squares on spatial gradients—sparse, good for corners, fails on uniform regions.
25
Horn–Schunck?
🔥 hard
Answer: Global energy balancing data term and smoothness—produces dense flow; iterative Gauss–Seidel / modern convex solvers.
26
Dense vs sparse?
⚡ easy
Answer: Dense: vector per pixel. Sparse: features only (LK on Harris corners)—dense needed for warping, segmentation, depth hints.
27
Gaussian pyramid?
📊 medium
Answer: Estimate flow coarse-to-fine to handle large displacement—warp and refine each level.
28
Occlusions?
🔥 hard
Answer: Pixels visible in one frame but not the next—forward-backward consistency checks and learned occlusion masks help.
29
Farneback?
📊 medium
Answer: Polynomial expansion per neighborhood then solve for displacement—dense polynomial basis alternative in OpenCV.
flow = cv2.calcOpticalFlowFarneback(prev, next, None, 0.5, 3, 15, 3, 5, 1.2, 0)
30
TV-L1?
🔥 hard
Answer: Total variation regularization with L1 data term—robust to outliers, good for preserving discontinuities.
31
Warping in deep nets?
📊 medium
Answer: Differentiable bilinear sampling to align frames by predicted flow—core building block in iterative refinement networks.
32
Deep learning flow?
📊 medium
Answer: CNNs predict flow end-to-end (FlowNet, PWC, RAFT)—supervised on synthetic datasets (FlyingChairs, Sintel) + finetune.
33
PWC-Net?
🔥 hard
Answer: Pyramid, warping, cost volume—correlate features at multiple scales efficiently.
34
RAFT?
🔥 hard
Answer: Build multi-scale 4D correlation volume + recurrent GRU updates—state-of-the-art accuracy on benchmarks.
35
Flow vs stereo?
📊 medium
Answer: Stereo: displacement along epipolar line (1D) after rectification. Flow: general 2D motion—stereo is constrained flow.
36
Large motion?
📊 medium
Answer: Pyramids, feature matching init, or patch-based methods—pure local LK insufficient without hierarchy.
37
Metrics?
⚡ easy
Answer: EPE (end-point error) on Middlebury/Sintel/KITTI—different datasets stress occlusion and realism.
38
Use in stabilization?
⚡ easy
Answer: Estimate global dominant motion from flow field or parametric model—smooth camera path.
39
Failure modes?
📊 medium
Answer: Textureless regions, repetitive patterns, fast motion, transparency—classical and learned methods both struggle.
40
Real-time?
⚡ easy
Answer: DIS, Farneback GPU, lite deep models—trade accuracy vs FPS for robotics and AR.
Action Recognition: 20 Essential Q&A
41
What is action recognition?
⚡ easy
Answer: Predict human action label(s) from a video clip—may include temporal localization and multi-person context.
42
Two-stream networks?
📊 medium
Answer: RGB CNN for appearance + flow CNN for motion; fuse scores—strong baseline before full 3D models.
43
C3D?
📊 medium
Answer: VGG-style 3×3×3 conv stacks on short clips—learns spatiotemporal filters end-to-end.
44
I3D?
🔥 hard
Answer: Inflate 2D ImageNet filters to 3D for initialization—bootstrap video nets from image pretraining, big jump on Kinetics.
# I3D: inflate 2D k×k filters to k×k×k, bootstrap from ImageNet
45
TSN?
📊 medium
Answer: Sparse sampling of frames/segments across video; aggregate (avg/max) after shared CNN—covers long videos cheaply.
46
SlowFast?
🔥 hard
Answer: Dual pathways with different frame rates and widths—efficient motion + semantics; widely used backbone.
47
X3D?
📊 medium
Answer: Expand dimensions (depth, width, resolution, frames) systematically for accuracy–compute Pareto.
48
Kinetics?
⚡ easy
Answer: Large-scale YouTube action dataset (400/600/700 classes)—ImageNet moment for video recognition.
49
Something-Something?
📊 medium
Answer: Emphasizes fine-grained motion and object interactions—tests temporal reasoning more than appearance.
50
Early vs late fusion?
🔥 hard
Answer: Early: stack frames/channels in first layers. Late: per-frame features then aggregate—middle fusion via 3D conv is common compromise.
51
LSTM over CNN features?
📊 medium
Answer: Encode temporal order for variable-length clips—lighter 3D but can underfit complex motion vs Transformers.
52
Space-time attention?
🔥 hard
Answer: Factorized or joint attention over patches and time (TimeSformer, Video Swin)—scalable with data.
53
Skeleton-based?
📊 medium
Answer: Graph convolutions on body joints—privacy-friendly, works with pose estimators; less appearance detail.
54
Long videos?
🔥 hard
Answer: Segment clips, hierarchical models, or memory—full movies need different pipelines than 10s Kinetics clips.
55
Temporal localization?
📊 medium
Answer: Detect action start/end (SSN, BMN) or per-frame labels—needed for untrimmed video.
56
Multi-label?
📊 medium
Answer: Sigmoid + BCE when several actions co-occur (cooking + talking)—different from single softmax.
57
Weak supervision?
🔥 hard
Answer: Train with video tags only (MIL) or narration—reduces frame-level annotation cost.
58
Real-time?
⚡ easy
Answer: Smaller X3D, MobileNet-3D, or keyframe-only inference—latency budgets for edge cameras.
59
Transfer learning?
📊 medium
Answer: ImageNet → inflate → Kinetics finetune → downstream (surgery, sports)—standard recipe.
60
Metrics?
⚡ easy
Answer: Top-1/5 accuracy; mAP for detection/localization; sometimes calibrated confidence for safety apps.
Full tutorial chapter
Pair these interview notes with the matching CV tutorial chapter.