Computer Vision Interview 60 Q&A Chapter 11

Video Analysis — Interview Q&A

Video processing fundamentals, optical flow, and action recognition in video.

60 questions Chapter 11

Video Processing: 20 Essential Q&A

1 How is video different from images? ⚡ easy
Answer: Adds a time axis—motion, temporal context, and much higher data rate; models must aggregate or align frames.
2 What is FPS? ⚡ easy
Answer: Frames per second—higher FPS preserves motion detail but increases compute; downsample for recognition if action is slow.
3 Temporal redundancy? 📊 medium
Answer: Adjacent frames are highly correlated—exploited by codecs (motion compensation) and by sparse sampling for recognition.
4 Frame sampling strategies? 📊 medium
Answer: Uniform stride, random jitter, clip cropping, or dense optical flow—trade temporal coverage vs GPU memory.
5 3D convolutions? 📊 medium
Answer: Kernel spans H×W×T—learns motion patterns directly; expensive vs 2D+temporal modules.
6 Two-stream architecture? 📊 medium
Answer: One stream on RGB (appearance), one on stacked optical flow (motion)—late fusion for action recognition classic baseline.
7 CNN + LSTM? 📊 medium
Answer: Per-frame CNN embeddings fed to LSTM/GRU to model sequence—flexible length, order-sensitive.
8 Codec vs container? ⚡ easy
Answer: Codec (H.264/HEVC/AV1) compresses pixels; container (MP4/MKV) wraps streams and metadata—decode to raw frames for training.
9 Decode pipeline? 📊 medium
Answer: VideoCapture → seek/keyframe → decode to tensors—hardware decoders (NVDEC) critical for throughput at scale.
10 Background subtraction? 📊 medium
Answer: MOG2, KNN—foreground mask for surveillance; sensitive to lighting and camera jitter.
cap = cv2.VideoCapture("clip.mp4"); ret, frame = cap.read()
11 Stabilization? 🔥 hard
Answer: Estimate global motion, warp frames to smooth path—used in consumer video; different from optical flow per pixel.
12 Tracking in video? 📊 medium
Answer: Detection per frame + association (SORT, DeepSORT)—links identities across time for analytics.
13 Long videos? 🔥 hard
Answer: Clip sampling, hierarchical pooling, memory transformers, or sparse attention—full-length movies don’t fit one forward pass.
14 Augmentation? 📊 medium
Answer: Temporal jitter, random clip length, mixup on clips, cutout—must preserve label semantics (temporal order for actions).
15 Example datasets? ⚡ easy
Answer: Kinetics, Something-Something, UCF101, AVA—vary in duration, labels, and multi-label structure.
16 Real-time? 📊 medium
Answer: Batch size 1, TensorRT, smaller backbone, ROI processing—latency budget drives architecture choice.
17 SlowFast pathways? 🔥 hard
Answer: Slow branch low frame rate high channel capacity; Fast branch high FPS lightweight—captures semantics and motion efficiently.
18 Video ViT? 🔥 hard
Answer: Patchify across space-time or factorized space/time attention—scalable with large data (TimeSformer, Video Swin).
19 Anomaly detection? 📊 medium
Answer: Reconstruction or prediction error in latent space across frames—surveillance and manufacturing QC.
20 Deployment? ⚡ easy
Answer: Stream ingestion (RTSP), frame sync, GPU decode, model serving with batching policies for variable FPS.

Optical Flow: 20 Essential Q&A

21 What is optical flow? ⚡ easy
Answer: Per-pixel 2D motion field (u,v) mapping points in frame t to frame t+1—under brightness and smoothness assumptions.
22 Brightness constancy? 📊 medium
Answer: Assumes I(x,y,t) ≈ I(x+u,y+v,t+1)—linearize for small motion; breaks with lighting change or specularities.
23 Smoothness prior? 📊 medium
Answer: Neighboring pixels should have similar flow—regularizes ill-posed problem except at motion boundaries.
24 Lucas–Kanade? 📊 medium
Answer: Assume constant flow in patch—solve least squares on spatial gradients—sparse, good for corners, fails on uniform regions.
25 Horn–Schunck? 🔥 hard
Answer: Global energy balancing data term and smoothness—produces dense flow; iterative Gauss–Seidel / modern convex solvers.
26 Dense vs sparse? ⚡ easy
Answer: Dense: vector per pixel. Sparse: features only (LK on Harris corners)—dense needed for warping, segmentation, depth hints.
27 Gaussian pyramid? 📊 medium
Answer: Estimate flow coarse-to-fine to handle large displacement—warp and refine each level.
28 Occlusions? 🔥 hard
Answer: Pixels visible in one frame but not the next—forward-backward consistency checks and learned occlusion masks help.
29 Farneback? 📊 medium
Answer: Polynomial expansion per neighborhood then solve for displacement—dense polynomial basis alternative in OpenCV.
flow = cv2.calcOpticalFlowFarneback(prev, next, None, 0.5, 3, 15, 3, 5, 1.2, 0)
30 TV-L1? 🔥 hard
Answer: Total variation regularization with L1 data term—robust to outliers, good for preserving discontinuities.
31 Warping in deep nets? 📊 medium
Answer: Differentiable bilinear sampling to align frames by predicted flow—core building block in iterative refinement networks.
32 Deep learning flow? 📊 medium
Answer: CNNs predict flow end-to-end (FlowNet, PWC, RAFT)—supervised on synthetic datasets (FlyingChairs, Sintel) + finetune.
33 PWC-Net? 🔥 hard
Answer: Pyramid, warping, cost volume—correlate features at multiple scales efficiently.
34 RAFT? 🔥 hard
Answer: Build multi-scale 4D correlation volume + recurrent GRU updates—state-of-the-art accuracy on benchmarks.
35 Flow vs stereo? 📊 medium
Answer: Stereo: displacement along epipolar line (1D) after rectification. Flow: general 2D motion—stereo is constrained flow.
36 Large motion? 📊 medium
Answer: Pyramids, feature matching init, or patch-based methods—pure local LK insufficient without hierarchy.
37 Metrics? ⚡ easy
Answer: EPE (end-point error) on Middlebury/Sintel/KITTI—different datasets stress occlusion and realism.
38 Use in stabilization? ⚡ easy
Answer: Estimate global dominant motion from flow field or parametric model—smooth camera path.
39 Failure modes? 📊 medium
Answer: Textureless regions, repetitive patterns, fast motion, transparency—classical and learned methods both struggle.
40 Real-time? ⚡ easy
Answer: DIS, Farneback GPU, lite deep models—trade accuracy vs FPS for robotics and AR.

Action Recognition: 20 Essential Q&A

41 What is action recognition? ⚡ easy
Answer: Predict human action label(s) from a video clip—may include temporal localization and multi-person context.
42 Two-stream networks? 📊 medium
Answer: RGB CNN for appearance + flow CNN for motion; fuse scores—strong baseline before full 3D models.
43 C3D? 📊 medium
Answer: VGG-style 3×3×3 conv stacks on short clips—learns spatiotemporal filters end-to-end.
44 I3D? 🔥 hard
Answer: Inflate 2D ImageNet filters to 3D for initialization—bootstrap video nets from image pretraining, big jump on Kinetics.
# I3D: inflate 2D k×k filters to k×k×k, bootstrap from ImageNet
45 TSN? 📊 medium
Answer: Sparse sampling of frames/segments across video; aggregate (avg/max) after shared CNN—covers long videos cheaply.
46 SlowFast? 🔥 hard
Answer: Dual pathways with different frame rates and widths—efficient motion + semantics; widely used backbone.
47 X3D? 📊 medium
Answer: Expand dimensions (depth, width, resolution, frames) systematically for accuracy–compute Pareto.
48 Kinetics? ⚡ easy
Answer: Large-scale YouTube action dataset (400/600/700 classes)—ImageNet moment for video recognition.
49 Something-Something? 📊 medium
Answer: Emphasizes fine-grained motion and object interactions—tests temporal reasoning more than appearance.
50 Early vs late fusion? 🔥 hard
Answer: Early: stack frames/channels in first layers. Late: per-frame features then aggregate—middle fusion via 3D conv is common compromise.
51 LSTM over CNN features? 📊 medium
Answer: Encode temporal order for variable-length clips—lighter 3D but can underfit complex motion vs Transformers.
52 Space-time attention? 🔥 hard
Answer: Factorized or joint attention over patches and time (TimeSformer, Video Swin)—scalable with data.
53 Skeleton-based? 📊 medium
Answer: Graph convolutions on body joints—privacy-friendly, works with pose estimators; less appearance detail.
54 Long videos? 🔥 hard
Answer: Segment clips, hierarchical models, or memory—full movies need different pipelines than 10s Kinetics clips.
55 Temporal localization? 📊 medium
Answer: Detect action start/end (SSN, BMN) or per-frame labels—needed for untrimmed video.
56 Multi-label? 📊 medium
Answer: Sigmoid + BCE when several actions co-occur (cooking + talking)—different from single softmax.
57 Weak supervision? 🔥 hard
Answer: Train with video tags only (MIL) or narration—reduces frame-level annotation cost.
58 Real-time? ⚡ easy
Answer: Smaller X3D, MobileNet-3D, or keyframe-only inference—latency budgets for edge cameras.
59 Transfer learning? 📊 medium
Answer: ImageNet → inflate → Kinetics finetune → downstream (surgery, sports)—standard recipe.
60 Metrics? ⚡ easy
Answer: Top-1/5 accuracy; mAP for detection/localization; sometimes calibrated confidence for safety apps.
Full tutorial chapter

Pair these interview notes with the matching CV tutorial chapter.

align-items-center flex-wrap gap-2"> Previous Next