Video Analysis — Interview Q&A

Question 1

1 How is video different from images? ⚡ easy

Answer

Answer: Adds a time axis—motion, temporal context, and much higher data rate; models must aggregate or align frames.

Question 2

2 What is FPS? ⚡ easy

Answer

Answer: Frames per second—higher FPS preserves motion detail but increases compute; downsample for recognition if action is slow.

Question 3

3 Temporal redundancy? 📊 medium

Answer

Answer: Adjacent frames are highly correlated—exploited by codecs (motion compensation) and by sparse sampling for recognition.

Question 4

4 Frame sampling strategies? 📊 medium

Answer

Answer: Uniform stride, random jitter, clip cropping, or dense optical flow—trade temporal coverage vs GPU memory.

Question 5

5 3D convolutions? 📊 medium

Answer

Answer: Kernel spans H×W×T—learns motion patterns directly; expensive vs 2D+temporal modules.

Question 6

6 Two-stream architecture? 📊 medium

Answer

Answer: One stream on RGB (appearance), one on stacked optical flow (motion)—late fusion for action recognition classic baseline.

Question 7

7 CNN + LSTM? 📊 medium

Answer

Answer: Per-frame CNN embeddings fed to LSTM/GRU to model sequence—flexible length, order-sensitive.

Question 8

8 Codec vs container? ⚡ easy

Answer

Answer: Codec (H.264/HEVC/AV1) compresses pixels; container (MP4/MKV) wraps streams and metadata—decode to raw frames for training.

Question 9

9 Decode pipeline? 📊 medium

Answer

Answer: VideoCapture → seek/keyframe → decode to tensors—hardware decoders (NVDEC) critical for throughput at scale.

Question 10

10 Background subtraction? 📊 medium

Answer

Answer: MOG2, KNN—foreground mask for surveillance; sensitive to lighting and camera jitter.

Question 11

11 Stabilization? 🔥 hard

Answer

Answer: Estimate global motion, warp frames to smooth path—used in consumer video; different from optical flow per pixel.

Question 12

12 Tracking in video? 📊 medium

Answer

Answer: Detection per frame + association (SORT, DeepSORT)—links identities across time for analytics.

Question 13

13 Long videos? 🔥 hard

Answer

Answer: Clip sampling, hierarchical pooling, memory transformers, or sparse attention—full-length movies don’t fit one forward pass.

Question 14

14 Augmentation? 📊 medium

Answer

Answer: Temporal jitter, random clip length, mixup on clips, cutout—must preserve label semantics (temporal order for actions).

Question 15

15 Example datasets? ⚡ easy

Answer

Answer: Kinetics, Something-Something, UCF101, AVA—vary in duration, labels, and multi-label structure.

Question 16

16 Real-time? 📊 medium

Answer

Answer: Batch size 1, TensorRT, smaller backbone, ROI processing—latency budget drives architecture choice.

Question 17

17 SlowFast pathways? 🔥 hard

Answer

Answer: Slow branch low frame rate high channel capacity; Fast branch high FPS lightweight—captures semantics and motion efficiently.

Question 18

18 Video ViT? 🔥 hard

Answer

Answer: Patchify across space-time or factorized space/time attention—scalable with large data (TimeSformer, Video Swin).

Question 19

19 Anomaly detection? 📊 medium

Answer

Answer: Reconstruction or prediction error in latent space across frames—surveillance and manufacturing QC.

Question 20

20 Deployment? ⚡ easy

Answer

Answer: Stream ingestion (RTSP), frame sync, GPU decode, model serving with batching policies for variable FPS.

Question 21

21 What is optical flow? ⚡ easy

Answer

Answer: Per-pixel 2D motion field (u,v) mapping points in frame t to frame t+1—under brightness and smoothness assumptions.

Question 22

22 Brightness constancy? 📊 medium

Answer

Answer: Assumes I(x,y,t) ≈ I(x+u,y+v,t+1)—linearize for small motion; breaks with lighting change or specularities.

Question 23

23 Smoothness prior? 📊 medium

Answer

Answer: Neighboring pixels should have similar flow—regularizes ill-posed problem except at motion boundaries.

Question 24

24 Lucas–Kanade? 📊 medium

Answer

Answer: Assume constant flow in patch—solve least squares on spatial gradients—sparse, good for corners, fails on uniform regions.

Question 25

25 Horn–Schunck? 🔥 hard

Answer

Answer: Global energy balancing data term and smoothness—produces dense flow; iterative Gauss–Seidel / modern convex solvers.

Question 26

26 Dense vs sparse? ⚡ easy

Answer

Answer: Dense: vector per pixel. Sparse: features only (LK on Harris corners)—dense needed for warping, segmentation, depth hints.

Question 27

27 Gaussian pyramid? 📊 medium

Answer

Answer: Estimate flow coarse-to-fine to handle large displacement—warp and refine each level.

Question 28

28 Occlusions? 🔥 hard

Answer

Answer: Pixels visible in one frame but not the next—forward-backward consistency checks and learned occlusion masks help.

Question 29

29 Farneback? 📊 medium

Answer

Answer: Polynomial expansion per neighborhood then solve for displacement—dense polynomial basis alternative in OpenCV.

Question 30

30 TV-L1? 🔥 hard

Answer

Answer: Total variation regularization with L1 data term—robust to outliers, good for preserving discontinuities.

Question 31

31 Warping in deep nets? 📊 medium

Answer

Answer: Differentiable bilinear sampling to align frames by predicted flow—core building block in iterative refinement networks.

Question 32

32 Deep learning flow? 📊 medium

Answer

Answer: CNNs predict flow end-to-end (FlowNet, PWC, RAFT)—supervised on synthetic datasets (FlyingChairs, Sintel) + finetune.

Question 33

33 PWC-Net? 🔥 hard

Answer

Answer: Pyramid, warping, cost volume—correlate features at multiple scales efficiently.

Question 34

34 RAFT? 🔥 hard

Answer

Answer: Build multi-scale 4D correlation volume + recurrent GRU updates—state-of-the-art accuracy on benchmarks.

Question 35

35 Flow vs stereo? 📊 medium

Answer

Answer: Stereo: displacement along epipolar line (1D) after rectification. Flow: general 2D motion—stereo is constrained flow.

Question 36

36 Large motion? 📊 medium

Answer

Answer: Pyramids, feature matching init, or patch-based methods—pure local LK insufficient without hierarchy.

Question 37

37 Metrics? ⚡ easy

Answer

Answer: EPE (end-point error) on Middlebury/Sintel/KITTI—different datasets stress occlusion and realism.

Question 38

38 Use in stabilization? ⚡ easy

Answer

Answer: Estimate global dominant motion from flow field or parametric model—smooth camera path.

Question 39

39 Failure modes? 📊 medium

Answer

Answer: Textureless regions, repetitive patterns, fast motion, transparency—classical and learned methods both struggle.

Question 40

40 Real-time? ⚡ easy

Answer

Answer: DIS, Farneback GPU, lite deep models—trade accuracy vs FPS for robotics and AR.

Question 41

41 What is action recognition? ⚡ easy

Answer

Answer: Predict human action label(s) from a video clip—may include temporal localization and multi-person context.

Question 42

42 Two-stream networks? 📊 medium

Answer

Answer: RGB CNN for appearance + flow CNN for motion; fuse scores—strong baseline before full 3D models.

Question 43

43 C3D? 📊 medium

Answer

Answer: VGG-style 3×3×3 conv stacks on short clips—learns spatiotemporal filters end-to-end.

Question 44

44 I3D? 🔥 hard

Answer

Answer: Inflate 2D ImageNet filters to 3D for initialization—bootstrap video nets from image pretraining, big jump on Kinetics.

Question 45

45 TSN? 📊 medium

Answer

Answer: Sparse sampling of frames/segments across video; aggregate (avg/max) after shared CNN—covers long videos cheaply.

Question 46

46 SlowFast? 🔥 hard

Answer

Answer: Dual pathways with different frame rates and widths—efficient motion + semantics; widely used backbone.

Question 47

47 X3D? 📊 medium

Answer

Answer: Expand dimensions (depth, width, resolution, frames) systematically for accuracy–compute Pareto.

Question 48

48 Kinetics? ⚡ easy

Answer

Answer: Large-scale YouTube action dataset (400/600/700 classes)—ImageNet moment for video recognition.

Question 49

49 Something-Something? 📊 medium

Answer

Answer: Emphasizes fine-grained motion and object interactions—tests temporal reasoning more than appearance.

Question 50

50 Early vs late fusion? 🔥 hard

Answer

Answer: Early: stack frames/channels in first layers. Late: per-frame features then aggregate—middle fusion via 3D conv is common compromise.

Question 51

51 LSTM over CNN features? 📊 medium

Answer

Answer: Encode temporal order for variable-length clips—lighter 3D but can underfit complex motion vs Transformers.

Question 52

52 Space-time attention? 🔥 hard

Answer

Answer: Factorized or joint attention over patches and time (TimeSformer, Video Swin)—scalable with data.

Question 53

53 Skeleton-based? 📊 medium

Answer

Answer: Graph convolutions on body joints—privacy-friendly, works with pose estimators; less appearance detail.

Question 54

54 Long videos? 🔥 hard

Answer

Answer: Segment clips, hierarchical models, or memory—full movies need different pipelines than 10s Kinetics clips.

Question 55

55 Temporal localization? 📊 medium

Answer

Answer: Detect action start/end (SSN, BMN) or per-frame labels—needed for untrimmed video.

Question 56

56 Multi-label? 📊 medium

Answer

Answer: Sigmoid + BCE when several actions co-occur (cooking + talking)—different from single softmax.

Question 57

57 Weak supervision? 🔥 hard

Answer

Answer: Train with video tags only (MIL) or narration—reduces frame-level annotation cost.

Question 58

58 Real-time? ⚡ easy

Answer

Answer: Smaller X3D, MobileNet-3D, or keyframe-only inference—latency budgets for edge cameras.

Question 59

59 Transfer learning? 📊 medium

Answer

Answer: ImageNet → inflate → Kinetics finetune → downstream (surgery, sports)—standard recipe.

Question 60

60 Metrics? ⚡ easy

Answer

Answer: Top-1/5 accuracy; mAP for detection/localization; sometimes calibrated confidence for safety apps.

Video Analysis — Interview Q&A

Video Processing: 20 Essential Q&A

Optical Flow: 20 Essential Q&A

Action Recognition: 20 Essential Q&A

Full tutorial chapter