Computer Vision Interview: Introduction & Basics

Question 1

1 What is Computer Vision? ⚡ easy

Answer

Answer: Computer Vision (CV) is a field that builds algorithms and systems to extract meaningful information from images or video—so machines can see, interpret scenes, measure geometry, recognize objects, track motion, or reconstruct 3D structure. It overlaps with image processing, machine learning, robotics, and graphics.

Question 2

2 How does Computer Vision differ from Image Processing? ⚡ easy

Answer

Answer: Image processing focuses on transforming pixel data—filtering, enhancement, resizing, color conversion—often without explicit “understanding” of scene content. Computer vision uses those signals (and learning) to infer semantics (what/where), geometry (depth, pose), or actions. Image processing is often a preprocessing step inside a CV pipeline.

Question 3

3 How does Computer Vision relate to Machine Learning and AI? ⚡ easy

Answer

Answer: AI is the broad goal of intelligent behavior. ML learns patterns from data; modern CV heavily uses ML (especially deep learning) for classification, detection, and segmentation. CV is an application domain: you still need domain choices—camera models, geometry, evaluation for vision tasks—not just generic tabular ML.

Question 4

4 How is a digital image represented? ⚡ easy

Answer

Answer: Typically as a grid of pixels; each pixel stores intensity (grayscale) or multiple channels (e.g. RGB). Values are discrete after sampling (spatial) and quantization (brightness levels). In code this is often a tensor or NumPy array with shape (H, W) or (H, W, C).

Question 5

5 What are common Computer Vision tasks? ⚡ easy

Answer

Answer:

Classification: what object/scene is in the image?
Detection: what objects and where (bounding boxes)?
Segmentation: pixel-level regions (semantic/instance)
Keypoints / pose: landmarks, human pose, faces
Tracking: same object across video frames
3D: depth, stereo, reconstruction, SLAM

Question 6

6 What is low-level vs high-level vision? 📊 medium

Answer

Answer: Low-level vision works on raw pixels and local structure—edges, textures, optical flow, filtering. Mid-level groups structure into parts or regions. High-level vision reasons about objects, relationships, and scene understanding. Many pipelines stack these stages; deep networks can learn hierarchical features that blur the boundaries.

Question 7

7 What makes Computer Vision hard in the real world? 📊 medium

Answer

Answer: Lighting changes, occlusions, clutter, viewpoint and scale variation, motion blur, sensor noise, class imbalance, domain shift (train vs deploy), labeling cost, latency and memory on edge devices, and safety/privacy constraints. Robust systems combine data, augmentation, architecture, and careful evaluation.

Question 8

8 Why are convolutions central to modern Computer Vision? 📊 medium

Answer

Answer: Convolutional layers enforce local connectivity and weight sharing, which matches the spatial structure of images, reduces parameters vs fully connected layers, and builds translation-aware features. Deep CNNs stack convolutions to capture edges → textures → parts → objects.

Question 9

9 What are handcrafted features vs learned features? 📊 medium

Answer

Answer: Handcrafted features (SIFT, HOG, Harris, etc.) are engineered descriptors of local image structure, often paired with classical classifiers. Learned features come from neural network weights trained end-to-end on data. Deep learning dominates when large labeled data or pre-trained models exist; classical features remain useful for small data, interpretability, or embedded baselines.

Question 10

10 Classical Computer Vision vs Deep Learning—when to use which? 📊 medium

Answer

Answer: Use classical methods for well-defined geometry (calibration, stereo with known models), lightweight pipelines, or when data is scarce. Use deep learning for complex appearance-based tasks (detection, segmentation) when you can afford data/compute or can use transfer learning. Production systems often mix both (DL for perception, classical for geometry or post-processing).

Question 11

11 What does “real-time” mean in Computer Vision? ⚡ easy

Answer

Answer: It usually means processing each frame (or batch) within a latency budget—for example 30+ FPS for video, or sub-100 ms for robotics. It depends on the product: mobile AR, autonomous driving, and industrial inspection have different throughput and accuracy tradeoffs. Techniques include smaller models, quantization, TensorRT/ONNX, and ROI cropping.

Question 12

12 What is supervised learning in Computer Vision? ⚡ easy

Answer

Answer: Training with input-output pairs: images with labels (class), boxes, masks, or keypoints. The model minimizes a loss (e.g. cross-entropy, IoU-based losses) on labeled data. Most detection and segmentation benchmarks are supervised; getting quality labels is often the bottleneck.

Question 13

13 How does overfitting show up in vision models? 📊 medium

Answer

Answer: Great training accuracy but poor validation/test performance—memorizing backgrounds, watermarks, or spurious correlations. Mitigations: more diverse data, augmentation, regularization (dropout, weight decay), early stopping, pre-training and fine-tuning, and stronger evaluation (different lighting/scenes).

Question 14

14 What is data augmentation in CV? 📊 medium

Answer

Answer: Applying label-preserving transforms during training: flips, crops, color jitter, rotation, blur, cutout/mixup variants, etc. It artificially expands diversity and reduces overfitting. Augmentations should match deployment conditions (e.g. don’t flip text OCR images if labels break).

Question 15

15 What is transfer learning in Computer Vision? 📊 medium

Answer

Answer: Starting from weights trained on a large source dataset (e.g. ImageNet) and fine-tuning on your smaller target task. Lower layers learn generic edges/textures; upper layers adapt to your classes. It cuts data and training time and is standard for classification and often backbone initialization for detection/segmentation.

Question 16

16 What is Intersection over Union (IoU)? 📊 medium

Answer

Answer: IoU measures overlap between two regions (often bounding boxes or masks): |A ∩ B| / |A ∪ B|. Values range from 0 (no overlap) to 1 (perfect match). It is used to score detections, train some losses, and define “positive” matches (e.g. IoU ≥ 0.5) in mAP evaluation.

Question 17

17 How are precision and recall used in object detection? 🔥 hard

Answer

Answer: After matching predictions to ground truth by IoU threshold, precision is TP / (TP + FP)—how many predicted boxes are correct. Recall is TP / (TP + FN)—what fraction of real objects were found. Changing the confidence threshold trades precision vs recall; AP/mAP summarizes this curve across thresholds and classes.

Question 18

18 When would you use grayscale instead of RGB? ⚡ easy

Answer

Answer: When color is irrelevant or noisy—some industrial inspection, edge detection, or text/document pipelines. Grayscale reduces channels (speed, memory). When color carries signal (segmentation by material, medical imaging, traffic signs), keep RGB or appropriate color space (HSV/LAB) for invariance or interpretability.

Question 19

19 What is OpenCV used for? ⚡ easy

Answer

Answer: OpenCV is a widely used library for image/video I/O, geometric transforms, filtering, feature detection, camera calibration, and some DNN inference helpers. Interviews often expect familiarity with reading images, color conversion, resizing, and basic drawing—for both prototyping and deployment glue code.

Question 20

20 Why are GPUs important for Computer Vision? ⚡ easy

Answer

Answer: Deep vision models perform massive parallel matrix operations (convolutions, batch training). GPUs offer high throughput for these workloads versus CPUs. For deployment, you might also use specialized accelerators (TPU, NPU) or optimized runtimes; edge devices may use smaller models instead of large GPUs.

Computer Vision Interview: Introduction & Basics

Quick Navigation

CV Basics Interview Cheat Sheet

Pipeline

Deep learning

Tools

Full tutorial track