Computer Vision Interview
20 essential Q&A
Updated 2026
basics
Computer Vision Interview: Introduction & Basics
Short, interview-ready answers on what computer vision is, core tasks, classical vs deep learning, and how real CV systems are built.
~11 min read
20 questions
Beginner
pixels & images
deep learning
detection
segmentation
video
OpenCV
Quick Navigation
1
What is Computer Vision?
⚡ easy
Answer: Computer Vision (CV) is a field that builds algorithms and systems to extract meaningful information from images or video—so machines can see, interpret scenes, measure geometry, recognize objects, track motion, or reconstruct 3D structure. It overlaps with image processing, machine learning, robotics, and graphics.
definition
images & video
2
How does Computer Vision differ from Image Processing?
⚡ easy
Answer: Image processing focuses on transforming pixel data—filtering, enhancement, resizing, color conversion—often without explicit “understanding” of scene content. Computer vision uses those signals (and learning) to infer semantics (what/where), geometry (depth, pose), or actions. Image processing is often a preprocessing step inside a CV pipeline.
3
How does Computer Vision relate to Machine Learning and AI?
⚡ easy
Answer: AI is the broad goal of intelligent behavior. ML learns patterns from data; modern CV heavily uses ML (especially deep learning) for classification, detection, and segmentation. CV is an application domain: you still need domain choices—camera models, geometry, evaluation for vision tasks—not just generic tabular ML.
4
How is a digital image represented?
⚡ easy
Answer: Typically as a grid of pixels; each pixel stores intensity (grayscale) or multiple channels (e.g. RGB). Values are discrete after sampling (spatial) and quantization (brightness levels). In code this is often a tensor or NumPy array with shape
(H, W) or (H, W, C).
# Grayscale: H x W; RGB: H x W x 3
import numpy as np
img = np.zeros((480, 640, 3), dtype=np.uint8)
5
What are common Computer Vision tasks?
⚡ easy
Answer:
- Classification: what object/scene is in the image?
- Detection: what objects and where (bounding boxes)?
- Segmentation: pixel-level regions (semantic/instance)
- Keypoints / pose: landmarks, human pose, faces
- Tracking: same object across video frames
- 3D: depth, stereo, reconstruction, SLAM
6
What is low-level vs high-level vision?
📊 medium
Answer: Low-level vision works on raw pixels and local structure—edges, textures, optical flow, filtering. Mid-level groups structure into parts or regions. High-level vision reasons about objects, relationships, and scene understanding. Many pipelines stack these stages; deep networks can learn hierarchical features that blur the boundaries.
7
What makes Computer Vision hard in the real world?
📊 medium
Answer: Lighting changes, occlusions, clutter, viewpoint and scale variation, motion blur, sensor noise, class imbalance, domain shift (train vs deploy), labeling cost, latency and memory on edge devices, and safety/privacy constraints. Robust systems combine data, augmentation, architecture, and careful evaluation.
8
Why are convolutions central to modern Computer Vision?
📊 medium
Answer: Convolutional layers enforce local connectivity and weight sharing, which matches the spatial structure of images, reduces parameters vs fully connected layers, and builds translation-aware features. Deep CNNs stack convolutions to capture edges → textures → parts → objects.
9
What are handcrafted features vs learned features?
📊 medium
Answer: Handcrafted features (SIFT, HOG, Harris, etc.) are engineered descriptors of local image structure, often paired with classical classifiers. Learned features come from neural network weights trained end-to-end on data. Deep learning dominates when large labeled data or pre-trained models exist; classical features remain useful for small data, interpretability, or embedded baselines.
SIFT / HOG
CNN features
10
Classical Computer Vision vs Deep Learning—when to use which?
📊 medium
Answer: Use classical methods for well-defined geometry (calibration, stereo with known models), lightweight pipelines, or when data is scarce. Use deep learning for complex appearance-based tasks (detection, segmentation) when you can afford data/compute or can use transfer learning. Production systems often mix both (DL for perception, classical for geometry or post-processing).
11
What does “real-time” mean in Computer Vision?
⚡ easy
Answer: It usually means processing each frame (or batch) within a latency budget—for example 30+ FPS for video, or sub-100 ms for robotics. It depends on the product: mobile AR, autonomous driving, and industrial inspection have different throughput and accuracy tradeoffs. Techniques include smaller models, quantization, TensorRT/ONNX, and ROI cropping.
12
What is supervised learning in Computer Vision?
⚡ easy
Answer: Training with input-output pairs: images with labels (class), boxes, masks, or keypoints. The model minimizes a loss (e.g. cross-entropy, IoU-based losses) on labeled data. Most detection and segmentation benchmarks are supervised; getting quality labels is often the bottleneck.
13
How does overfitting show up in vision models?
📊 medium
Answer: Great training accuracy but poor validation/test performance—memorizing backgrounds, watermarks, or spurious correlations. Mitigations: more diverse data, augmentation, regularization (dropout, weight decay), early stopping, pre-training and fine-tuning, and stronger evaluation (different lighting/scenes).
14
What is data augmentation in CV?
📊 medium
Answer: Applying label-preserving transforms during training: flips, crops, color jitter, rotation, blur, cutout/mixup variants, etc. It artificially expands diversity and reduces overfitting. Augmentations should match deployment conditions (e.g. don’t flip text OCR images if labels break).
15
What is transfer learning in Computer Vision?
📊 medium
Answer: Starting from weights trained on a large source dataset (e.g. ImageNet) and fine-tuning on your smaller target task. Lower layers learn generic edges/textures; upper layers adapt to your classes. It cuts data and training time and is standard for classification and often backbone initialization for detection/segmentation.
16
What is Intersection over Union (IoU)?
📊 medium
Answer: IoU measures overlap between two regions (often bounding boxes or masks): |A ∩ B| / |A ∪ B|. Values range from 0 (no overlap) to 1 (perfect match). It is used to score detections, train some losses, and define “positive” matches (e.g. IoU ≥ 0.5) in mAP evaluation.
17
How are precision and recall used in object detection?
🔥 hard
Answer: After matching predictions to ground truth by IoU threshold, precision is TP / (TP + FP)—how many predicted boxes are correct. Recall is TP / (TP + FN)—what fraction of real objects were found. Changing the confidence threshold trades precision vs recall; AP/mAP summarizes this curve across thresholds and classes.
18
When would you use grayscale instead of RGB?
⚡ easy
Answer: When color is irrelevant or noisy—some industrial inspection, edge detection, or text/document pipelines. Grayscale reduces channels (speed, memory). When color carries signal (segmentation by material, medical imaging, traffic signs), keep RGB or appropriate color space (HSV/LAB) for invariance or interpretability.
19
What is OpenCV used for?
⚡ easy
Answer: OpenCV is a widely used library for image/video I/O, geometric transforms, filtering, feature detection, camera calibration, and some DNN inference helpers. Interviews often expect familiarity with reading images, color conversion, resizing, and basic drawing—for both prototyping and deployment glue code.
import cv2
img = cv2.imread("photo.jpg") # BGR
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
20
Why are GPUs important for Computer Vision?
⚡ easy
Answer: Deep vision models perform massive parallel matrix operations (convolutions, batch training). GPUs offer high throughput for these workloads versus CPUs. For deployment, you might also use specialized accelerators (TPU, NPU) or optimized runtimes; edge devices may use smaller models instead of large GPUs.
CUDA / training
latency vs accuracy
CV Basics Interview Cheat Sheet
Pipeline
- Acquire → preprocess
- Model inference
- Post-process & NMS
- Metrics & monitoring
Deep learning
- CNNs & transfer learning
- Augmentation
- IoU / mAP intuition
- Real-time tradeoffs
Tools
- OpenCV / Pillow
- PyTorch & torchvision
- TensorFlow / Keras
- NumPy tensors
💡 Pro tip: For each task, know input/output format, a standard metric, and one real failure mode (lighting, occlusion, domain shift).
Full tutorial track
Pair these interview notes with the step-by-step CV chapters for deeper intuition and code.