Computer Vision Interview
40 Q&A
Chapter 1
Introduction to Computer Vision — Interview Q&A
What computer vision is, how it differs from image processing, history, applications, and digital image basics—pixels, channels, resolution, and Python loading examples.
40 questions
Chapter 1
Computer Vision Interview: Introduction & Basics
1
What is Computer Vision?
⚡ easy
Answer: Computer Vision (CV) is a field that builds algorithms and systems to extract meaningful information from images or video—so machines can see, interpret scenes, measure geometry, recognize objects, track motion, or reconstruct 3D structure. It overlaps with image processing, machine learning, robotics, and graphics.
definition
images & video
2
How does Computer Vision differ from Image Processing?
⚡ easy
Answer: Image processing focuses on transforming pixel data—filtering, enhancement, resizing, color conversion—often without explicit “understanding” of scene content. Computer vision uses those signals (and learning) to infer semantics (what/where), geometry (depth, pose), or actions. Image processing is often a preprocessing step inside a CV pipeline.
3
How does Computer Vision relate to Machine Learning and AI?
⚡ easy
Answer: AI is the broad goal of intelligent behavior. ML learns patterns from data; modern CV heavily uses ML (especially deep learning) for classification, detection, and segmentation. CV is an application domain: you still need domain choices—camera models, geometry, evaluation for vision tasks—not just generic tabular ML.
4
How is a digital image represented?
⚡ easy
Answer: Typically as a grid of pixels; each pixel stores intensity (grayscale) or multiple channels (e.g. RGB). Values are discrete after sampling (spatial) and quantization (brightness levels). In code this is often a tensor or NumPy array with shape
(H, W) or (H, W, C).
# Grayscale: H x W; RGB: H x W x 3
import numpy as np
img = np.zeros((480, 640, 3), dtype=np.uint8)
5
What are common Computer Vision tasks?
⚡ easy
Answer:
- Classification: what object/scene is in the image?
- Detection: what objects and where (bounding boxes)?
- Segmentation: pixel-level regions (semantic/instance)
- Keypoints / pose: landmarks, human pose, faces
- Tracking: same object across video frames
- 3D: depth, stereo, reconstruction, SLAM
6
What is low-level vs high-level vision?
📊 medium
Answer: Low-level vision works on raw pixels and local structure—edges, textures, optical flow, filtering. Mid-level groups structure into parts or regions. High-level vision reasons about objects, relationships, and scene understanding. Many pipelines stack these stages; deep networks can learn hierarchical features that blur the boundaries.
7
What makes Computer Vision hard in the real world?
📊 medium
Answer: Lighting changes, occlusions, clutter, viewpoint and scale variation, motion blur, sensor noise, class imbalance, domain shift (train vs deploy), labeling cost, latency and memory on edge devices, and safety/privacy constraints. Robust systems combine data, augmentation, architecture, and careful evaluation.
8
Why are convolutions central to modern Computer Vision?
📊 medium
Answer: Convolutional layers enforce local connectivity and weight sharing, which matches the spatial structure of images, reduces parameters vs fully connected layers, and builds translation-aware features. Deep CNNs stack convolutions to capture edges → textures → parts → objects.
9
What are handcrafted features vs learned features?
📊 medium
Answer: Handcrafted features (SIFT, HOG, Harris, etc.) are engineered descriptors of local image structure, often paired with classical classifiers. Learned features come from neural network weights trained end-to-end on data. Deep learning dominates when large labeled data or pre-trained models exist; classical features remain useful for small data, interpretability, or embedded baselines.
SIFT / HOG
CNN features
10
Classical Computer Vision vs Deep Learning—when to use which?
📊 medium
Answer: Use classical methods for well-defined geometry (calibration, stereo with known models), lightweight pipelines, or when data is scarce. Use deep learning for complex appearance-based tasks (detection, segmentation) when you can afford data/compute or can use transfer learning. Production systems often mix both (DL for perception, classical for geometry or post-processing).
11
What does “real-time” mean in Computer Vision?
⚡ easy
Answer: It usually means processing each frame (or batch) within a latency budget—for example 30+ FPS for video, or sub-100 ms for robotics. It depends on the product: mobile AR, autonomous driving, and industrial inspection have different throughput and accuracy tradeoffs. Techniques include smaller models, quantization, TensorRT/ONNX, and ROI cropping.
12
What is supervised learning in Computer Vision?
⚡ easy
Answer: Training with input-output pairs: images with labels (class), boxes, masks, or keypoints. The model minimizes a loss (e.g. cross-entropy, IoU-based losses) on labeled data. Most detection and segmentation benchmarks are supervised; getting quality labels is often the bottleneck.
13
How does overfitting show up in vision models?
📊 medium
Answer: Great training accuracy but poor validation/test performance—memorizing backgrounds, watermarks, or spurious correlations. Mitigations: more diverse data, augmentation, regularization (dropout, weight decay), early stopping, pre-training and fine-tuning, and stronger evaluation (different lighting/scenes).
14
What is data augmentation in CV?
📊 medium
Answer: Applying label-preserving transforms during training: flips, crops, color jitter, rotation, blur, cutout/mixup variants, etc. It artificially expands diversity and reduces overfitting. Augmentations should match deployment conditions (e.g. don’t flip text OCR images if labels break).
15
What is transfer learning in Computer Vision?
📊 medium
Answer: Starting from weights trained on a large source dataset (e.g. ImageNet) and fine-tuning on your smaller target task. Lower layers learn generic edges/textures; upper layers adapt to your classes. It cuts data and training time and is standard for classification and often backbone initialization for detection/segmentation.
16
What is Intersection over Union (IoU)?
📊 medium
Answer: IoU measures overlap between two regions (often bounding boxes or masks): |A ∩ B| / |A ∪ B|. Values range from 0 (no overlap) to 1 (perfect match). It is used to score detections, train some losses, and define “positive” matches (e.g. IoU ≥ 0.5) in mAP evaluation.
17
How are precision and recall used in object detection?
🔥 hard
Answer: After matching predictions to ground truth by IoU threshold, precision is TP / (TP + FP)—how many predicted boxes are correct. Recall is TP / (TP + FN)—what fraction of real objects were found. Changing the confidence threshold trades precision vs recall; AP/mAP summarizes this curve across thresholds and classes.
18
When would you use grayscale instead of RGB?
⚡ easy
Answer: When color is irrelevant or noisy—some industrial inspection, edge detection, or text/document pipelines. Grayscale reduces channels (speed, memory). When color carries signal (segmentation by material, medical imaging, traffic signs), keep RGB or appropriate color space (HSV/LAB) for invariance or interpretability.
19
What is OpenCV used for?
⚡ easy
Answer: OpenCV is a widely used library for image/video I/O, geometric transforms, filtering, feature detection, camera calibration, and some DNN inference helpers. Interviews often expect familiarity with reading images, color conversion, resizing, and basic drawing—for both prototyping and deployment glue code.
import cv2
img = cv2.imread("photo.jpg") # BGR
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
20
Why are GPUs important for Computer Vision?
⚡ easy
Answer: Deep vision models perform massive parallel matrix operations (convolutions, batch training). GPUs offer high throughput for these workloads versus CPUs. For deployment, you might also use specialized accelerators (TPU, NPU) or optimized runtimes; edge devices may use smaller models instead of large GPUs.
CUDA / training
latency vs accuracy
Image Processing Basics: 20 Essential Q&A
21
What is a digital image in computer vision?
⚡ easy
Answer: A 2D (or 2D+channels) grid of samples where each cell is a pixel storing numeric intensity or color. It is a discrete approximation of a continuous scene after capture by a sensor and analog-to-digital conversion.
22
What is a pixel?
⚡ easy
Answer: The smallest addressable element of a raster image. Each pixel holds one or more values (e.g. gray level or R,G,B). Spatially, pixels sit on a regular grid; physically, they correspond to sensor photosites plus processing (demosaicing for color cameras).
23
Explain sampling and quantization.
📊 medium
Answer: Sampling chooses discrete spatial locations (grid resolution). Quantization maps continuous intensity to finite levels (bit depth). Together they convert a continuous image to digital form and introduce spatial and intensity approximation error.
24
What is image resolution?
⚡ easy
Answer: Usually the grid size width × height in pixels (e.g. 1920×1080). Higher resolution preserves finer detail but costs memory and compute. Aspect ratio is width/height; changing resolution without preserving ratio stretches content.
25
What are color channels?
⚡ easy
Answer: Separate 2D arrays (or stacked planes) per color component—commonly R, G, B for display. Grayscale has one channel. Multispectral/hyperspectral images have many bands beyond visible RGB.
26
How is grayscale often computed from RGB?
⚡ easy
Answer: A weighted sum approximating luminance, e.g. 0.299R + 0.587G + 0.114B (ITU-R BT.601) or simpler averages for rough work. Weights reflect human sensitivity to green; the exact formula depends on standard and use case.
27
What is bit depth? Why does it matter?
📊 medium
Answer: Bits per channel (e.g. 8-bit → 256 levels). Higher depth reduces banding and helps medical/raw workflows; 8-bit uint is standard for web and many CV datasets. HDR may use 16/32-bit float linear pipelines before tone mapping.
28
How are pixel coordinates usually indexed?
⚡ easy
Answer: Often (row, col) or (y, x) with origin at top-left, row increasing downward—matching matrix indexing in NumPy/OpenCV. Be careful when converting to math coordinates where y may increase upward.
29
What does tensor shape (H, W, C) mean?
📊 medium
Answer: Height (rows), width (columns), channels—typical for NumPy/OpenCV images. PyTorch often uses (N, C, H, W) for batches. Interviews check you can transpose between layouts without mixing H/W.
30
Raster vs vector graphics?
⚡ easy
Answer: Raster: pixel grid (photos, textures). Vector: curves/paths (SVG, fonts)—infinite resolution until rasterized. CV pipelines usually consume raster tensors; vector assets are rasterized for learning.
31
When choose JPEG vs PNG?
⚡ easy
Answer: JPEG: photos, smaller files, lossy, poor for sharp edges/text. PNG: lossless, transparency, screenshots and graphics. For repeated ML saves, beware JPEG compression artifacts affecting edges and noise.
32
What problems can lossy compression cause for CV?
📊 medium
Answer: Blocking, ringing, color bleeding—especially around edges. Models may overfit artifact patterns. For training data, prefer lossless or high-quality JPEG; for deployment, know your camera/codec pipeline.
33
What is aliasing when downsampling?
📊 medium
Answer: High-frequency detail folds into low frequencies as moiré or jaggies if you shrink without low-pass filtering. Fix: blur then downsample or use good resampling (area interpolation for downscaling in OpenCV).
34
Nearest-neighbor vs bilinear interpolation?
📊 medium
Answer: Nearest: fast, blocky, preserves original values. Bilinear: smooths using 4 neighbors, better for resizing/rotation but blurs fine detail. Bicubic is smoother still; choice affects augmentation and geometric transforms.
35
Typical dtypes for images in NumPy?
⚡ easy
Answer: uint8 [0,255] most common. Float images may be [0,1] or [0,255] depending on library—always normalize consistently before math or neural nets.
import numpy as np
img = np.zeros((480, 640, 3), dtype=np.uint8) # H,W,C
36
Why does OpenCV use BGR?
⚡ easy
Answer: Historical reasons;
imread returns BGR order. Convert to RGB for matplotlib or PIL-centric code: cv2.cvtColor(img, cv2.COLOR_BGR2RGB). Mixing orders is a common interview “debugging” trap.
import cv2
bgr = cv2.imread('x.jpg')
rgb = cv2.cvtColor(bgr, cv2.COLOR_BGR2RGB)
37
What is the alpha channel?
⚡ easy
Answer: Per-pixel opacity for compositing (RGBA). Not always present. When loading to 3-channel models, you often drop alpha or premultiply RGB depending on graphics pipeline.
38
What does an image histogram show?
📊 medium
Answer: The distribution of pixel intensities (per channel or gray). Useful for exposure diagnosis, thresholding intuition, and contrast enhancement—foundation for histogram equalization (covered in later chapters).
39
How does a video relate to images?
⚡ easy
Answer: A sequence of frames (2D images) sampled in time with a frame rate (FPS). Temporal redundancy enables compression and tracking; many CV models treat frames independently at first.
40
What is EXIF metadata?
⚡ easy
Answer: Embedded tags in JPEG/TIFF: orientation, camera settings, timestamp, GPS. The orientation tag can rotate images—some loaders ignore it, causing inconsistent training data; preprocess to canonical orientation.
Full tutorial chapter
Pair these interview notes with the matching CV tutorial chapter.