Introduction to Computer Vision — Interview Q&A

Question 1

1 What is Computer Vision? ⚡ easy

Answer

Answer: Computer Vision (CV) is a field that builds algorithms and systems to extract meaningful information from images or video—so machines can see, interpret scenes, measure geometry, recognize objects, track motion, or reconstruct 3D structure. It overlaps with image processing, machine learning, robotics, and graphics.

Question 2

2 How does Computer Vision differ from Image Processing? ⚡ easy

Answer

Answer: Image processing focuses on transforming pixel data—filtering, enhancement, resizing, color conversion—often without explicit “understanding” of scene content. Computer vision uses those signals (and learning) to infer semantics (what/where), geometry (depth, pose), or actions. Image processing is often a preprocessing step inside a CV pipeline.

Question 3

3 How does Computer Vision relate to Machine Learning and AI? ⚡ easy

Answer

Answer: AI is the broad goal of intelligent behavior. ML learns patterns from data; modern CV heavily uses ML (especially deep learning) for classification, detection, and segmentation. CV is an application domain: you still need domain choices—camera models, geometry, evaluation for vision tasks—not just generic tabular ML.

Question 4

4 How is a digital image represented? ⚡ easy

Answer

Answer: Typically as a grid of pixels; each pixel stores intensity (grayscale) or multiple channels (e.g. RGB). Values are discrete after sampling (spatial) and quantization (brightness levels). In code this is often a tensor or NumPy array with shape (H, W) or (H, W, C).

Question 5

5 What are common Computer Vision tasks? ⚡ easy

Answer

Answer:

Classification: what object/scene is in the image?
Detection: what objects and where (bounding boxes)?
Segmentation: pixel-level regions (semantic/instance)
Keypoints / pose: landmarks, human pose, faces
Tracking: same object across video frames
3D: depth, stereo, reconstruction, SLAM

Question 6

6 What is low-level vs high-level vision? 📊 medium

Answer

Answer: Low-level vision works on raw pixels and local structure—edges, textures, optical flow, filtering. Mid-level groups structure into parts or regions. High-level vision reasons about objects, relationships, and scene understanding. Many pipelines stack these stages; deep networks can learn hierarchical features that blur the boundaries.

Question 7

7 What makes Computer Vision hard in the real world? 📊 medium

Answer

Answer: Lighting changes, occlusions, clutter, viewpoint and scale variation, motion blur, sensor noise, class imbalance, domain shift (train vs deploy), labeling cost, latency and memory on edge devices, and safety/privacy constraints. Robust systems combine data, augmentation, architecture, and careful evaluation.

Question 8

8 Why are convolutions central to modern Computer Vision? 📊 medium

Answer

Answer: Convolutional layers enforce local connectivity and weight sharing, which matches the spatial structure of images, reduces parameters vs fully connected layers, and builds translation-aware features. Deep CNNs stack convolutions to capture edges → textures → parts → objects.

Question 9

9 What are handcrafted features vs learned features? 📊 medium

Answer

Answer: Handcrafted features (SIFT, HOG, Harris, etc.) are engineered descriptors of local image structure, often paired with classical classifiers. Learned features come from neural network weights trained end-to-end on data. Deep learning dominates when large labeled data or pre-trained models exist; classical features remain useful for small data, interpretability, or embedded baselines.

Question 10

10 Classical Computer Vision vs Deep Learning—when to use which? 📊 medium

Answer

Answer: Use classical methods for well-defined geometry (calibration, stereo with known models), lightweight pipelines, or when data is scarce. Use deep learning for complex appearance-based tasks (detection, segmentation) when you can afford data/compute or can use transfer learning. Production systems often mix both (DL for perception, classical for geometry or post-processing).

Question 11

11 What does “real-time” mean in Computer Vision? ⚡ easy

Answer

Answer: It usually means processing each frame (or batch) within a latency budget—for example 30+ FPS for video, or sub-100 ms for robotics. It depends on the product: mobile AR, autonomous driving, and industrial inspection have different throughput and accuracy tradeoffs. Techniques include smaller models, quantization, TensorRT/ONNX, and ROI cropping.

Question 12

12 What is supervised learning in Computer Vision? ⚡ easy

Answer

Answer: Training with input-output pairs: images with labels (class), boxes, masks, or keypoints. The model minimizes a loss (e.g. cross-entropy, IoU-based losses) on labeled data. Most detection and segmentation benchmarks are supervised; getting quality labels is often the bottleneck.

Question 13

13 How does overfitting show up in vision models? 📊 medium

Answer

Answer: Great training accuracy but poor validation/test performance—memorizing backgrounds, watermarks, or spurious correlations. Mitigations: more diverse data, augmentation, regularization (dropout, weight decay), early stopping, pre-training and fine-tuning, and stronger evaluation (different lighting/scenes).

Question 14

14 What is data augmentation in CV? 📊 medium

Answer

Answer: Applying label-preserving transforms during training: flips, crops, color jitter, rotation, blur, cutout/mixup variants, etc. It artificially expands diversity and reduces overfitting. Augmentations should match deployment conditions (e.g. don’t flip text OCR images if labels break).

Question 15

15 What is transfer learning in Computer Vision? 📊 medium

Answer

Answer: Starting from weights trained on a large source dataset (e.g. ImageNet) and fine-tuning on your smaller target task. Lower layers learn generic edges/textures; upper layers adapt to your classes. It cuts data and training time and is standard for classification and often backbone initialization for detection/segmentation.

Question 16

16 What is Intersection over Union (IoU)? 📊 medium

Answer

Answer: IoU measures overlap between two regions (often bounding boxes or masks): |A ∩ B| / |A ∪ B|. Values range from 0 (no overlap) to 1 (perfect match). It is used to score detections, train some losses, and define “positive” matches (e.g. IoU ≥ 0.5) in mAP evaluation.

Question 17

17 How are precision and recall used in object detection? 🔥 hard

Answer

Answer: After matching predictions to ground truth by IoU threshold, precision is TP / (TP + FP)—how many predicted boxes are correct. Recall is TP / (TP + FN)—what fraction of real objects were found. Changing the confidence threshold trades precision vs recall; AP/mAP summarizes this curve across thresholds and classes.

Question 18

18 When would you use grayscale instead of RGB? ⚡ easy

Answer

Answer: When color is irrelevant or noisy—some industrial inspection, edge detection, or text/document pipelines. Grayscale reduces channels (speed, memory). When color carries signal (segmentation by material, medical imaging, traffic signs), keep RGB or appropriate color space (HSV/LAB) for invariance or interpretability.

Question 19

19 What is OpenCV used for? ⚡ easy

Answer

Answer: OpenCV is a widely used library for image/video I/O, geometric transforms, filtering, feature detection, camera calibration, and some DNN inference helpers. Interviews often expect familiarity with reading images, color conversion, resizing, and basic drawing—for both prototyping and deployment glue code.

Question 20

20 Why are GPUs important for Computer Vision? ⚡ easy

Answer

Answer: Deep vision models perform massive parallel matrix operations (convolutions, batch training). GPUs offer high throughput for these workloads versus CPUs. For deployment, you might also use specialized accelerators (TPU, NPU) or optimized runtimes; edge devices may use smaller models instead of large GPUs.

Question 21

21 What is a digital image in computer vision? ⚡ easy

Answer

Answer: A 2D (or 2D+channels) grid of samples where each cell is a pixel storing numeric intensity or color. It is a discrete approximation of a continuous scene after capture by a sensor and analog-to-digital conversion.

Question 22

22 What is a pixel? ⚡ easy

Answer

Answer: The smallest addressable element of a raster image. Each pixel holds one or more values (e.g. gray level or R,G,B). Spatially, pixels sit on a regular grid; physically, they correspond to sensor photosites plus processing (demosaicing for color cameras).

Question 23

23 Explain sampling and quantization. 📊 medium

Answer

Answer: Sampling chooses discrete spatial locations (grid resolution). Quantization maps continuous intensity to finite levels (bit depth). Together they convert a continuous image to digital form and introduce spatial and intensity approximation error.

Question 24

24 What is image resolution? ⚡ easy

Answer

Answer: Usually the grid size width × height in pixels (e.g. 1920×1080). Higher resolution preserves finer detail but costs memory and compute. Aspect ratio is width/height; changing resolution without preserving ratio stretches content.

Question 25

25 What are color channels? ⚡ easy

Answer

Answer: Separate 2D arrays (or stacked planes) per color component—commonly R, G, B for display. Grayscale has one channel. Multispectral/hyperspectral images have many bands beyond visible RGB.

Question 26

26 How is grayscale often computed from RGB? ⚡ easy

Answer

Answer: A weighted sum approximating luminance, e.g. 0.299R + 0.587G + 0.114B (ITU-R BT.601) or simpler averages for rough work. Weights reflect human sensitivity to green; the exact formula depends on standard and use case.

Question 27

27 What is bit depth? Why does it matter? 📊 medium

Answer

Answer: Bits per channel (e.g. 8-bit → 256 levels). Higher depth reduces banding and helps medical/raw workflows; 8-bit uint is standard for web and many CV datasets. HDR may use 16/32-bit float linear pipelines before tone mapping.

Question 28

28 How are pixel coordinates usually indexed? ⚡ easy

Answer

Answer: Often (row, col) or (y, x) with origin at top-left, row increasing downward—matching matrix indexing in NumPy/OpenCV. Be careful when converting to math coordinates where y may increase upward.

Question 29

29 What does tensor shape (H, W, C) mean? 📊 medium

Answer

Answer: Height (rows), width (columns), channels—typical for NumPy/OpenCV images. PyTorch often uses (N, C, H, W) for batches. Interviews check you can transpose between layouts without mixing H/W.

Question 30

30 Raster vs vector graphics? ⚡ easy

Answer

Answer: Raster: pixel grid (photos, textures). Vector: curves/paths (SVG, fonts)—infinite resolution until rasterized. CV pipelines usually consume raster tensors; vector assets are rasterized for learning.

Question 31

31 When choose JPEG vs PNG? ⚡ easy

Answer

Answer: JPEG: photos, smaller files, lossy, poor for sharp edges/text. PNG: lossless, transparency, screenshots and graphics. For repeated ML saves, beware JPEG compression artifacts affecting edges and noise.

Question 32

32 What problems can lossy compression cause for CV? 📊 medium

Answer

Answer: Blocking, ringing, color bleeding—especially around edges. Models may overfit artifact patterns. For training data, prefer lossless or high-quality JPEG; for deployment, know your camera/codec pipeline.

Question 33

33 What is aliasing when downsampling? 📊 medium

Answer

Answer: High-frequency detail folds into low frequencies as moiré or jaggies if you shrink without low-pass filtering. Fix: blur then downsample or use good resampling (area interpolation for downscaling in OpenCV).

Question 34

34 Nearest-neighbor vs bilinear interpolation? 📊 medium

Answer

Answer: Nearest: fast, blocky, preserves original values. Bilinear: smooths using 4 neighbors, better for resizing/rotation but blurs fine detail. Bicubic is smoother still; choice affects augmentation and geometric transforms.

Question 35

35 Typical dtypes for images in NumPy? ⚡ easy

Answer

Answer: uint8 [0,255] most common. Float images may be [0,1] or [0,255] depending on library—always normalize consistently before math or neural nets.

Question 36

36 Why does OpenCV use BGR? ⚡ easy

Answer

Answer: Historical reasons; imread returns BGR order. Convert to RGB for matplotlib or PIL-centric code: cv2.cvtColor(img, cv2.COLOR_BGR2RGB). Mixing orders is a common interview “debugging” trap.

Question 37

37 What is the alpha channel? ⚡ easy

Answer

Answer: Per-pixel opacity for compositing (RGBA). Not always present. When loading to 3-channel models, you often drop alpha or premultiply RGB depending on graphics pipeline.

Question 38

38 What does an image histogram show? 📊 medium

Answer

Answer: The distribution of pixel intensities (per channel or gray). Useful for exposure diagnosis, thresholding intuition, and contrast enhancement—foundation for histogram equalization (covered in later chapters).

Question 39

39 How does a video relate to images? ⚡ easy

Answer

Answer: A sequence of frames (2D images) sampled in time with a frame rate (FPS). Temporal redundancy enables compression and tracking; many CV models treat frames independently at first.

Question 40

40 What is EXIF metadata? ⚡ easy

Answer

Answer: Embedded tags in JPEG/TIFF: orientation, camera settings, timestamp, GPS. The orientation tag can rotate images—some loaders ignore it, causing inconsistent training data; preprocess to canonical orientation.

Introduction to Computer Vision — Interview Q&A

Computer Vision Interview: Introduction & Basics

Image Processing Basics: 20 Essential Q&A

Full tutorial chapter