Computer Vision Interview
20 essential Q&A
Updated 2026
CNN
CNNs for Vision: 20 Essential Q&A
Why convolutions beat dense layers on images—and how pooling, padding, and depth build representations.
~11 min read
20 questions
Intermediate
convpoolReLUparameter sharing
Quick Navigation
1
Why CNNs for vision?
⚡ easy
Answer: Images have spatial structure; conv layers exploit local correlations and share weights—far fewer parameters and better generalization than huge FC layers on pixels.
# PyTorch: nn.Conv2d(in_c, out_c, k, padding=1) # standard conv layer
2
What does a conv layer do?
📊 medium
Answer: Slides learnable filters over the input; each output location is dot product of filter with local patch—detects patterns like edges/textures at many positions.
4
Local receptive field?
⚡ easy
Answer: Each neuron sees only a small neighborhood—deeper layers indirectly see larger context via stacked convs.
5
Stride and padding?
📊 medium
Answer: Stride subsamples spatial size; same padding keeps H×W with zero border; valid shrinks without padding.
6
Purpose of pooling?
📊 medium
Answer: Reduces spatial resolution, adds slight translation tolerance, and lowers compute—max pool keeps strongest activations in each window.
7
Output depth?
⚡ easy
Answer: Number of filters = number of output channels—each filter produces one feature map.
8
Receptive field size?
🔥 hard
Answer: Grows with kernel sizes, strides, and stacking—after L layers network “sees” a region of that size in the input image.
9
Why 1×1 conv?
📊 medium
Answer: Mixes channels at each spatial location—cheap way to change depth (bottleneck), add nonlinearity, or implement MLP per pixel.
10
CNN vs fully connected?
📊 medium
Answer: FC connects all inputs to each output—no locality; used at end (or as 1×1 conv) after spatial reduction for classification.
11
Translation equivariance?
🔥 hard
Answer: Shift input → shifted feature maps (before pooling)—CNN respects spatial structure; pooling adds limited invariance.
12
RGB input?
⚡ easy
Answer: First conv has 3 input channels per filter—depth matches image channels (or more for hyperspectral).
13
Role of batch norm?
📊 medium
Answer: Normalize activations per channel for stable training and higher learning rates—slight regularization effect.
14
Dropout in CNNs?
📊 medium
Answer: More common in FC heads; sometimes spatial dropout drops whole feature maps—less standard than in MLPs.
15
Global average pooling?
📊 medium
Answer: Average each channel to one value—reduces params vs large FC layers before softmax (Network in Network / ResNet style).
16
Classification loss?
⚡ easy
Answer: Cross-entropy with softmax over classes—multi-label uses sigmoid + BCE per class.
17
Typical augmentation?
📊 medium
Answer: Random crop/flip, color jitter, mixup/cutmix—improves generalization and simulates viewpoint/light changes.
18
Transfer learning?
📊 medium
Answer: Initialize backbone from ImageNet pretrain, replace head, fine-tune—standard when labeled data is limited.
19
Estimate complexity?
🔥 hard
Answer: Conv: roughly O(H_out×W_out×C_in×C_out×k²)—depthwise separable reduces this (MobileNet).
20
CNN vs Vision Transformer?
🔥 hard
Answer: CNN: local inductive bias and efficiency. ViT: global attention, needs more data—hybrids (ConvNeXt, Swin) blend ideas.
CNN Cheat Sheet
Conv
- Local + shared
- Stride/pad
Pool
- Downsample
- Invariance
Head
- GAP + FC
- Softmax CE
💡 Pro tip: Sharing weights is the core efficiency vs FC on pixels.
Full tutorial track
Go deeper with the matching tutorial chapter and code examples.