CNN Basics for Vision — Interview Q&A

Question 1

1 Why CNNs for vision? ⚡ easy

Answer

Answer: Images have spatial structure; conv layers exploit local correlations and share weights—far fewer parameters and better generalization than huge FC layers on pixels.

Question 2

2 What does a conv layer do? 📊 medium

Answer

Answer: Slides learnable filters over the input; each output location is dot product of filter with local patch—detects patterns like edges/textures at many positions.

Question 3

3 What is parameter sharing? 📊 medium

Answer

Answer: Same filter weights used at every spatial location—if a feature is useful in one place, it can appear anywhere; drastically cuts parameters vs FC.

Question 4

4 Local receptive field? ⚡ easy

Answer

Answer: Each neuron sees only a small neighborhood—deeper layers indirectly see larger context via stacked convs.

Question 5

5 Stride and padding? 📊 medium

Answer

Answer: Stride subsamples spatial size; same padding keeps H×W with zero border; valid shrinks without padding.

Question 6

6 Purpose of pooling? 📊 medium

Answer

Answer: Reduces spatial resolution, adds slight translation tolerance, and lowers compute—max pool keeps strongest activations in each window.

Question 7

7 Output depth? ⚡ easy

Answer

Answer: Number of filters = number of output channels—each filter produces one feature map.

Question 8

8 Receptive field size? 🔥 hard

Answer

Answer: Grows with kernel sizes, strides, and stacking—after L layers network “sees” a region of that size in the input image.

Question 9

9 Why 1×1 conv? 📊 medium

Answer

Answer: Mixes channels at each spatial location—cheap way to change depth (bottleneck), add nonlinearity, or implement MLP per pixel.

Question 10

10 CNN vs fully connected? 📊 medium

Answer

Answer: FC connects all inputs to each output—no locality; used at end (or as 1×1 conv) after spatial reduction for classification.

Question 11

11 Translation equivariance? 🔥 hard

Answer

Answer: Shift input → shifted feature maps (before pooling)—CNN respects spatial structure; pooling adds limited invariance.

Question 12

12 RGB input? ⚡ easy

Answer

Answer: First conv has 3 input channels per filter—depth matches image channels (or more for hyperspectral).

Question 13

13 Role of batch norm? 📊 medium

Answer

Answer: Normalize activations per channel for stable training and higher learning rates—slight regularization effect.

Question 14

14 Dropout in CNNs? 📊 medium

Answer

Answer: More common in FC heads; sometimes spatial dropout drops whole feature maps—less standard than in MLPs.

Question 15

15 Global average pooling? 📊 medium

Answer

Answer: Average each channel to one value—reduces params vs large FC layers before softmax (Network in Network / ResNet style).

Question 16

16 Classification loss? ⚡ easy

Answer

Answer: Cross-entropy with softmax over classes—multi-label uses sigmoid + BCE per class.

Question 17

17 Typical augmentation? 📊 medium

Answer

Answer: Random crop/flip, color jitter, mixup/cutmix—improves generalization and simulates viewpoint/light changes.

Question 18

18 Transfer learning? 📊 medium

Answer

Answer: Initialize backbone from ImageNet pretrain, replace head, fine-tune—standard when labeled data is limited.

Question 19

19 Estimate complexity? 🔥 hard

Answer

Answer: Conv: roughly O(H_out×W_out×C_in×C_out×k²)—depthwise separable reduces this (MobileNet).

Question 20

20 CNN vs Vision Transformer? 🔥 hard

Answer

Answer: CNN: local inductive bias and efficiency. ViT: global attention, needs more data—hybrids (ConvNeXt, Swin) blend ideas.

Question 21

21 Why is AlexNet important? ⚡ easy

Answer

Answer: Won ImageNet 2012 by a large margin—showed deep CNNs + GPU + data could beat hand-crafted features, sparking the deep learning boom in vision.

Question 22

22 What was ImageNet 2012? 📊 medium

Answer

Answer: 1.2M images, 1000 classes—AlexNet ~16% top-5 error vs previous ~26% with shallow methods—breakthrough result.

Question 23

23 Rough architecture? 📊 medium

Answer

Answer: Five conv layers (some grouped across 2 GPUs) + max pooling + three large FC layers + softmax—deeper than prior CNNs for this task.

Question 24

24 Why ReLU? 📊 medium

Answer

Answer: Faster training than saturating tanh/sigmoid; mitigates vanishing gradient in deep stacks; sparse activations.

Question 25

25 Use of dropout? 📊 medium

Answer

Answer: Regularize huge FC layers by randomly zeroing neurons—reduces co-adaptation on training set.

Question 26

26 What was LRN? 🔥 hard

Answer

Answer: Local response normalization—side inhibition across channels; later often replaced by batch norm; minor effect in hindsight.

Question 27

27 Overlapping pooling? 📊 medium

Answer

Answer: Stride smaller than pool window—slightly richer downsampling vs non-overlapping; less common in newer nets.

Question 28

28 Two GPUs? ⚡ easy

Answer

Answer: Model split across GPUs due to memory limits—cross-GPU connections only on certain layers (engineering constraint of the time).

Question 29

29 Augmentation? 📊 medium

Answer

Answer: Random crops/flips from 256×256, PCA color jitter—reduces overfitting and increases effective data.

Question 30

30 Parameters? ⚡ easy

Answer

Answer: On order of 60M—mostly FC layers; later architectures reduce FC params with GAP.

Question 31

31 Training details? 📊 medium

Answer

Answer: SGD + momentum, weight decay, learning rate schedule dropping on plateaus—long schedule on two GPUs.

Question 32

32 Overfitting risk? 📊 medium

Answer

Answer: Large capacity vs data—addressed by dropout, aug, and weight decay; still a concern for smaller datasets when fine-tuning.

Question 33

33 vs VGG? 📊 medium

Answer

Answer: VGG uses uniform 3×3 stacks, deeper, more systematic—higher accuracy, more compute; AlexNet shallower irregular design.

Question 34

34 vs ResNet? 📊 medium

Answer

Answer: ResNet adds residuals enabling much deeper nets—AlexNet depth modest by today’s standards.

Question 35

35 Use AlexNet now? ⚡ easy

Answer

Answer: Mostly for teaching/history; ResNet/EfficientNet backbones dominate transfer learning—AlexNet too weak/slow vs modern alternatives.

Question 36

36 Typical input? 📊 medium

Answer

Answer: 224×224 crops from 256×256 resized image—standard pipeline referenced in many papers.

Question 37

37 Output layer? ⚡ easy

Answer

Answer: 1000-way softmax for ImageNet classes—cross-entropy loss during training.

Question 38

38 Obsolete? ⚡ easy

Answer

Answer: For production accuracy, yes; for pedagogy and history, still the canonical “first big win” story.

Question 39

39 Impact beyond vision? ⚡ easy

Answer

Answer: Validated deep learning at scale—influenced speech, NLP later wave; proved GPUs + data + depth recipe.

Question 40

40 Modern small nets? 📊 medium

Answer

Answer: MobileNet, EfficientNet achieve better accuracy/FLOPs—mobile edge rarely uses AlexNet-sized FC heads.

CNN Basics for Vision — Interview Q&A

CNNs for Vision: 20 Essential Q&A

AlexNet: 20 Essential Q&A

Full tutorial chapter