CNN & Convolutional Layers: 20 Interview Questions

Question 1

1 What is a convolution layer? How does it work? ⚡ Easy

Answer

Answer: A convolution layer applies learnable filters (kernels) that slide across input volume. Each filter performs element-wise multiplication and summation, producing a feature map. It preserves spatial structure, uses parameter sharing, and is translation invariant.

Question 2

2 A filter's depth must match input depth. Explain. 📊 Medium

Answer

Answer: Each filter has the same depth as the input volume (e.g., RGB image depth=3). The filter convolves across all channels, summing results to produce a single channel output. Number of filters = output depth.

Question 3

3 Formula: output size after convolution? ⚡ Easy

Answer

Answer: O = ⌊(I - K + 2P)/S⌋ + 1. Where I=input size, K=kernel, P=padding, S=stride. Same padding: O = I (if S=1, P=(K-1)/2). Valid padding: P=0.

Question 4

4 What is parameter sharing in CNN? Benefit? ⚡ Easy

Answer

Answer: The same filter weights are used across all spatial positions. Drastically reduces parameters, provides translation invariance, and improves efficiency. Contrast with fully connected layer.

Question 5

5 What is receptive field? How to compute? 🔥 Hard

Answer

Answer: Receptive field is the region of input space that affects a particular neuron. Formula (iterative): RF_l = RF_{l-1} + (K_l -1) * ∏_{i=1}^{l-1} stride_i. Important for understanding context in segmentation/detection.

Question 6

6 Why use pooling? Max vs Average vs Global? 📊 Medium

Answer

Answer: Pooling downsamples feature maps, reduces dimensions, increases receptive field, provides translation invariance. Max: retains strongest activation, average: smooths. Global pooling replaces FC, reduces overfitting.

Question 7

7 What is 1x1 convolution? Why useful? 📊 Medium

Answer

Answer: 1x1 conv (pointwise) mixes channels, changes depth. Used for dimensionality reduction (bottleneck), increasing non-linearity, and channel-wise pooling. Key in Inception, MobileNet, ResNet.

Question 8

8 Explain depthwise separable convolution. 🔥 Hard

Answer

Answer: Factorizes standard conv into depthwise (spatial, per channel) + pointwise (1x1, channel mixing). Drastically reduces parameters and FLOPs. Used in MobileNet, Xception.

Question 9

9 What is dilated convolution? Why use it? 🔥 Hard

Answer

Answer: Inserts holes (zeros) between kernel elements. Increases receptive field without adding parameters or downsampling. Used in DeepLab, WaveNet, segmentation tasks.

Question 10

10 Transposed convolution: what is it? Misconception? 🔥 Hard

Answer

Answer: Transposed convolution upsamples feature maps (learnable). Not true deconvolution (inverse of conv). Performs convolution with fractional stride. Used in GANs, segmentation (FCN, U-Net).

Question 11

11 Calculate parameters in a conv layer? 📊 Medium

Answer

Answer: Params = (K_h · K_w · C_in) · C_out + C_out (biases). No dependence on input spatial size.

Question 12

12 Why use two 3x3 conv instead of one 5x5? 📊 Medium

Answer

Answer: Two 3x3 have same receptive field as 5x5 (2 layers), but with fewer parameters (2·9·C² vs 25·C²) and more non-linearity (ReLU in between). Used in VGG.

Question 13

13 How is backpropagation computed in convolution? 🔥 Hard

Answer

Answer: Gradient w.r.t. weights: convolution of input with output gradient. Gradient w.r.t. input: full convolution (with rotated kernel) of output gradient. Implemented via im2col + GEMM.

Question 14

14 What is grouped convolution? Why? 📊 Medium

Answer

Answer: Split input channels into groups, each group convolved independently. Reduces parameters, used in AlexNet (GPU memory), ResNeXt, ShuffleNet. Depthwise conv is extreme group (groups = channels).

Question 15

15 What is deformable convolution? 🔥 Hard

Answer

Answer: Adds learnable 2D offsets to kernel sampling positions. Adapts to geometric variations (scale, rotation). Improves object detection/segmentation. Used in recent SOTA models.

Question 16

16 Computational cost: conv vs fully connected? 📊 Medium

Answer

Answer: Conv: O(K²·C_in·C_out·H_out·W_out). FC: O(H_in·W_in·C_in · H_out·W_out·C_out). Conv is spatially local, far more efficient for images.

Question 17

17 Why do some modern CNNs (ResNet) avoid pooling? 📊 Medium

Answer

Answer: Use strided convolution for downsampling. Learns spatial reduction, avoids losing information. Pooling is heuristic; strided conv is trainable.

Question 18

18 Should we use bias in conv after BatchNorm? 🔥 Hard

Answer

Answer: No – BatchNorm subtracts mean, bias becomes redundant. Standard practice: conv without bias when followed by BN.

Question 19

19 Explain bottleneck design in ResNet. 📊 Medium

Answer

Answer: 1x1 conv (reduce channels) → 3x3 conv → 1x1 conv (restore). Reduces computation, allows deeper networks. Example: 256→64→64→256.

Question 20

20 Depthwise separable conv: accuracy vs efficiency? 🔥 Hard

Answer

Answer: Drastically reduces FLOPs/params (≈1/9 for 3x3). Slight accuracy drop if naive; modern versions (MobileNetV2/V3) with inverted residuals and linear bottlenecks close the gap.

CNN & Convolutional Layers: 20 Interview Questions

CNN & Convolutional Layers – Interview Cheat Sheet

Conv Arithmetic

Filter types

Pooling

Modern concepts

Efficiency