CNN & Convolutional Layers
20 Essential Q/A
CV Interview Prep
CNN & Convolutional Layers: 20 Interview Questions
Master convolution arithmetic, filters, stride, padding, pooling, receptive field, 1x1 conv, depthwise separable, dilated/atrous conv, transposed conv, backprop, and modern CNN designs. Interview-ready answers.
Filters/Kernels
Stride & Padding
Pooling
1x1 Conv
Depthwise Sep
Dilated Conv
1
What is a convolution layer? How does it work?
⚡ Easy
Answer: A convolution layer applies learnable filters (kernels) that slide across input volume. Each filter performs element-wise multiplication and summation, producing a feature map. It preserves spatial structure, uses parameter sharing, and is translation invariant.
Output feature map: O(i,j) = Σ_m Σ_n I(i+m, j+n) · K(m,n) + bias
2
A filter's depth must match input depth. Explain.
📊 Medium
Answer: Each filter has the same depth as the input volume (e.g., RGB image depth=3). The filter convolves across all channels, summing results to produce a single channel output. Number of filters = output depth.
Input: H×W×C | Filter: k×k×C | Output per filter: H'×W'×1
3
Formula: output size after convolution?
⚡ Easy
Answer: O = ⌊(I - K + 2P)/S⌋ + 1. Where I=input size, K=kernel, P=padding, S=stride. Same padding: O = I (if S=1, P=(K-1)/2). Valid padding: P=0.
Example: I=32, K=5, P=2, S=1 → O = (32-5+4)/1 +1 = 32
4
What is parameter sharing in CNN? Benefit?
⚡ Easy
Answer: The same filter weights are used across all spatial positions. Drastically reduces parameters, provides translation invariance, and improves efficiency. Contrast with fully connected layer.
memory efficient, invariant
less adaptive per location
5
What is receptive field? How to compute?
🔥 Hard
Answer: Receptive field is the region of input space that affects a particular neuron. Formula (iterative): RF_l = RF_{l-1} + (K_l -1) * ∏_{i=1}^{l-1} stride_i. Important for understanding context in segmentation/detection.
# Example: VGG16 after conv5 ~ 212 receptive field
6
Why use pooling? Max vs Average vs Global?
📊 Medium
Answer: Pooling downsamples feature maps, reduces dimensions, increases receptive field, provides translation invariance. Max: retains strongest activation, average: smooths. Global pooling replaces FC, reduces overfitting.
MaxPool
AvgPool
GlobalAvgPool
7
What is 1x1 convolution? Why useful?
📊 Medium
Answer: 1x1 conv (pointwise) mixes channels, changes depth. Used for dimensionality reduction (bottleneck), increasing non-linearity, and channel-wise pooling. Key in Inception, MobileNet, ResNet.
Input: H×W×C_in | Filter: 1×1×C_in | Output: H×W×C_out
8
Explain depthwise separable convolution.
🔥 Hard
Answer: Factorizes standard conv into depthwise (spatial, per channel) + pointwise (1x1, channel mixing). Drastically reduces parameters and FLOPs. Used in MobileNet, Xception.
Standard: k²·C_in·C_out | Depthwise: k²·C_in + C_in·C_out (pointwise)
9
What is dilated convolution? Why use it?
🔥 Hard
Answer: Inserts holes (zeros) between kernel elements. Increases receptive field without adding parameters or downsampling. Used in DeepLab, WaveNet, segmentation tasks.
Effective kernel size = K + (K-1)·(dilation-1)
10
Transposed convolution: what is it? Misconception?
🔥 Hard
Answer: Transposed convolution upsamples feature maps (learnable). Not true deconvolution (inverse of conv). Performs convolution with fractional stride. Used in GANs, segmentation (FCN, U-Net).
11
Calculate parameters in a conv layer?
📊 Medium
Answer: Params = (K_h · K_w · C_in) · C_out + C_out (biases). No dependence on input spatial size.
Example: 3x3x64 filters, 128 filters → 3·3·64·128 + 128 = 73,856 params
12
Why use two 3x3 conv instead of one 5x5?
📊 Medium
Answer: Two 3x3 have same receptive field as 5x5 (2 layers), but with fewer parameters (2·9·C² vs 25·C²) and more non-linearity (ReLU in between). Used in VGG.
13
How is backpropagation computed in convolution?
🔥 Hard
Answer: Gradient w.r.t. weights: convolution of input with output gradient. Gradient w.r.t. input: full convolution (with rotated kernel) of output gradient. Implemented via im2col + GEMM.
14
What is grouped convolution? Why?
📊 Medium
Answer: Split input channels into groups, each group convolved independently. Reduces parameters, used in AlexNet (GPU memory), ResNeXt, ShuffleNet. Depthwise conv is extreme group (groups = channels).
15
What is deformable convolution?
🔥 Hard
Answer: Adds learnable 2D offsets to kernel sampling positions. Adapts to geometric variations (scale, rotation). Improves object detection/segmentation. Used in recent SOTA models.
16
Computational cost: conv vs fully connected?
📊 Medium
Answer: Conv: O(K²·C_in·C_out·H_out·W_out). FC: O(H_in·W_in·C_in · H_out·W_out·C_out). Conv is spatially local, far more efficient for images.
17
Why do some modern CNNs (ResNet) avoid pooling?
📊 Medium
Answer: Use strided convolution for downsampling. Learns spatial reduction, avoids losing information. Pooling is heuristic; strided conv is trainable.
18
Should we use bias in conv after BatchNorm?
🔥 Hard
Answer: No – BatchNorm subtracts mean, bias becomes redundant. Standard practice: conv without bias when followed by BN.
19
Explain bottleneck design in ResNet.
📊 Medium
Answer: 1x1 conv (reduce channels) → 3x3 conv → 1x1 conv (restore). Reduces computation, allows deeper networks. Example: 256→64→64→256.
Input 256: 1x1,64 → 3x3,64 → 1x1,256
20
Depthwise separable conv: accuracy vs efficiency?
🔥 Hard
Answer: Drastically reduces FLOPs/params (≈1/9 for 3x3). Slight accuracy drop if naive; modern versions (MobileNetV2/V3) with inverted residuals and linear bottlenecks close the gap.
Efficient, mobile
May underfit if too thin
CNN & Convolutional Layers – Interview Cheat Sheet
Conv Arithmetic
- O (I - K + 2P)/S + 1
- Params (K²·C_in)·C_out + C_out
- RF RF_l = RF_{l-1} + (K_l-1)·∏stride
Filter types
- 1x1 Channel mixing, bottleneck
- Depthwise Spatial per channel
- Dilated Larger RF, no extra params
- Transposed Upsampling
Pooling
- Max Sharp features
- Average Smooth
- Global Replace FC
Modern concepts
- Bottleneck 1x1→3x3→1x1
- Grouped Channel groups
- Deformable Learnable offsets
- Strided conv Replace pooling
Efficiency
- Depthwise sep: k²C_in + C_in·C_out
Verdict: "Conv layers are local, parameter-efficient, and learn hierarchies – know your arithmetic and modern variants!"
20 CNN Q/A covered
RNN/LSTM