Why convolution?
A dense layer on a 224×224 RGB image would need hundreds of millions of weights per neuron. A conv2d layer uses small kernels (e.g. 3×3) slid over the input—the same filter bank applied everywhere, drastically reducing parameters and encoding prior knowledge about local structure.
Output height/width
For one spatial dimension: out = floor((in + 2·pad − dilation·(k−1) − 1) / stride + 1). Default dilation=1.
def conv_out(in_sz, k, stride=1, pad=0, dilation=1):
return (in_sz + 2 * pad - dilation * (k - 1) - 1) // stride + 1
# Example: 32 in, kernel 3, padding 1, stride 1 → 32
print(conv_out(32, 3, 1, 1))
# stride 2 same pad → 16
print(conv_out(32, 3, 2, 1))
nn.Conv2d examples
import torch
import torch.nn as nn
x = torch.randn(4, 3, 64, 64) # N, C, H, W
conv1 = nn.Conv2d(3, 16, kernel_size=3, padding=1, bias=False)
y1 = conv1(x)
print(y1.shape) # [4, 16, 64, 64]
conv2 = nn.Conv2d(3, 32, kernel_size=5, stride=2, padding=2)
y2 = conv2(x)
print(y2.shape) # [4, 32, 32, 32]
conv3 = nn.Conv2d(3, 8, kernel_size=1) # pointwise mixing of channels
y3 = conv3(x)
print(y3.shape) # [4, 8, 64, 64] — 1×1 conv preserves H,W
Pooling and activation
pool = nn.MaxPool2d(2, stride=2)
avg = nn.AvgPool2d(2, stride=2)
act = nn.ReLU(inplace=True)
h = act(conv1(x))
p = pool(h)
print(p.shape) # [4, 16, 32, 32]
Global average pooling (GAP)
gap = nn.AdaptiveAvgPool2d(1)
v = gap(p).flatten(1)
print(v.shape) # [4, 16] — one value per channel
Tiny CNN for CIFAR-scale input
class TinyCNN(nn.Module):
def __init__(self, num_classes=10):
super().__init__()
self.f = nn.Sequential(
nn.Conv2d(3, 32, 3, padding=1),
nn.ReLU(inplace=True),
nn.MaxPool2d(2),
nn.Conv2d(32, 64, 3, padding=1),
nn.ReLU(inplace=True),
nn.MaxPool2d(2),
)
self.fc = nn.Linear(64 * 8 * 8, num_classes)
def forward(self, x):
x = self.f(x)
x = x.flatten(1)
return self.fc(x)
m = TinyCNN()
out = m(torch.randn(2, 3, 32, 32))
print(out.shape) # [2, 10]
After two 2×2 pools on 32×32 → 8×8 spatial size for the linear layer.
BatchNorm and Dropout (typical use)
block = nn.Sequential(
nn.Conv2d(32, 64, 3, padding=1),
nn.BatchNorm2d(64),
nn.ReLU(inplace=True),
nn.Dropout2d(0.1),
)
Takeaways
- Track tensors as N×C×H×W in PyTorch conv nets.
- Padding preserves size with stride 1 and kernel 3 (pad 1).
- Stack conv → nonlinearity → pool; end with GAP or flatten + linear.