Computer Vision Chapter 29

CNNs for vision

Convolutional neural networks (CNNs) build hierarchical representations of images: early layers respond to edges and textures; deeper layers encode parts and objects. Convolution shares weights across space (translation equivariance); pooling adds local translation tolerance and downsampling. This page uses PyTorch to show tensor shapes, multiple conv/pool configurations, a tiny classifier, and the output-size formula—with several runnable snippets.

Why convolution?

A dense layer on a 224×224 RGB image would need hundreds of millions of weights per neuron. A conv2d layer uses small kernels (e.g. 3×3) slid over the input—the same filter bank applied everywhere, drastically reducing parameters and encoding prior knowledge about local structure.

Output height/width

For one spatial dimension: out = floor((in + 2·pad − dilation·(k−1) − 1) / stride + 1). Default dilation=1.

def conv_out(in_sz, k, stride=1, pad=0, dilation=1):
    return (in_sz + 2 * pad - dilation * (k - 1) - 1) // stride + 1

# Example: 32 in, kernel 3, padding 1, stride 1 → 32
print(conv_out(32, 3, 1, 1))
# stride 2 same pad → 16
print(conv_out(32, 3, 2, 1))

nn.Conv2d examples

import torch
import torch.nn as nn

x = torch.randn(4, 3, 64, 64)  # N, C, H, W

conv1 = nn.Conv2d(3, 16, kernel_size=3, padding=1, bias=False)
y1 = conv1(x)
print(y1.shape)   # [4, 16, 64, 64]

conv2 = nn.Conv2d(3, 32, kernel_size=5, stride=2, padding=2)
y2 = conv2(x)
print(y2.shape)   # [4, 32, 32, 32]

conv3 = nn.Conv2d(3, 8, kernel_size=1)  # pointwise mixing of channels
y3 = conv3(x)
print(y3.shape)   # [4, 8, 64, 64] — 1×1 conv preserves H,W

Pooling and activation

pool = nn.MaxPool2d(2, stride=2)
avg = nn.AvgPool2d(2, stride=2)
act = nn.ReLU(inplace=True)

h = act(conv1(x))
p = pool(h)
print(p.shape)  # [4, 16, 32, 32]

Global average pooling (GAP)

gap = nn.AdaptiveAvgPool2d(1)
v = gap(p).flatten(1)
print(v.shape)  # [4, 16] — one value per channel

Tiny CNN for CIFAR-scale input

class TinyCNN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        self.f = nn.Sequential(
            nn.Conv2d(3, 32, 3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2),
            nn.Conv2d(32, 64, 3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2),
        )
        self.fc = nn.Linear(64 * 8 * 8, num_classes)

    def forward(self, x):
        x = self.f(x)
        x = x.flatten(1)
        return self.fc(x)

m = TinyCNN()
out = m(torch.randn(2, 3, 32, 32))
print(out.shape)  # [2, 10]

After two 2×2 pools on 32×32 → 8×8 spatial size for the linear layer.

BatchNorm and Dropout (typical use)

block = nn.Sequential(
    nn.Conv2d(32, 64, 3, padding=1),
    nn.BatchNorm2d(64),
    nn.ReLU(inplace=True),
    nn.Dropout2d(0.1),
)

Takeaways

  • Track tensors as N×C×H×W in PyTorch conv nets.
  • Padding preserves size with stride 1 and kernel 3 (pad 1).
  • Stack conv → nonlinearity → pool; end with GAP or flatten + linear.

Quick FAQ

A conv layer with 1×1 kernel and matching spatial size can emulate a per-pixel linear mix; global classification layers are usually FC or GAP + FC.

Deeper stacks and dilated convolutions grow the image region that influences one output neuron—important for context in segmentation.