Neural Networks CNN
Conv Pooling

Convolutional Neural Networks

Images are grids of pixels with strong local structure: edges combine into textures, textures into parts, parts into objects. A convolutional layer slides small learnable filters (kernels) over the input, producing feature maps that respond to local patterns. Parameter sharing—the same filter applied at every spatial location—cuts parameters versus fully connected layers and encodes translation equivariance (shift the input, the activation map shifts). Pooling downsamples spatial resolution and adds local translation tolerance.

stride padding channels receptive field

Convolution in One Minute

At each output location, the kernel’s weights multiply a patch of the input (across input channels) and sum into one value. Stride controls how far the window steps; larger stride shrinks output size. Padding (often “same” padding) preserves spatial size when desired. Stacking conv layers grows the receptive field—how much of the original image influences a deep pixel—so later layers see context.

Output channels equal the number of learned filters; each filter specializes (e.g. vertical edges, color blobs). Deep CNNs interleave Conv → BN → ReLU blocks, sometimes with residual skips.

Pooling

Max pooling takes the maximum over each k×k window—common 2×2 with stride 2 halves height and width. It builds a degree of local invariance (small shifts inside the window do not change the output). Average pooling smooths; global average pooling at the end of many classifiers replaces large FC layers.

Modern designs sometimes use strided convolutions instead of pooling for downsampling—fewer hand-picked operations, learned downsampling.

PyTorch: Conv2d

Simple block
import torch.nn as nn

block = nn.Sequential(
    nn.Conv2d(in_channels=3, out_channels=64, kernel_size=3, padding=1),
    nn.BatchNorm2d(64),
    nn.ReLU(inplace=True),
    nn.MaxPool2d(kernel_size=2, stride=2),
)

Summary

  • CNNs use local, shared filters—efficient and suited to images, video frames, and spectrograms.
  • Depth increases receptive field; pooling or strided conv reduces resolution.
  • Classic ideas (VGG-style stacks, ResNet skips, depthwise separable convs) trade accuracy vs compute.
  • Next: RNNs for sequences in time or text.

For ordered data—speech, language, sensor streams—recurrent models maintain a hidden state across steps.