Convolutional Neural Networks

Images are grids of pixels with strong local structure: edges combine into textures, textures into parts, parts into objects. A convolutional layer slides small learnable filters (kernels) over the input, producing feature maps that respond to local patterns. Parameter sharingâ€”the same filter applied at every spatial locationâ€”cuts parameters versus fully connected layers and encodes translation equivariance (shift the input, the activation map shifts). Pooling downsamples spatial resolution and adds local translation tolerance.

stride padding channels receptive field

Convolution in One Minute

At each output location, the kernelâ€™s weights multiply a patch of the input (across input channels) and sum into one value. Stride controls how far the window steps; larger stride shrinks output size. Padding (often â€œsameâ€ padding) preserves spatial size when desired. Stacking conv layers grows the receptive fieldâ€”how much of the original image influences a deep pixelâ€”so later layers see context.

Output channels equal the number of learned filters; each filter specializes (e.g. vertical edges, color blobs). Deep CNNs interleave Conv â†’ BN â†’ ReLU blocks, sometimes with residual skips.

Pooling

Max pooling takes the maximum over each kÃ—k windowâ€”common 2Ã—2 with stride 2 halves height and width. It builds a degree of local invariance (small shifts inside the window do not change the output). Average pooling smooths; global average pooling at the end of many classifiers replaces large FC layers.

Modern designs sometimes use strided convolutions instead of pooling for downsamplingâ€”fewer hand-picked operations, learned downsampling.

PyTorch: `Conv2d`

Simple block

import torch.nn as nn

block = nn.Sequential(
    nn.Conv2d(in_channels=3, out_channels=64, kernel_size=3, padding=1),
    nn.BatchNorm2d(64),
    nn.ReLU(inplace=True),
    nn.MaxPool2d(kernel_size=2, stride=2),
)

Summary

CNNs use local, shared filtersâ€”efficient and suited to images, video frames, and spectrograms.
Depth increases receptive field; pooling or strided conv reduces resolution.
Classic ideas (VGG-style stacks, ResNet skips, depthwise separable convs) trade accuracy vs compute.
Next: RNNs for sequences in time or text.

For ordered dataâ€”speech, language, sensor streamsâ€”recurrent models maintain a hidden state across steps.

Previous: Gradients Next: Recurrent neural networks

Related Neural Networks Links