Convolutional Neural Networks: The Vision Architecture

CNNs revolutionized computer vision by learning hierarchical feature representations. From edge detection to semantic understanding â€” complete guide covering convolution math, layer design, modern architectures, and implementation.

Conv2D

Kernels, stride, padding

Pooling

Downsampling

Residual

Skip connections

EfficientNet

Compound scaling

What is a Convolutional Neural Network?

A Convolutional Neural Network (CNN) is a specialized neural architecture designed for processing grid-structured data such as images. Instead of fully connected layers, CNNs use convolutional layers that learn spatial hierarchies of patterns â€” from edges and textures to object parts and complete objects.

Input Image (HÃ—WÃ—C) â†’ [Conv2D + ReLU] â†’ Pooling â†’ ... â†’ Flatten â†’ FC â†’ Output

CNNs automatically learn spatial feature hierarchies via backpropagation.

Convolution Arithmetic: Kernels, Stride, Padding

2D Convolution

(I * K)(i,j) = Î£â‚˜ Î£â‚™ I(i+m, j+n) Â· K(m,n)

Input I, kernel/filter K. Output size = (H - Kâ‚• + 2P)/S + 1 Ã— (W - K_w + 2P)/S + 1

Kernel: 3x3, 5x7 Stride (S): 1 or 2 Padding (P): same/valid

Receptive Field

Each neuron in deeper layers sees a larger region of the input. Stacking 3x3 convs: 3 layers â†’ receptive field 7x7.

RF = fâ‚–â‚‹â‚ + (k - 1) * âˆ_{i=1}^{k-1} sáµ¢

Conv2D from scratch (valid convolution)

import numpy as np

def conv2d(image, kernel, stride=1, padding=0):
    # image: (H, W), kernel: (kH, kW)
    H, W = image.shape
    kH, kW = kernel.shape
    out_h = (H - kH + 2*padding) // stride + 1
    out_w = (W - kW + 2*padding) // stride + 1
    
    if padding > 0:
        image = np.pad(image, pad_width=padding, mode='constant')
    
    output = np.zeros((out_h, out_w))
    for i in range(out_h):
        for j in range(out_w):
            h_start = i * stride
            w_start = j * stride
            output[i,j] = np.sum(image[h_start:h_start+kH, w_start:w_start+kW] * kernel)
    return output

Pooling and Activation Functions

Max Pooling

Selects maximum value in window. Reduces spatial size, provides translation invariance.

2x2, stride 2 â†’ downsample 50%

Average Pooling

Average of values. Smoother but edges less preserved. Used in modern architectures.

ReLU

max(0, x). Standard activation. Variants: LeakyReLU, ELU, GELU.

Insight: Pooling reduces spatial dimension, increases receptive field and provides local translation invariance. Many modern CNNs replace pooling with strided convolution.

Feature Hierarchy: From Edges to Objects

Layer 1: Low-level features

Gabor filters: edges, corners, color blobs. Generally 3x3 or 5x5 kernels.

Layer 2: Mid-level features

Textures, patterns, part of objects (eyes, wheels).

Layer 3-4: High-level features

Object parts, semantic concepts (faces, animals, cars).

Fully Connected

Global reasoning, classification scores.

Feature visualization: Early layers â†’ edges / Later layers â†’ entire objects. Learned via backpropagation.

Iconic CNN Architectures (2012â€“2024)

LeNet-5 (1998)

Origin. 2 conv + pooling, 3 FC. Digit recognition.

AlexNet (2012)

Breakthrough. ReLU, Dropout, GPU training. 5 conv + 3 FC.

VGGNet (2014)

Simplicity: 3x3 conv, deeper (16-19 layers).

ResNet (2015)

Skip connections â†’ train up to 152 layers. Residual learning. Solves degradation.

Output = F(x) + x

DenseNet (2017)

Concatenate all previous feature maps. Feature reuse.

EfficientNet (2019)

Compound scaling: depth, width, resolution. Neural Architecture Search.

Inception / GoogLeNet

Parallel conv 1x1, 3x3, 5x5, pooling â†’ concatenate. Efficient.

MobileNet

Depthwise separable convolution. Lightweight for edge.

Evolution: Deeper â†’ Residual connections â†’ Automated architecture search. ResNet is the most influential.

Transfer Learning with CNNs

Rarely train CNNs from scratch. Use models pretrained on ImageNet (1.2M images).

Feature Extractor

Freeze conv base, train new classifier on top. Fast, small dataset.

Fine-tuning

Unfreeze some layers, train with lower LR. Adapt to domain.

PyTorch Transfer Learning (feature extraction)

import torchvision.models as models

# Load pretrained ResNet
resnet = models.resnet50(weights='IMAGENET1K_V2')

# Freeze parameters
for param in resnet.parameters():
    param.requires_grad = False

# Replace classifier head
num_features = resnet.fc.in_features
resnet.fc = torch.nn.Linear(num_features, num_classes)  # e.g., 10 classes

# Train only the new head
optimizer = torch.optim.Adam(resnet.fc.parameters(), lr=0.001)

CNN in TensorFlow & PyTorch

TensorFlow / Keras

import tensorflow as tf

model = tf.keras.Sequential([
    tf.keras.layers.Conv2D(32, 3, activation='relu', padding='same', input_shape=(32,32,3)),
    tf.keras.layers.MaxPool2D(2,2),
    tf.keras.layers.Conv2D(64, 3, activation='relu', padding='same'),
    tf.keras.layers.MaxPool2D(2,2),
    tf.keras.layers.Conv2D(128, 3, activation='relu', padding='same'),
    tf.keras.layers.GlobalAvgPool2D(),
    tf.keras.layers.Dense(10, activation='softmax')
])

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['acc'])

PyTorch

import torch.nn as nn

class SimpleCNN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        self.features = nn.Sequential(
            nn.Conv2d(3, 32, 3, padding=1), nn.ReLU(), nn.MaxPool2d(2,2),
            nn.Conv2d(32, 64, 3, padding=1), nn.ReLU(), nn.MaxPool2d(2,2),
            nn.Conv2d(64, 128, 3, padding=1), nn.ReLU(), nn.AdaptiveAvgPool2d(1)
        )
        self.classifier = nn.Linear(128, num_classes)
    def forward(self, x):
        x = self.features(x)
        x = x.view(x.size(0), -1)
        return self.classifier(x)

model = SimpleCNN()

CNN Training: Hyperparameters & Best Practices

ðŸ§ Kernel size: Stack of 3x3 > single 7x7 (fewer params, more non-linearity). 1x1 conv for channel reduction.

ðŸ“ Stride & Padding: 'same' padding preserves spatial size; 'valid' shrinks. Use stride=2 for downsampling.

âœ… Data Augmentation: Random crop, flip, rotation, color jitter. Essential for generalization.

âš¡ Batch Norm: After conv, before ReLU. Stabilizes training, allows higher LR.

CNNs Beyond Image Classification

Object Detection

YOLO, Faster R-CNN, SSD. CNNs for localization + classification.

Semantic Segmentation

U-Net, DeepLab, FCN. Pixel-wise classification.

Video Analysis

3D CNNs, I3D, C3D. Spatiotemporal filters.

Generative Models

DCGAN, StyleGAN. CNN generators/discriminators.

Medical Imaging

MRI, CT, histopathology. Pretrained CNNs as feature extractors.

Self-driving

Lane detection, obstacle recognition.

CNN Architectures & Use Cases â€“ CheatsheetResNet Default backbone
EfficientNet SOTA efficiency
MobileNet Edge/CPU
ViT Transformer rival
U-Net Segmentation
YOLO Real-time detection
3D CNN Video
Siamese Face verification

Architecture Comparison

Architecture	Year	Depth	Key Innovation	Top-1 ImageNet
AlexNet	2012	8	ReLU, Dropout, GPU	~57%
VGG-16	2014	16	3x3 only	~71%
ResNet-50	2015	50	Skip connections	~76%
DenseNet-121	2017	121	Dense blocks	~75%
EfficientNet-B7	2019	~80	Compound scaling, NAS	~84%

CNN Pitfalls & Debugging

âš ï¸ Overfitting: Use augmentation, dropout, weight decay. Small dataset â†’ transfer learning.

âš ï¸ Vanishing gradients: Use BatchNorm, residual connections, proper initialization.

âœ… Debug shape: Print output size after each layer. Mismatch in classifier.

âœ… Monitor filters: Visualize kernels, feature maps. Early layers should be smooth.

Next Up: Recurrent Neural Networks & Transformers â€“ Sequence modeling for NLP and time series.

Next: RNN & Transformers

Related Deep Learning Links