CNN Basics for Vision

CNNs for vision

Why convolution?

A dense layer on a 224×224 RGB image would need hundreds of millions of weights per neuron. A conv2d layer uses small kernels (e.g. 3×3) slid over the input—the same filter bank applied everywhere, drastically reducing parameters and encoding prior knowledge about local structure.

Output height/width

For one spatial dimension: out = floor((in + 2·pad − dilation·(k−1) − 1) / stride + 1). Default dilation=1.

def conv_out(in_sz, k, stride=1, pad=0, dilation=1):
    return (in_sz + 2 * pad - dilation * (k - 1) - 1) // stride + 1

# Example: 32 in, kernel 3, padding 1, stride 1 → 32
print(conv_out(32, 3, 1, 1))
# stride 2 same pad → 16
print(conv_out(32, 3, 2, 1))

`nn.Conv2d` examples

import torch
import torch.nn as nn

x = torch.randn(4, 3, 64, 64)  # N, C, H, W

conv1 = nn.Conv2d(3, 16, kernel_size=3, padding=1, bias=False)
y1 = conv1(x)
print(y1.shape)   # [4, 16, 64, 64]

conv2 = nn.Conv2d(3, 32, kernel_size=5, stride=2, padding=2)
y2 = conv2(x)
print(y2.shape)   # [4, 32, 32, 32]

conv3 = nn.Conv2d(3, 8, kernel_size=1)  # pointwise mixing of channels
y3 = conv3(x)
print(y3.shape)   # [4, 8, 64, 64] — 1×1 conv preserves H,W

Pooling and activation

pool = nn.MaxPool2d(2, stride=2)
avg = nn.AvgPool2d(2, stride=2)
act = nn.ReLU(inplace=True)

h = act(conv1(x))
p = pool(h)
print(p.shape)  # [4, 16, 32, 32]

Global average pooling (GAP)

gap = nn.AdaptiveAvgPool2d(1)
v = gap(p).flatten(1)
print(v.shape)  # [4, 16] — one value per channel

Tiny CNN for CIFAR-scale input

class TinyCNN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        self.f = nn.Sequential(
            nn.Conv2d(3, 32, 3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2),
            nn.Conv2d(32, 64, 3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(2),
        )
        self.fc = nn.Linear(64 * 8 * 8, num_classes)

    def forward(self, x):
        x = self.f(x)
        x = x.flatten(1)
        return self.fc(x)

m = TinyCNN()
out = m(torch.randn(2, 3, 32, 32))
print(out.shape)  # [2, 10]

After two 2×2 pools on 32×32 → 8×8 spatial size for the linear layer.

BatchNorm and Dropout (typical use)

block = nn.Sequential(
    nn.Conv2d(32, 64, 3, padding=1),
    nn.BatchNorm2d(64),
    nn.ReLU(inplace=True),
    nn.Dropout2d(0.1),
)

                    Takeaways
                    Track tensors as N×C×H×W in PyTorch conv nets.
Padding preserves size with stride 1 and kernel 3 (pad 1).
Stack conv → nonlinearity → pool; end with GAP or flatten + linear.

                

Quick FAQ

A conv layer with 1×1 kernel and matching spatial size can emulate a per-pixel linear mix; global classification layers are usually FC or GAP + FC.

Deeper stacks and dilated convolutions grow the image region that influences one output neuron—important for context in segmentation.

AlexNet

Architecture (conceptual)

Input traditionally 224×224 (after crop). Five convolutional stages with ReLU and max pooling (original paper used overlapping pooling in places). Three large fully connected layers (4096, 4096, 1000) with dropout. Local Response Normalization (LRN) appeared in the original paper; torchvision’s implementation may omit LRN in favor of batch-oriented training practices—check the version you use.

ReLU

Faster training than saturating tanh/sigmoid for deep nets at the time.

Dropout

Regularizes the huge FC parameters to reduce co-adaptation.

Load pretrained weights

import torch
from torchvision.models import alexnet, AlexNet_Weights

weights = AlexNet_Weights.IMAGENET1K_V1
model = alexnet(weights=weights).eval()

preprocess = weights.transforms()
print(preprocess)

Random init (train from scratch)

model_scratch = alexnet(weights=None)

Single image → class logits

from PIL import Image

img = Image.open("cat.jpg").convert("RGB")
batch = preprocess(img).unsqueeze(0)

with torch.no_grad():
    logits = model(batch)
probs = logits.softmax(dim=1)
top5 = probs.topk(5, dim=1)

# Map indices to labels
categories = weights.meta["categories"]
for score, idx in zip(top5.values[0], top5.indices[0]):
    print(f"{categories[idx]}: {float(score):.4f}")

4096-D embedding (before classifier)

# torchvision alexnet: features → avgpool → classifier (fc layers)
with torch.no_grad():
    x = model.features(batch)
    x = model.avgpool(x)
    x = torch.flatten(x, 1)
    # Default torchvision alexnet: classifier[6] is Linear(4096, 1000)
    vec = model.classifier[:6](x)   # through second FC + ReLU → 4096-D
print(vec.shape)

Confirm with print(model.classifier)—slices change if the head was replaced for fine-tuning.

Alternative: hook after second ReLU

activation = {}
def get(name):
    def hook(m, i, o):
        activation[name] = o.detach()
    return hook

h = model.classifier[5].register_forward_hook(get("fc4096_relu"))
_ = model(batch)
h.remove()
feat = activation["fc4096_relu"]

Mini-batch

from PIL import Image

paths = ["a.jpg", "b.jpg", "c.jpg"]
tensors = [preprocess(Image.open(p).convert("RGB")) for p in paths]
xb = torch.stack(tensors, dim=0)

with torch.no_grad():
    out = model(xb)
print(out.shape)  # [3, 1000]

weights.transforms() handles resize, ToTensor, and ImageNet normalization for PIL or tensor inputs per torchvision version.

Fine-tune last layer (sketch)

import torch.nn as nn

num_classes = 10
model_ft = alexnet(weights=weights)
model_ft.classifier[6] = nn.Linear(4096, num_classes)
# freeze earlier layers optionally, then train with your dataloader

                    Takeaways
                    AlexNet = deep conv stacks + large FC + ReLU/dropout—ImageNet 2012 breakthrough.
Use AlexNet_Weights transforms for correct normalization.
For transfer learning, replace the final Linear(4096, 1000) with your class count.

                

Quick FAQ

Rarely for accuracy—ResNet/EfficientNet/ViT families dominate. AlexNet remains useful for teaching and lightweight baselines on small data with heavy regularization.

Pretrained FC layers expect a fixed flattened size from avgpool. Changing input resolution may break shape; keep 224 or redesign the head.

Chapter FAQ

Quick FAQ

A conv layer with 1×1 kernel and matching spatial size can emulate a per-pixel linear mix; global classification layers are usually FC or GAP + FC.

Deeper stacks and dilated convolutions grow the image region that influences one output neuron—important for context in segmentation.

Quick FAQ

Rarely for accuracy—ResNet/EfficientNet/ViT families dominate. AlexNet remains useful for teaching and lightweight baselines on small data with heavy regularization.

Pretrained FC layers expect a fixed flattened size from avgpool. Changing input resolution may break shape; keep 224 or redesign the head.

CNNs for vision

Why convolution?

Output height/width

nn.Conv2d examples

Pooling and activation

Global average pooling (GAP)

Tiny CNN for CIFAR-scale input

BatchNorm and Dropout (typical use)

Takeaways

Quick FAQ

Conv vs fully connected?

Receptive field?

AlexNet

Architecture (conceptual)

ReLU

Dropout

Load pretrained weights

Random init (train from scratch)

Single image → class logits

4096-D embedding (before classifier)

Alternative: hook after second ReLU

Mini-batch

Fine-tune last layer (sketch)

Takeaways

Quick FAQ

Still use AlexNet in production?

Input not 224?

Chapter FAQ

Quick FAQ

Conv vs fully connected?

Receptive field?

Quick FAQ

Still use AlexNet in production?

Input not 224?

`nn.Conv2d` examples