CNNs for vision
Why convolution?
A dense layer on a 224×224 RGB image would need hundreds of millions of weights per neuron. A conv2d layer uses small kernels (e.g. 3×3) slid over the input—the same filter bank applied everywhere, drastically reducing parameters and encoding prior knowledge about local structure.
Output height/width
For one spatial dimension: out = floor((in + 2·pad − dilation·(k−1) − 1) / stride + 1). Default dilation=1.
def conv_out(in_sz, k, stride=1, pad=0, dilation=1):
return (in_sz + 2 * pad - dilation * (k - 1) - 1) // stride + 1
# Example: 32 in, kernel 3, padding 1, stride 1 → 32
print(conv_out(32, 3, 1, 1))
# stride 2 same pad → 16
print(conv_out(32, 3, 2, 1))
nn.Conv2d examples
import torch
import torch.nn as nn
x = torch.randn(4, 3, 64, 64) # N, C, H, W
conv1 = nn.Conv2d(3, 16, kernel_size=3, padding=1, bias=False)
y1 = conv1(x)
print(y1.shape) # [4, 16, 64, 64]
conv2 = nn.Conv2d(3, 32, kernel_size=5, stride=2, padding=2)
y2 = conv2(x)
print(y2.shape) # [4, 32, 32, 32]
conv3 = nn.Conv2d(3, 8, kernel_size=1) # pointwise mixing of channels
y3 = conv3(x)
print(y3.shape) # [4, 8, 64, 64] — 1×1 conv preserves H,W
Pooling and activation
pool = nn.MaxPool2d(2, stride=2)
avg = nn.AvgPool2d(2, stride=2)
act = nn.ReLU(inplace=True)
h = act(conv1(x))
p = pool(h)
print(p.shape) # [4, 16, 32, 32]
Global average pooling (GAP)
gap = nn.AdaptiveAvgPool2d(1)
v = gap(p).flatten(1)
print(v.shape) # [4, 16] — one value per channel
Tiny CNN for CIFAR-scale input
class TinyCNN(nn.Module):
def __init__(self, num_classes=10):
super().__init__()
self.f = nn.Sequential(
nn.Conv2d(3, 32, 3, padding=1),
nn.ReLU(inplace=True),
nn.MaxPool2d(2),
nn.Conv2d(32, 64, 3, padding=1),
nn.ReLU(inplace=True),
nn.MaxPool2d(2),
)
self.fc = nn.Linear(64 * 8 * 8, num_classes)
def forward(self, x):
x = self.f(x)
x = x.flatten(1)
return self.fc(x)
m = TinyCNN()
out = m(torch.randn(2, 3, 32, 32))
print(out.shape) # [2, 10]
After two 2×2 pools on 32×32 → 8×8 spatial size for the linear layer.
BatchNorm and Dropout (typical use)
block = nn.Sequential(
nn.Conv2d(32, 64, 3, padding=1),
nn.BatchNorm2d(64),
nn.ReLU(inplace=True),
nn.Dropout2d(0.1),
)
Takeaways
- Track tensors as N×C×H×W in PyTorch conv nets.
- Padding preserves size with stride 1 and kernel 3 (pad 1).
- Stack conv → nonlinearity → pool; end with GAP or flatten + linear.
Quick FAQ
AlexNet
Architecture (conceptual)
Input traditionally 224×224 (after crop). Five convolutional stages with ReLU and max pooling (original paper used overlapping pooling in places). Three large fully connected layers (4096, 4096, 1000) with dropout. Local Response Normalization (LRN) appeared in the original paper; torchvision’s implementation may omit LRN in favor of batch-oriented training practices—check the version you use.
ReLU
Faster training than saturating tanh/sigmoid for deep nets at the time.
Dropout
Regularizes the huge FC parameters to reduce co-adaptation.
Load pretrained weights
import torch
from torchvision.models import alexnet, AlexNet_Weights
weights = AlexNet_Weights.IMAGENET1K_V1
model = alexnet(weights=weights).eval()
preprocess = weights.transforms()
print(preprocess)
Random init (train from scratch)
model_scratch = alexnet(weights=None)
Single image → class logits
from PIL import Image
img = Image.open("cat.jpg").convert("RGB")
batch = preprocess(img).unsqueeze(0)
with torch.no_grad():
logits = model(batch)
probs = logits.softmax(dim=1)
top5 = probs.topk(5, dim=1)
# Map indices to labels
categories = weights.meta["categories"]
for score, idx in zip(top5.values[0], top5.indices[0]):
print(f"{categories[idx]}: {float(score):.4f}")
4096-D embedding (before classifier)
# torchvision alexnet: features → avgpool → classifier (fc layers)
with torch.no_grad():
x = model.features(batch)
x = model.avgpool(x)
x = torch.flatten(x, 1)
# Default torchvision alexnet: classifier[6] is Linear(4096, 1000)
vec = model.classifier[:6](x) # through second FC + ReLU → 4096-D
print(vec.shape)
Confirm with print(model.classifier)—slices change if the head was replaced for fine-tuning.
Alternative: hook after second ReLU
activation = {}
def get(name):
def hook(m, i, o):
activation[name] = o.detach()
return hook
h = model.classifier[5].register_forward_hook(get("fc4096_relu"))
_ = model(batch)
h.remove()
feat = activation["fc4096_relu"]
Mini-batch
from PIL import Image
paths = ["a.jpg", "b.jpg", "c.jpg"]
tensors = [preprocess(Image.open(p).convert("RGB")) for p in paths]
xb = torch.stack(tensors, dim=0)
with torch.no_grad():
out = model(xb)
print(out.shape) # [3, 1000]
weights.transforms() handles resize, ToTensor, and ImageNet normalization for PIL or tensor inputs per torchvision version.
Fine-tune last layer (sketch)
import torch.nn as nn
num_classes = 10
model_ft = alexnet(weights=weights)
model_ft.classifier[6] = nn.Linear(4096, num_classes)
# freeze earlier layers optionally, then train with your dataloader
Takeaways
- AlexNet = deep conv stacks + large FC + ReLU/dropout—ImageNet 2012 breakthrough.
- Use
AlexNet_Weightstransforms for correct normalization. - For transfer learning, replace the final
Linear(4096, 1000)with your class count.
Quick FAQ
avgpool. Changing input resolution may break shape; keep 224 or redesign the head.Chapter FAQ
Quick FAQ
Quick FAQ
avgpool. Changing input resolution may break shape; keep 224 or redesign the head.