Computer Vision Chapter 31

ResNet

ResNet (He et al., 2015) made very deep CNNs trainable by learning residual mappings with identity skip connections. Instead of forcing a stack of layers to approximate H(x) directly, a block learns F(x) and outputs F(x) + x (when shapes match). This eases optimization and addresses the degradation problem where deeper plain nets can get higher training error. Deeper variants use bottleneck blocks (1×1–3×3–1×1) for efficiency. Below: the idea in code, torchvision loading, backbone features, and fine-tuning—with several examples.

Residual block (concept)

import torch
import torch.nn as nn

class ResidualBlock(nn.Module):
    def __init__(self, c):
        super().__init__()
        self.conv1 = nn.Conv2d(c, c, 3, padding=1, bias=False)
        self.bn1 = nn.BatchNorm2d(c)
        self.conv2 = nn.Conv2d(c, c, 3, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(c)
        self.act = nn.ReLU(inplace=True)

    def forward(self, x):
        identity = x
        out = self.act(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))
        return self.act(out + identity)

When channel or stride changes, the skip uses a 1×1 conv projection—see torchvision’s Bottleneck / BasicBlock.

torchvision: ResNet-18 and ResNet-50

from torchvision.models import resnet18, resnet50, ResNet18_Weights, ResNet50_Weights

r18 = resnet18(weights=ResNet18_Weights.IMAGENET1K_V1).eval()
r50 = resnet50(weights=ResNet50_Weights.IMAGENET1K_V2).eval()

tf18 = ResNet18_Weights.IMAGENET1K_V1.transforms()
tf50 = ResNet50_Weights.IMAGENET1K_V2.transforms()

Inference logits

from PIL import Image
import torch

img = tf50(Image.open("dog.jpg").convert("RGB")).unsqueeze(0)
with torch.no_grad():
    logits = r50(img)
probs = logits.softmax(1).squeeze()
i = int(probs.argmax())
print(ResNet50_Weights.IMAGENET1K_V2.meta["categories"][i])

Backbone embedding (before FC)

# Remove classifier: avgpool + flatten → vector
backbone = nn.Sequential(*list(r50.children())[:-1])  # drop fc
with torch.no_grad():
    feat = backbone(img).flatten(1)
print(feat.shape)  # [1, 2048] for ResNet-50

Hooks (optional)

Register a forward hook on r50.layer4 if you need intermediate maps without rewriting the full forward. For a single embedding vector, the Sequential backbone above is usually enough.

Fine-tune last layer

num_classes = 5
r50.fc = nn.Linear(r50.fc.in_features, num_classes)

for p in r50.parameters():
    p.requires_grad = False
for p in r50.fc.parameters():
    p.requires_grad = True

Train all layers with lower LR on backbone (concept)

import torch.optim as optim
opt = optim.AdamW([
    {"params": r50.fc.parameters(), "lr": 1e-3},
    {"params": [p for n, p in r50.named_parameters() if "fc" not in n], "lr": 1e-5},
])

Names to know

  • ResNeXt — grouped convolutions in blocks.
  • Wide ResNet — wider channels, fewer layers sometimes competitive.
  • EfficientNet / ConvNeXt — later efficiency-accuracy tradeoffs (different families).

Takeaways

  • Residual: y = F(x) + x stabilizes deep training.
  • ResNet-50 uses bottleneck blocks; ResNet-18 uses two 3×3 basic blocks.
  • Standard transfer: replace fc, freeze or differential LR.

Quick FAQ

Call model.eval() before inference so BN uses running stats and dropout is off.

Global average pooling before fc allows variable H/W in many setups; still use consistent preprocessing and validate shape through backbone.