Computer Vision Chapter 15

Advanced CNN Architectures

ResNet skip connections, MobileNet depthwise separable convolutions, and EfficientNet scaling.

ResNet

Residual block (concept)

import torch
import torch.nn as nn

class ResidualBlock(nn.Module):
    def __init__(self, c):
        super().__init__()
        self.conv1 = nn.Conv2d(c, c, 3, padding=1, bias=False)
        self.bn1 = nn.BatchNorm2d(c)
        self.conv2 = nn.Conv2d(c, c, 3, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(c)
        self.act = nn.ReLU(inplace=True)

    def forward(self, x):
        identity = x
        out = self.act(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))
        return self.act(out + identity)

When channel or stride changes, the skip uses a 1×1 conv projection—see torchvision’s Bottleneck / BasicBlock.

torchvision: ResNet-18 and ResNet-50

from torchvision.models import resnet18, resnet50, ResNet18_Weights, ResNet50_Weights

r18 = resnet18(weights=ResNet18_Weights.IMAGENET1K_V1).eval()
r50 = resnet50(weights=ResNet50_Weights.IMAGENET1K_V2).eval()

tf18 = ResNet18_Weights.IMAGENET1K_V1.transforms()
tf50 = ResNet50_Weights.IMAGENET1K_V2.transforms()

Inference logits

from PIL import Image
import torch

img = tf50(Image.open("dog.jpg").convert("RGB")).unsqueeze(0)
with torch.no_grad():
    logits = r50(img)
probs = logits.softmax(1).squeeze()
i = int(probs.argmax())
print(ResNet50_Weights.IMAGENET1K_V2.meta["categories"][i])

Backbone embedding (before FC)

# Remove classifier: avgpool + flatten → vector
backbone = nn.Sequential(*list(r50.children())[:-1])  # drop fc
with torch.no_grad():
    feat = backbone(img).flatten(1)
print(feat.shape)  # [1, 2048] for ResNet-50

Hooks (optional)

Register a forward hook on r50.layer4 if you need intermediate maps without rewriting the full forward. For a single embedding vector, the Sequential backbone above is usually enough.

Fine-tune last layer

num_classes = 5
r50.fc = nn.Linear(r50.fc.in_features, num_classes)

for p in r50.parameters():
    p.requires_grad = False
for p in r50.fc.parameters():
    p.requires_grad = True

Train all layers with lower LR on backbone (concept)

import torch.optim as optim
opt = optim.AdamW([
    {"params": r50.fc.parameters(), "lr": 1e-3},
    {"params": [p for n, p in r50.named_parameters() if "fc" not in n], "lr": 1e-5},
])

Names to know

  • ResNeXt — grouped convolutions in blocks.
  • Wide ResNet — wider channels, fewer layers sometimes competitive.
  • EfficientNet / ConvNeXt — later efficiency-accuracy tradeoffs (different families).

Takeaways

  • Residual: y = F(x) + x stabilizes deep training.
  • ResNet-50 uses bottleneck blocks; ResNet-18 uses two 3×3 basic blocks.
  • Standard transfer: replace fc, freeze or differential LR.

Quick FAQ

Call model.eval() before inference so BN uses running stats and dropout is off.

Global average pooling before fc allows variable H/W in many setups; still use consistent preprocessing and validate shape through backbone.

MobileNet

Depthwise separable (from scratch sketch)

import torch.nn as nn

class DepthwiseSeparable(nn.Module):
    def __init__(self, in_ch, out_ch, stride=1):
        super().__init__()
        self.dw = nn.Conv2d(in_ch, in_ch, 3, stride, 1, groups=in_ch, bias=False)
        self.pw = nn.Conv2d(in_ch, out_ch, 1, bias=False)
        self.bn = nn.BatchNorm2d(out_ch)
        self.act = nn.ReLU6(inplace=True)

    def forward(self, x):
        return self.act(self.bn(self.pw(self.dw(x))))

Real MobileNet blocks add expansion ratios, residuals (V2), and SE/h-swish (V3)—use torchvision.models for faithful implementations.

torchvision MobileNetV2 / V3

from torchvision.models import (
    mobilenet_v2, mobilenet_v3_small, mobilenet_v3_large,
    MobileNet_V2_Weights, MobileNet_V3_Small_Weights, MobileNet_V3_Large_Weights,
)

m2 = mobilenet_v2(weights=MobileNet_V2_Weights.IMAGENET1K_V1).eval()
m3s = mobilenet_v3_small(weights=MobileNet_V3_Small_Weights.IMAGENET1K_V1).eval()
tf2 = MobileNet_V2_Weights.IMAGENET1K_V1.transforms()
tf3 = MobileNet_V3_Small_Weights.IMAGENET1K_V1.transforms()

Classification

from PIL import Image
import torch

img = tf2(Image.open("cat.jpg").convert("RGB")).unsqueeze(0)
with torch.no_grad():
    logits = m2(img)
idx = int(logits.argmax(1))
print(MobileNet_V2_Weights.IMAGENET1K_V1.meta["categories"][idx])

Feature vector (before classifier)

# MobileNetV2: features end before classifier
feat_net = nn.Sequential(m2.features, nn.AdaptiveAvgPool2d(1), nn.Flatten(1))
with torch.no_grad():
    emb = feat_net(img)
print(emb.shape)

V3 small: same idea

img3 = tf3(Image.open("cat.jpg").convert("RGB")).unsqueeze(0)
feat3 = nn.Sequential(m3s.features, m3s.avgpool, nn.Flatten(1))
with torch.no_grad():
    e3 = feat3(img3)

Replace classifier head

num_classes = 10
in_f = m2.classifier[1].in_features
m2.classifier[1] = nn.Linear(in_f, num_classes)

Width multiplier & resolution

Papers scale channel width and input resolution for accuracy–latency tradeoffs. In torchvision, pick a different weights enum or instantiate without pretrained weights and pass width_mult where the API exposes it (API varies by version).

Takeaways

  • Depthwise + pointwise ≈ fewer FLOPs than one full conv.
  • V2: inverted residual + linear bottleneck.
  • V3: tuned for mobile with h-swish / SE-style blocks.

Quick FAQ

Clamps activations to [0, 6], helpful for quantized deployment in original MobileNet designs.

Both pursue efficiency; EfficientNet compounds depth/width/resolution scaling. Choose by latency on your hardware and framework support.

Chapter FAQ

Quick FAQ

Call model.eval() before inference so BN uses running stats and dropout is off.

Global average pooling before fc allows variable H/W in many setups; still use consistent preprocessing and validate shape through backbone.

Quick FAQ

Clamps activations to [0, 6], helpful for quantized deployment in original MobileNet designs.

Both pursue efficiency; EfficientNet compounds depth/width/resolution scaling. Choose by latency on your hardware and framework support.