Advanced CNN Architectures

ResNet

Residual block (concept)

import torch
import torch.nn as nn

class ResidualBlock(nn.Module):
    def __init__(self, c):
        super().__init__()
        self.conv1 = nn.Conv2d(c, c, 3, padding=1, bias=False)
        self.bn1 = nn.BatchNorm2d(c)
        self.conv2 = nn.Conv2d(c, c, 3, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(c)
        self.act = nn.ReLU(inplace=True)

    def forward(self, x):
        identity = x
        out = self.act(self.bn1(self.conv1(x)))
        out = self.bn2(self.conv2(out))
        return self.act(out + identity)

When channel or stride changes, the skip uses a 1×1 conv projection—see torchvision’s Bottleneck / BasicBlock.

torchvision: ResNet-18 and ResNet-50

from torchvision.models import resnet18, resnet50, ResNet18_Weights, ResNet50_Weights

r18 = resnet18(weights=ResNet18_Weights.IMAGENET1K_V1).eval()
r50 = resnet50(weights=ResNet50_Weights.IMAGENET1K_V2).eval()

tf18 = ResNet18_Weights.IMAGENET1K_V1.transforms()
tf50 = ResNet50_Weights.IMAGENET1K_V2.transforms()

Inference logits

from PIL import Image
import torch

img = tf50(Image.open("dog.jpg").convert("RGB")).unsqueeze(0)
with torch.no_grad():
    logits = r50(img)
probs = logits.softmax(1).squeeze()
i = int(probs.argmax())
print(ResNet50_Weights.IMAGENET1K_V2.meta["categories"][i])

Backbone embedding (before FC)

# Remove classifier: avgpool + flatten → vector
backbone = nn.Sequential(*list(r50.children())[:-1])  # drop fc
with torch.no_grad():
    feat = backbone(img).flatten(1)
print(feat.shape)  # [1, 2048] for ResNet-50

Hooks (optional)

Register a forward hook on r50.layer4 if you need intermediate maps without rewriting the full forward. For a single embedding vector, the Sequential backbone above is usually enough.

Fine-tune last layer

num_classes = 5
r50.fc = nn.Linear(r50.fc.in_features, num_classes)

for p in r50.parameters():
    p.requires_grad = False
for p in r50.fc.parameters():
    p.requires_grad = True

Train all layers with lower LR on backbone (concept)

import torch.optim as optim
opt = optim.AdamW([
    {"params": r50.fc.parameters(), "lr": 1e-3},
    {"params": [p for n, p in r50.named_parameters() if "fc" not in n], "lr": 1e-5},
])

Names to know

ResNeXt — grouped convolutions in blocks.
Wide ResNet — wider channels, fewer layers sometimes competitive.
EfficientNet / ConvNeXt — later efficiency-accuracy tradeoffs (different families).

                    Takeaways
                    Residual: y = F(x) + x stabilizes deep training.
ResNet-50 uses bottleneck blocks; ResNet-18 uses two 3×3 basic blocks.
Standard transfer: replace fc, freeze or differential LR.

                

Quick FAQ

Call model.eval() before inference so BN uses running stats and dropout is off.

Global average pooling before fc allows variable H/W in many setups; still use consistent preprocessing and validate shape through backbone.

MobileNet

Depthwise separable (from scratch sketch)

import torch.nn as nn

class DepthwiseSeparable(nn.Module):
    def __init__(self, in_ch, out_ch, stride=1):
        super().__init__()
        self.dw = nn.Conv2d(in_ch, in_ch, 3, stride, 1, groups=in_ch, bias=False)
        self.pw = nn.Conv2d(in_ch, out_ch, 1, bias=False)
        self.bn = nn.BatchNorm2d(out_ch)
        self.act = nn.ReLU6(inplace=True)

    def forward(self, x):
        return self.act(self.bn(self.pw(self.dw(x))))

Real MobileNet blocks add expansion ratios, residuals (V2), and SE/h-swish (V3)—use torchvision.models for faithful implementations.

torchvision MobileNetV2 / V3

from torchvision.models import (
    mobilenet_v2, mobilenet_v3_small, mobilenet_v3_large,
    MobileNet_V2_Weights, MobileNet_V3_Small_Weights, MobileNet_V3_Large_Weights,
)

m2 = mobilenet_v2(weights=MobileNet_V2_Weights.IMAGENET1K_V1).eval()
m3s = mobilenet_v3_small(weights=MobileNet_V3_Small_Weights.IMAGENET1K_V1).eval()
tf2 = MobileNet_V2_Weights.IMAGENET1K_V1.transforms()
tf3 = MobileNet_V3_Small_Weights.IMAGENET1K_V1.transforms()

Classification

from PIL import Image
import torch

img = tf2(Image.open("cat.jpg").convert("RGB")).unsqueeze(0)
with torch.no_grad():
    logits = m2(img)
idx = int(logits.argmax(1))
print(MobileNet_V2_Weights.IMAGENET1K_V1.meta["categories"][idx])

Feature vector (before classifier)

# MobileNetV2: features end before classifier
feat_net = nn.Sequential(m2.features, nn.AdaptiveAvgPool2d(1), nn.Flatten(1))
with torch.no_grad():
    emb = feat_net(img)
print(emb.shape)

V3 small: same idea

img3 = tf3(Image.open("cat.jpg").convert("RGB")).unsqueeze(0)
feat3 = nn.Sequential(m3s.features, m3s.avgpool, nn.Flatten(1))
with torch.no_grad():
    e3 = feat3(img3)

Replace classifier head

num_classes = 10
in_f = m2.classifier[1].in_features
m2.classifier[1] = nn.Linear(in_f, num_classes)

Width multiplier & resolution

Papers scale channel width and input resolution for accuracy–latency tradeoffs. In torchvision, pick a different weights enum or instantiate without pretrained weights and pass width_mult where the API exposes it (API varies by version).

                    Takeaways
                    Depthwise + pointwise ≈ fewer FLOPs than one full conv.
V2: inverted residual + linear bottleneck.
V3: tuned for mobile with h-swish / SE-style blocks.

                

Quick FAQ

Clamps activations to [0, 6], helpful for quantized deployment in original MobileNet designs.

Both pursue efficiency; EfficientNet compounds depth/width/resolution scaling. Choose by latency on your hardware and framework support.

Chapter FAQ

Quick FAQ

Call model.eval() before inference so BN uses running stats and dropout is off.

Global average pooling before fc allows variable H/W in many setups; still use consistent preprocessing and validate shape through backbone.

Quick FAQ

Clamps activations to [0, 6], helpful for quantized deployment in original MobileNet designs.

Both pursue efficiency; EfficientNet compounds depth/width/resolution scaling. Choose by latency on your hardware and framework support.

ResNet

Residual block (concept)

torchvision: ResNet-18 and ResNet-50

Inference logits

Backbone embedding (before FC)

Hooks (optional)

Fine-tune last layer

Train all layers with lower LR on backbone (concept)

Names to know

Takeaways

Quick FAQ

BatchNorm in eval?

Input size not 224?

MobileNet

Depthwise separable (from scratch sketch)

torchvision MobileNetV2 / V3

Classification

Feature vector (before classifier)

V3 small: same idea

Replace classifier head

Width multiplier & resolution

Takeaways

Quick FAQ

ReLU6?

MobileNet vs EfficientNet?

Chapter FAQ

Quick FAQ

BatchNorm in eval?

Input size not 224?

Quick FAQ

ReLU6?

MobileNet vs EfficientNet?