ResNet
Residual block (concept)
import torch
import torch.nn as nn
class ResidualBlock(nn.Module):
def __init__(self, c):
super().__init__()
self.conv1 = nn.Conv2d(c, c, 3, padding=1, bias=False)
self.bn1 = nn.BatchNorm2d(c)
self.conv2 = nn.Conv2d(c, c, 3, padding=1, bias=False)
self.bn2 = nn.BatchNorm2d(c)
self.act = nn.ReLU(inplace=True)
def forward(self, x):
identity = x
out = self.act(self.bn1(self.conv1(x)))
out = self.bn2(self.conv2(out))
return self.act(out + identity)
When channel or stride changes, the skip uses a 1×1 conv projection—see torchvision’s Bottleneck / BasicBlock.
torchvision: ResNet-18 and ResNet-50
from torchvision.models import resnet18, resnet50, ResNet18_Weights, ResNet50_Weights
r18 = resnet18(weights=ResNet18_Weights.IMAGENET1K_V1).eval()
r50 = resnet50(weights=ResNet50_Weights.IMAGENET1K_V2).eval()
tf18 = ResNet18_Weights.IMAGENET1K_V1.transforms()
tf50 = ResNet50_Weights.IMAGENET1K_V2.transforms()
Inference logits
from PIL import Image
import torch
img = tf50(Image.open("dog.jpg").convert("RGB")).unsqueeze(0)
with torch.no_grad():
logits = r50(img)
probs = logits.softmax(1).squeeze()
i = int(probs.argmax())
print(ResNet50_Weights.IMAGENET1K_V2.meta["categories"][i])
Backbone embedding (before FC)
# Remove classifier: avgpool + flatten → vector
backbone = nn.Sequential(*list(r50.children())[:-1]) # drop fc
with torch.no_grad():
feat = backbone(img).flatten(1)
print(feat.shape) # [1, 2048] for ResNet-50
Hooks (optional)
Register a forward hook on r50.layer4 if you need intermediate maps without rewriting the full forward. For a single embedding vector, the Sequential backbone above is usually enough.
Fine-tune last layer
num_classes = 5
r50.fc = nn.Linear(r50.fc.in_features, num_classes)
for p in r50.parameters():
p.requires_grad = False
for p in r50.fc.parameters():
p.requires_grad = True
Train all layers with lower LR on backbone (concept)
import torch.optim as optim
opt = optim.AdamW([
{"params": r50.fc.parameters(), "lr": 1e-3},
{"params": [p for n, p in r50.named_parameters() if "fc" not in n], "lr": 1e-5},
])
Names to know
- ResNeXt — grouped convolutions in blocks.
- Wide ResNet — wider channels, fewer layers sometimes competitive.
- EfficientNet / ConvNeXt — later efficiency-accuracy tradeoffs (different families).
Takeaways
- Residual:
y = F(x) + xstabilizes deep training. - ResNet-50 uses bottleneck blocks; ResNet-18 uses two 3×3 basic blocks.
- Standard transfer: replace
fc, freeze or differential LR.
Quick FAQ
model.eval() before inference so BN uses running stats and dropout is off.fc allows variable H/W in many setups; still use consistent preprocessing and validate shape through backbone.MobileNet
Depthwise separable (from scratch sketch)
import torch.nn as nn
class DepthwiseSeparable(nn.Module):
def __init__(self, in_ch, out_ch, stride=1):
super().__init__()
self.dw = nn.Conv2d(in_ch, in_ch, 3, stride, 1, groups=in_ch, bias=False)
self.pw = nn.Conv2d(in_ch, out_ch, 1, bias=False)
self.bn = nn.BatchNorm2d(out_ch)
self.act = nn.ReLU6(inplace=True)
def forward(self, x):
return self.act(self.bn(self.pw(self.dw(x))))
Real MobileNet blocks add expansion ratios, residuals (V2), and SE/h-swish (V3)—use torchvision.models for faithful implementations.
torchvision MobileNetV2 / V3
from torchvision.models import (
mobilenet_v2, mobilenet_v3_small, mobilenet_v3_large,
MobileNet_V2_Weights, MobileNet_V3_Small_Weights, MobileNet_V3_Large_Weights,
)
m2 = mobilenet_v2(weights=MobileNet_V2_Weights.IMAGENET1K_V1).eval()
m3s = mobilenet_v3_small(weights=MobileNet_V3_Small_Weights.IMAGENET1K_V1).eval()
tf2 = MobileNet_V2_Weights.IMAGENET1K_V1.transforms()
tf3 = MobileNet_V3_Small_Weights.IMAGENET1K_V1.transforms()
Classification
from PIL import Image
import torch
img = tf2(Image.open("cat.jpg").convert("RGB")).unsqueeze(0)
with torch.no_grad():
logits = m2(img)
idx = int(logits.argmax(1))
print(MobileNet_V2_Weights.IMAGENET1K_V1.meta["categories"][idx])
Feature vector (before classifier)
# MobileNetV2: features end before classifier
feat_net = nn.Sequential(m2.features, nn.AdaptiveAvgPool2d(1), nn.Flatten(1))
with torch.no_grad():
emb = feat_net(img)
print(emb.shape)
V3 small: same idea
img3 = tf3(Image.open("cat.jpg").convert("RGB")).unsqueeze(0)
feat3 = nn.Sequential(m3s.features, m3s.avgpool, nn.Flatten(1))
with torch.no_grad():
e3 = feat3(img3)
Replace classifier head
num_classes = 10
in_f = m2.classifier[1].in_features
m2.classifier[1] = nn.Linear(in_f, num_classes)
Width multiplier & resolution
Papers scale channel width and input resolution for accuracy–latency tradeoffs. In torchvision, pick a different weights enum or instantiate without pretrained weights and pass width_mult where the API exposes it (API varies by version).
Takeaways
- Depthwise + pointwise ≈ fewer FLOPs than one full conv.
- V2: inverted residual + linear bottleneck.
- V3: tuned for mobile with h-swish / SE-style blocks.
Quick FAQ
Chapter FAQ
Quick FAQ
model.eval() before inference so BN uses running stats and dropout is off.fc allows variable H/W in many setups; still use consistent preprocessing and validate shape through backbone.