R-CNN (2014)
Selective Search (or similar) proposes ~2k region boxes per image. Each crop is warped, passed through a CNN (e.g. AlexNet/VGG) to get a feature vector, then classified by class-specific SVMs. Box regression refines coordinates. Problem: thousands of forward passes per image—very slow; training is multi-stage.
Fast R-CNN
Run the CNN once on the full image to get a feature map. Project each proposal onto the map and apply ROI Pooling to extract a fixed-size feature vector per box—then classify and regress in parallel branches. Training is joint (except proposals still external). ROI Pool quantizes coordinates to discrete cells, causing small misalignments.
# Concept: feature map stride e.g. 16 — map (x,y,w,h) from image to grid coords
# ROI Pool divides each ROI into k×k bins and max-pools inside each bin
Faster R-CNN + RPN
The Region Proposal Network (RPN) slides a small network over the convolutional feature map, predicting objectness and box deltas for anchors (reference boxes at multiple scales/aspect ratios). Positive anchors match ground-truth with sufficient IoU; negatives are background. RPN and detection heads share features—end-to-end trainable with alternating or joint optimization.
Anchors
Template boxes at each spatial location; the network predicts offsets and scores vs each anchor.
FPN
Feature Pyramid Networks add top-down pathways—standard in modern Faster R-CNN variants for small objects.
ROIAlign and Mask R-CNN
ROIAlign uses bilinear interpolation to sample features at continuous locations—no harsh quantization—critical for pixel masks. Mask R-CNN adds a parallel mask head (see the instance segmentation chapter).
Using Faster R-CNN in PyTorch
import torch
from torchvision.models.detection import fasterrcnn_resnet50_fpn
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
def get_model(num_classes):
model = fasterrcnn_resnet50_fpn(weights="DEFAULT")
in_features = model.roi_heads.box_predictor.cls_score.in_features
model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)
return model
# num_classes = 2 for 1 foreground class + background
model = get_model(num_classes=2)
Pass a list of image tensors to model(images); targets during training are dicts with boxes and labels per image.
Beyond the original paper
Cascade R-CNN stacks multiple heads with increasing IoU thresholds to refine hard positives. HTC / DetectoRS add segmentation context. Libraries: torchvision, Detectron2, mmdetection ship production configs.
Takeaways
- R-CNN → shared features (Fast) → learned proposals (Faster).
- RPN + anchors remain the template for many two-stage systems.
- ROIAlign fixes alignment for detection and especially for masks.