Computer Vision Chapter 19

R-CNN family

The R-CNN line shaped modern two-stage detection: start from region proposals, then run a strong classifier on cropped features. Each generation fixed the previous bottleneck—shared computation (Fast R-CNN), learned proposals (Faster R-CNN), and aligned feature sampling (ROIAlign for masks). Understanding this arc clarifies why today’s detectors still use “proposal + refine” ideas inside transformer heads.

R-CNN (2014)

Selective Search (or similar) proposes ~2k region boxes per image. Each crop is warped, passed through a CNN (e.g. AlexNet/VGG) to get a feature vector, then classified by class-specific SVMs. Box regression refines coordinates. Problem: thousands of forward passes per image—very slow; training is multi-stage.

Fast R-CNN

Run the CNN once on the full image to get a feature map. Project each proposal onto the map and apply ROI Pooling to extract a fixed-size feature vector per box—then classify and regress in parallel branches. Training is joint (except proposals still external). ROI Pool quantizes coordinates to discrete cells, causing small misalignments.

# Concept: feature map stride e.g. 16 — map (x,y,w,h) from image to grid coords
# ROI Pool divides each ROI into k×k bins and max-pools inside each bin

Faster R-CNN + RPN

The Region Proposal Network (RPN) slides a small network over the convolutional feature map, predicting objectness and box deltas for anchors (reference boxes at multiple scales/aspect ratios). Positive anchors match ground-truth with sufficient IoU; negatives are background. RPN and detection heads share features—end-to-end trainable with alternating or joint optimization.

Anchors

Template boxes at each spatial location; the network predicts offsets and scores vs each anchor.

FPN

Feature Pyramid Networks add top-down pathways—standard in modern Faster R-CNN variants for small objects.

ROIAlign and Mask R-CNN

ROIAlign uses bilinear interpolation to sample features at continuous locations—no harsh quantization—critical for pixel masks. Mask R-CNN adds a parallel mask head (see the instance segmentation chapter).

Using Faster R-CNN in PyTorch

import torch
from torchvision.models.detection import fasterrcnn_resnet50_fpn
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor

def get_model(num_classes):
    model = fasterrcnn_resnet50_fpn(weights="DEFAULT")
    in_features = model.roi_heads.box_predictor.cls_score.in_features
    model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)
    return model

# num_classes = 2 for 1 foreground class + background
model = get_model(num_classes=2)

Pass a list of image tensors to model(images); targets during training are dicts with boxes and labels per image.

Beyond the original paper

Cascade R-CNN stacks multiple heads with increasing IoU thresholds to refine hard positives. HTC / DetectoRS add segmentation context. Libraries: torchvision, Detectron2, mmdetection ship production configs.

Takeaways

  • R-CNN → shared features (Fast) → learned proposals (Faster).
  • RPN + anchors remain the template for many two-stage systems.
  • ROIAlign fixes alignment for detection and especially for masks.

Quick FAQ

Yes for accuracy-first batch jobs, custom small datasets, or when transformer compute is too heavy. YOLO-class models often win on speed; hybrid and DETR-style models compete on accuracy.

Research systems replace anchors with center points or queries (e.g. DETR, Sparse R-CNN). Concepts of proposal quality and box regression still apply.