Computer Vision Chapter 8

Two-Stage Object Detection

Detection fundamentals and the R-CNN family—region proposals, Fast R-CNN, and Faster R-CNN.

Object detection (intro)

Bounding boxes and scores

A box is often stored as (x_min, y_min, x_max, y_max) in pixel coordinates, or center (cx, cy) with width/height. Each prediction includes class probabilities (or logits) and an objectness score in some architectures. Post-processing merges overlapping predictions.

IoU and non-maximum suppression

Intersection over Union (IoU) measures overlap between two boxes on [0, 1]. It gates “is this detection a match to ground truth?” during evaluation and training (e.g. assign anchors to targets).

def box_iou(a, b):
    # a,b = (x1,y1,x2,y2)
    xi1, yi1 = max(a[0], b[0]), max(a[1], b[1])
    xi2, yi2 = min(a[2], b[2]), min(a[3], b[3])
    inter = max(0, xi2 - xi1) * max(0, yi2 - yi1)
    aa = (a[2]-a[0])*(a[3]-a[1])
    bb = (b[2]-b[0])*(b[3]-b[1])
    return inter / (aa + bb - inter + 1e-6)

NMS keeps the highest-scoring box and discards others of the same class with IoU above a threshold (e.g. 0.5), repeating until the list is exhausted—this removes duplicate boxes on one object.

mAP and precision–recall

For each class, sort predictions by score; at each threshold compute precision and recall vs ground truth (matched by IoU ≥ 0.5 for COCO “AP50”). Average Precision (AP) is the area under the precision–recall curve. mAP averages AP over classes (and sometimes over IoU thresholds, e.g. COCO AP@[.5:.95]). Higher mAP = better overall detection quality.

Two-stage vs one-stage

Two-stage (e.g. R-CNN family)

First propose regions, then classify and refine boxes. Often more accurate, slower per image.

One-stage (e.g. YOLO, SSD, RetinaNet)

Dense predictions over a grid or anchors in one forward pass—favored for real-time and embedded.

Example: Faster R-CNN (torchvision)

import torch
import torchvision.transforms as T
from torchvision.models.detection import fasterrcnn_resnet50_fpn
from PIL import Image

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = fasterrcnn_resnet50_fpn(weights="DEFAULT").to(device).eval()

img = Image.open("street.jpg").convert("RGB")
x = T.functional.to_tensor(img).to(device)

with torch.no_grad():
    r = model([x])[0]

for i in range(len(r["scores"])):
    if r["scores"][i] < 0.5:
        continue
    box = r["boxes"][i].tolist()
    label = int(r["labels"][i])
    score = float(r["scores"][i])
    # draw box with PIL/OpenCV using COCO label names

Training requires a Dataset returning image tensor and target dict with boxes, labels, image_id—see torchvision detection reference.

Data and deployment

Strong augmentations (mosaic, mixup, random crop) are common for one-stage detectors. For deployment, export to ONNX or TensorRT, quantize to INT8 where accuracy allows, and batch inputs for throughput.

Takeaways

  • Detection = where + what for multiple objects.
  • IoU and NMS are core to both training assignment and inference cleanup.
  • Compare models with mAP on the same benchmark and IoU rules.

Quick FAQ

Lower NMS IoU threshold, raise score threshold, or use softer-NMS variants. Duplicate predictions often mean the model is well-calibrated but NMS is too loose.

Yes—load ONNX/YOLO weights with cv2.dnn for C++ or Python without PyTorch at runtime. Pre/post-processing must match the exporter.

R-CNN family

R-CNN (2014)

Selective Search (or similar) proposes ~2k region boxes per image. Each crop is warped, passed through a CNN (e.g. AlexNet/VGG) to get a feature vector, then classified by class-specific SVMs. Box regression refines coordinates. Problem: thousands of forward passes per image—very slow; training is multi-stage.

Fast R-CNN

Run the CNN once on the full image to get a feature map. Project each proposal onto the map and apply ROI Pooling to extract a fixed-size feature vector per box—then classify and regress in parallel branches. Training is joint (except proposals still external). ROI Pool quantizes coordinates to discrete cells, causing small misalignments.

# Concept: feature map stride e.g. 16 — map (x,y,w,h) from image to grid coords
# ROI Pool divides each ROI into k×k bins and max-pools inside each bin

Faster R-CNN + RPN

The Region Proposal Network (RPN) slides a small network over the convolutional feature map, predicting objectness and box deltas for anchors (reference boxes at multiple scales/aspect ratios). Positive anchors match ground-truth with sufficient IoU; negatives are background. RPN and detection heads share features—end-to-end trainable with alternating or joint optimization.

Anchors

Template boxes at each spatial location; the network predicts offsets and scores vs each anchor.

FPN

Feature Pyramid Networks add top-down pathways—standard in modern Faster R-CNN variants for small objects.

ROIAlign and Mask R-CNN

ROIAlign uses bilinear interpolation to sample features at continuous locations—no harsh quantization—critical for pixel masks. Mask R-CNN adds a parallel mask head (see the instance segmentation chapter).

Using Faster R-CNN in PyTorch

import torch
from torchvision.models.detection import fasterrcnn_resnet50_fpn
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor

def get_model(num_classes):
    model = fasterrcnn_resnet50_fpn(weights="DEFAULT")
    in_features = model.roi_heads.box_predictor.cls_score.in_features
    model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)
    return model

# num_classes = 2 for 1 foreground class + background
model = get_model(num_classes=2)

Pass a list of image tensors to model(images); targets during training are dicts with boxes and labels per image.

Beyond the original paper

Cascade R-CNN stacks multiple heads with increasing IoU thresholds to refine hard positives. HTC / DetectoRS add segmentation context. Libraries: torchvision, Detectron2, mmdetection ship production configs.

Takeaways

  • R-CNN → shared features (Fast) → learned proposals (Faster).
  • RPN + anchors remain the template for many two-stage systems.
  • ROIAlign fixes alignment for detection and especially for masks.

Quick FAQ

Yes for accuracy-first batch jobs, custom small datasets, or when transformer compute is too heavy. YOLO-class models often win on speed; hybrid and DETR-style models compete on accuracy.

Research systems replace anchors with center points or queries (e.g. DETR, Sparse R-CNN). Concepts of proposal quality and box regression still apply.

Chapter FAQ

Quick FAQ

Lower NMS IoU threshold, raise score threshold, or use softer-NMS variants. Duplicate predictions often mean the model is well-calibrated but NMS is too loose.

Yes—load ONNX/YOLO weights with cv2.dnn for C++ or Python without PyTorch at runtime. Pre/post-processing must match the exporter.

Quick FAQ

Yes for accuracy-first batch jobs, custom small datasets, or when transformer compute is too heavy. YOLO-class models often win on speed; hybrid and DETR-style models compete on accuracy.

Research systems replace anchors with center points or queries (e.g. DETR, Sparse R-CNN). Concepts of proposal quality and box regression still apply.