Two-Stage Object Detection

Object detection (intro)

Bounding boxes and scores

A box is often stored as (x_min, y_min, x_max, y_max) in pixel coordinates, or center (cx, cy) with width/height. Each prediction includes class probabilities (or logits) and an objectness score in some architectures. Post-processing merges overlapping predictions.

IoU and non-maximum suppression

Intersection over Union (IoU) measures overlap between two boxes on [0, 1]. It gates “is this detection a match to ground truth?” during evaluation and training (e.g. assign anchors to targets).

def box_iou(a, b):
    # a,b = (x1,y1,x2,y2)
    xi1, yi1 = max(a[0], b[0]), max(a[1], b[1])
    xi2, yi2 = min(a[2], b[2]), min(a[3], b[3])
    inter = max(0, xi2 - xi1) * max(0, yi2 - yi1)
    aa = (a[2]-a[0])*(a[3]-a[1])
    bb = (b[2]-b[0])*(b[3]-b[1])
    return inter / (aa + bb - inter + 1e-6)

NMS keeps the highest-scoring box and discards others of the same class with IoU above a threshold (e.g. 0.5), repeating until the list is exhausted—this removes duplicate boxes on one object.

mAP and precision–recall

For each class, sort predictions by score; at each threshold compute precision and recall vs ground truth (matched by IoU ≥ 0.5 for COCO “AP50”). Average Precision (AP) is the area under the precision–recall curve. mAP averages AP over classes (and sometimes over IoU thresholds, e.g. COCO AP@[.5:.95]). Higher mAP = better overall detection quality.

Two-stage vs one-stage

Two-stage (e.g. R-CNN family)

First propose regions, then classify and refine boxes. Often more accurate, slower per image.

One-stage (e.g. YOLO, SSD, RetinaNet)

Dense predictions over a grid or anchors in one forward pass—favored for real-time and embedded.

Example: Faster R-CNN (torchvision)

import torch
import torchvision.transforms as T
from torchvision.models.detection import fasterrcnn_resnet50_fpn
from PIL import Image

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = fasterrcnn_resnet50_fpn(weights="DEFAULT").to(device).eval()

img = Image.open("street.jpg").convert("RGB")
x = T.functional.to_tensor(img).to(device)

with torch.no_grad():
    r = model([x])[0]

for i in range(len(r["scores"])):
    if r["scores"][i] < 0.5:
        continue
    box = r["boxes"][i].tolist()
    label = int(r["labels"][i])
    score = float(r["scores"][i])
    # draw box with PIL/OpenCV using COCO label names

Training requires a Dataset returning image tensor and target dict with boxes, labels, image_id—see torchvision detection reference.

Data and deployment

Strong augmentations (mosaic, mixup, random crop) are common for one-stage detectors. For deployment, export to ONNX or TensorRT, quantize to INT8 where accuracy allows, and batch inputs for throughput.

                    Takeaways
                    Detection = where + what for multiple objects.
IoU and NMS are core to both training assignment and inference cleanup.
Compare models with mAP on the same benchmark and IoU rules.

                

Quick FAQ

Lower NMS IoU threshold, raise score threshold, or use softer-NMS variants. Duplicate predictions often mean the model is well-calibrated but NMS is too loose.

Yes—load ONNX/YOLO weights with cv2.dnn for C++ or Python without PyTorch at runtime. Pre/post-processing must match the exporter.

R-CNN family

R-CNN (2014)

Selective Search (or similar) proposes ~2k region boxes per image. Each crop is warped, passed through a CNN (e.g. AlexNet/VGG) to get a feature vector, then classified by class-specific SVMs. Box regression refines coordinates. Problem: thousands of forward passes per image—very slow; training is multi-stage.

Fast R-CNN

Run the CNN once on the full image to get a feature map. Project each proposal onto the map and apply ROI Pooling to extract a fixed-size feature vector per box—then classify and regress in parallel branches. Training is joint (except proposals still external). ROI Pool quantizes coordinates to discrete cells, causing small misalignments.

# Concept: feature map stride e.g. 16 — map (x,y,w,h) from image to grid coords
# ROI Pool divides each ROI into k×k bins and max-pools inside each bin

Faster R-CNN + RPN

The Region Proposal Network (RPN) slides a small network over the convolutional feature map, predicting objectness and box deltas for anchors (reference boxes at multiple scales/aspect ratios). Positive anchors match ground-truth with sufficient IoU; negatives are background. RPN and detection heads share features—end-to-end trainable with alternating or joint optimization.

Anchors

Template boxes at each spatial location; the network predicts offsets and scores vs each anchor.

FPN

Feature Pyramid Networks add top-down pathways—standard in modern Faster R-CNN variants for small objects.

ROIAlign and Mask R-CNN

ROIAlign uses bilinear interpolation to sample features at continuous locations—no harsh quantization—critical for pixel masks. Mask R-CNN adds a parallel mask head (see the instance segmentation chapter).

Using Faster R-CNN in PyTorch

import torch
from torchvision.models.detection import fasterrcnn_resnet50_fpn
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor

def get_model(num_classes):
    model = fasterrcnn_resnet50_fpn(weights="DEFAULT")
    in_features = model.roi_heads.box_predictor.cls_score.in_features
    model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)
    return model

# num_classes = 2 for 1 foreground class + background
model = get_model(num_classes=2)

Pass a list of image tensors to model(images); targets during training are dicts with boxes and labels per image.

Beyond the original paper

Cascade R-CNN stacks multiple heads with increasing IoU thresholds to refine hard positives. HTC / DetectoRS add segmentation context. Libraries: torchvision, Detectron2, mmdetection ship production configs.

                    Takeaways
                    R-CNN → shared features (Fast) → learned proposals (Faster).
RPN + anchors remain the template for many two-stage systems.
ROIAlign fixes alignment for detection and especially for masks.

                

Quick FAQ

Yes for accuracy-first batch jobs, custom small datasets, or when transformer compute is too heavy. YOLO-class models often win on speed; hybrid and DETR-style models compete on accuracy.

Research systems replace anchors with center points or queries (e.g. DETR, Sparse R-CNN). Concepts of proposal quality and box regression still apply.

Chapter FAQ

Quick FAQ

Lower NMS IoU threshold, raise score threshold, or use softer-NMS variants. Duplicate predictions often mean the model is well-calibrated but NMS is too loose.

Yes—load ONNX/YOLO weights with cv2.dnn for C++ or Python without PyTorch at runtime. Pre/post-processing must match the exporter.

Quick FAQ

Yes for accuracy-first batch jobs, custom small datasets, or when transformer compute is too heavy. YOLO-class models often win on speed; hybrid and DETR-style models compete on accuracy.

Research systems replace anchors with center points or queries (e.g. DETR, Sparse R-CNN). Concepts of proposal quality and box regression still apply.

Object detection (intro)

Bounding boxes and scores

IoU and non-maximum suppression

mAP and precision–recall

Two-stage vs one-stage

Two-stage (e.g. R-CNN family)

One-stage (e.g. YOLO, SSD, RetinaNet)

Example: Faster R-CNN (torchvision)

Data and deployment

Takeaways

Quick FAQ

Many duplicate boxes?

OpenCV DNN for detection?

R-CNN family

R-CNN (2014)

Fast R-CNN

Faster R-CNN + RPN

Anchors

FPN

ROIAlign and Mask R-CNN

Using Faster R-CNN in PyTorch

Beyond the original paper

Takeaways

Quick FAQ

Still use Faster R-CNN in 2026?

Anchor-free two-stage?

Chapter FAQ

Quick FAQ

Many duplicate boxes?

OpenCV DNN for detection?

Quick FAQ

Still use Faster R-CNN in 2026?

Anchor-free two-stage?