Object detection (intro)
Bounding boxes and scores
A box is often stored as (x_min, y_min, x_max, y_max) in pixel coordinates, or center (cx, cy) with width/height. Each prediction includes class probabilities (or logits) and an objectness score in some architectures. Post-processing merges overlapping predictions.
IoU and non-maximum suppression
Intersection over Union (IoU) measures overlap between two boxes on [0, 1]. It gates “is this detection a match to ground truth?” during evaluation and training (e.g. assign anchors to targets).
def box_iou(a, b):
# a,b = (x1,y1,x2,y2)
xi1, yi1 = max(a[0], b[0]), max(a[1], b[1])
xi2, yi2 = min(a[2], b[2]), min(a[3], b[3])
inter = max(0, xi2 - xi1) * max(0, yi2 - yi1)
aa = (a[2]-a[0])*(a[3]-a[1])
bb = (b[2]-b[0])*(b[3]-b[1])
return inter / (aa + bb - inter + 1e-6)
NMS keeps the highest-scoring box and discards others of the same class with IoU above a threshold (e.g. 0.5), repeating until the list is exhausted—this removes duplicate boxes on one object.
mAP and precision–recall
For each class, sort predictions by score; at each threshold compute precision and recall vs ground truth (matched by IoU ≥ 0.5 for COCO “AP50”). Average Precision (AP) is the area under the precision–recall curve. mAP averages AP over classes (and sometimes over IoU thresholds, e.g. COCO AP@[.5:.95]). Higher mAP = better overall detection quality.
Two-stage vs one-stage
Two-stage (e.g. R-CNN family)
First propose regions, then classify and refine boxes. Often more accurate, slower per image.
One-stage (e.g. YOLO, SSD, RetinaNet)
Dense predictions over a grid or anchors in one forward pass—favored for real-time and embedded.
Example: Faster R-CNN (torchvision)
import torch
import torchvision.transforms as T
from torchvision.models.detection import fasterrcnn_resnet50_fpn
from PIL import Image
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = fasterrcnn_resnet50_fpn(weights="DEFAULT").to(device).eval()
img = Image.open("street.jpg").convert("RGB")
x = T.functional.to_tensor(img).to(device)
with torch.no_grad():
r = model([x])[0]
for i in range(len(r["scores"])):
if r["scores"][i] < 0.5:
continue
box = r["boxes"][i].tolist()
label = int(r["labels"][i])
score = float(r["scores"][i])
# draw box with PIL/OpenCV using COCO label names
Training requires a Dataset returning image tensor and target dict with boxes, labels, image_id—see torchvision detection reference.
Data and deployment
Strong augmentations (mosaic, mixup, random crop) are common for one-stage detectors. For deployment, export to ONNX or TensorRT, quantize to INT8 where accuracy allows, and batch inputs for throughput.
Takeaways
- Detection = where + what for multiple objects.
- IoU and NMS are core to both training assignment and inference cleanup.
- Compare models with mAP on the same benchmark and IoU rules.
Quick FAQ
cv2.dnn for C++ or Python without PyTorch at runtime. Pre/post-processing must match the exporter.R-CNN family
R-CNN (2014)
Selective Search (or similar) proposes ~2k region boxes per image. Each crop is warped, passed through a CNN (e.g. AlexNet/VGG) to get a feature vector, then classified by class-specific SVMs. Box regression refines coordinates. Problem: thousands of forward passes per image—very slow; training is multi-stage.
Fast R-CNN
Run the CNN once on the full image to get a feature map. Project each proposal onto the map and apply ROI Pooling to extract a fixed-size feature vector per box—then classify and regress in parallel branches. Training is joint (except proposals still external). ROI Pool quantizes coordinates to discrete cells, causing small misalignments.
# Concept: feature map stride e.g. 16 — map (x,y,w,h) from image to grid coords
# ROI Pool divides each ROI into k×k bins and max-pools inside each bin
Faster R-CNN + RPN
The Region Proposal Network (RPN) slides a small network over the convolutional feature map, predicting objectness and box deltas for anchors (reference boxes at multiple scales/aspect ratios). Positive anchors match ground-truth with sufficient IoU; negatives are background. RPN and detection heads share features—end-to-end trainable with alternating or joint optimization.
Anchors
Template boxes at each spatial location; the network predicts offsets and scores vs each anchor.
FPN
Feature Pyramid Networks add top-down pathways—standard in modern Faster R-CNN variants for small objects.
ROIAlign and Mask R-CNN
ROIAlign uses bilinear interpolation to sample features at continuous locations—no harsh quantization—critical for pixel masks. Mask R-CNN adds a parallel mask head (see the instance segmentation chapter).
Using Faster R-CNN in PyTorch
import torch
from torchvision.models.detection import fasterrcnn_resnet50_fpn
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
def get_model(num_classes):
model = fasterrcnn_resnet50_fpn(weights="DEFAULT")
in_features = model.roi_heads.box_predictor.cls_score.in_features
model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)
return model
# num_classes = 2 for 1 foreground class + background
model = get_model(num_classes=2)
Pass a list of image tensors to model(images); targets during training are dicts with boxes and labels per image.
Beyond the original paper
Cascade R-CNN stacks multiple heads with increasing IoU thresholds to refine hard positives. HTC / DetectoRS add segmentation context. Libraries: torchvision, Detectron2, mmdetection ship production configs.
Takeaways
- R-CNN → shared features (Fast) → learned proposals (Faster).
- RPN + anchors remain the template for many two-stage systems.
- ROIAlign fixes alignment for detection and especially for masks.
Quick FAQ
Chapter FAQ
Quick FAQ
cv2.dnn for C++ or Python without PyTorch at runtime. Pre/post-processing must match the exporter.