Object Detection: Localize & Classify
Beyond image classification: draw bounding boxes around every object. From R-CNN to YOLOv8 and DETR — master the architectures that power autonomous vehicles, medical imaging, and visual search.
Bounding Box
x, y, w, h
IoU
Intersection over Union
NMS
Non-Max Suppression
mAP
Mean Average Precision
What is Object Detection?
Object detection = localization (where?) + classification (what?). Output: variable number of bounding boxes with class labels.
Challenge: varying number of objects, scale, occlusion, real-time speed.
Detection Fundamentals
IoU (Intersection over Union)
IoU = Area of Overlap / Area of Union
Threshold: typically 0.5 (PASCAL) or 0.5:0.95 (COCO).
def iou(box1, box2):
# box: [x1,y1,x2,y2]
inter = max(0, min(b1[2],b2[2]) - max(b1[0],b2[0])) * ...
union = (b1[2]-b1[0])*(b1[3]-b1[1]) + ...
return inter / union
Non-Max Suppression (NMS)
Remove duplicate detections: pick highest confidence, suppress overlapping boxes.
- Sort by confidence
- Select top box, remove IoU > threshold
- Repeat
Anchor Boxes
Predefined boxes of different scales/aspect ratios. Network predicts offsets from anchors.
Faster R-CNN: 9 anchors/position. YOLO: 5-9 clusters.
Two-Stage Detectors: R-CNN → Fast → Faster
R-CNN (2014)
Selective Search → 2000 region proposals → warp → CNN → SVM + bbox regressor.
Slow: 47s/image.
Fast R-CNN (2015)
Single CNN forward pass. RoI pooling layer extracts features for each proposal.
Multi-task loss: classification + bbox regression.
0.3s/image.
Faster R-CNN (2015)
Region Proposal Network (RPN) replaces selective search. End-to-end trainable.
Anchors at each location. Still two-stage but real-time capable.
import torchvision
from torchvision.models.detection import fasterrcnn_resnet50_fpn
# Load pretrained model
model = fasterrcnn_resnet50_fpn(pretrained=True)
model.eval()
# Inference
with torch.no_grad():
predictions = model(image_tensor)
# predictions: list[dict] with 'boxes', 'labels', 'scores'
# Fine-tune on custom dataset
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
model = fasterrcnn_resnet50_fpn(pretrained=True)
num_classes = 10 # your classes + background
in_features = model.roi_heads.box_predictor.cls_score.in_features
model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)
One-Stage Detectors: YOLO & SSD
Single forward pass: simultaneous classification + localization. Much faster, ideal for real-time.
YOLO – You Only Look Once
Divide image into S×S grid. Each cell predicts B boxes + confidence + C class probabilities.
Loss = λ_coord * L_xywh + λ_obj * L_obj + λ_noobj * L_noobj + L_class
Anchor boxes, multi-scale, CSPNet, anchor-free variants.
SSD – Single Shot MultiBox Detector
Multi-scale feature maps. Predict offsets from default boxes at each scale.
No RPN, no resampling. Faster than Faster R-CNN, competitive accuracy.
# SSD backbone: VGG or ResNet
# Extra convolutional layers for scale pyramid
from ultralytics import YOLO
# Load pretrained model
model = YOLO('yolov8n.pt') # nano, small, medium, large, xlarge
# Run inference
results = model('image.jpg', save=True)
# Access detections
for r in results:
for box in r.boxes:
x1,y1,x2,y2 = box.xyxy[0].tolist()
conf = box.conf[0].item()
cls = int(box.cls[0].item())
print(f"{cls} @ {conf:.2f}: [{x1:.0f},{y1:.0f},{x2:.0f},{y2:.0f}]")
# Train on custom data
model.train(data='coco128.yaml', epochs=50)
RetinaNet: Focal Loss for Class Imbalance
One-stage detectors used to lag in accuracy due to extreme foreground-background imbalance. Focal loss solves it.
Focal Loss Formula
FL(pₜ) = -αₜ(1-pₜ)ᵞ log(pₜ)
γ=2, α=0.25. Down-weights easy examples, focuses on hard misclassifications.
RetinaNet = ResNet/FPN backbone + two subnetworks (class + box).
# Focal Loss implementation
class FocalLoss(nn.Module):
def __init__(self, alpha=0.25, gamma=2):
super().__init__()
self.alpha = alpha
self.gamma = gamma
def forward(self, preds, targets):
ce_loss = F.binary_cross_entropy_with_logits(preds, targets, reduction='none')
p_t = preds * targets + (1 - preds) * (1 - targets)
focal_weight = (1 - p_t) ** self.gamma
focal_loss = focal_weight * ce_loss
return focal_loss.mean()
Anchor-Free & Transformer Detectors
Anchor-Free Detectors
Predict keypoints or center points instead of anchor offsets.
- CenterNet: Object as center point + width/height
- FCOS: Per-pixel prediction, multi-level FPN
- CornerNet: Detect corners, group embeddings
Fewer hyperparameters, simpler.
DETR – Detection Transformer
End-to-end object detection with Transformers. No anchors, no NMS (bipartite matching).
Encoder-decoder: CNN backbone + transformer + fixed set of object queries.
Hungarian loss matches predictions to ground truth.
from transformers import DetrImageProcessor, DetrForObjectDetection
processor = DetrImageProcessor.from_pretrained("facebook/detr-resnet-50")
model = DetrForObjectDetection.from_pretrained("facebook/detr-resnet-50")
inputs = processor(images=image, return_tensors="pt")
outputs = model(**inputs)
# Convert outputs to COCO format
target_sizes = torch.tensor([image.size[::-1]])
results = processor.post_process_object_detection(outputs, target_sizes=target_sizes)[0]
for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
if score > 0.7:
print(f"{model.config.id2label[label.item()]}: {score:.2f} at {box.tolist()}")
Evaluation: mAP (mean Average Precision)
Standard metric for object detection (PASCAL VOC, COCO).
Precision-Recall Curve
For each class, rank detections by confidence. Compute precision/recall at each rank.
AP = area under P-R curve.
mAP = mean AP over all classes.
COCO mAP
- mAP@0.5: IoU threshold 0.5 (PASCAL)
- mAP@0.5:0.95: average over IoU thresholds 0.5 to 0.95 step 0.05 (COCO primary)
- mAP_small, mAP_medium, mAP_large
# Using torchmetrics
from torchmetrics.detection.mean_ap import MeanAveragePrecision
metric = MeanAveragePrecision(iou_thresholds=[0.5, 0.75])
metric.update(predictions, ground_truths)
results = metric.compute()
Training Tricks & Augmentation
Random resize between 640-800px each iteration.
Combine 4 images into one (YOLOv4). Boosts small object detection.
Blend images and labels.
Detector Comparison
| Detector | Type | Speed (FPS) | COCO mAP | Key Feature |
|---|---|---|---|---|
| Faster R-CNN | Two-stage | 7-15 | 42-45 | RPN, accuracy benchmark |
| SSD | One-stage | 40-60 | 28-33 | Multi-scale defaults |
| YOLOv5 | One-stage | 80-140 | 50-55 | Speed-accuracy tradeoff |
| YOLOv8 | One-stage | 80-160 | 53-57 | Anchor-free, SOTA |
| RetinaNet | One-stage | 20-40 | 40-44 | Focal loss |
| DETR | Transformer | 20-28 | 42-45 | No anchors/NMS |
| DINO | Transformer | 20-30 | 63+ | SOTA Transformer |
Real-World Object Detection
Autonomous Vehicles
Cars, pedestrians, traffic signs
Medical Imaging
Tumors, lesions, cells
Robotics
Object grasping, navigation
Retail
Shelf monitoring, checkout-free
Deploying Object Detectors
Optimize YOLO/Faster R-CNN for GPU inference. 2-5x speedup.
Intel CPU/VPU acceleration. Popular for edge.
# Export YOLOv8 to ONNX
model.export(format='onnx', imgsz=640)
# TensorRT
model.export(format='engine', device=0)
# TorchScript
scripted_model = torch.jit.script(model)
scripted_model.save('detector.pt')