Object Detection: Localize & Classify

Beyond image classification: draw bounding boxes around every object. From R-CNN to YOLOv8 and DETR â€” master the architectures that power autonomous vehicles, medical imaging, and visual search.

Bounding Box

x, y, w, h

IoU

Intersection over Union

NMS

Non-Max Suppression

mAP

Mean Average Precision

What is Object Detection?

Object detection = localization (where?) + classification (what?). Output: variable number of bounding boxes with class labels.

Input Image â†’ Detector â†’ [(xâ‚,yâ‚,wâ‚,hâ‚, classâ‚), ..., (xâ‚™,yâ‚™,wâ‚™,hâ‚™, classâ‚™)]

Challenge: varying number of objects, scale, occlusion, real-time speed.

âœ“ Classification: what?

âœ“ Localization: where?

âœ“ Multiple objects

Detection Fundamentals

IoU (Intersection over Union)

IoU = Area of Overlap / Area of Union

Threshold: typically 0.5 (PASCAL) or 0.5:0.95 (COCO).

def iou(box1, box2):
    # box: [x1,y1,x2,y2]
    inter = max(0, min(b1[2],b2[2]) - max(b1[0],b2[0])) * ...
    union = (b1[2]-b1[0])*(b1[3]-b1[1]) + ...
    return inter / union

Non-Max Suppression (NMS)

Remove duplicate detections: pick highest confidence, suppress overlapping boxes.

Sort by confidence
Select top box, remove IoU > threshold
Repeat

Anchor Boxes

Predefined boxes of different scales/aspect ratios. Network predicts offsets from anchors.

Faster R-CNN: 9 anchors/position. YOLO: 5-9 clusters.

Two-Stage Detectors: R-CNN â†’ Fast â†’ Faster

R-CNN (2014)

Selective Search â†’ 2000 region proposals â†’ warp â†’ CNN â†’ SVM + bbox regressor.

Slow: 47s/image.

Fast R-CNN (2015)

Single CNN forward pass. RoI pooling layer extracts features for each proposal.

Multi-task loss: classification + bbox regression.

0.3s/image.

Faster R-CNN (2015)

Region Proposal Network (RPN) replaces selective search. End-to-end trainable.

Anchors at each location. Still two-stage but real-time capable.

Faster R-CNN with PyTorch (Torchvision)

import torchvision
from torchvision.models.detection import fasterrcnn_resnet50_fpn

# Load pretrained model
model = fasterrcnn_resnet50_fpn(pretrained=True)
model.eval()

# Inference
with torch.no_grad():
    predictions = model(image_tensor)
    # predictions: list[dict] with 'boxes', 'labels', 'scores'

# Fine-tune on custom dataset
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
model = fasterrcnn_resnet50_fpn(pretrained=True)
num_classes = 10  # your classes + background
in_features = model.roi_heads.box_predictor.cls_score.in_features
model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)

One-Stage Detectors: YOLO & SSD

Single forward pass: simultaneous classification + localization. Much faster, ideal for real-time.

YOLO â€“ You Only Look Once

Divide image into SÃ—S grid. Each cell predicts B boxes + confidence + C class probabilities.

Loss = Î»_coord * L_xywh + Î»_obj * L_obj + Î»_noobj * L_noobj + L_class

YOLOv1 YOLOv2/v3 YOLOv4 YOLOv5 YOLOv6 YOLOv7 YOLOv8

Anchor boxes, multi-scale, CSPNet, anchor-free variants.

SSD â€“ Single Shot MultiBox Detector

Multi-scale feature maps. Predict offsets from default boxes at each scale.

No RPN, no resampling. Faster than Faster R-CNN, competitive accuracy.

# SSD backbone: VGG or ResNet
# Extra convolutional layers for scale pyramid

YOLOv5 / YOLOv8 Inference (Ultralytics)

from ultralytics import YOLO

# Load pretrained model
model = YOLO('yolov8n.pt')  # nano, small, medium, large, xlarge

# Run inference
results = model('image.jpg', save=True)

# Access detections
for r in results:
    for box in r.boxes:
        x1,y1,x2,y2 = box.xyxy[0].tolist()
        conf = box.conf[0].item()
        cls = int(box.cls[0].item())
        print(f"{cls} @ {conf:.2f}: [{x1:.0f},{y1:.0f},{x2:.0f},{y2:.0f}]")

# Train on custom data
model.train(data='coco128.yaml', epochs=50)

RetinaNet: Focal Loss for Class Imbalance

One-stage detectors used to lag in accuracy due to extreme foreground-background imbalance. Focal loss solves it.

Focal Loss Formula

FL(pâ‚œ) = -Î±â‚œ(1-pâ‚œ)áµž log(pâ‚œ)

Î³=2, Î±=0.25. Down-weights easy examples, focuses on hard misclassifications.

RetinaNet = ResNet/FPN backbone + two subnetworks (class + box).

# Focal Loss implementation
class FocalLoss(nn.Module):
    def __init__(self, alpha=0.25, gamma=2):
        super().__init__()
        self.alpha = alpha
        self.gamma = gamma
    
    def forward(self, preds, targets):
        ce_loss = F.binary_cross_entropy_with_logits(preds, targets, reduction='none')
        p_t = preds * targets + (1 - preds) * (1 - targets)
        focal_weight = (1 - p_t) ** self.gamma
        focal_loss = focal_weight * ce_loss
        return focal_loss.mean()

Anchor-Free & Transformer Detectors

Anchor-Free Detectors

Predict keypoints or center points instead of anchor offsets.

CenterNet: Object as center point + width/height
FCOS: Per-pixel prediction, multi-level FPN
CornerNet: Detect corners, group embeddings

Fewer hyperparameters, simpler.

DETR â€“ Detection Transformer

End-to-end object detection with Transformers. No anchors, no NMS (bipartite matching).

Encoder-decoder: CNN backbone + transformer + fixed set of object queries.

Hungarian loss matches predictions to ground truth.

DETR Inference (Hugging Face)

from transformers import DetrImageProcessor, DetrForObjectDetection

processor = DetrImageProcessor.from_pretrained("facebook/detr-resnet-50")
model = DetrForObjectDetection.from_pretrained("facebook/detr-resnet-50")

inputs = processor(images=image, return_tensors="pt")
outputs = model(**inputs)

# Convert outputs to COCO format
target_sizes = torch.tensor([image.size[::-1]])
results = processor.post_process_object_detection(outputs, target_sizes=target_sizes)[0]

for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
    if score > 0.7:
        print(f"{model.config.id2label[label.item()]}: {score:.2f} at {box.tolist()}")

Evaluation: mAP (mean Average Precision)

Standard metric for object detection (PASCAL VOC, COCO).

Precision-Recall Curve

For each class, rank detections by confidence. Compute precision/recall at each rank.

AP = area under P-R curve.

mAP = mean AP over all classes.

COCO mAP

mAP@0.5: IoU threshold 0.5 (PASCAL)
mAP@0.5:0.95: average over IoU thresholds 0.5 to 0.95 step 0.05 (COCO primary)
mAP_small, mAP_medium, mAP_large

# Using torchmetrics
from torchmetrics.detection.mean_ap import MeanAveragePrecision

metric = MeanAveragePrecision(iou_thresholds=[0.5, 0.75])
metric.update(predictions, ground_truths)
results = metric.compute()

Training Tricks & Augmentation

âœ… Multi-scale training

Random resize between 640-800px each iteration.

âœ… Mosaic augmentation

Combine 4 images into one (YOLOv4). Boosts small object detection.

âœ… MixUp / CutMix

Blend images and labels.

âš ï¸ Class imbalance: Focal loss, hard negative mining.

âš ï¸ Overfitting: Use pretrained backbone, heavy augmentation, dropout.

Detector Comparison

Detector	Type	Speed (FPS)	COCO mAP	Key Feature
Faster R-CNN	Two-stage	7-15	42-45	RPN, accuracy benchmark
SSD	One-stage	40-60	28-33	Multi-scale defaults
YOLOv5	One-stage	80-140	50-55	Speed-accuracy tradeoff
YOLOv8	One-stage	80-160	53-57	Anchor-free, SOTA
RetinaNet	One-stage	20-40	40-44	Focal loss
DETR	Transformer	20-28	42-45	No anchors/NMS
DINO	Transformer	20-30	63+	SOTA Transformer

Real-World Object Detection

Autonomous Vehicles

Cars, pedestrians, traffic signs

Medical Imaging

Tumors, lesions, cells

Robotics

Object grasping, navigation

Retail

Shelf monitoring, checkout-free

Deploying Object Detectors

TensorRT / ONNX

Optimize YOLO/Faster R-CNN for GPU inference. 2-5x speedup.

OpenVINO

Intel CPU/VPU acceleration. Popular for edge.

# Export YOLOv8 to ONNX
model.export(format='onnx', imgsz=640)

# TensorRT
model.export(format='engine', device=0)

# TorchScript
scripted_model = torch.jit.script(model)
scripted_model.save('detector.pt')

Object Detection CheatsheetIoU overlap measure
NMS remove duplicates
Anchor predefined boxes
RPN region proposals
YOLO grid + regression
SSD multi-scale
Focal Loss class imbalance
DETR transformer
mAP evaluation
FPN feature pyramid

Next: NLP-Deep Learning

Related Deep Learning Links