Related Deep Learning Links
Learn Object Detection Deep Learning Tutorial, validate concepts with Object Detection Deep Learning MCQ Questions, and prepare interviews through Object Detection Deep Learning Interview Questions and Answers.
Object Detection: Localize & Classify
Beyond image classification: draw bounding boxes around every object. From R-CNN to YOLOv8 and DETR — master the architectures that power autonomous vehicles, medical imaging, and visual search.
Bounding Box
x, y, w, h
IoU
Intersection over Union
NMS
Non-Max Suppression
mAP
Mean Average Precision
What is Object Detection?
Object detection = localization (where?) + classification (what?). Output: variable number of bounding boxes with class labels.
Challenge: varying number of objects, scale, occlusion, real-time speed.
Detection Fundamentals
IoU (Intersection over Union)
IoU = Area of Overlap / Area of Union
Threshold: typically 0.5 (PASCAL) or 0.5:0.95 (COCO).
def iou(box1, box2):
# box: [x1,y1,x2,y2]
inter = max(0, min(b1[2],b2[2]) - max(b1[0],b2[0])) * ...
union = (b1[2]-b1[0])*(b1[3]-b1[1]) + ...
return inter / union
Non-Max Suppression (NMS)
Remove duplicate detections: pick highest confidence, suppress overlapping boxes.
- Sort by confidence
- Select top box, remove IoU > threshold
- Repeat
Anchor Boxes
Predefined boxes of different scales/aspect ratios. Network predicts offsets from anchors.
Faster R-CNN: 9 anchors/position. YOLO: 5-9 clusters.
Two-Stage Detectors: R-CNN → Fast → Faster
R-CNN (2014)
Selective Search → 2000 region proposals → warp → CNN → SVM + bbox regressor.
Slow: 47s/image.
Fast R-CNN (2015)
Single CNN forward pass. RoI pooling layer extracts features for each proposal.
Multi-task loss: classification + bbox regression.
0.3s/image.
Faster R-CNN (2015)
Region Proposal Network (RPN) replaces selective search. End-to-end trainable.
Anchors at each location. Still two-stage but real-time capable.
import torchvision
from torchvision.models.detection import fasterrcnn_resnet50_fpn
# Load pretrained model
model = fasterrcnn_resnet50_fpn(pretrained=True)
model.eval()
# Inference
with torch.no_grad():
predictions = model(image_tensor)
# predictions: list[dict] with 'boxes', 'labels', 'scores'
# Fine-tune on custom dataset
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
model = fasterrcnn_resnet50_fpn(pretrained=True)
num_classes = 10 # your classes + background
in_features = model.roi_heads.box_predictor.cls_score.in_features
model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)
One-Stage Detectors: YOLO & SSD
Single forward pass: simultaneous classification + localization. Much faster, ideal for real-time.
YOLO – You Only Look Once
Divide image into S×S grid. Each cell predicts B boxes + confidence + C class probabilities.
Loss = λ_coord * L_xywh + λ_obj * L_obj + λ_noobj * L_noobj + L_class
Anchor boxes, multi-scale, CSPNet, anchor-free variants.
SSD – Single Shot MultiBox Detector
Multi-scale feature maps. Predict offsets from default boxes at each scale.
No RPN, no resampling. Faster than Faster R-CNN, competitive accuracy.
# SSD backbone: VGG or ResNet
# Extra convolutional layers for scale pyramid
from ultralytics import YOLO
# Load pretrained model
model = YOLO('yolov8n.pt') # nano, small, medium, large, xlarge
# Run inference
results = model('image.jpg', save=True)
# Access detections
for r in results:
for box in r.boxes:
x1,y1,x2,y2 = box.xyxy[0].tolist()
conf = box.conf[0].item()
cls = int(box.cls[0].item())
print(f"{cls} @ {conf:.2f}: [{x1:.0f},{y1:.0f},{x2:.0f},{y2:.0f}]")
# Train on custom data
model.train(data='coco128.yaml', epochs=50)
RetinaNet: Focal Loss for Class Imbalance
One-stage detectors used to lag in accuracy due to extreme foreground-background imbalance. Focal loss solves it.
Focal Loss Formula
FL(pₜ) = -αₜ(1-pₜ)ᵞ log(pₜ)
γ=2, α=0.25. Down-weights easy examples, focuses on hard misclassifications.
RetinaNet = ResNet/FPN backbone + two subnetworks (class + box).
# Focal Loss implementation
class FocalLoss(nn.Module):
def __init__(self, alpha=0.25, gamma=2):
super().__init__()
self.alpha = alpha
self.gamma = gamma
def forward(self, preds, targets):
ce_loss = F.binary_cross_entropy_with_logits(preds, targets, reduction='none')
p_t = preds * targets + (1 - preds) * (1 - targets)
focal_weight = (1 - p_t) ** self.gamma
focal_loss = focal_weight * ce_loss
return focal_loss.mean()
Anchor-Free & Transformer Detectors
Anchor-Free Detectors
Predict keypoints or center points instead of anchor offsets.
- CenterNet: Object as center point + width/height
- FCOS: Per-pixel prediction, multi-level FPN
- CornerNet: Detect corners, group embeddings
Fewer hyperparameters, simpler.
DETR – Detection Transformer
End-to-end object detection with Transformers. No anchors, no NMS (bipartite matching).
Encoder-decoder: CNN backbone + transformer + fixed set of object queries.
Hungarian loss matches predictions to ground truth.
from transformers import DetrImageProcessor, DetrForObjectDetection
processor = DetrImageProcessor.from_pretrained("facebook/detr-resnet-50")
model = DetrForObjectDetection.from_pretrained("facebook/detr-resnet-50")
inputs = processor(images=image, return_tensors="pt")
outputs = model(**inputs)
# Convert outputs to COCO format
target_sizes = torch.tensor([image.size[::-1]])
results = processor.post_process_object_detection(outputs, target_sizes=target_sizes)[0]
for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
if score > 0.7:
print(f"{model.config.id2label[label.item()]}: {score:.2f} at {box.tolist()}")
Evaluation: mAP (mean Average Precision)
Standard metric for object detection (PASCAL VOC, COCO).
Precision-Recall Curve
For each class, rank detections by confidence. Compute precision/recall at each rank.
AP = area under P-R curve.
mAP = mean AP over all classes.
COCO mAP
- mAP@0.5: IoU threshold 0.5 (PASCAL)
- mAP@0.5:0.95: average over IoU thresholds 0.5 to 0.95 step 0.05 (COCO primary)
- mAP_small, mAP_medium, mAP_large
# Using torchmetrics
from torchmetrics.detection.mean_ap import MeanAveragePrecision
metric = MeanAveragePrecision(iou_thresholds=[0.5, 0.75])
metric.update(predictions, ground_truths)
results = metric.compute()
Training Tricks & Augmentation
Random resize between 640-800px each iteration.
Combine 4 images into one (YOLOv4). Boosts small object detection.
Blend images and labels.
Detector Comparison
| Detector | Type | Speed (FPS) | COCO mAP | Key Feature |
|---|---|---|---|---|
| Faster R-CNN | Two-stage | 7-15 | 42-45 | RPN, accuracy benchmark |
| SSD | One-stage | 40-60 | 28-33 | Multi-scale defaults |
| YOLOv5 | One-stage | 80-140 | 50-55 | Speed-accuracy tradeoff |
| YOLOv8 | One-stage | 80-160 | 53-57 | Anchor-free, SOTA |
| RetinaNet | One-stage | 20-40 | 40-44 | Focal loss |
| DETR | Transformer | 20-28 | 42-45 | No anchors/NMS |
| DINO | Transformer | 20-30 | 63+ | SOTA Transformer |
Real-World Object Detection
Autonomous Vehicles
Cars, pedestrians, traffic signs
Medical Imaging
Tumors, lesions, cells
Robotics
Object grasping, navigation
Retail
Shelf monitoring, checkout-free
Deploying Object Detectors
Optimize YOLO/Faster R-CNN for GPU inference. 2-5x speedup.
Intel CPU/VPU acceleration. Popular for edge.
# Export YOLOv8 to ONNX
model.export(format='onnx', imgsz=640)
# TensorRT
model.export(format='engine', device=0)
# TorchScript
scripted_model = torch.jit.script(model)
scripted_model.save('detector.pt')