Computer Vision Chapter 9

One-Stage Object Detection

YOLO, RetinaNet, and SSD-style single-shot detectors for real-time localization.

YOLO

Core idea

The network outputs dense tensors encoding, for each spatial location (and scale): objectness or class scores, box coordinates (center, size, or distances to sides), and sometimes mask coefficients. Training matches predictions to ground-truth with IoU-based assignment and a multi-part loss (classification + localization + objectness). At inference, low-confidence predictions are filtered and non-maximum suppression removes overlaps.

Why fast?

Single backbone + detection head; highly optimized implementations (TensorRT, ONNX Runtime).

Trade-offs

Tiny models on hard scenes (crowds, small objects) may trail heavy two-stage detectors on mAP.

Version sketch

  • YOLOv3 / v4 / v5 — multi-scale predictions, strong community adoption; v5/v8 ecosystems centered on Ultralytics tooling.
  • YOLOv8 / YOLO11 (Ultralytics) — unified API for detect, segment, classify, pose; improved training pipeline and export.
  • Papers also track YOLOv9/v10 etc.—check the exact paper/repo for architectural claims.

Ultralytics: predict on image / video

Install: pip install ultralytics (pulls PyTorch). Weights download on first use.

from ultralytics import YOLO

model = YOLO("yolov8n.pt")  # nano — fastest; try s/m/l for accuracy
results = model.predict("https://ultralytics.com/images/bus.jpg", conf=0.25)

r = results[0]
if r.boxes is not None:
    xyxys = r.boxes.xyxy.cpu().numpy()
    clss = r.boxes.cls.cpu().numpy()
    confs = r.boxes.conf.cpu().numpy()
    for i in range(len(xyxys)):
        x1, y1, x2, y2 = xyxys[i].tolist()
        cls = int(clss[i])
        conf = float(confs[i])

r.save("out.jpg")  # annotated image
# r.show()  # optional: requires a display

Batch and streaming

results = model.predict(["a.jpg", "b.jpg"], device=0, imgsz=640)
for seq in model.predict(source="video.mp4", stream=True):
    pass  # process each Results without holding all frames in RAM

Train and export (outline)

# Dataset: YOLO txt labels + yaml pointing to train/val images
model = YOLO("yolov8n.pt")
model.train(data="coco8.yaml", epochs=50, imgsz=640, batch=16)

model.export(format="onnx", opset=12)   # deploy with ONNX Runtime / TensorRT

Replace coco8.yaml with your dataset YAML; validate paths and class count.

Speed tips

  • Lower imgsz (e.g. 416) for throughput; raise for small objects.
  • Use nano/small weights; quantize (INT8) after calibration on representative data.
  • Enable half precision (half=True on CUDA) when numerically stable.
  • For CPU-only, prefer ONNX Runtime with optimized graph or OpenVINO where available.

Takeaways

  • YOLO = one-shot detector family optimized for latency.
  • Ultralytics provides a practical training, validation, and export loop.
  • Match conf / NMS settings to your precision–recall needs.

Quick FAQ

YOLO is usually faster and simpler to deploy; two-stage models may win mAP on difficult COCO-style scenes. Benchmark on your own data and hardware.

Check the current Ultralytics license and model zoo terms—they have evolved. For enterprise, confirm with legal alongside AGPL/AGPL-style constraints if applicable.

RetinaNet

Focal loss (intuition)

Standard cross-entropy is dominated by easy negatives (background). Focal loss multiplies CE by a modulating factor (1 − pt)γ with focusing parameter γ ≥ 0. When the model is confident on a class (pt near 1), the loss shrinks; hard examples keep larger gradients. An optional α balances positive/negative contribution.

# Conceptual focal modulator on top of CE (illustrative)
import math
def focal_weight(pt, gamma=2.0):
    return (1.0 - pt) ** gamma

FPN and detection heads

RetinaNet attaches two subnetworks (classification and box regression) at each pyramid level. Anchors span scales and aspect ratios per level. Predictions are decoded to image-space boxes and filtered with thresholding and NMS—same post-processing family as other anchor-based detectors.

vs YOLO

Both one-stage; RetinaNet’s focal loss specifically targets CE imbalance. YOLO families use different assignment and loss formulations.

vs Faster R-CNN

No RPN stage—denser set of candidates; often slower than tiny YOLO but competitive accuracy on COCO-style data.

Inference: torchvision

import torch
import torchvision.transforms as T
from torchvision.models.detection import retinanet_resnet50_fpn
from PIL import Image

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = retinanet_resnet50_fpn(weights="DEFAULT").to(device).eval()

img = Image.open("street.jpg").convert("RGB")
x = T.functional.to_tensor(img).to(device)

with torch.no_grad():
    r = model([x])[0]

for i in range(len(r["scores"])):
    if r["scores"][i] < 0.5:
        continue
    box = r["boxes"][i].tolist()
    lbl = int(r["labels"][i])
    sc = float(r["scores"][i])

Custom classes: replace heads

from torchvision.models.detection.retinanet import RetinaNetClassificationHead
from torchvision.models.detection import retinanet_resnet50_fpn
from torchvision.models.detection.anchor_utils import AnchorGenerator
import torch.nn as nn

num_classes = 3  # e.g. background + 2 object classes
model = retinanet_resnet50_fpn(weights="DEFAULT")
# Typical pattern: rebuild cls head with num_classes and in_channels from backbone
# See torchvision retinanet source for RetinaNetHead constructor args for your version

torchvision’s internal head API shifted across releases—copy the official “Training on a custom dataset” snippet for your installed version.

Takeaways

  • Focal loss fights anchor imbalance in dense detectors.
  • FPN gives multi-scale representation for small and large objects.
  • Strong baseline when you want one-stage accuracy without YOLO-specific tooling.

Quick FAQ

Paper defaults (γ=2, α≈0.25) are a start. On very imbalanced custom data, adjust α per class frequency or use focal loss implementations that support per-class α.

Follow-up work (FCOS, ATSS, etc.) removes hand-designed anchors; focal-style losses still appear in many modern one-stage designs.

Chapter FAQ

Quick FAQ

YOLO is usually faster and simpler to deploy; two-stage models may win mAP on difficult COCO-style scenes. Benchmark on your own data and hardware.

Check the current Ultralytics license and model zoo terms—they have evolved. For enterprise, confirm with legal alongside AGPL/AGPL-style constraints if applicable.

Quick FAQ

Paper defaults (γ=2, α≈0.25) are a start. On very imbalanced custom data, adjust α per class frequency or use focal loss implementations that support per-class α.

Follow-up work (FCOS, ATSS, etc.) removes hand-designed anchors; focal-style losses still appear in many modern one-stage designs.