Computer Vision Chapter 17

Face & Pose Estimation

Face recognition pipelines and human pose keypoint estimation.

Face recognition

Face detection: Haar cascade

import cv2

face_cascade = cv2.CascadeClassifier(
    cv2.data.haarcascades + "haarcascade_frontalface_default.xml"
)
gray = cv2.cvtColor(img_bgr, cv2.COLOR_BGR2GRAY)
faces = face_cascade.detectMultiScale(gray, scaleFactor=1.1, minNeighbors=5, minSize=(40, 40))
for (x, y, w, h) in faces:
    cv2.rectangle(img_bgr, (x, y), (x + w, y + h), (0, 255, 0), 2)

Fast and dependency-light; less robust than modern CNN detectors in hard lighting or pose.

DNN face detector (OpenCV)

Download Caffe or TensorFlow/OpenCV zoo models (e.g. single-shot detector variants). Load with cv2.dnn.readNetFromCaffe or readNetFromTensorflow, build a blob from the image, net.setInput, forward, then decode boxes and NMS.

net = cv2.dnn.readNetFromTensorflow("opencv_face_detector_uint8.pb",
                                    "opencv_face_detector.pbtxt")
h, w = img_bgr.shape[:2]
blob = cv2.dnn.blobFromImage(img_bgr, 1.0, (300, 300), [104, 117, 123])
net.setInput(blob)
detections = net.forward()
# iterate detections[0,0,i,:] — confidence, box coords — apply threshold + NMS

Exact decoding depends on the model’s output layout; see OpenCV samples for the matching version.

Embeddings (concept)

Crop the face, resize to the network input (often 112×112), run the backbone + embedding head. L2-normalize vectors so cosine similarity equals dot product.

Verification: cosine similarity

import torch
import torch.nn.functional as F

def l2n(x):
    return F.normalize(x, dim=1)

# e1, e2: [1, D] from your face encoder
sim = (l2n(e1) * l2n(e2)).sum(dim=1)
same_person = sim > 0.35  # threshold is model- and dataset-specific

Takeaways

  • Detection quality limits end-to-end accuracy—align before encoding when possible.
  • Use calibrated thresholds; report FAR/FRR for security-sensitive use.
  • Privacy: biometrics need consent, secure storage, and compliance (e.g. GDPR).

Quick FAQ

Liveness detection (texture, depth, challenge-response) blocks printed or screen replay attacks.

Verification: one-to-one (probe vs claimed identity). Identification: one-to-many search over a gallery (nearest neighbor in embedding space).

Pose estimation

COCO-17 keypoints (idea)

Order typically includes nose, eyes, ears, shoulders, elbows, wrists, hips, knees, ankles. Each predicted point has (x, y) and often a confidence; low confidence means occlusion or out-of-frame. Connect pairs with a fixed edge list to render a skeleton.

MediaPipe Pose (Python)

# pip install mediapipe opencv-python
import cv2
import mediapipe as mp

mp_pose = mp.solutions.pose
mp_draw = mp.solutions.drawing_utils

img_rgb = cv2.cvtColor(img_bgr, cv2.COLOR_BGR2RGB)
with mp_pose.Pose(static_image_mode=True) as pose:
    res = pose.process(img_rgb)
    if res.pose_landmarks:
        mp_draw.draw_landmarks(
            img_bgr, res.pose_landmarks, mp_pose.POSE_CONNECTIONS)

For video, set static_image_mode=False and reuse the same Pose instance across frames for smoother tracking.

OpenCV DNN (OpenPose-style)

OpenCV samples load Caffe/ONNX multi-branch models that output heatmaps and part affinity fields. You download the model files from the OpenCV GitHub wiki, run net.forward, then decode peaks and associate limbs—more code than MediaPipe but fully offline and customizable.

3D pose

Extends estimation to camera-centered 3D joint coordinates (monocular lifting, multi-view fusion, or depth sensors). Often couples with biomechanics or AR.

Takeaways

  • Normalize crops and augment data for robustness to scale and clothing.
  • Multi-person scenes need association (top-down boxes or bottom-up grouping).
  • Ethics: pose in public spaces raises consent and surveillance concerns.

Quick FAQ

Temporal smoothing (Kalman, exponential moving average) or higher input resolution often helps.

Models may hallucinate hidden joints; use confidence thresholds and temporal consistency checks.

Chapter FAQ

Quick FAQ

Liveness detection (texture, depth, challenge-response) blocks printed or screen replay attacks.

Verification: one-to-one (probe vs claimed identity). Identification: one-to-many search over a gallery (nearest neighbor in embedding space).

Quick FAQ

Temporal smoothing (Kalman, exponential moving average) or higher input resolution often helps.

Models may hallucinate hidden joints; use confidence thresholds and temporal consistency checks.