Computer Vision Chapter 35

Diffusion models

Diffusion models learn to reverse a gradual forward corruption: Gaussian noise is added over T steps until data looks like noise. A neural net (typically a UNet) predicts noise (or score / v-prediction depending on formulation) conditioned on timestep t—and sometimes on text. Sampling starts from pure noise and iteratively denoises. DDPM popularized discrete-time training with a variance schedule β_t. Stable Diffusion runs diffusion in a VAE latent space with a text encoder (CLIP) for conditioning. Below: schedule math sketch, training loss pattern, sampling loop outline, and optional diffusers usage.

Forward process (concept)

Given clean x_0, define q(x_t | x_{t-1}) = N(√(1-β_t) x_{t-1}, β_t I). With reparameterization, sample x_t = √(α̅_t) x_0 + √(1-α̅_t) ε with ε ~ N(0,I), where α̅_t is cumulative product of 1-β_t. Training picks random t and teaches a net ε_θ(x_t, t) to predict ε.

# Pseudocode: sample x_t in closed form (DDPM-style)
import torch

def q_sample(x0, t, alphas_cumprod, noise=None):
    if noise is None:
        noise = torch.randn_like(x0)
    sqrt_acp = alphas_cumprod[t].sqrt().view(-1, 1, 1, 1)
    sqrt_om = (1 - alphas_cumprod[t]).sqrt().view(-1, 1, 1, 1)
    return sqrt_acp * x0 + sqrt_om * noise

alphas_cumprod is precomputed on device; t shape [B] indexes per-sample timestep.

Training step (predict noise)

# unet(x_t, t) -> predicted noise, same shape as x_t
def train_step(unet, x0, optimizer, alphas_cumprod):
    b = x0.size(0)
    t = torch.randint(0, len(alphas_cumprod), (b,), device=x0.device)
    noise = torch.randn_like(x0)
    xt = q_sample(x0, t, alphas_cumprod, noise)
    pred = unet(xt, t)
    loss = torch.nn.functional.mse_loss(pred, noise)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    return loss.item()

Sampling (ancestral DDPM outline)

@torch.no_grad()
def p_sample_loop(unet, shape, betas, alphas, alphas_cumprod):
    x = torch.randn(shape, device=betas.device)
    for t in reversed(range(len(betas))):
        ts = torch.full((shape[0],), t, device=x.device, dtype=torch.long)
        pred_noise = unet(x, ts)
        # combine pred_noise with x, betas[t], alphas[t], alphas_cumprod[t]
        # to get mean of p(x_{t-1}|x_t); add scaled noise if t > 0
        ...
    return x

Full coefficients are in DDPM papers/cheatsheets; libraries implement them exactly.

Stable Diffusion (stack)

  • VAE encoder/decoder: operate in lower-dimensional latent images.
  • UNet: denoises latents with cross-attention to text tokens.
  • Text encoder: CLIP-like transformer turns prompt into conditioning.

Hugging Face diffusers (optional)

# pip install diffusers transformers accelerate torch
from diffusers import StableDiffusionPipeline
import torch

pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    torch_dtype=torch.float16,
).to("cuda")
img = pipe("a photo of an astronaut riding a horse").images[0]

Requires GPU with sufficient VRAM for default resolution; use smaller models or CPU offload for constraints.

Takeaways

  • Forward: add noise; learn reverse denoising transitions.
  • UNet + timestep (and optional text) conditioning is standard for images.
  • Latent diffusion + text encoder = efficient high-res generation (Stable Diffusion class).

Quick FAQ

DDIM allows fewer steps with deterministic updates (non-Markov); faster sampling with some quality tradeoffs.

Blend conditional and unconditional noise predictions to sharpen text alignment: ε_θ + w (ε_c - ε_u).