Forward process (concept)
Given clean x_0, define q(x_t | x_{t-1}) = N(√(1-β_t) x_{t-1}, β_t I). With reparameterization, sample x_t = √(α̅_t) x_0 + √(1-α̅_t) ε with ε ~ N(0,I), where α̅_t is cumulative product of 1-β_t. Training picks random t and teaches a net ε_θ(x_t, t) to predict ε.
# Pseudocode: sample x_t in closed form (DDPM-style)
import torch
def q_sample(x0, t, alphas_cumprod, noise=None):
if noise is None:
noise = torch.randn_like(x0)
sqrt_acp = alphas_cumprod[t].sqrt().view(-1, 1, 1, 1)
sqrt_om = (1 - alphas_cumprod[t]).sqrt().view(-1, 1, 1, 1)
return sqrt_acp * x0 + sqrt_om * noise
alphas_cumprod is precomputed on device; t shape [B] indexes per-sample timestep.
Training step (predict noise)
# unet(x_t, t) -> predicted noise, same shape as x_t
def train_step(unet, x0, optimizer, alphas_cumprod):
b = x0.size(0)
t = torch.randint(0, len(alphas_cumprod), (b,), device=x0.device)
noise = torch.randn_like(x0)
xt = q_sample(x0, t, alphas_cumprod, noise)
pred = unet(xt, t)
loss = torch.nn.functional.mse_loss(pred, noise)
optimizer.zero_grad()
loss.backward()
optimizer.step()
return loss.item()
Sampling (ancestral DDPM outline)
@torch.no_grad()
def p_sample_loop(unet, shape, betas, alphas, alphas_cumprod):
x = torch.randn(shape, device=betas.device)
for t in reversed(range(len(betas))):
ts = torch.full((shape[0],), t, device=x.device, dtype=torch.long)
pred_noise = unet(x, ts)
# combine pred_noise with x, betas[t], alphas[t], alphas_cumprod[t]
# to get mean of p(x_{t-1}|x_t); add scaled noise if t > 0
...
return x
Full coefficients are in DDPM papers/cheatsheets; libraries implement them exactly.
Stable Diffusion (stack)
- VAE encoder/decoder: operate in lower-dimensional latent images.
- UNet: denoises latents with cross-attention to text tokens.
- Text encoder: CLIP-like transformer turns prompt into conditioning.
Hugging Face diffusers (optional)
# pip install diffusers transformers accelerate torch
from diffusers import StableDiffusionPipeline
import torch
pipe = StableDiffusionPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5",
torch_dtype=torch.float16,
).to("cuda")
img = pipe("a photo of an astronaut riding a horse").images[0]
Requires GPU with sufficient VRAM for default resolution; use smaller models or CPU offload for constraints.
Takeaways
- Forward: add noise; learn reverse denoising transitions.
- UNet + timestep (and optional text) conditioning is standard for images.
- Latent diffusion + text encoder = efficient high-res generation (Stable Diffusion class).
Quick FAQ
ε_θ + w (ε_c - ε_u).