Computer Vision Interview 60 Q&A Chapter 16

Generative Vision Models — Interview Q&A

Autoencoders, GANs, and diffusion models for image generation and reconstruction.

60 questions Chapter 16

Autoencoders: 20 Essential Q&A

1 What is an autoencoder? ⚡ easy
Answer: Neural net trained to copy input to output through a bottleneck: encoder maps x→z, decoder maps z→x̂—forces compact representation.
2 Role of the encoder? ⚡ easy
Answer: Maps high-dimensional input (e.g. image) to a lower-dimensional latent code z—extracts salient factors.
3 Role of the decoder? ⚡ easy
Answer: Maps latent z back to output space—should reconstruct structure lost only if bottleneck truly limits capacity.
4 Why a bottleneck? 📊 medium
Answer: Constrains information flow so the model must learn a compressed code—similar inputs map to nearby latents if the AE is well regularized.
5 Common reconstruction loss? 📊 medium
Answer: MSE (L2) per pixel for continuous images; BCE if outputs are probabilities; perceptual losses use a pretrained net’s features.
# loss = F.mse_loss(recon, x)  # vanilla AE
6 Under-complete vs over-complete? 🔥 hard
Answer: Under-complete: dim(z) < dim(x)—true compression. Over-complete: dim(z) larger—needs regularization (sparse, denoising, VAE) or trivial identity.
7 What is a denoising autoencoder? 📊 medium
Answer: Train on corrupted inputs (noise, masking) to reconstruct clean x—learns robust features instead of copying noise.
8 Sparse autoencoder? 📊 medium
Answer: Penalize activations (e.g. KL on firing rates) so few units active per example—encourages meaningful distributed codes when over-complete.
9 VAE vs deterministic AE? 📊 medium
Answer: VAE encodes a distribution q(z|x); sample z for decoder—adds KL to prior p(z) for a generative model with smooth latent space.
10 What does the KL term do? 🔥 hard
Answer: Pulls approximate posterior toward prior (often N(0,I))—balances reconstruction vs regularization; enables sampling new z ~ p(z).
11 Reparameterization trick? 🔥 hard
Answer: Write z = μ(x) + σ(x)⊙ε with ε~N(0,1) so gradients flow through μ,σ—needed to backprop through stochastic sampling.
12 Use for anomaly detection? 📊 medium
Answer: Train on normal data; high reconstruction error on test indicates out-of-distribution—used in defect and fraud pipelines.
13 Link to PCA? 🔥 hard
Answer: Linear AE with MSE and tied weights can recover PCA subspace—deep nonlinear AE generalizes with stronger representational power.
14 Disentangled representations? 🔥 hard
Answer: Ideal latents align with generative factors; plain AE does not guarantee this—β-VAE and supervision help.
15 AE vs GAN for generation? 📊 medium
Answer: AE/VAE optimize likelihood-like objectives; GAN uses adversarial realism—GANs often sharper; VAEs more stable latent geometry.
16 Convolutional autoencoder? ⚡ easy
Answer: Encoder stacks conv+pool/downsample; decoder uses upsample/transpose conv—standard for images.
17 AE for super-resolution? 📊 medium
Answer: Condition decoder on low-res input or use skip connections (U-Net style)—AE ideas plus perceptual loss improve texture.
18 Embeddings for search? 📊 medium
Answer: Use encoder output as vector; nearest neighbors in latent space for similar images—may need contrastive training for metric quality.
19 Training tips? ⚡ easy
Answer: Normalize inputs; watch for posterior collapse in VAE; use skip connections if reconstruction is blurry from pure bottleneck.
20 Limitations? 📊 medium
Answer: Reconstructions can be blurry (MSE averages); latent may be entangled; vanilla AE is not a sharp generative model without VAE/GAN hybrids.

GANs: 20 Essential Q&A

21 What is a GAN? ⚡ easy
Answer: Generative model with generator G(z) making samples and discriminator D(x) judging real vs fake—trained adversarially.
22 State the min-max game. 🔥 hard
Answer: G minimizes log(1−D(G(z))) while D maximizes log D(x)+log(1−D(G(z)))—equivalent to Jensen–Shannon related objectives in classic formulation.
23 Role of generator? 📊 medium
Answer: Maps noise z (latent) to data space—should match real data distribution at optimum.
# min_G max_D V(D,G) — alternate k steps on D, 1 on G
24 Role of discriminator? 📊 medium
Answer: Binary classifier estimating probability “real”—provides training signal to G via gradient through D.
25 Nash equilibrium? 🔥 hard
Answer: At optimum (ideal), p_G = p_data and D = ½ on generated samples—hard to reach in practice with finite capacity and SGD.
26 Why unstable training? 📊 medium
Answer: Oscillating dynamics, vanishing gradients when D too good, or D too weak—need balanced updates and architecture tricks.
27 What is mode collapse? 📊 medium
Answer: G outputs few varieties ignoring diversity—D cannot push G to cover all modes; minibatch discrimination and unrolled GANs mitigate.
28 DCGAN guidelines? 📊 medium
Answer: Strided convolutions, BatchNorm, no FC except input/output, ReLU in G, LeakyReLU in D—empirical recipe for stable conv GANs.
29 WGAN / WGAN-GP? 🔥 hard
Answer: Use Wasserstein distance with Lipschitz critic (weight clip or gradient penalty)—smoother training signal than JS when distributions disjoint.
30 LSGAN / hinge? 📊 medium
Answer: Replace sigmoid BCE with least-squares or hinge losses—often more stable gradients in practice.
31 Conditional GAN? 📊 medium
Answer: Both G and D conditioned on label, text, or image—enables targeted generation (class-conditional faces, etc.).
32 pix2pix? 📊 medium
Answer: Paired image-to-image with U-Net G and PatchGAN D—L1 + adversarial loss for aligned translation (maps→aerial).
33 CycleGAN? 🔥 hard
Answer: Unpaired domains with cycle consistency L_cycle(G,F)—horse↔zebra without paired data.
34 StyleGAN idea? 📊 medium
Answer: Map latent through learned affine transforms per layer to control coarse-to-fine style—high-quality face generation.
35 FID / Inception Score? 📊 medium
Answer: IS measures classifiable diversity of G samples; FID compares Inception feature statistics to real—lower FID is better.
36 Signs of bad training? ⚡ easy
Answer: D loss → 0 instantly, G loss frozen, identical outputs, exploding gradients—tune LR, label smoothing, TTUR.
37 Spectral normalization? 🔥 hard
Answer: Constrain D’s Lipschitz constant per layer—alternative to WGAN-GP for stable critic.
38 Data needs? ⚡ easy
Answer: Large diverse sets for photorealism; data augmentation and balancing classes help conditional GANs.
39 GAN vs diffusion today? 📊 medium
Answer: Diffusion often wins on diversity and training stability for images; GANs still valued for fast sampling and some domains.
40 Deepfakes concern? ⚡ easy
Answer: Misinformation and consent—watermarking, detection models, and policy; same tech powers legitimate VFX and data synthesis.

Diffusion Models: 20 Essential Q&A

41 What is a diffusion model? ⚡ easy
Answer: Generative model that learns to reverse a gradual noising process—start from Gaussian noise and denoise into a sample.
42 Forward process? 📊 medium
Answer: Fixed Markov chain adding Gaussian noise over T steps until data ≈ pure noise—q(x_t|x_{t−1}) with known variances.
43 Reverse process? 📊 medium
Answer: Learn p_θ(x_{t−1}|x_t) approximating true posterior—typically predict noise ε or x_0 with a neural net.
44 Training objective (DDPM)? 🔥 hard
Answer: Simplified ε-prediction MSE: network predicts noise added at each t—equivalent to variational lower bound with reweights.
45 Noise schedule β_t? 📊 medium
Answer: How fast variance grows with t—linear, cosine, etc.; affects training stability and sample quality.
# ε-prediction: target noise ε; pred = unet(x_t, t)
46 Why a U-Net? 📊 medium
Answer: Multi-scale spatial denoising with skip connections—preserves detail while aggregating context; time t injected via embeddings.
47 Sampling cost? 📊 medium
Answer: Autoregressive in time—hundreds/thousands of steps slow; accelerators (DDIM, distillation) reduce steps.
48 DDIM? 🔥 hard
Answer: Non-Markovian deterministic sampling sharing training objective—fewer steps with some quality tradeoff.
49 Classifier guidance? 🔥 hard
Answer: Use gradients from a classifier p(y|x_t) during sampling to steer generation—sharp but needs extra classifier.
50 Classifier-free guidance? 📊 medium
Answer: Train conditional and unconditional model together; interpolate scores at sample time—no separate classifier, widely used in SD.
51 Latent diffusion? 🔥 hard
Answer: Run diffusion in VAE latent space (lower res)—much cheaper; decode with VAE decoder (Stable Diffusion).
52 Stable Diffusion pieces? 📊 medium
Answer: CLIP text encoder, U-Net denoiser in latent space, VAE—plus schedulers and safety tooling around the stack.
53 vs GANs? 📊 medium
Answer: Diffusion: stable training, great diversity, slower sampling. GAN: fast one-shot but trickier mode coverage.
54 Video diffusion? 📊 medium
Answer: Add temporal layers or 3D convs; causal attention across frames—data and compute heavy.
55 Inpainting? ⚡ easy
Answer: Condition on known regions by concatenating mask/channel inputs to U-Net—fill missing areas consistently.
56 Text conditioning? 📊 medium
Answer: Cross-attention from text tokens to spatial features (like transformers)—T5/CLIP embeddings as context.
57 SNR weighting? 🔥 hard
Answer: Different timesteps contribute unequally to loss—reweighting (v-prediction, Min-SNR) improves quality.
58 Flow matching? 🔥 hard
Answer: Related generative path from noise to data via ODE/flows—competes with diffusion on speed and quality in recent work.
59 Compute / data? ⚡ easy
Answer: Large image-text pairs for T2I; training is GPU-heavy; inference optimizes with TensorRT, FlashAttention, distilled samplers.
60 Evaluation? 📊 medium
Answer: FID, CLIP score for text alignment, human preference studies—no single metric captures all.
Full tutorial chapter

Pair these interview notes with the matching CV tutorial chapter.

align-items-center flex-wrap gap-2"> Previous Next