Generative Vision Models — Interview Q&A

Question 1

1 What is an autoencoder? ⚡ easy

Answer

Answer: Neural net trained to copy input to output through a bottleneck: encoder maps x→z, decoder maps z→x̂—forces compact representation.

Question 2

2 Role of the encoder? ⚡ easy

Answer

Answer: Maps high-dimensional input (e.g. image) to a lower-dimensional latent code z—extracts salient factors.

Question 3

3 Role of the decoder? ⚡ easy

Answer

Answer: Maps latent z back to output space—should reconstruct structure lost only if bottleneck truly limits capacity.

Question 4

4 Why a bottleneck? 📊 medium

Answer

Answer: Constrains information flow so the model must learn a compressed code—similar inputs map to nearby latents if the AE is well regularized.

Question 5

5 Common reconstruction loss? 📊 medium

Answer

Answer: MSE (L2) per pixel for continuous images; BCE if outputs are probabilities; perceptual losses use a pretrained net’s features.

Question 6

6 Under-complete vs over-complete? 🔥 hard

Answer

Answer: Under-complete: dim(z) < dim(x)—true compression. Over-complete: dim(z) larger—needs regularization (sparse, denoising, VAE) or trivial identity.

Question 7

7 What is a denoising autoencoder? 📊 medium

Answer

Answer: Train on corrupted inputs (noise, masking) to reconstruct clean x—learns robust features instead of copying noise.

Question 8

8 Sparse autoencoder? 📊 medium

Answer

Answer: Penalize activations (e.g. KL on firing rates) so few units active per example—encourages meaningful distributed codes when over-complete.

Question 9

9 VAE vs deterministic AE? 📊 medium

Answer

Answer: VAE encodes a distribution q(z|x); sample z for decoder—adds KL to prior p(z) for a generative model with smooth latent space.

Question 10

10 What does the KL term do? 🔥 hard

Answer

Answer: Pulls approximate posterior toward prior (often N(0,I))—balances reconstruction vs regularization; enables sampling new z ~ p(z).

Question 11

11 Reparameterization trick? 🔥 hard

Answer

Answer: Write z = μ(x) + σ(x)⊙ε with ε~N(0,1) so gradients flow through μ,σ—needed to backprop through stochastic sampling.

Question 12

12 Use for anomaly detection? 📊 medium

Answer

Answer: Train on normal data; high reconstruction error on test indicates out-of-distribution—used in defect and fraud pipelines.

Question 13

13 Link to PCA? 🔥 hard

Answer

Answer: Linear AE with MSE and tied weights can recover PCA subspace—deep nonlinear AE generalizes with stronger representational power.

Question 14

14 Disentangled representations? 🔥 hard

Answer

Answer: Ideal latents align with generative factors; plain AE does not guarantee this—β-VAE and supervision help.

Question 15

15 AE vs GAN for generation? 📊 medium

Answer

Answer: AE/VAE optimize likelihood-like objectives; GAN uses adversarial realism—GANs often sharper; VAEs more stable latent geometry.

Question 16

16 Convolutional autoencoder? ⚡ easy

Answer

Answer: Encoder stacks conv+pool/downsample; decoder uses upsample/transpose conv—standard for images.

Question 17

17 AE for super-resolution? 📊 medium

Answer

Answer: Condition decoder on low-res input or use skip connections (U-Net style)—AE ideas plus perceptual loss improve texture.

Question 18

18 Embeddings for search? 📊 medium

Answer

Answer: Use encoder output as vector; nearest neighbors in latent space for similar images—may need contrastive training for metric quality.

Question 19

19 Training tips? ⚡ easy

Answer

Answer: Normalize inputs; watch for posterior collapse in VAE; use skip connections if reconstruction is blurry from pure bottleneck.

Question 20

20 Limitations? 📊 medium

Answer

Answer: Reconstructions can be blurry (MSE averages); latent may be entangled; vanilla AE is not a sharp generative model without VAE/GAN hybrids.

Question 21

21 What is a GAN? ⚡ easy

Answer

Answer: Generative model with generator G(z) making samples and discriminator D(x) judging real vs fake—trained adversarially.

Question 22

22 State the min-max game. 🔥 hard

Answer

Answer: G minimizes log(1−D(G(z))) while D maximizes log D(x)+log(1−D(G(z)))—equivalent to Jensen–Shannon related objectives in classic formulation.

Question 23

23 Role of generator? 📊 medium

Answer

Answer: Maps noise z (latent) to data space—should match real data distribution at optimum.

Question 24

24 Role of discriminator? 📊 medium

Answer

Answer: Binary classifier estimating probability “real”—provides training signal to G via gradient through D.

Question 25

25 Nash equilibrium? 🔥 hard

Answer

Answer: At optimum (ideal), p_G = p_data and D = ½ on generated samples—hard to reach in practice with finite capacity and SGD.

Question 26

26 Why unstable training? 📊 medium

Answer

Answer: Oscillating dynamics, vanishing gradients when D too good, or D too weak—need balanced updates and architecture tricks.

Question 27

27 What is mode collapse? 📊 medium

Answer

Answer: G outputs few varieties ignoring diversity—D cannot push G to cover all modes; minibatch discrimination and unrolled GANs mitigate.

Question 28

28 DCGAN guidelines? 📊 medium

Answer

Answer: Strided convolutions, BatchNorm, no FC except input/output, ReLU in G, LeakyReLU in D—empirical recipe for stable conv GANs.

Question 29

29 WGAN / WGAN-GP? 🔥 hard

Answer

Answer: Use Wasserstein distance with Lipschitz critic (weight clip or gradient penalty)—smoother training signal than JS when distributions disjoint.

Question 30

30 LSGAN / hinge? 📊 medium

Answer

Answer: Replace sigmoid BCE with least-squares or hinge losses—often more stable gradients in practice.

Question 31

31 Conditional GAN? 📊 medium

Answer

Answer: Both G and D conditioned on label, text, or image—enables targeted generation (class-conditional faces, etc.).

Question 32

32 pix2pix? 📊 medium

Answer

Answer: Paired image-to-image with U-Net G and PatchGAN D—L1 + adversarial loss for aligned translation (maps→aerial).

Question 33

33 CycleGAN? 🔥 hard

Answer

Answer: Unpaired domains with cycle consistency L_cycle(G,F)—horse↔zebra without paired data.

Question 34

34 StyleGAN idea? 📊 medium

Answer

Answer: Map latent through learned affine transforms per layer to control coarse-to-fine style—high-quality face generation.

Question 35

35 FID / Inception Score? 📊 medium

Answer

Answer: IS measures classifiable diversity of G samples; FID compares Inception feature statistics to real—lower FID is better.

Question 36

36 Signs of bad training? ⚡ easy

Answer

Answer: D loss → 0 instantly, G loss frozen, identical outputs, exploding gradients—tune LR, label smoothing, TTUR.

Question 37

37 Spectral normalization? 🔥 hard

Answer

Answer: Constrain D’s Lipschitz constant per layer—alternative to WGAN-GP for stable critic.

Question 38

38 Data needs? ⚡ easy

Answer

Answer: Large diverse sets for photorealism; data augmentation and balancing classes help conditional GANs.

Question 39

39 GAN vs diffusion today? 📊 medium

Answer

Answer: Diffusion often wins on diversity and training stability for images; GANs still valued for fast sampling and some domains.

Question 40

40 Deepfakes concern? ⚡ easy

Answer

Answer: Misinformation and consent—watermarking, detection models, and policy; same tech powers legitimate VFX and data synthesis.

Question 41

41 What is a diffusion model? ⚡ easy

Answer

Answer: Generative model that learns to reverse a gradual noising process—start from Gaussian noise and denoise into a sample.

Question 42

42 Forward process? 📊 medium

Answer

Answer: Fixed Markov chain adding Gaussian noise over T steps until data ≈ pure noise—q(x_t|x_{t−1}) with known variances.

Question 43

43 Reverse process? 📊 medium

Answer

Answer: Learn p_θ(x_{t−1}|x_t) approximating true posterior—typically predict noise ε or x_0 with a neural net.

Question 44

44 Training objective (DDPM)? 🔥 hard

Answer

Answer: Simplified ε-prediction MSE: network predicts noise added at each t—equivalent to variational lower bound with reweights.

Question 45

45 Noise schedule β_t? 📊 medium

Answer

Answer: How fast variance grows with t—linear, cosine, etc.; affects training stability and sample quality.

Question 46

46 Why a U-Net? 📊 medium

Answer

Answer: Multi-scale spatial denoising with skip connections—preserves detail while aggregating context; time t injected via embeddings.

Question 47

47 Sampling cost? 📊 medium

Answer

Answer: Autoregressive in time—hundreds/thousands of steps slow; accelerators (DDIM, distillation) reduce steps.

Question 48

48 DDIM? 🔥 hard

Answer

Answer: Non-Markovian deterministic sampling sharing training objective—fewer steps with some quality tradeoff.

Question 49

49 Classifier guidance? 🔥 hard

Answer

Answer: Use gradients from a classifier p(y|x_t) during sampling to steer generation—sharp but needs extra classifier.

Question 50

50 Classifier-free guidance? 📊 medium

Answer

Answer: Train conditional and unconditional model together; interpolate scores at sample time—no separate classifier, widely used in SD.

Question 51

51 Latent diffusion? 🔥 hard

Answer

Answer: Run diffusion in VAE latent space (lower res)—much cheaper; decode with VAE decoder (Stable Diffusion).

Question 52

52 Stable Diffusion pieces? 📊 medium

Answer

Answer: CLIP text encoder, U-Net denoiser in latent space, VAE—plus schedulers and safety tooling around the stack.

Question 53

53 vs GANs? 📊 medium

Answer

Answer: Diffusion: stable training, great diversity, slower sampling. GAN: fast one-shot but trickier mode coverage.

Question 54

54 Video diffusion? 📊 medium

Answer

Answer: Add temporal layers or 3D convs; causal attention across frames—data and compute heavy.

Question 55

55 Inpainting? ⚡ easy

Answer

Answer: Condition on known regions by concatenating mask/channel inputs to U-Net—fill missing areas consistently.

Question 56

56 Text conditioning? 📊 medium

Answer

Answer: Cross-attention from text tokens to spatial features (like transformers)—T5/CLIP embeddings as context.

Question 57

57 SNR weighting? 🔥 hard

Answer

Answer: Different timesteps contribute unequally to loss—reweighting (v-prediction, Min-SNR) improves quality.

Question 58

58 Flow matching? 🔥 hard

Answer

Answer: Related generative path from noise to data via ODE/flows—competes with diffusion on speed and quality in recent work.

Question 59

59 Compute / data? ⚡ easy

Answer

Answer: Large image-text pairs for T2I; training is GPU-heavy; inference optimizes with TensorRT, FlashAttention, distilled samplers.

Question 60

60 Evaluation? 📊 medium

Answer

Answer: FID, CLIP score for text alignment, human preference studies—no single metric captures all.

Generative Vision Models — Interview Q&A

Autoencoders: 20 Essential Q&A

GANs: 20 Essential Q&A

Diffusion Models: 20 Essential Q&A

Full tutorial chapter