Computer Vision Interview
60 Q&A
Chapter 16
Generative Vision Models — Interview Q&A
Autoencoders, GANs, and diffusion models for image generation and reconstruction.
60 questions
Chapter 16
Autoencoders: 20 Essential Q&A
1
What is an autoencoder?
⚡ easy
Answer: Neural net trained to copy input to output through a bottleneck: encoder maps x→z, decoder maps z→x̂—forces compact representation.
2
Role of the encoder?
⚡ easy
Answer: Maps high-dimensional input (e.g. image) to a lower-dimensional latent code z—extracts salient factors.
3
Role of the decoder?
⚡ easy
Answer: Maps latent z back to output space—should reconstruct structure lost only if bottleneck truly limits capacity.
4
Why a bottleneck?
📊 medium
Answer: Constrains information flow so the model must learn a compressed code—similar inputs map to nearby latents if the AE is well regularized.
5
Common reconstruction loss?
📊 medium
Answer: MSE (L2) per pixel for continuous images; BCE if outputs are probabilities; perceptual losses use a pretrained net’s features.
# loss = F.mse_loss(recon, x) # vanilla AE
6
Under-complete vs over-complete?
🔥 hard
Answer: Under-complete: dim(z) < dim(x)—true compression. Over-complete: dim(z) larger—needs regularization (sparse, denoising, VAE) or trivial identity.
7
What is a denoising autoencoder?
📊 medium
Answer: Train on corrupted inputs (noise, masking) to reconstruct clean x—learns robust features instead of copying noise.
8
Sparse autoencoder?
📊 medium
Answer: Penalize activations (e.g. KL on firing rates) so few units active per example—encourages meaningful distributed codes when over-complete.
9
VAE vs deterministic AE?
📊 medium
Answer: VAE encodes a distribution q(z|x); sample z for decoder—adds KL to prior p(z) for a generative model with smooth latent space.
10
What does the KL term do?
🔥 hard
Answer: Pulls approximate posterior toward prior (often N(0,I))—balances reconstruction vs regularization; enables sampling new z ~ p(z).
11
Reparameterization trick?
🔥 hard
Answer: Write z = μ(x) + σ(x)⊙ε with ε~N(0,1) so gradients flow through μ,σ—needed to backprop through stochastic sampling.
12
Use for anomaly detection?
📊 medium
Answer: Train on normal data; high reconstruction error on test indicates out-of-distribution—used in defect and fraud pipelines.
13
Link to PCA?
🔥 hard
Answer: Linear AE with MSE and tied weights can recover PCA subspace—deep nonlinear AE generalizes with stronger representational power.
14
Disentangled representations?
🔥 hard
Answer: Ideal latents align with generative factors; plain AE does not guarantee this—β-VAE and supervision help.
15
AE vs GAN for generation?
📊 medium
Answer: AE/VAE optimize likelihood-like objectives; GAN uses adversarial realism—GANs often sharper; VAEs more stable latent geometry.
16
Convolutional autoencoder?
⚡ easy
Answer: Encoder stacks conv+pool/downsample; decoder uses upsample/transpose conv—standard for images.
17
AE for super-resolution?
📊 medium
Answer: Condition decoder on low-res input or use skip connections (U-Net style)—AE ideas plus perceptual loss improve texture.
18
Embeddings for search?
📊 medium
Answer: Use encoder output as vector; nearest neighbors in latent space for similar images—may need contrastive training for metric quality.
19
Training tips?
⚡ easy
Answer: Normalize inputs; watch for posterior collapse in VAE; use skip connections if reconstruction is blurry from pure bottleneck.
20
Limitations?
📊 medium
Answer: Reconstructions can be blurry (MSE averages); latent may be entangled; vanilla AE is not a sharp generative model without VAE/GAN hybrids.
GANs: 20 Essential Q&A
21
What is a GAN?
⚡ easy
Answer: Generative model with generator G(z) making samples and discriminator D(x) judging real vs fake—trained adversarially.
22
State the min-max game.
🔥 hard
Answer: G minimizes log(1−D(G(z))) while D maximizes log D(x)+log(1−D(G(z)))—equivalent to Jensen–Shannon related objectives in classic formulation.
23
Role of generator?
📊 medium
Answer: Maps noise z (latent) to data space—should match real data distribution at optimum.
# min_G max_D V(D,G) — alternate k steps on D, 1 on G
24
Role of discriminator?
📊 medium
Answer: Binary classifier estimating probability “real”—provides training signal to G via gradient through D.
25
Nash equilibrium?
🔥 hard
Answer: At optimum (ideal), p_G = p_data and D = ½ on generated samples—hard to reach in practice with finite capacity and SGD.
26
Why unstable training?
📊 medium
Answer: Oscillating dynamics, vanishing gradients when D too good, or D too weak—need balanced updates and architecture tricks.
27
What is mode collapse?
📊 medium
Answer: G outputs few varieties ignoring diversity—D cannot push G to cover all modes; minibatch discrimination and unrolled GANs mitigate.
28
DCGAN guidelines?
📊 medium
Answer: Strided convolutions, BatchNorm, no FC except input/output, ReLU in G, LeakyReLU in D—empirical recipe for stable conv GANs.
29
WGAN / WGAN-GP?
🔥 hard
Answer: Use Wasserstein distance with Lipschitz critic (weight clip or gradient penalty)—smoother training signal than JS when distributions disjoint.
30
LSGAN / hinge?
📊 medium
Answer: Replace sigmoid BCE with least-squares or hinge losses—often more stable gradients in practice.
31
Conditional GAN?
📊 medium
Answer: Both G and D conditioned on label, text, or image—enables targeted generation (class-conditional faces, etc.).
32
pix2pix?
📊 medium
Answer: Paired image-to-image with U-Net G and PatchGAN D—L1 + adversarial loss for aligned translation (maps→aerial).
33
CycleGAN?
🔥 hard
Answer: Unpaired domains with cycle consistency L_cycle(G,F)—horse↔zebra without paired data.
34
StyleGAN idea?
📊 medium
Answer: Map latent through learned affine transforms per layer to control coarse-to-fine style—high-quality face generation.
35
FID / Inception Score?
📊 medium
Answer: IS measures classifiable diversity of G samples; FID compares Inception feature statistics to real—lower FID is better.
36
Signs of bad training?
⚡ easy
Answer: D loss → 0 instantly, G loss frozen, identical outputs, exploding gradients—tune LR, label smoothing, TTUR.
37
Spectral normalization?
🔥 hard
Answer: Constrain D’s Lipschitz constant per layer—alternative to WGAN-GP for stable critic.
38
Data needs?
⚡ easy
Answer: Large diverse sets for photorealism; data augmentation and balancing classes help conditional GANs.
39
GAN vs diffusion today?
📊 medium
Answer: Diffusion often wins on diversity and training stability for images; GANs still valued for fast sampling and some domains.
40
Deepfakes concern?
⚡ easy
Answer: Misinformation and consent—watermarking, detection models, and policy; same tech powers legitimate VFX and data synthesis.
Diffusion Models: 20 Essential Q&A
41
What is a diffusion model?
⚡ easy
Answer: Generative model that learns to reverse a gradual noising process—start from Gaussian noise and denoise into a sample.
42
Forward process?
📊 medium
Answer: Fixed Markov chain adding Gaussian noise over T steps until data ≈ pure noise—q(x_t|x_{t−1}) with known variances.
43
Reverse process?
📊 medium
Answer: Learn p_θ(x_{t−1}|x_t) approximating true posterior—typically predict noise ε or x_0 with a neural net.
44
Training objective (DDPM)?
🔥 hard
Answer: Simplified ε-prediction MSE: network predicts noise added at each t—equivalent to variational lower bound with reweights.
45
Noise schedule β_t?
📊 medium
Answer: How fast variance grows with t—linear, cosine, etc.; affects training stability and sample quality.
# ε-prediction: target noise ε; pred = unet(x_t, t)
46
Why a U-Net?
📊 medium
Answer: Multi-scale spatial denoising with skip connections—preserves detail while aggregating context; time t injected via embeddings.
47
Sampling cost?
📊 medium
Answer: Autoregressive in time—hundreds/thousands of steps slow; accelerators (DDIM, distillation) reduce steps.
48
DDIM?
🔥 hard
Answer: Non-Markovian deterministic sampling sharing training objective—fewer steps with some quality tradeoff.
49
Classifier guidance?
🔥 hard
Answer: Use gradients from a classifier p(y|x_t) during sampling to steer generation—sharp but needs extra classifier.
50
Classifier-free guidance?
📊 medium
Answer: Train conditional and unconditional model together; interpolate scores at sample time—no separate classifier, widely used in SD.
51
Latent diffusion?
🔥 hard
Answer: Run diffusion in VAE latent space (lower res)—much cheaper; decode with VAE decoder (Stable Diffusion).
52
Stable Diffusion pieces?
📊 medium
Answer: CLIP text encoder, U-Net denoiser in latent space, VAE—plus schedulers and safety tooling around the stack.
53
vs GANs?
📊 medium
Answer: Diffusion: stable training, great diversity, slower sampling. GAN: fast one-shot but trickier mode coverage.
54
Video diffusion?
📊 medium
Answer: Add temporal layers or 3D convs; causal attention across frames—data and compute heavy.
55
Inpainting?
⚡ easy
Answer: Condition on known regions by concatenating mask/channel inputs to U-Net—fill missing areas consistently.
56
Text conditioning?
📊 medium
Answer: Cross-attention from text tokens to spatial features (like transformers)—T5/CLIP embeddings as context.
57
SNR weighting?
🔥 hard
Answer: Different timesteps contribute unequally to loss—reweighting (v-prediction, Min-SNR) improves quality.
58
Flow matching?
🔥 hard
Answer: Related generative path from noise to data via ODE/flows—competes with diffusion on speed and quality in recent work.
59
Compute / data?
⚡ easy
Answer: Large image-text pairs for T2I; training is GPU-heavy; inference optimizes with TensorRT, FlashAttention, distilled samplers.
60
Evaluation?
📊 medium
Answer: FID, CLIP score for text alignment, human preference studies—no single metric captures all.
Full tutorial chapter
Pair these interview notes with the matching CV tutorial chapter.