VAE vs GAN vs Diffusion
December 28, 2025
The Generative Modeling Problem
We want to learn a distribution $p_\theta(x)$ that approximates the true data distribution $p_{\text{data}}(x)$. Given a dataset $\{x_1, x_2, \ldots, x_n\}$, how do we generate new samples that look like they came from the same distribution?
Three major approaches emerged: Variational Autoencoders (VAE), Generative Adversarial Networks (GAN), and Diffusion Models. Each takes a fundamentally different path to the same destination.
Prerequisites: What You Need to Know
Maximum Likelihood Estimation
We want to find parameters $\theta$ that maximize the likelihood of observing our data:
In practice, we maximize the log-likelihood (easier to work with):
KL Divergence
Measures how one probability distribution differs from another:
Key property: $D_{\text{KL}}(p \| q) \geq 0$, with equality if and only if $p = q$.
The Reparameterization Trick
Instead of sampling $z \sim \mathcal{N}(\mu, \sigma^2)$, we sample $\epsilon \sim \mathcal{N}(0, 1)$ and compute:
This makes the sampling operation differentiable with respect to $\mu$ and $\sigma$.
Variational Autoencoders (VAE)
The Core Idea
VAEs assume data is generated from a latent variable $z$. We want to learn both:
- Prior: $p(z)$ (usually $\mathcal{N}(0, I)$)
- Likelihood: $p_\theta(x|z)$ (decoder network)
The challenge is that $p_\theta(x) = \int p_\theta(x|z) p(z) dz$ is intractable. So we introduce an approximate posterior $q_\phi(z|x)$ (encoder network).
The Evidence Lower Bound (ELBO)
We can't maximize $\log p_\theta(x)$ directly, but we can maximize a lower bound:
This decomposes beautifully:
- Reconstruction term: $\mathbb{E}_{q_\phi(z|x)} [\log p_\theta(x|z)]$ — how well can we reconstruct $x$ from $z$?
- Regularization term: $D_{\text{KL}}(q_\phi(z|x) \| p(z))$ — how close is our posterior to the prior?
Training Procedure
For each data point $x$:
- Encode: $q_\phi(z|x) = \mathcal{N}(\mu_\phi(x), \sigma_\phi^2(x))$
- Sample: $z = \mu_\phi(x) + \sigma_\phi(x) \epsilon$ where $\epsilon \sim \mathcal{N}(0, I)$
- Decode: $p_\theta(x|z) = \mathcal{N}(\mu_\theta(z), \sigma_\theta^2(z))$
- Compute loss: $\mathcal{L} = -\mathbb{E}_{q_\phi(z|x)} [\log p_\theta(x|z)] + D_{\text{KL}}(q_\phi(z|x) \| p(z))$
Strengths and Weaknesses
Strengths: Principled probabilistic framework, stable training, meaningful latent space.
Weaknesses: Blurry samples (due to reconstruction loss), restrictive assumptions about posterior.
Generative Adversarial Networks (GAN)
The Core Idea
Forget explicit density modeling. Instead, train two networks that compete:
- Generator $G_\theta(z)$: transforms noise $z \sim p(z)$ into fake data
- Discriminator $D_\phi(x)$: distinguishes real from fake data
The Minimax Game
The discriminator tries to maximize:
The generator tries to minimize the same objective (or maximize $\mathbb{E}_{z \sim p(z)} [\log D_\phi(G_\theta(z))]$ in practice):
What Does the Discriminator Learn?
At the optimal discriminator, it outputs the probability that a sample is real:
Connection to Divergence Minimization
The GAN objective is equivalent to minimizing the Jensen-Shannon divergence:
Training Procedure
Alternate between:
- Update discriminator: Sample real data $x$ and noise $z$, maximize discriminator accuracy
- Update generator: Sample noise $z$, maximize discriminator confusion
Strengths and Weaknesses
Strengths: Sharp, high-quality samples, no assumption about data distribution.
Weaknesses: Training instability, mode collapse, no explicit density estimation.
Diffusion Models
The Core Idea
Learn to reverse a gradual noising process. If we know how to denoise, we can start from pure noise and gradually recover data.
The Forward Process
Gradually add Gaussian noise over $T$ steps:
With a clever reparameterization, we can jump to any timestep directly:
where $\alpha_t = 1 - \beta_t$ and $\bar{\alpha}_t = \prod_{s=1}^t \alpha_s$.
The Reverse Process
Learn to reverse the noising process:
What Do We Actually Learn?
Instead of predicting $\mu_\theta$ directly, we train a network $\epsilon_\theta(x_t, t)$ to predict the noise that was added:
This is remarkably simple: just predict the noise at each timestep.
Sampling Procedure
- Start with pure noise: $x_T \sim \mathcal{N}(0, I)$
- For $t = T, T-1, \ldots, 1$:
- Predict noise: $\epsilon_\theta(x_t, t)$
- Compute denoised sample: $x_{t-1}$ using reverse process
- Return $x_0$
Connection to Score Matching
Diffusion models are learning the score function (gradient of log density):
Strengths and Weaknesses
Strengths: Stable training, high-quality samples, principled framework, exact likelihood computation possible.
Weaknesses: Slow sampling (requires many steps), high computational cost.
Connecting the Three
VAE and Diffusion
Diffusion models can be viewed as hierarchical VAEs with:
- Latent variables at each timestep: $z_1, z_2, \ldots, z_T$
- Fixed encoder (the noising process)
- Learned decoder (the denoising process)
The ELBO in diffusion models has a similar structure to VAE, but summed over timesteps.
GAN and Diffusion
Recent work shows we can train diffusion models adversarially:
- Generator: denoising network
- Discriminator: distinguishes real from denoised samples
This combines GAN's sharp samples with diffusion's stable training.
The Unified View
| Aspect | VAE | GAN | Diffusion |
|---|---|---|---|
| Objective | Maximize ELBO | Minimax game | Denoise at each step |
| Training | Stable | Can be unstable | Stable |
| Sample quality | Blurry | Sharp | Sharp |
| Likelihood | Lower bound | No explicit | Exact (with effort) |
| Latent space | Meaningful | Less structure | Hierarchical |
| Speed | Fast | Fast | Slow |