VAE vs GAN vs Diffusion

December 28, 2025

The Generative Modeling Problem

We want to learn a distribution $p_\theta(x)$ that approximates the true data distribution $p_{\text{data}}(x)$. Given a dataset $\{x_1, x_2, \ldots, x_n\}$, how do we generate new samples that look like they came from the same distribution?

Three major approaches emerged: Variational Autoencoders (VAE), Generative Adversarial Networks (GAN), and Diffusion Models. Each takes a fundamentally different path to the same destination.

Prerequisites: What You Need to Know

Maximum Likelihood Estimation

We want to find parameters $\theta$ that maximize the likelihood of observing our data:

$$\theta^* = \arg\max_\theta \prod_{i=1}^n p_\theta(x_i)$$

In practice, we maximize the log-likelihood (easier to work with):

$$\theta^* = \arg\max_\theta \sum_{i=1}^n \log p_\theta(x_i)$$

KL Divergence

Measures how one probability distribution differs from another:

$$D_{\text{KL}}(p \| q) = \mathbb{E}_{x \sim p} \left[ \log \frac{p(x)}{q(x)} \right]$$

Key property: $D_{\text{KL}}(p \| q) \geq 0$, with equality if and only if $p = q$.

The Reparameterization Trick

Instead of sampling $z \sim \mathcal{N}(\mu, \sigma^2)$, we sample $\epsilon \sim \mathcal{N}(0, 1)$ and compute:

$$z = \mu + \sigma \epsilon$$

This makes the sampling operation differentiable with respect to $\mu$ and $\sigma$.

Variational Autoencoders (VAE)

The Core Idea

VAEs assume data is generated from a latent variable $z$. We want to learn both:

Prior: $p(z)$ (usually $\mathcal{N}(0, I)$)
Likelihood: $p_\theta(x|z)$ (decoder network)

The challenge is that $p_\theta(x) = \int p_\theta(x|z) p(z) dz$ is intractable. So we introduce an approximate posterior $q_\phi(z|x)$ (encoder network).

The Evidence Lower Bound (ELBO)

We can't maximize $\log p_\theta(x)$ directly, but we can maximize a lower bound:

$$\log p_\theta(x) \geq \mathbb{E}_{q_\phi(z|x)} [\log p_\theta(x|z)] - D_{\text{KL}}(q_\phi(z|x) \| p(z))$$

This decomposes beautifully:

Reconstruction term: $\mathbb{E}_{q_\phi(z|x)} [\log p_\theta(x|z)]$ — how well can we reconstruct $x$ from $z$?
Regularization term: $D_{\text{KL}}(q_\phi(z|x) \| p(z))$ — how close is our posterior to the prior?

Training Procedure

For each data point $x$:

Encode: $q_\phi(z|x) = \mathcal{N}(\mu_\phi(x), \sigma_\phi^2(x))$
Sample: $z = \mu_\phi(x) + \sigma_\phi(x) \epsilon$ where $\epsilon \sim \mathcal{N}(0, I)$
Decode: $p_\theta(x|z) = \mathcal{N}(\mu_\theta(z), \sigma_\theta^2(z))$
Compute loss: $\mathcal{L} = -\mathbb{E}_{q_\phi(z|x)} [\log p_\theta(x|z)] + D_{\text{KL}}(q_\phi(z|x) \| p(z))$

Strengths and Weaknesses

Strengths: Principled probabilistic framework, stable training, meaningful latent space.

Weaknesses: Blurry samples (due to reconstruction loss), restrictive assumptions about posterior.

Generative Adversarial Networks (GAN)

The Core Idea

Forget explicit density modeling. Instead, train two networks that compete:

Generator $G_\theta(z)$: transforms noise $z \sim p(z)$ into fake data
Discriminator $D_\phi(x)$: distinguishes real from fake data

The Minimax Game

The discriminator tries to maximize:

$$\max_\phi \mathbb{E}_{x \sim p_{\text{data}}} [\log D_\phi(x)] + \mathbb{E}_{z \sim p(z)} [\log(1 - D_\phi(G_\theta(z)))]$$

The generator tries to minimize the same objective (or maximize $\mathbb{E}_{z \sim p(z)} [\log D_\phi(G_\theta(z))]$ in practice):

$$\min_\theta \max_\phi V(D, G) = \mathbb{E}_{x \sim p_{\text{data}}} [\log D_\phi(x)] + \mathbb{E}_{z \sim p(z)} [\log(1 - D_\phi(G_\theta(z)))]$$

What Does the Discriminator Learn?

At the optimal discriminator, it outputs the probability that a sample is real:

$$D^*(x) = \frac{p_{\text{data}}(x)}{p_{\text{data}}(x) + p_\theta(x)}$$

Connection to Divergence Minimization

The GAN objective is equivalent to minimizing the Jensen-Shannon divergence:

$$D_{\text{JS}}(p_{\text{data}} \| p_\theta) = \frac{1}{2} D_{\text{KL}}(p_{\text{data}} \| \frac{p_{\text{data}} + p_\theta}{2}) + \frac{1}{2} D_{\text{KL}}(p_\theta \| \frac{p_{\text{data}} + p_\theta}{2})$$

Training Procedure

Alternate between:

Update discriminator: Sample real data $x$ and noise $z$, maximize discriminator accuracy
Update generator: Sample noise $z$, maximize discriminator confusion

Strengths and Weaknesses

Strengths: Sharp, high-quality samples, no assumption about data distribution.

Weaknesses: Training instability, mode collapse, no explicit density estimation.

Diffusion Models

The Core Idea

Learn to reverse a gradual noising process. If we know how to denoise, we can start from pure noise and gradually recover data.

The Forward Process

Gradually add Gaussian noise over $T$ steps:

$$q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \beta_t} x_{t-1}, \beta_t I)$$

With a clever reparameterization, we can jump to any timestep directly:

$$q(x_t | x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} x_0, (1 - \bar{\alpha}_t) I)$$

where $\alpha_t = 1 - \beta_t$ and $\bar{\alpha}_t = \prod_{s=1}^t \alpha_s$.

The Reverse Process

Learn to reverse the noising process:

$$p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t))$$

What Do We Actually Learn?

Instead of predicting $\mu_\theta$ directly, we train a network $\epsilon_\theta(x_t, t)$ to predict the noise that was added:

$$\mathcal{L} = \mathbb{E}_{t, x_0, \epsilon} \left[ \| \epsilon - \epsilon_\theta(\sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon, t) \|^2 \right]$$

This is remarkably simple: just predict the noise at each timestep.

Sampling Procedure

Start with pure noise: $x_T \sim \mathcal{N}(0, I)$
For $t = T, T-1, \ldots, 1$:
- Predict noise: $\epsilon_\theta(x_t, t)$
- Compute denoised sample: $x_{t-1}$ using reverse process
Return $x_0$

Connection to Score Matching

Diffusion models are learning the score function (gradient of log density):

$$\nabla_{x_t} \log p(x_t) \approx -\frac{1}{\sqrt{1 - \bar{\alpha}_t}} \epsilon_\theta(x_t, t)$$

Strengths and Weaknesses

Strengths: Stable training, high-quality samples, principled framework, exact likelihood computation possible.

Weaknesses: Slow sampling (requires many steps), high computational cost.

Connecting the Three

VAE and Diffusion

Diffusion models can be viewed as hierarchical VAEs with:

Latent variables at each timestep: $z_1, z_2, \ldots, z_T$
Fixed encoder (the noising process)
Learned decoder (the denoising process)

The ELBO in diffusion models has a similar structure to VAE, but summed over timesteps.

GAN and Diffusion

Recent work shows we can train diffusion models adversarially:

Generator: denoising network
Discriminator: distinguishes real from denoised samples

This combines GAN's sharp samples with diffusion's stable training.

The Unified View

Aspect	VAE	GAN	Diffusion
Objective	Maximize ELBO	Minimax game	Denoise at each step
Training	Stable	Can be unstable	Stable
Sample quality	Blurry	Sharp	Sharp
Likelihood	Lower bound	No explicit	Exact (with effort)
Latent space	Meaningful	Less structure	Hierarchical
Speed	Fast	Fast	Slow