VAE vs GAN vs Diffusion

December 28, 2025

The Generative Modeling Problem

We want to learn a distribution $p_\theta(x)$ that approximates the true data distribution $p_{\text{data}}(x)$. Given a dataset $\{x_1, x_2, \ldots, x_n\}$, how do we generate new samples that look like they came from the same distribution?

Three major approaches emerged: Variational Autoencoders (VAE), Generative Adversarial Networks (GAN), and Diffusion Models. Each takes a fundamentally different path to the same destination.

Prerequisites: What You Need to Know

Maximum Likelihood Estimation

We want to find parameters $\theta$ that maximize the likelihood of observing our data:

$$\theta^* = \arg\max_\theta \prod_{i=1}^n p_\theta(x_i)$$

In practice, we maximize the log-likelihood (easier to work with):

$$\theta^* = \arg\max_\theta \sum_{i=1}^n \log p_\theta(x_i)$$

KL Divergence

Measures how one probability distribution differs from another:

$$D_{\text{KL}}(p \| q) = \mathbb{E}_{x \sim p} \left[ \log \frac{p(x)}{q(x)} \right]$$

Key property: $D_{\text{KL}}(p \| q) \geq 0$, with equality if and only if $p = q$.

The Reparameterization Trick

Instead of sampling $z \sim \mathcal{N}(\mu, \sigma^2)$, we sample $\epsilon \sim \mathcal{N}(0, 1)$ and compute:

$$z = \mu + \sigma \epsilon$$

This makes the sampling operation differentiable with respect to $\mu$ and $\sigma$.

Variational Autoencoders (VAE)

The Core Idea

VAEs assume data is generated from a latent variable $z$. We want to learn both:

The challenge is that $p_\theta(x) = \int p_\theta(x|z) p(z) dz$ is intractable. So we introduce an approximate posterior $q_\phi(z|x)$ (encoder network).

The Evidence Lower Bound (ELBO)

We can't maximize $\log p_\theta(x)$ directly, but we can maximize a lower bound:

$$\log p_\theta(x) \geq \mathbb{E}_{q_\phi(z|x)} [\log p_\theta(x|z)] - D_{\text{KL}}(q_\phi(z|x) \| p(z))$$

This decomposes beautifully:

Training Procedure

For each data point $x$:

  1. Encode: $q_\phi(z|x) = \mathcal{N}(\mu_\phi(x), \sigma_\phi^2(x))$
  2. Sample: $z = \mu_\phi(x) + \sigma_\phi(x) \epsilon$ where $\epsilon \sim \mathcal{N}(0, I)$
  3. Decode: $p_\theta(x|z) = \mathcal{N}(\mu_\theta(z), \sigma_\theta^2(z))$
  4. Compute loss: $\mathcal{L} = -\mathbb{E}_{q_\phi(z|x)} [\log p_\theta(x|z)] + D_{\text{KL}}(q_\phi(z|x) \| p(z))$

Strengths and Weaknesses

Strengths: Principled probabilistic framework, stable training, meaningful latent space.

Weaknesses: Blurry samples (due to reconstruction loss), restrictive assumptions about posterior.

Generative Adversarial Networks (GAN)

The Core Idea

Forget explicit density modeling. Instead, train two networks that compete:

The Minimax Game

The discriminator tries to maximize:

$$\max_\phi \mathbb{E}_{x \sim p_{\text{data}}} [\log D_\phi(x)] + \mathbb{E}_{z \sim p(z)} [\log(1 - D_\phi(G_\theta(z)))]$$

The generator tries to minimize the same objective (or maximize $\mathbb{E}_{z \sim p(z)} [\log D_\phi(G_\theta(z))]$ in practice):

$$\min_\theta \max_\phi V(D, G) = \mathbb{E}_{x \sim p_{\text{data}}} [\log D_\phi(x)] + \mathbb{E}_{z \sim p(z)} [\log(1 - D_\phi(G_\theta(z)))]$$

What Does the Discriminator Learn?

At the optimal discriminator, it outputs the probability that a sample is real:

$$D^*(x) = \frac{p_{\text{data}}(x)}{p_{\text{data}}(x) + p_\theta(x)}$$

Connection to Divergence Minimization

The GAN objective is equivalent to minimizing the Jensen-Shannon divergence:

$$D_{\text{JS}}(p_{\text{data}} \| p_\theta) = \frac{1}{2} D_{\text{KL}}(p_{\text{data}} \| \frac{p_{\text{data}} + p_\theta}{2}) + \frac{1}{2} D_{\text{KL}}(p_\theta \| \frac{p_{\text{data}} + p_\theta}{2})$$

Training Procedure

Alternate between:

  1. Update discriminator: Sample real data $x$ and noise $z$, maximize discriminator accuracy
  2. Update generator: Sample noise $z$, maximize discriminator confusion

Strengths and Weaknesses

Strengths: Sharp, high-quality samples, no assumption about data distribution.

Weaknesses: Training instability, mode collapse, no explicit density estimation.

Diffusion Models

The Core Idea

Learn to reverse a gradual noising process. If we know how to denoise, we can start from pure noise and gradually recover data.

The Forward Process

Gradually add Gaussian noise over $T$ steps:

$$q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \beta_t} x_{t-1}, \beta_t I)$$

With a clever reparameterization, we can jump to any timestep directly:

$$q(x_t | x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} x_0, (1 - \bar{\alpha}_t) I)$$

where $\alpha_t = 1 - \beta_t$ and $\bar{\alpha}_t = \prod_{s=1}^t \alpha_s$.

The Reverse Process

Learn to reverse the noising process:

$$p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t))$$

What Do We Actually Learn?

Instead of predicting $\mu_\theta$ directly, we train a network $\epsilon_\theta(x_t, t)$ to predict the noise that was added:

$$\mathcal{L} = \mathbb{E}_{t, x_0, \epsilon} \left[ \| \epsilon - \epsilon_\theta(\sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon, t) \|^2 \right]$$

This is remarkably simple: just predict the noise at each timestep.

Sampling Procedure

  1. Start with pure noise: $x_T \sim \mathcal{N}(0, I)$
  2. For $t = T, T-1, \ldots, 1$:
    • Predict noise: $\epsilon_\theta(x_t, t)$
    • Compute denoised sample: $x_{t-1}$ using reverse process
  3. Return $x_0$

Connection to Score Matching

Diffusion models are learning the score function (gradient of log density):

$$\nabla_{x_t} \log p(x_t) \approx -\frac{1}{\sqrt{1 - \bar{\alpha}_t}} \epsilon_\theta(x_t, t)$$

Strengths and Weaknesses

Strengths: Stable training, high-quality samples, principled framework, exact likelihood computation possible.

Weaknesses: Slow sampling (requires many steps), high computational cost.

Connecting the Three

VAE and Diffusion

Diffusion models can be viewed as hierarchical VAEs with:

The ELBO in diffusion models has a similar structure to VAE, but summed over timesteps.

GAN and Diffusion

Recent work shows we can train diffusion models adversarially:

This combines GAN's sharp samples with diffusion's stable training.

The Unified View

Aspect VAE GAN Diffusion
Objective Maximize ELBO Minimax game Denoise at each step
Training Stable Can be unstable Stable
Sample quality Blurry Sharp Sharp
Likelihood Lower bound No explicit Exact (with effort)
Latent space Meaningful Less structure Hierarchical
Speed Fast Fast Slow