Diffusion Models: From DDPM to Flow Matching

January 4, 2026

Introduction

Diffusion models have emerged as the dominant paradigm in generative modeling, powering systems like DALL-E, Stable Diffusion, and Sora. But the field has evolved rapidly from the original DDPM formulation through score matching, SDEs, and most recently flow matching. This article traces that evolution, showing how each advancement builds on the last.

We'll see how what started as a discrete denoising process became understood as continuous dynamics, then as probability flows, and finally as optimal transport problems. By the end, you'll understand not just the algorithms, but the deep connections between these seemingly different approaches.

Foundational Concepts

Gaussian Distributions and Reparameterization

The Gaussian distribution is central to all diffusion models. A random variable $x \sim \mathcal{N}(\mu, \sigma^2)$ has density:

$$p(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)$$

The reparameterization trick lets us write $x = \mu + \sigma \epsilon$ where $\epsilon \sim \mathcal{N}(0, 1)$. This makes sampling differentiable with respect to parameters.

Bayes' Rule and Conditional Distributions

For Gaussians, Bayes' rule has a particularly nice form. If $x_0 \sim \mathcal{N}(\mu_0, \sigma_0^2)$ and $x_t | x_0 \sim \mathcal{N}(\alpha_t x_0, \sigma_t^2)$, then:

$$x_0 | x_t \sim \mathcal{N}\left(\frac{\alpha_t \sigma_0^2}{\alpha_t^2 \sigma_0^2 + \sigma_t^2} x_t, \frac{\sigma_0^2 \sigma_t^2}{\alpha_t^2 \sigma_0^2 + \sigma_t^2}\right)$$

This posterior inference is exact and tractable for Gaussian processes - a key advantage we'll exploit throughout.

The Score Function

The score function is the gradient of the log-density:

$$s(x) = \nabla_x \log p(x)$$

For a Gaussian $\mathcal{N}(\mu, \sigma^2)$, the score is particularly simple:

$$\nabla_x \log p(x) = -\frac{x - \mu}{\sigma^2}$$

This points toward the mode (mean) of the distribution, with strength proportional to how far we are from it. The score function will become central to understanding diffusion models.

DDPM: Denoising Diffusion Probabilistic Models

The Setup

DDPM (Ho et al., 2020) introduced a discrete-time noising process over $T$ steps. Starting from data $x_0 \sim q(x_0)$, we gradually add Gaussian noise:

$$q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t I)$$

where $\beta_1, \ldots, \beta_T$ is a variance schedule with $0 < \beta_t < 1$. Each step adds a little noise and shrinks the signal slightly.

The Magic of Closed-Form Diffusion

A key insight: we can jump to any timestep directly. Define $\alpha_t = 1 - \beta_t$ and $\bar{\alpha}_t = \prod_{s=1}^t \alpha_s$. Then:

$$q(x_t | x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} x_0, (1-\bar{\alpha}_t) I)$$

Equivalently, using reparameterization:

$$x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t} \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)$$

This is beautiful: $x_t$ is a weighted interpolation between the data $x_0$ and pure noise $\epsilon$. As $t \to T$, we have $\bar{\alpha}_T \to 0$, so $x_T$ becomes pure noise.

The Reverse Process

To generate samples, we need to reverse this process. We model:

$$p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \sigma_t^2 I)$$

The variance $\sigma_t^2$ can be fixed (to $\beta_t$ or $\tilde{\beta}_t = \frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t} \beta_t$). The mean $\mu_\theta$ is what we learn.

Training Objective

The variational bound gives us an objective that decomposes across timesteps:

$$\mathcal{L} = \mathbb{E}_{t \sim \text{Uniform}(1,T)} \mathbb{E}_{x_0, \epsilon} \left[ \left\| \mu_\theta(x_t, t) - \tilde{\mu}(x_t, x_0) \right\|^2 \right]$$

where $\tilde{\mu}$ is the true posterior mean, computable in closed form using Bayes' rule.

Predicting the Noise

Instead of predicting $\mu_\theta$ directly, Ho et al. showed we can predict the noise $\epsilon$. The relationship is:

$$\tilde{\mu}(x_t, x_0) = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}} \epsilon \right)$$

So we train $\epsilon_\theta(x_t, t)$ to predict $\epsilon$ with the simplified loss:

$$\mathcal{L}_{\text{simple}} = \mathbb{E}_{t, x_0, \epsilon} \left[ \| \epsilon - \epsilon_\theta(x_t, t) \|^2 \right]$$

where $x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t} \epsilon$. This is remarkably simple: sample a timestep, add noise, predict the noise.

Sampling

To generate a sample:

  1. Start with $x_T \sim \mathcal{N}(0, I)$
  2. For $t = T, T-1, \ldots, 1$:
    • Predict $\hat{\epsilon} = \epsilon_\theta(x_t, t)$
    • Compute $x_{t-1} = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}} \hat{\epsilon} \right) + \sigma_t z$ where $z \sim \mathcal{N}(0, I)$
  3. Return $x_0$

This requires $T$ network evaluations (typically $T = 1000$), making sampling slow.

Score Matching: A Parallel Story

The Score-Based Perspective

Independently, Song & Ermon (2019) developed score-based generative modeling. The key insight: if we know the score function $\nabla_x \log p(x)$, we can generate samples using Langevin dynamics:

$$x_{k+1} = x_k + \frac{\epsilon}{2} \nabla_x \log p(x_k) + \sqrt{\epsilon} z_k$$

where $z_k \sim \mathcal{N}(0, I)$. As $\epsilon \to 0$ and iterations $\to \infty$, this converges to sampling from $p(x)$.

The Challenge: Score Estimation

Directly estimating $\nabla_x \log p_{\text{data}}(x)$ is hard in low-density regions (where there's little data). Song & Ermon's solution: noise-conditioned score matching.

Perturb data with different noise levels $\sigma_1 > \sigma_2 > \cdots > \sigma_L$:

$$q_\sigma(x) = \int p_{\text{data}}(x_0) \mathcal{N}(x; x_0, \sigma^2 I) dx_0$$

Train $s_\theta(x, \sigma)$ to match these scores:

$$\mathcal{L} = \mathbb{E}_{\sigma \sim p(\sigma)} \mathbb{E}_{x_0 \sim p_{\text{data}}} \mathbb{E}_{x \sim \mathcal{N}(x_0, \sigma^2 I)} \left[ \left\| s_\theta(x, \sigma) - \nabla_x \log q_\sigma(x | x_0) \right\|^2 \right]$$

For Gaussian perturbations, $\nabla_x \log q_\sigma(x | x_0) = -\frac{x - x_0}{\sigma^2}$. So we're learning to predict the direction back to clean data.

Connection to DDPM

This is equivalent to DDPM! The DDPM noise predictor $\epsilon_\theta$ relates to the score as:

$$s_\theta(x_t, t) = -\frac{\epsilon_\theta(x_t, t)}{\sqrt{1-\bar{\alpha}_t}}$$

Both methods learn the same quantity, just parameterized differently. DDPM predicts noise; score matching predicts gradients. They're two sides of the same coin.

DDIM: Deterministic Sampling

The Limitation of DDPM

DDPM sampling is a Markov chain with stochastic transitions. This requires many steps ($T \sim 1000$) to produce good samples. Can we go faster?

Non-Markovian Forward Processes

Song et al. (2021) showed that DDPM's forward marginals $q(x_t | x_0)$ don't uniquely determine the forward process $q(x_t | x_{t-1})$. We can use non-Markovian processes that give the same marginals!

DDIM introduces a deterministic process:

$$x_{t-1} = \sqrt{\bar{\alpha}_{t-1}} \underbrace{\left( \frac{x_t - \sqrt{1-\bar{\alpha}_t} \epsilon_\theta(x_t, t)}{\sqrt{\bar{\alpha}_t}} \right)}_{\text{predicted } x_0} + \sqrt{1-\bar{\alpha}_{t-1}} \epsilon_\theta(x_t, t)$$

Notice there's no noise term! This is a deterministic mapping from $x_t$ to $x_{t-1}$.

Accelerated Sampling

Because DDIM is deterministic, we can skip timesteps. Instead of running all $t = 1000, 999, \ldots, 1$, we can use a subsequence like $t = 1000, 900, 800, \ldots, 100$. This gives $10\times$ speedup with minimal quality loss.

The key: DDIM defines implicit probabilistic models (hence the name) where the reverse process is a learned ODE, not an SDE.

SDEs: The Continuous-Time View

From Discrete to Continuous

Song et al. (2021) took the next conceptual leap: view diffusion as a continuous-time process. As $T \to \infty$ and step size $\to 0$, the discrete process becomes a stochastic differential equation (SDE):

$$dx = f(x, t) dt + g(t) dw$$

where $w$ is a Wiener process (Brownian motion), $f$ is the drift, and $g$ is the diffusion coefficient.

The Forward SDE

The DDPM forward process corresponds to the Variance Preserving (VP) SDE:

$$dx = -\frac{1}{2} \beta(t) x dt + \sqrt{\beta(t)} dw$$

This has a nice solution: for any time $t$, $x_t = \alpha_t x_0 + \sigma_t \epsilon$ where $\alpha_t$ and $\sigma_t$ satisfy ODEs derived from the SDE coefficients.

The Reverse SDE

Anderson (1982) showed that every forward SDE has a reverse SDE:

$$dx = \left[ f(x, t) - g(t)^2 \nabla_x \log p_t(x) \right] dt + g(t) d\bar{w}$$

where $\bar{w}$ is a reverse-time Wiener process. The only unknown is the score $\nabla_x \log p_t(x)$, which we approximate with our neural network!

This is profound: the reverse process has a closed form in terms of the score. Learn the score, get generation for free.

The Probability Flow ODE

Even more remarkably, there's a deterministic ODE with the same marginal distributions as the SDE:

$$dx = \left[ f(x, t) - \frac{1}{2} g(t)^2 \nabla_x \log p_t(x) \right] dt$$

This ODE has the same starting distribution $p_0(x)$ and ending distribution $p_T(x)$ as the SDE, but follows deterministic trajectories. This explains DDIM: it's approximately solving this probability flow ODE!

Benefits of the Continuous View

Score-Based SDE Framework

General Noise Schedules

The SDE framework allows exploring different noise schedules beyond DDPM's discrete variance schedule. Song et al. introduced three main families:

Variance Preserving (VP): Maintains constant variance ($\|x_t\| \approx \|x_0\|$)

$$dx = -\frac{1}{2} \beta(t) x dt + \sqrt{\beta(t)} dw$$

Variance Exploding (VE): Variance grows without signal decay

$$dx = \sqrt{\frac{d[\sigma^2(t)]}{dt}} dw$$

sub-VP: Variance shrinks over time

$$dx = -\frac{1}{2} \beta(t) x dt + \sqrt{\beta(t)(1 - e^{-2\int_0^t \beta(s)ds})} dw$$

Training with SDEs

The training objective generalizes score matching:

$$\mathcal{L} = \mathbb{E}_{t \sim \mathcal{U}(0, T)} \mathbb{E}_{x_0 \sim p_{\text{data}}} \mathbb{E}_{x_t \sim p_t(x_t | x_0)} \left[ \lambda(t) \left\| s_\theta(x_t, t) - \nabla_{x_t} \log p_t(x_t | x_0) \right\|^2 \right]$$

where $\lambda(t)$ is a weighting function. Different choices of $\lambda$ connect to different objectives (DDPM uses $\lambda(t) = \sigma_t^2$).

Controllable Generation

The SDE framework enables elegant conditional generation. For class conditioning:

$$\nabla_x \log p(x_t | y) = \nabla_x \log p(x_t) + \nabla_x \log p(y | x_t)$$

We can use a classifier $p(y | x_t)$ to guide the score, giving classifier guidance. Or train a conditional score network $s_\theta(x_t, t, y)$ directly.

Flow Matching: The Modern View

Motivation: Optimal Transport

All previous methods use Gaussian noise schedules. But is this optimal? Flow matching (Lipman et al., 2023) takes a different approach: directly learn the vector field that transports noise to data.

The setup: we want a time-dependent vector field $v_t(x)$ such that the ODE:

$$\frac{dx_t}{dt} = v_t(x_t)$$

transports samples from a simple prior $p_0$ (e.g., $\mathcal{N}(0, I)$) to the data distribution $p_1$.

Conditional Flow Matching

Directly learning $v_t$ is hard. Flow matching uses a clever trick: match conditional vector fields.

For each data point $x_1 \sim p_{\text{data}}$, define a simple conditional path:

$$x_t = t x_1 + (1-t) x_0, \quad x_0 \sim \mathcal{N}(0, I)$$

This linearly interpolates from noise to data. The velocity along this path is:

$$u_t(x_t | x_1) = \frac{dx_t}{dt} = x_1 - x_0 = x_1 - \frac{x_t - t x_1}{1-t}$$

The Flow Matching Objective

Train a neural network $v_\theta(x_t, t)$ to match these conditional velocities:

$$\mathcal{L}_{\text{FM}} = \mathbb{E}_{t \sim \mathcal{U}(0,1)} \mathbb{E}_{x_1 \sim p_{\text{data}}} \mathbb{E}_{x_0 \sim \mathcal{N}(0,I)} \left[ \left\| v_\theta(x_t, t) - (x_1 - x_0) \right\|^2 \right]$$

where $x_t = t x_1 + (1-t) x_0$. This is beautifully simple: the target is just the difference $x_1 - x_0$!

Why Does This Work?

The key theorem: if $v_\theta$ matches the conditional flows, it also matches the marginal flow. That is, the learned $v_\theta$ generates the correct marginal distribution at time $t=1$.

The proof uses the fact that the marginal vector field is the expectation of conditional vector fields:

$$v_t(x_t) = \mathbb{E}_{x_1 \sim p(x_1 | x_t)} [u_t(x_t | x_1)]$$

Rectified Flows

Liu et al. (2023) proposed rectified flows: iteratively train flow models to make trajectories straighter. Starting with a trained model $v_\theta^{(0)}$:

  1. Generate pairs $(x_0^{(k)}, x_1^{(k)})$ by running the ODE forward and backward
  2. Train a new model $v_\theta^{(k+1)}$ to match straight-line paths between these pairs
  3. Repeat

After a few iterations, trajectories become nearly straight, allowing one-step or few-step generation!

Connection to Diffusion Models

Flow matching is intimately related to the probability flow ODE. In fact, diffusion models can be viewed as flow matching with a specific Gaussian conditional path:

$$p_t(x_t | x_0) = \mathcal{N}(x_t; \alpha_t x_0, \sigma_t^2 I)$$

The flow matching framework generalizes this, allowing non-Gaussian paths and more flexibility.

Optimal Transport Flows

Recent work connects flow matching to optimal transport (OT). The idea: find the flow that minimizes transport cost between $p_0$ and $p_1$.

The Monge problem seeks a map $T: x_0 \mapsto x_1$ minimizing:

$$\min_T \mathbb{E}_{x_0 \sim p_0} \left[ c(x_0, T(x_0)) \right]$$

For cost $c(x_0, x_1) = \|x_1 - x_0\|^2$, the optimal paths are straight lines! This justifies the straight-line interpolation in flow matching.

Practical Considerations

Network Architectures

All methods use time-conditioned networks. Common choices:

Parameterization Choices

Different parameterizations of the learned function:

These are related via:

$$\hat{x}_\theta = \frac{x_t - \sigma_t \epsilon_\theta}{\alpha_t} = x_t + \sigma_t^2 s_\theta$$

Sampling Algorithms

For ODEs (DDIM, probability flow, flow matching):

For SDEs:

Guidance Techniques

Classifier-free guidance (Ho & Salimans, 2021) is now standard for conditional generation:

$$\tilde{\epsilon}_\theta(x_t, t, y) = (1 + w) \epsilon_\theta(x_t, t, y) - w \epsilon_\theta(x_t, t, \emptyset)$$

where $w$ is the guidance weight. This extrapolates away from unconditional predictions, improving sample quality at the cost of diversity.

The Unified Picture

What Are We Really Learning?

All diffusion models learn the same fundamental object: the vector field that transports noise to data. Different formulations are different parameterizations of this field:

They converge to the same solution, differing mainly in training efficiency and sampling speed.

Key Insights

Gaussian perturbations are not special: Flow matching shows we can use arbitrary conditional paths. Gaussians are convenient but not necessary.

Stochasticity is optional: Both stochastic (SDE) and deterministic (ODE) processes work. ODEs are often preferred for speed and controllability.

Straight is fast: Straighter trajectories (rectified flows, optimal transport) enable few-step generation without quality loss.

The score is central: Whether we call it noise prediction, score estimation, or velocity field learning, we're learning how to move from noise to data.

Comparison Table

Method Framework Learns Sampling Speed
DDPM Discrete VAE Noise $\epsilon_\theta$ Stochastic Slow (1000 steps)
DDIM Implicit model Noise $\epsilon_\theta$ Deterministic Medium (50 steps)
Score SDE Continuous SDE Score $s_\theta$ Flexible Medium
Probability Flow ODE Score $s_\theta$ Deterministic Medium
Flow Matching ODE / OT Velocity $v_\theta$ Deterministic Fast (10 steps)
Rectified Flow OT Velocity $v_\theta$ Deterministic Very fast (1-5 steps)

Current Frontiers

Distillation

Progressive distillation (Salimans & Ho, 2022) and consistency models (Song et al., 2023) aim to distill multi-step diffusion into few-step or even one-step generators while preserving quality.

Discrete Diffusion

Extending diffusion to discrete spaces (text, graphs) is an active area. Approaches include absorbing states, corruption processes, and score matching on discrete structures.

Unified Models

Recent work unifies diffusion with other generative models: masked generative models, autoregressive models, and energy-based models all connect to the diffusion framework.

Applications Beyond Images

Diffusion models now power video generation (Sora), 3D synthesis (DreamFusion), molecular design, protein folding, and more. The framework is remarkably general.

Conclusion

The journey from DDPM to flow matching reveals a beautiful story of conceptual unification. What began as discrete denoising became continuous dynamics, then optimal transport. Each perspective adds insight:

Today's best models combine insights from all these perspectives. They use the training simplicity of flow matching, the theoretical understanding from SDEs, and sampling algorithms refined through years of diffusion model research.

The field continues to evolve rapidly. But these fundamentals - learning to transport noise to data along smooth trajectories - will remain central. Understanding this progression equips you to both use current models and contribute to future developments.