Diffusion Models: From DDPM to Flow Matching
January 4, 2026
Introduction
Diffusion models have emerged as the dominant paradigm in generative modeling, powering systems like DALL-E, Stable Diffusion, and Sora. But the field has evolved rapidly from the original DDPM formulation through score matching, SDEs, and most recently flow matching. This article traces that evolution, showing how each advancement builds on the last.
We'll see how what started as a discrete denoising process became understood as continuous dynamics, then as probability flows, and finally as optimal transport problems. By the end, you'll understand not just the algorithms, but the deep connections between these seemingly different approaches.
Foundational Concepts
Gaussian Distributions and Reparameterization
The Gaussian distribution is central to all diffusion models. A random variable $x \sim \mathcal{N}(\mu, \sigma^2)$ has density:
The reparameterization trick lets us write $x = \mu + \sigma \epsilon$ where $\epsilon \sim \mathcal{N}(0, 1)$. This makes sampling differentiable with respect to parameters.
Bayes' Rule and Conditional Distributions
For Gaussians, Bayes' rule has a particularly nice form. If $x_0 \sim \mathcal{N}(\mu_0, \sigma_0^2)$ and $x_t | x_0 \sim \mathcal{N}(\alpha_t x_0, \sigma_t^2)$, then:
This posterior inference is exact and tractable for Gaussian processes - a key advantage we'll exploit throughout.
The Score Function
The score function is the gradient of the log-density:
For a Gaussian $\mathcal{N}(\mu, \sigma^2)$, the score is particularly simple:
This points toward the mode (mean) of the distribution, with strength proportional to how far we are from it. The score function will become central to understanding diffusion models.
DDPM: Denoising Diffusion Probabilistic Models
The Setup
DDPM (Ho et al., 2020) introduced a discrete-time noising process over $T$ steps. Starting from data $x_0 \sim q(x_0)$, we gradually add Gaussian noise:
where $\beta_1, \ldots, \beta_T$ is a variance schedule with $0 < \beta_t < 1$. Each step adds a little noise and shrinks the signal slightly.
The Magic of Closed-Form Diffusion
A key insight: we can jump to any timestep directly. Define $\alpha_t = 1 - \beta_t$ and $\bar{\alpha}_t = \prod_{s=1}^t \alpha_s$. Then:
Equivalently, using reparameterization:
This is beautiful: $x_t$ is a weighted interpolation between the data $x_0$ and pure noise $\epsilon$. As $t \to T$, we have $\bar{\alpha}_T \to 0$, so $x_T$ becomes pure noise.
The Reverse Process
To generate samples, we need to reverse this process. We model:
The variance $\sigma_t^2$ can be fixed (to $\beta_t$ or $\tilde{\beta}_t = \frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t} \beta_t$). The mean $\mu_\theta$ is what we learn.
Training Objective
The variational bound gives us an objective that decomposes across timesteps:
where $\tilde{\mu}$ is the true posterior mean, computable in closed form using Bayes' rule.
Predicting the Noise
Instead of predicting $\mu_\theta$ directly, Ho et al. showed we can predict the noise $\epsilon$. The relationship is:
So we train $\epsilon_\theta(x_t, t)$ to predict $\epsilon$ with the simplified loss:
where $x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t} \epsilon$. This is remarkably simple: sample a timestep, add noise, predict the noise.
Sampling
To generate a sample:
- Start with $x_T \sim \mathcal{N}(0, I)$
- For $t = T, T-1, \ldots, 1$:
- Predict $\hat{\epsilon} = \epsilon_\theta(x_t, t)$
- Compute $x_{t-1} = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}} \hat{\epsilon} \right) + \sigma_t z$ where $z \sim \mathcal{N}(0, I)$
- Return $x_0$
This requires $T$ network evaluations (typically $T = 1000$), making sampling slow.
Score Matching: A Parallel Story
The Score-Based Perspective
Independently, Song & Ermon (2019) developed score-based generative modeling. The key insight: if we know the score function $\nabla_x \log p(x)$, we can generate samples using Langevin dynamics:
where $z_k \sim \mathcal{N}(0, I)$. As $\epsilon \to 0$ and iterations $\to \infty$, this converges to sampling from $p(x)$.
The Challenge: Score Estimation
Directly estimating $\nabla_x \log p_{\text{data}}(x)$ is hard in low-density regions (where there's little data). Song & Ermon's solution: noise-conditioned score matching.
Perturb data with different noise levels $\sigma_1 > \sigma_2 > \cdots > \sigma_L$:
Train $s_\theta(x, \sigma)$ to match these scores:
For Gaussian perturbations, $\nabla_x \log q_\sigma(x | x_0) = -\frac{x - x_0}{\sigma^2}$. So we're learning to predict the direction back to clean data.
Connection to DDPM
This is equivalent to DDPM! The DDPM noise predictor $\epsilon_\theta$ relates to the score as:
Both methods learn the same quantity, just parameterized differently. DDPM predicts noise; score matching predicts gradients. They're two sides of the same coin.
DDIM: Deterministic Sampling
The Limitation of DDPM
DDPM sampling is a Markov chain with stochastic transitions. This requires many steps ($T \sim 1000$) to produce good samples. Can we go faster?
Non-Markovian Forward Processes
Song et al. (2021) showed that DDPM's forward marginals $q(x_t | x_0)$ don't uniquely determine the forward process $q(x_t | x_{t-1})$. We can use non-Markovian processes that give the same marginals!
DDIM introduces a deterministic process:
Notice there's no noise term! This is a deterministic mapping from $x_t$ to $x_{t-1}$.
Accelerated Sampling
Because DDIM is deterministic, we can skip timesteps. Instead of running all $t = 1000, 999, \ldots, 1$, we can use a subsequence like $t = 1000, 900, 800, \ldots, 100$. This gives $10\times$ speedup with minimal quality loss.
The key: DDIM defines implicit probabilistic models (hence the name) where the reverse process is a learned ODE, not an SDE.
SDEs: The Continuous-Time View
From Discrete to Continuous
Song et al. (2021) took the next conceptual leap: view diffusion as a continuous-time process. As $T \to \infty$ and step size $\to 0$, the discrete process becomes a stochastic differential equation (SDE):
where $w$ is a Wiener process (Brownian motion), $f$ is the drift, and $g$ is the diffusion coefficient.
The Forward SDE
The DDPM forward process corresponds to the Variance Preserving (VP) SDE:
This has a nice solution: for any time $t$, $x_t = \alpha_t x_0 + \sigma_t \epsilon$ where $\alpha_t$ and $\sigma_t$ satisfy ODEs derived from the SDE coefficients.
The Reverse SDE
Anderson (1982) showed that every forward SDE has a reverse SDE:
where $\bar{w}$ is a reverse-time Wiener process. The only unknown is the score $\nabla_x \log p_t(x)$, which we approximate with our neural network!
This is profound: the reverse process has a closed form in terms of the score. Learn the score, get generation for free.
The Probability Flow ODE
Even more remarkably, there's a deterministic ODE with the same marginal distributions as the SDE:
This ODE has the same starting distribution $p_0(x)$ and ending distribution $p_T(x)$ as the SDE, but follows deterministic trajectories. This explains DDIM: it's approximately solving this probability flow ODE!
Benefits of the Continuous View
- Unified framework: DDPM, DDIM, and score matching are all special cases
- Flexible sampling: Use any ODE/SDE solver with adaptive step sizes
- Exact likelihood: The ODE formulation allows tractable likelihood via the change-of-variables formula
- Manipulations: Can interpolate, encode, edit by manipulating ODE trajectories
Score-Based SDE Framework
General Noise Schedules
The SDE framework allows exploring different noise schedules beyond DDPM's discrete variance schedule. Song et al. introduced three main families:
Variance Preserving (VP): Maintains constant variance ($\|x_t\| \approx \|x_0\|$)
Variance Exploding (VE): Variance grows without signal decay
sub-VP: Variance shrinks over time
Training with SDEs
The training objective generalizes score matching:
where $\lambda(t)$ is a weighting function. Different choices of $\lambda$ connect to different objectives (DDPM uses $\lambda(t) = \sigma_t^2$).
Controllable Generation
The SDE framework enables elegant conditional generation. For class conditioning:
We can use a classifier $p(y | x_t)$ to guide the score, giving classifier guidance. Or train a conditional score network $s_\theta(x_t, t, y)$ directly.
Flow Matching: The Modern View
Motivation: Optimal Transport
All previous methods use Gaussian noise schedules. But is this optimal? Flow matching (Lipman et al., 2023) takes a different approach: directly learn the vector field that transports noise to data.
The setup: we want a time-dependent vector field $v_t(x)$ such that the ODE:
transports samples from a simple prior $p_0$ (e.g., $\mathcal{N}(0, I)$) to the data distribution $p_1$.
Conditional Flow Matching
Directly learning $v_t$ is hard. Flow matching uses a clever trick: match conditional vector fields.
For each data point $x_1 \sim p_{\text{data}}$, define a simple conditional path:
This linearly interpolates from noise to data. The velocity along this path is:
The Flow Matching Objective
Train a neural network $v_\theta(x_t, t)$ to match these conditional velocities:
where $x_t = t x_1 + (1-t) x_0$. This is beautifully simple: the target is just the difference $x_1 - x_0$!
Why Does This Work?
The key theorem: if $v_\theta$ matches the conditional flows, it also matches the marginal flow. That is, the learned $v_\theta$ generates the correct marginal distribution at time $t=1$.
The proof uses the fact that the marginal vector field is the expectation of conditional vector fields:
Rectified Flows
Liu et al. (2023) proposed rectified flows: iteratively train flow models to make trajectories straighter. Starting with a trained model $v_\theta^{(0)}$:
- Generate pairs $(x_0^{(k)}, x_1^{(k)})$ by running the ODE forward and backward
- Train a new model $v_\theta^{(k+1)}$ to match straight-line paths between these pairs
- Repeat
After a few iterations, trajectories become nearly straight, allowing one-step or few-step generation!
Connection to Diffusion Models
Flow matching is intimately related to the probability flow ODE. In fact, diffusion models can be viewed as flow matching with a specific Gaussian conditional path:
The flow matching framework generalizes this, allowing non-Gaussian paths and more flexibility.
Optimal Transport Flows
Recent work connects flow matching to optimal transport (OT). The idea: find the flow that minimizes transport cost between $p_0$ and $p_1$.
The Monge problem seeks a map $T: x_0 \mapsto x_1$ minimizing:
For cost $c(x_0, x_1) = \|x_1 - x_0\|^2$, the optimal paths are straight lines! This justifies the straight-line interpolation in flow matching.
Practical Considerations
Network Architectures
All methods use time-conditioned networks. Common choices:
- U-Net: The standard for images, with skip connections and attention layers
- Transformers: For high-resolution images and videos (DiT, U-ViT)
- Time conditioning: Via sinusoidal embeddings, added to intermediate features
Parameterization Choices
Different parameterizations of the learned function:
- Noise prediction $\epsilon_\theta(x_t, t)$: DDPM's original choice, works well empirically
- Score prediction $s_\theta(x_t, t)$: Natural for score-based methods
- Data prediction $\hat{x}_\theta(x_t, t)$: Directly predict clean data $x_0$
- Velocity prediction $v_\theta(x_t, t)$: Flow matching's choice
These are related via:
Sampling Algorithms
For ODEs (DDIM, probability flow, flow matching):
- Euler: $x_{t-\Delta t} = x_t - \Delta t \cdot v_\theta(x_t, t)$
- Heun: Second-order method, more accurate
- DPM-Solver: Specialized ODE solver for diffusion, very fast
For SDEs:
- Euler-Maruyama: Basic SDE solver
- Predictor-Corrector: Alternate between ODE step and Langevin correction
Guidance Techniques
Classifier-free guidance (Ho & Salimans, 2021) is now standard for conditional generation:
where $w$ is the guidance weight. This extrapolates away from unconditional predictions, improving sample quality at the cost of diversity.
The Unified Picture
What Are We Really Learning?
All diffusion models learn the same fundamental object: the vector field that transports noise to data. Different formulations are different parameterizations of this field:
- DDPM: Discrete-time noise prediction
- Score matching: Gradient of log-density
- Probability flow ODE: Deterministic transport
- Flow matching: Direct velocity field learning
They converge to the same solution, differing mainly in training efficiency and sampling speed.
Key Insights
Gaussian perturbations are not special: Flow matching shows we can use arbitrary conditional paths. Gaussians are convenient but not necessary.
Stochasticity is optional: Both stochastic (SDE) and deterministic (ODE) processes work. ODEs are often preferred for speed and controllability.
Straight is fast: Straighter trajectories (rectified flows, optimal transport) enable few-step generation without quality loss.
The score is central: Whether we call it noise prediction, score estimation, or velocity field learning, we're learning how to move from noise to data.
Comparison Table
| Method | Framework | Learns | Sampling | Speed |
|---|---|---|---|---|
| DDPM | Discrete VAE | Noise $\epsilon_\theta$ | Stochastic | Slow (1000 steps) |
| DDIM | Implicit model | Noise $\epsilon_\theta$ | Deterministic | Medium (50 steps) |
| Score SDE | Continuous SDE | Score $s_\theta$ | Flexible | Medium |
| Probability Flow | ODE | Score $s_\theta$ | Deterministic | Medium |
| Flow Matching | ODE / OT | Velocity $v_\theta$ | Deterministic | Fast (10 steps) |
| Rectified Flow | OT | Velocity $v_\theta$ | Deterministic | Very fast (1-5 steps) |
Current Frontiers
Distillation
Progressive distillation (Salimans & Ho, 2022) and consistency models (Song et al., 2023) aim to distill multi-step diffusion into few-step or even one-step generators while preserving quality.
Discrete Diffusion
Extending diffusion to discrete spaces (text, graphs) is an active area. Approaches include absorbing states, corruption processes, and score matching on discrete structures.
Unified Models
Recent work unifies diffusion with other generative models: masked generative models, autoregressive models, and energy-based models all connect to the diffusion framework.
Applications Beyond Images
Diffusion models now power video generation (Sora), 3D synthesis (DreamFusion), molecular design, protein folding, and more. The framework is remarkably general.
Conclusion
The journey from DDPM to flow matching reveals a beautiful story of conceptual unification. What began as discrete denoising became continuous dynamics, then optimal transport. Each perspective adds insight:
- DDPM gave us the practical algorithm and showed it works
- Score matching revealed the geometric structure (moving along score directions)
- SDEs provided the continuous framework and mathematical rigor
- Flow matching connected to optimal transport and enabled fast sampling
Today's best models combine insights from all these perspectives. They use the training simplicity of flow matching, the theoretical understanding from SDEs, and sampling algorithms refined through years of diffusion model research.
The field continues to evolve rapidly. But these fundamentals - learning to transport noise to data along smooth trajectories - will remain central. Understanding this progression equips you to both use current models and contribute to future developments.