SHP's DevBlog

전체 글

[Paper Review] Denoising Diffusion Probabilistic Models 2025.10.26
Information Theory 2025.09.09
Probability and Random Variables 2025.09.09

[Paper Review] Denoising Diffusion Probabilistic Models

Sunghyun Park 2025. 10. 26. 20:47

2025. 10. 26. 20:47

Before reading this post, You can think about this questions.

"If we intentionally destroy data by adding noise (diffusion), can we really train a model to reverse the process?"

"And in doing so, Is it possible to create high-quality new samples from pure noise? What does this process mean in terms of information theory perspective?"

📓 Paper : Jonathan Ho, Ajay Jain, Pieter Abbeel (2020). Denoising Diffusion Probabilistic Models

📚 Additional Reading material (Blog) : Lilian Weng, What are Diffusion Models?

I. SHORT SUMMARY

Denoising Diffusion Probabilistic Models (DDPMs) are a generative modeling framework based on a parameterized Markov chain. The model learns to reverse a fixed forward (diffusion) process, which gradually adds Gaussian noise to data until it becomes pure noise. The reverse chain starts from a standard Gaussian prior and learns to progressively denoise it back into a sample.

This model's core innovation is parameterizing the reverse process to predict the noise at each step and simplifying the complex training objective function into a simple objective function.

II. MATHEMATICAL BACKGROUND

This concepts are general concepts to understand this paper.

II-1. 1st-Order Markov Process

A first-order Markov process (or Markov chain) is a stochastic model describing a sequence of possible events.

The key property of the "Markov property" is that the probability of the next state depends only on the previous state, not on the sequence of events that preceded it.

This "memoryless" property dramatically simplifies the joint probability of a long chain, allowing it to be factored into a product of transition probabilities.

By joint probability (Chain Rule), This property allows the joint probability of an entire sequence to be factored into a product of conditional probabilities. This is fundamental to defining both the forward and reverse processes in DDPM.

II-2. ELBO (Evidence Lower Bound)

In variational inference (VI), the data log-likelihood $log p_\theta (x)$ is often intractable.

The ELBO (Evidence Lower Bound) provides a computable lower bound.

By introducing an approximate posterior and applying Jensen's inequality, We can get following bounds.

$$ \log p_{\theta}(x)
= \log \int p_{\theta}(x, z)\ dz
= \log \int q(z \mid x) \frac{p_{\theta}(x, z)}{q(z \mid x)}\ dz
\ge \int q(z \mid x) \log \frac{p_{\theta}(x, z)}{q(z \mid x)}\ dz $$

II-3. KL Divergence of Gaussians

DDPMs assume Gaussian distributions for both the forward and reverse processes.

This means that the terms in the ELBO often reduce to computing the KL divergence between two Gaussians.

For two multivariate normal distributions $q \sim \mathcal{N}(\mu_q, \sigma_q^2)$ and $p \sim \mathcal{N}(\mu_p, \sigma_p^2)$, the KL divergence has a convenient closed-form solution.

$$D_{KL}(q \parallel p) = \left( \log \frac{\sigma_p}{\sigma_q} + \frac{\sigma_q^2 + (\mu_q - \mu_p)^2}{2\sigma_p^2} - \frac{1}{2} \right)$$

III. DENOISING DIFFUSION PROBABILISTIC MODELS

A DDPM is a latent variable model by Forward Process and Reverse Process.

III-1. Forward and Reverse Process

Figure 3. Forward Process and Reverse Process

Forward and reverse process are both Markovian processes.

(1) Forward Process $q$ (Diffusion) : This is a fixed process with no learnable parameters that gradually adds Gaussian noise to an image $x_0$ over $T$ steps according to a variance schedule $\beta_t$.

$$q(x_t|x_{t-1}) := \mathcal{N}(x_t; \sqrt{1-\beta_t}x_{t-1}, \beta_t\mathbf{I})$$
A key property is that we can sample $x_t$ at any arbitrary timestep directly from $x_0$.

Defining $\alpha_t := 1 - \beta_t$ and $\bar{\alpha}_t := \prod_{s=1}^t \alpha_s$, we have the closed form of $q$.

$$q(x_t|x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t}x_0, (1-\bar{\alpha}_t)\mathbf{I})$$

This makes training highly efficient, as we can sample a random $t$ for each training example.

(2) Reverse Process $p_\theta$ (Denoising) : This is a learned Markov chain that aims to reverse the diffusion, starting from pure noise $x_T \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ and progressively generating a data sample.
$$p_\theta(x_{t-1}|x_t) := \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \mathbf{\Sigma}_\theta(x_t, t))$$

III-2. Loss Function (Objective Function)

The model is trained by optimizing the variational bound (similar with ELBO) on the negative log-likelihood.

The paper shows this bound $L$ can be rewritten as a sum of KL divergences.

If you want to understand mathematical derivation, you can drive a Loss function to follow this equations.

Let's look at three terms in $L$ (Objective Function)!

(1) First Term of Objective Function : $L_T$ is constant because in this paper's implementation, $q$ has no learnable parameters and tractable distribution $q(x_T | x_0)$ and prior $p(x_T)$ are very similar. So, The first term of $L$ ($L_T$) can be ignored.

(2) Second Term of Objective Function : $L_{t-1}$

- What is the $q(x_{t-1} | x_t, x_0)$?

(Why? )

- What is the $p_\theta (x_t-1 | x_t)$?

By Definition of this paper,

- What is the KL Divergence of two distribution?

$p_\theta (x_t-1 | x_t)$ and $q(x_{t-1} | x_t, x_0)$ are both Gaussian Distribution.

From III-3, you can derive this equation.

By this mathematics derivation, If we follow Algorithm 2, we can sample $x_0$ from $x_T$.

and we can finally get $L_{t-1} - C$, but more simple version. ($C$ is a constant not depending on $\theta$)

(3) Third Term of Objective Function : $L_0$

$L_0$ is the final reconstruction log-likelihood term. In this paper, image data should be integers scaled linearly in [-1,1].

III-3. Simple Verison of Loss Function and Training

Can't you make it simpler? Of course you can!

It was a very difficult process to get here. But you will get $L_{simple}$ Version to train by Algorithm 1.

So, This simplification effectively ignores the complex weighting coefficient from the true variational bound.

IV. UNDERSTANDING DDPM THROUGH INFORMATION THEORY

DDPM's Information-theoretic story starts from Variation bound. (What is the DDPM's Variation bound? Equation 5 : a sum of Gaussian KL terms across timesteps plus a terminal reconstruction term.) and you can connect to autoregressive decoding.

IV-1. Rate-distortion curve

On the rate-distortion curve (especially the third Curve), the curve is steep at the start.

Large distortion reduction per bit!

Then, Flattens as marginal bits are spent refining invisible texture rather than improving image quality.

For example,

🔥 Why this matter?

- DDPMs can produce high-quality samples but still have non-competivity lossless codelengths versus other models. This is because most bits in the bound go to describing imperceptible details, revealing a lossy-compressor inductive bias.

IV-2. Interpolation

If you take a two image ($x_t, x_{t'} \sim q(x_t | x_0)$), you can make noise images through the forward diffusion. You can linearly blend the resulting noises, and decode(reverse) with the learned reverse process. You can make high quality images by decoding a blended noise.

V. MY THOUGHT

DDPM is amazing method to create high quality images from noisy images. What suprised me most is that the model doesn't predict the image directly. 😲 It predicts the noise at each step, which both simplifies the training objective and ties the sampler to a Langevin-like update.

I also liked how the paper controls variance across timesteps. The forword diffusion is driven by a variance schedule (you choose this schedule), and in their implement, they fix $\beta_t$ as constants so the forward process has no leanable parameters. Correspondingly, The reverse variance is fixed for stability.