5. Numerical integration
Recall the results of the previous section: we have shown that if we sample
Thus we now have a means of sampling from the data distribution at
Suppose we have a time-indexed family of distributions given by the probability density function
The strategy for generating a sample from
- Sample from the approximate distribution for
, for example, . - Starting from
, numerically integrate to obtain an approximation to the trajectory which would yield a sample from .


This shows a trajectory (orange) in the sample space
An example of what integration looks like for a real-life example, in the (much) higher dimensional vector representing the space of images. A sample at large
Integration involves sampling a point for
Thus conceptually, deterministic sampling is straightforward. However, in practice we must numerically integrate
Numerical integration algorithms
The simplest way of integrating
The final sample is
Note that time is decreasing as a function of the time index; this can be confusing, but is consistent with how we have used
Karras, Tero and Aittala, Miika and Aila, Timo and Laine, Samuli
Advances in Neural Information Processing Systems, 2022.
There are various sources of error in this:
- The distribution of
might not be precisely the same as (e.g., if there is a mismatch from it being a true Gaussian). - Numerically integrating
yields an integration error. - The function
might not be precisely a drift consistent with (e.g., if it arises from an imperfect denoiser).
Analyzing the integration error
For this part, let’s fix
Errors in Euler integration accumulate over many timesteps. Each integration step induces a local error, where even if we were to start at the correct position
The single-step error is
and is called the local truncation error.

Local truncation error, shown in red.
The word “truncation” signifies that this error is due solely to Euler’s method, and does not include any effects due to e.g., floating point approximations.
The local truncation error can be estimated using the Taylor expansion of
In particular, the local truncation error is quadratic in the timestep. How do these local errors accumulate?
The (signed) error in

Global truncation error over several steps, shown in red.
Write this error as
The first
It is then possible to do a standard inductive summation argument to yield
Having good bounds on the global integration error is a hard problem, and finding the optimal sequence of
Higher-order integrators
We can also use higher-order integrators, which offer better errors as a function of the number of NFA evaluations. [10]Elucidating the design space of diffusion-based generative models
Karras, Tero and Aittala, Miika and Aila, Timo and Laine, Samuli
Advances in Neural Information Processing Systems, 2022 propose using an explicit 2nd-order Runge-Kutta method. This replaces
Here we use Taylor approximations to analyze the local truncation error.
-
For simplicity, suppose at first that
has no dependence on , i.e., . In this case, . By using the Taylor expansions of and asand substituting these into
, show that the local truncation error when is for all choices of . (Thus the local truncation error is cubic, rather than quadratic in , compared with the Euler method.) -
In the general case where
has a dependence on , use the result of Exercise 3.2 for and the Taylor expansion of to show that the local truncation error is again for .
Spacing of timesteps? A worked example
Let’s suppose we have a fixed compute budget (corresponding to the number of integration steps
Let’s look at a worked toy example. This example is basic, but it has some features in common with world setups.
We will start with a highly simplified setup where we work entirely in the space
Let the data distribution of
More precisely, let
In this, we make repeated use (sometimes implicitly) of the idea of concentration of measure.
We generated
In our case,
For a much broader and informal introduction to concentration of measure, see Sander’s post on typicality [16]Musings on typicality
Dieleman, Sander
https://benanne.github.io/2020/09/01/typicality.html.
Let’s add noise to
(Thus for example
What’s the mutual information between
For
-
Show that
Thus by concentration of measure,
with high probability. -
Show that the number of points in
that are at distance from is (with high probability)where
is the binary entropy function, and where as . -
Thus conditioned on
, with high probability is at distance from . By estimating the number of points in “close” to , deduce that the entropy of conditioned on isand in particular the mutual information is
What does the entropy
the entropy is

Mutual information, and entropy, scaled by
Note also that if we extend past
In the example so far, we added Bernoulli noise:
It is in fact possible to “convert” the Gaussian noise to an equivalent amount of Bernoulli noise, i.e., find
For this exercise, let
From the argument that gave
and from Exercise 1.3:
By using the approximation
when

How much Bernoulli noise
The following figure shows how information about the original sample


Mutual information, and entropy, scaled by
Two things that stand out in adding Gaussian noise to a point sampled from a random set of discrete points:
- At low noise levels, no information is destroyed at all. The original signal is perfectly recoverable.
- The amount of noise
required to destroy most of the information depends on how sparse the discrete set is. There is no one-size-fits-all solution.
How does this relate to spacing of timesteps?
There are various heuristics to determine the spacing of timesteps in numerical integration, including:
- The timesteps should correspond to “decoding a constant rate of information” (since this is a rate-limiting step by a finite capacity neural network), for example as done in [17]Continuous diffusion for categorical data
Dieleman, Sander and Sartran, Laurent and Roshannai, Arman and Savinov, Nikolay and Ganin, Yaroslav and Richemond, Pierre H and Doucet, Arnaud and Strudel, Robin and Dyer, Chris and Durkan, Conor and others
arXiv preprint arXiv:2211.15089, 2022. - The timesteps should be more heavily concentrated in regions of greater trajectory curvature (since this leads to numerical integration error).
Let
Following
Given
In practice, the actual best noise schedule to use is an empirical question, as it is affected by so many things.
For the purpose of denoising images (which can be quite a different setup to denoising a random subset of the boolean hypercube, although certain intuitions carry over), [10]Elucidating the design space of diffusion-based generative models
Karras, Tero and Aittala, Miika and Aila, Timo and Laine, Samuli
Advances in Neural Information Processing Systems, 2022 propose using
for constants
For a much wider discussion of these issue, see [18]Noise schedules considered harmful
Dieleman, Sander
https://sander.ai/2024/06/14/noise-schedules.html.
Numerical issues
It is common practice in machine learning to use reduced floating point precision (for example 16 rather than 32 bits) to reduce the costs of computation, for example, in computing the activations in a forward pass through a neural network. However, it is recommended to use a high precision floating point format (such as float32) for storing and manipulating the actual intermediate integration values
The bfloat16 floating-point format is a float point format with 16 bits of precision, and in particular with 7 bits of mantissa. This means that for every range
What are the values of successive timesteps
How might using reduced floating point precision affect numeric integration?