3. Probability flow
The previous section introduced stochastic processes, defined by a collection of random variables \((X_t)_{t \in I}\) indexed by a set of times \(I \subset \R\). One particularly interesting class of stochastic processes are those defined by integrating a probability flow ODE, given by the differential equation
\[\begin{equation} \label{eq:stochastic_ode} \frac{dX_t}{dt} = v(X_t, t) \end{equation}\]for some function \(v(X_t, t)\), together with an initial distribution \(X_{t_0}\) for some time \(t_0\). We call \(v\) the drift or velocity. (We use these two terms somewhat interchangeably. Velocity comes from an analogy with fluid dynamics and other physical systems, and drift comes from the case where this is a term in a stochastic differential equation – we will explore these in a later section.)
There is a useful analogy with fluid dynamics.
 \(v(x, t)\) is the velocity of a compressible fluid at position \(x\) and time \(t\).
 The fluid has a time and spacevarying density \(\rho(x, t)\), which is determined by the density distribution at any time \(t_0\), together with the movement of the particles given by the velocity \(v(x, t)\).
 We could also track the trajectory of the particles: what happens to the particle which is at position \(x_0\) at time \(t_0\)? This is given by the “random variable” \(X_t\) conditioned on \(X_{t_0} = x_0\). The marginal distributions \(X_t\) then have density function \(\rho\) (or more specifically, \(x \mapsto \rho(x, t)\)).
For now, we only look at deterministic ordinary differential equations \eqref{eq:stochastic_ode}. Although the \((X_t)_{t \in I}\) resulting from integrating \(v\) is a stochastic process, it is fully “deterministic”, in the sense that knowing \(X_{t_0}\) for some \(t_0\) determines \(X_t\) provided the ODE does not “blow up” (see below). We will look at stochastic differential equations in a later section after we have introduced stochastic calculus, and for now we stick with ODEs.

The ODE \(\frac{d X_t}{d t} = X_t / t\) together with the initial condition \(X_1 = U[1, 1]\) over the range \(t \in [0, 1]\) yields the stochastic process with marginal distributions \(X_t \equiv U[t, t]\).

In the introduction, the backwards (sampling) process arose from the ODE
\[\begin{equation} \label{eq:dx_dt_in_motivating_example} \frac{d X_t}{dt} =  \frac{D(X_t, t)  X_t}{t} \end{equation}\]together with the “initial” condition \(X_{t_0} \equiv N(0, t_0^2)\) for \(t_0\) sufficiently large.
In this section we will look at the relationship between the velocity and the density, known as the continuity equation. (If you are familiar with the FokkerPlanck equation for SDEs, the continuity equation is the ODE analogue of this.) The continuity equation is a central result that allows us to prove diffusion results.
ODEs
Firstly, what does \(\eqref{eq:stochastic_ode}\) mean? Is it well defined? We can define it more formally as follows.
Let \(R\) be the ambient space in which we are sampling (typically \(R=\R^n\) for some \(n \ge 1\)). Let \(I \subset \R\) be a (time) interval, let \(t_0 \in I\), let \(v : R \times \R \rightarrow R\), and let \(X_{t_0}\) be an \(R\)valued random variable which provides the “boundary condition”.
Suppose there exists a unique function \(\Phi : I \times R \times I \rightarrow R\) for some time interval \(I \subset R\) containing \(t_0\) satisfying
\[\begin{align} \label{eq:ode} \frac{d}{dt} \Phi(t; x_0, t_0) &= v(\Phi(t; x_0, t_0), t) \qquad & \text{for all } t \in I, x_0 \in R, \\ \notag \Phi(t_0; x_0, t_0) &= x_0 \qquad & \text{for all } x_0 \in R. \end{align}\]We can then define \((X_t)_{t \in I}\) that satisfies \(\eqref{eq:stochastic_ode}\) via \(X_t = \Phi(t; X_{t_0}, t_0)\). Since \(X_{t_0}\) is a random variable, \(X_t\) is also a random variable, fully determined by \(X_{t_0}\).
Is \(\Phi\) is guaranteed to exist given \(v\), and if it does exist, is it is unique? Otherwise \(X_t\) is not welldefined.
The following fundamental theorem in calculus guarantees that such a solution exists and is unique in a neighborhood, provided \(v\) is sufficiently well behaved; for a proof see e.g. [14]Nonlinear dispersive equations: local and global analysis
Tao, Terence
American Mathematical Soc., 2006
(Note on notation: we use capital \(\Phi\) to denote the function of three variables that given an arbitrary starting point \(x_0\) and time \(t_0\) maps the current time \(t\) to a position; and use lower case \(\phi\) to denote a specific trajectory, which is a function of just a single variable time \(t\). In general it should be clear from the context whether we are talking about a family of trajectories, or a single trajectory.)
Let \(R = \R^d\), let \(I \subset \R\) be an open interval, and let \(v : R \times I \rightarrow R\) be continuous. Given \(t_0 \in I\) and \(x_0 \in R\), there exists an open time interval \(I'\) containing \(t_0\), together with a unique solution \(\phi : I' \rightarrow R\), which obeys
\[\begin{equation} \label{eq:traj_ode_cond} \phi(t_0) = x_0, \qquad \frac{d}{dt} \phi (t) = v(\phi(t), t). \end{equation}\]Note that although Theorem 3.2 guarantees the existence of a trajectory in the open neighbourhood of a point, it does not give any guarantees over the trajectory being defined globally. Indeed, this is impossible in general. A basic example is
\[v(x, t) = x^2.\]This is defined for all \(x, t \in \R\). Starting at \(t_0=0\), \(x_0=1\), the trajectory \(\phi(t) = \frac{1}{1t}\) is the unique solution to \(\eqref{eq:traj_ode_cond}\) provided \(t < 1\), but blows up at at \(t=1\), and so cannot be continued past this point.
In general we won’t run into this issue with diffusion, but it is worth bearing in mind that defining a probability flow ODE doesn’t automatically yield a stochastic process for all time.
We will use the formalism of stochastic processes \((X_t)_{t \in I}\), or that of functions or trajectories \(\phi(t)\) and \(\Phi(t; x_0, t_0)\) as most appropriate. We can thus equivalently write \(\Phi(t; x_0, t_0) = [X_t \vert X_{t_0} = x_0 ]\).
Calculate the marginal distributions of \(X_t\) over the time interval \(t \in (0, 1]\), assuming an initial distribution of \(X_1 = N(0, 1)\), where \(X_t\) is described by \(\eqref{eq:stochastic_ode}\), for the following.
 \(v(x, t) = x / t\).
 \(v(x, t) = x / 2t\).
In the following exercise and later, we use the notation \(\partial_t = \frac{\partial}{\partial t}\) to indicate a partial derivative. Note that this indicates taking the derivative with respect to the time parameter of the function it is applied to, rather than the total derivative with respect to time.
For example, if we have a function \(f(x, t)\), where \(x=x(t)\) depends on \(t\), then \(\partial_t f\) indicates the derivative of \(f\) with respect to the second parameter, and ignores the dependence of \(x\) on \(t\).
For functions of one variable, partial and total derivatives are the same; thus for example \(\partial_t \phi(t) = \frac{d}{dt} \phi(t)\), and we will tend to use \(\partial_t\) simply for compactness of notation.
We also write \(\partial_t^2 = \frac{\partial^2}{\partial t^2}\) etc.
Let \(\phi : I \rightarrow R\) be a twice differentiable function of time, and let \(v\) satisfy \(v(\phi(t), t) = \partial_t \phi(t)\) (recall that \(\partial_t \phi = \frac{d \phi}{dt}\) because \(\phi\) is a function of a single variable \(t\)). Show that
\[\begin{equation} \label{eq:traj_curvature} \partial_t^2 \phi(t) = \left( \partial_t + v(\phi(t), t) \cdot \nabla \right) v(\phi(t), t) \end{equation}\]where \(\nabla = (\frac{\partial}{\partial x_1}, \ldots, \frac{\partial}{\partial x_n})\) is the vector of partial derivatives, i.e., \((v \cdot \nabla) v := \sum_i v_i \frac{\partial}{\partial x_i} v\).
In fluid dynamics, the operator \(\partial_t + v(x, t) \cdot \nabla\) is known as the material derivative or substantial derivative. It tracks the rate of change of a physical parameter (like velocity) along a particle trajectory. This connection with fluid dynamics is more than coincidental.
Probability density
It is often useful to work with the simplified structure given just by the marginal distributions of \((X_t)_{t \in I}\), ignoring the dependence structure between them for different times. Typically we write \(\rho(\cdot, t)\) for the probability density function of \(X_t\).
In what follows, we will give theorems and results on the density function \(\rho\), for example, the derivative of \(\rho\) with respect to \(t\). However, there’s a slight technical problem that we should mention that \(\rho\), as a function \(\rho(\cdot, t) : \R^n \rightarrow R\), is not uniquely defined given \(X_t\).
Recall that a probability density function \(\pi\) for a random variable \(X\) is an integrable function that satisfies \(P(X \in A) = \int_A \pi(x) dx\) for every measurable set \(A\). But this means that in general in continuous sample spaces there are multiple probability density functions for the same distribution, since two functions can differ on a set of measure zero without changing the resulting distribution. For example, the following two probability density functions both correspond to the uniform distribution on \([0, 1]\):
\[\pi_1(x) = \begin{cases} 1 \qquad &\text{if } 0 \le x \le 1, \\ 0 \qquad &\text{otherwise}. \end{cases}\] \[\pi_2(x) = \begin{cases} 1 \qquad &\text{if } 0 \le x \le 1 \text{ and } x \ne 0.3, \\ 0 \qquad &\text{otherwise}. \end{cases}\]How then do we prove results about density functions?
Recall from the introduction that a test function is a continuous functions (possibly with continuous derivatives) that is zero outside of a finite region, i.e., is compactly supported. For example, we write \(C_c(\R)\) is the set of differentiable functions from \(\R\) to \(\R\) that are compactly supported.
For \(\pi_1\) and \(\pi_2\) above, although \(\pi_1(0.3) \ne \pi_2(0.3)\), it can be shown (exercise) that for all \(F \in C_c(\R)\) that
\[\int F(x) \pi_1(x) dx = \int F(x) \pi_2(x) dx.\]Test functions are powerful enough to detect when two density functions correspond to different distributions, but they are not so powerful that they distinguish functions that correspond to the same distribution.
Thus rather than working directly with or proving facts about probability functions (which are not uniquely defined), we typically instead “probe” them using test functions, and only prove results on density functions up to equivalence under test functions. Thus informally, when we talk about a probability density function \(\pi\), we will often mean the equivalency class of functions corresponding to the same distribution, and test functions will make a big part in what follows.
Thus although the density function \(\rho(\cdot, t)\) itself of a stochastic process \(X_t\) is not uniquely defined (and thus certainly not “continuous”), we can still define a notion of continuity for the timeindexed family of distributions uniquely determined by \(\rho(x, t)\). Say that the distributions are continuous with respect to time if for every test function \(F : R \rightarrow R\) (recall that a test function is continuous and compactly supported), the function \(t \mapsto F(X_t)\) is continuous.
Recall Exercise 1.7. Is it necessarily the case that if
\[P(t \mapsto X_t \text{ is continuous} ) = 1,\]then for \(F : R \rightarrow \R\) continous, that \(t \mapsto F(X_t)\) is continuous?
The continuity equation
What is the relationship between the density function \(\rho\), the velocity function \(v\) giving the derivative of \(X_t\) (or \(\phi\)), and the trajectories \(\phi\) of \(X_t\)?
Assuming the analogy of a compressible fluid, we can attempt to derive the relationship between \(\rho\) and \(v\) from first principles by seeing what happens in an infinitesimal region of fluid.
In two dimensions, imagine a region of fluid delineated by the box \([x_1, x_1 + \delta x_1] \times [x_2, x_2 + \delta x_2]\):
\[\begin{array}{rc} \delta x_2 & \begin{array}{c} \hline \\ \qquad \\ \hline \end{array} \\ (x_1, x_2) & \delta x_1 \end{array}\]From time \(t\) to \(t + \delta t\), what is the change of mass of fluid in this region? This can be calculated in two ways.
Firstly, it is equal to the timederivative of the density of the fluid, multiplied by the volume of the region:
\[\begin{equation} \label{eq:mass_flow_into_region_from_density_derivative} \textcolor{red}{ \delta x_1 \cdot \delta x_2 \cdot \delta t \cdot \partial_t \rho(x_1, x_2, t) } \end{equation}\]But secondly, it is also equal to the amount of fluid that flows into the region, minus the amount that flows out. This is equal to the amount of fluid flowing orthogonal to the region boundary in the inwards direction, which is equal to the area of the boundary multiplied by the mass flow multiplied by the time interval. Along the left boundary, this is equal to
\[\textcolor{green}{\delta x_2 \cdot v(x_1, x_2, t) \cdot \rho(x_1, x_2, t) \cdot \delta t}\]Along the right boundary, we have a similar expression, but with a minus sign since this corresponds to fluid flowing out, and with \(x_1\) shifted to \(x_1 + \delta x_1\):
\[\textcolor{orange}{ \delta x_2 \cdot v(x_1 + \delta x_1, x_2, t) \cdot \rho(x_1 + \delta x_1, x_2, t) \cdot \delta t}.\]Summing these two terms yields
\[\begin{align*} \delta x_2 \cdot \left( \textcolor{green}{v(x_1, x_2, t) \cdot \rho(x_1, x_2, t)} \textcolor{orange}{ v(x_1 + \delta x_1, x_2, t) \cdot \rho(x_1 + \delta x_1, x_2, t)} \right) \cdot \delta t \\ \approx  \delta x_2 \cdot \frac{\partial}{\partial x_1} ( v(x_1, x_2, t) \rho(x_1, x_2, t)) \cdot \delta x_1 \cdot \delta t \end{align*}\]If we also include similar terms for the bottom and top edges, including these as well yields the mass flow into the region as
\[\textcolor{blue}{ \delta x_1 \cdot \delta x_2 \cdot \delta t \cdot \nabla \cdot (\rho v)}\]Equating this expression with \(\eqref{eq:mass_flow_into_region_from_density_derivative}\) and cancelling the common \(\delta x_1 \cdot \delta x_2 \cdot \delta t\) terms then yields
\[\textcolor{red}{\partial_t \rho(x_1, x_2, t)} \approx \textcolor{blue}{ \nabla \cdot (\rho v)}.\]This is the continuity equation. We can prove this formally for probability flow.
Let \(I_0 \subset \R\) be a time interval, let \(v : I_0 \times \R^n \rightarrow \R^n\) be continuous, and let \((X_t)_{t \in I_0}\) satisfy
\[\frac{d X_t}{dt} = v(X_t, t).\]Let \(\rho : I_0 \times \R \rightarrow \R\) be the marginal density of \((X_t)_{t \in I_0}\). Then
\[\begin{equation} \label{eq:continuity_equation} \partial_t \rho(x, t) =  \nabla \cdot (\rho(x, t) v(x, t)). \end{equation}\]As remarked earlier, the density function \(\rho\) is only defined up to an equivalency class of functions that yield the same distribution, and so equation \(\eqref{eq:continuity_equation}\) is interpreted up to equivalency using test functions.
We prove the continuity equation in one dimension. The \(n\) dimensional case is left as an exercise.
Let \(F : \R \rightarrow \R\) be a test function, i.e., differentiable with compact support. Consider the map
\[\begin{equation} \label{eq:t_map_in_continuity_proof} t \mapsto \E F(X_t) = \int \rho(x, t) F(x) dx. \end{equation}\]The derivative of this with respect to \(t\) is
\[\int \partial_t \rho (x, t) F(x) dx.\]But also, letting \(\Phi(t; x_0, t_0) = X_t \vert \{ X_{t_0} = x_0 \}\) be the trajectory function of \(X_t\), this satisfies \(\partial_t \Phi(t; x, t_0) \vert_{t_0=t} = v(x, t)\). We can thus write \(\eqref{eq:t_map_in_continuity_proof}\) as
\[\E F(X_t) = \E F(\Phi(t; X_{t_0}, t_0)) = \int \rho(x, t_0) F(\Phi(t; x, t_0)) dx.\]Writing \(F_x\) for the derivative of \(F\), the derivative of this with respect to \(t\) evaluated at \(t_0=t\) is
\[\begin{align*} \partial_t \int \rho(x, t_0) F(\Phi(t; x, t_0)) dx &= \int \rho(x, t_0) \cdot \partial_t \Phi(t; x, t_0) \vert_{t_0=t} \cdot F_x(\Phi(t; x, t_0)) dx \\ &= \int \rho(x, t) v(x, t) F_x(x) dx \\ &=  \int \partial_x \big( \rho(x, t) v(x, t) \big) F(x) dx \qquad\text{(integration by parts)}. \end{align*}\]These two expressions for the derivative are equal, i.e.,
\[\int \partial_t \rho (x, t) F(x) dx =  \int \partial_x \big(\rho(x, t) v(x, t) \big) F(x) dx.\]Since this holds for an arbitrary test function \(F\), the theorem follows.
Generalise the proof of Theorem 3.3 to \(n\) dimensions.
Let’s look at the continuity equation for the uniform distribution \(U[t, t]\), where \(t\) is time. This has density
\[\rho(x, t) = \begin{cases} 1/2t &\text{if } \vert x \vert \le t, \\ 0 &\text{otherwise.} \end{cases}\]Verify that the continuity equation holds inside the boundary \(\vert x \vert < t\) for \(v(x, t) = x / t\).
What happens at the boundary \(x=t\)? Verify that the continuity equation still holds, provided we interpret the derivative of the step function \(1_{x \ge 0}\) as \(\delta(x)\). This makes sense: we proved the continuity equation in a distributional setting, where “taking the derivative of a step function” is a valid operation in the space of generalized functions.
Final remarks
In this section, we have developed the continuity equation, which is a fundamental tool in proving results about the sampling process. This makes sense: we want to analyse how the probability density evolves as a function of time if we follow the probability flow ODE \(\frac{dx}{dt} = v(x,t)\). If we can find a probability flow ODE that matches the forward noising process, then we will be able to integrate this ODE backwards and arive at samples from the data distribution. This is what we derive in the next section.