Speeding up MCMC with Langevin and Hamiltonian dynamics and stochastic gradient estimates
Published:
In this post I’m going to try and introduce Hamiltonian and Langevin Monte Carlo, and there stochastic gradient counterparts (among a couple other things). This post will involve a little stochastic differential calculus and some results from A Complete Recipe for Stochastic Gradient MCMC - Ma et al. (2015).
UNDER CONSTRUCTION
Resources
Main Resources
- A Complete Recipe for Stochastic Gradient MCMC - Ma et al. (2015)
- Microcanonical Hamiltonian Monte Carlo
- Fluctuation without dissipation: Microcanonical Langevin Monte Carlo
LMC/Langevin Methods
- Exponential convergence of Langevin distributions and their discrete approximations - Gareth O. Roberts, Richard L. Tweedie
- Brownian dynamics as smart Monte Carlo simulation - P. J. Rossky; J. D. Doll; H. L. Friedman
- Bayesian Learning via Stochastic Gradient Langevin Dynamics - Max Welling, Yee Whye Teh
- Metropolis-adjusted Monte Carlo - Wiki
- Towards a Theory of Non-Log-Concave Sampling: First-Order Stationarity Guarantees for Langevin Monte Carlo - Krishnakumar Balasubramanian, Sinho Chewi, Murat A. Erdogdu, Adil Salim, Matthew Zhang
- Efficient Approximate Posterior Sampling with Annealed Langevin Monte Carlo - Advait Parulekar, Litu Rout, Karthikeyan Shanmugam, and Sanjay Shakkottai
- Langevin Monte Carlo: random coordinate descent and variance reduction - Zhiyan Ding, Qin Li
- Random Reshuffling for Stochastic Gradient Langevin Dynamics - Luke Shaw, Peter A. Whalley
- Analysis of Langevin Monte Carlo via convex optimization - Alain Durmus, Szymon Majewski, and Blazej Miasojedow
- A Divergence Bound For Hybrids of MCMC and Variational Inference and …
- Generative Modeling by Estimating Gradients of the Data Distribution - Yang Song
- Sampling as optimization in the space of measures … - COLT
- Random Coordinate Descent and Langevin Monte Carlo - Qin Li, Simons Institute for the Theory of Computing
- The Sampling Problem Through The Lens of Optimization : Recent Advances and Insights by Aniket Das
For Langevin Dynamics
- A Simplified Overview of Langevin Dynamics - Roy Friedman
- On the Probability Flow ODE of Langevin Dynamics - Mingxuan Yi
- Langevin Equation - Wiki
- Great derivation for the Fokker-Planck equation and a few other results (that I was heavily ‘inspired’ by in my post).
For HMC
- Microcanonical Hamiltonian Monte Carlo
- Stochastic Gradient Hamiltonian Monte Carlo
-
A Conceptual Introduction to Hamiltonian Monte Carlo - Betancourt
- Pretty good imo but it’s 60 pages. I found the appendices to be pretty helpful with the NUTS implementation of HMC (specifically A.4)
Misc
- A prequel to the diffusion model - Nosuke Dev Blog
- Santa Jaws
- Improving Diffusion Models as an Alternative To GANs, Part 2 - NVIDIA Technical Blog
- Continuous Algorithms: Sampling and Optimization in High Dimension - Santesh Vempala, Simons Institute for the Theory of Computing
- Non-convex learning via Stochastic Gradient Langevin Dynamics: a nonasymptotic analysis
- Handbook of Stochastic Methods - Gardiner
- Stochastic Differential Equations - An Introduction with Applications - Bernt Øksendal
-
An Introduction to Stochastic Differential Equations - Lawrence C. Evans
- This is a particularly great reference for those not comfortable with measure theory interpretations of probability
- And ended up being what I structured my background sections on SDE on (skipping a lot of details and rigor)
-
Probability Theory in Finance: A Mathematical Guide to the Black-Scholes Formula (Graduate Studies in Mathematics) - Sean Dineen
- Much more informal-style or conversational take on the above topics more catered for finance but is great as a reference for some more abstractly presented topics in Evans
- Also I think there is more use of examples around novel propositions and definitions to help solidify said abstract ideas but can be seen as fluff if you aren’t interested/find the given ideas simple
Table of Contents
- The Motivation: Why “Random” Isn’t Enough
- A sneak peak: A purely intuitive idea of our goal
- Stochastic Calculus (and a lil’ Physics?)
- Hamiltonian MCMC
- Recipes for Stochastic Gradient MCMC
- Exploring Methods discussed in Ma et al. (2015)
The Motivation: Why “Random” Isn’t Enough
MCMC is one of the most revolutionary and widely used tools in a statisticians arsenal when it comes to exploring probability distributions. But despite this, it is deceptively simple and possibly doesn’t use all available information to speed up sampling. In this post we’re going to see some ways that we can possibly use some of this available information to do exactly that.
First, I’ll do a quick refresher for what Metropolis-Hasting MCMC is. If you are completely, unfamiliar with Metropolis-Hasting MCMC I’d recommend (completely unbiased) my practical guide on MCMC and especially the resources/references within but I’ll cover the main points below.
Metropolis-Hasting MCMC (MHMCMC) comprises of three things: a (typically unnormalised) target distribution, a proposal distribution and how many iterations of the below algorithm you can bare to wait for.
Metropolis-Hastings Algorithm
- Initialise:
- Have a distribution you want to sample from (duh) \(f(x)\),
- manually create a starting point for the algorithm \(x_0\),
- pick a distribution \(g(x\mid y)\) to sample from
- pick the number of samples you can be bothered waiting for \(N\)
- For each iteration \(n\)/Repeat \(N\) times
- Sample a new proposal point \(x^*\) from the symmetric distribution centred at the previous sample
- Calculate the acceptance probability \(\alpha\) given as \(\alpha = \frac{f(x^*)g(x_n\mid x^*)}{f(x_n)g(x^*\mid x_n)}\) (Here’s the change!)
- And if the acceptance probability is more than 1, cap it at 1.
- Generate a number, \(u\), from a uniform distribution between 0 and 1
- Accept or reject
- Accept if: \(u\leq\alpha\), s.t. \(x_{n+1} = x^*\)
- Reject if: \(u>\alpha\), s.t. \(x_{n+1} = x_n\)
This amounts to exploring a given distribution in the method indicated in the GIF below.
The target distribution is the distribution that we are trying to generate samples for and all we need is a way to evaluate it up to a normalisation constant. The proposal distribution is then some other distribution, that uses the previous accepted value that we have sampled, to propose a new point (and either accept-reject based on the density ratio with respect to the previous accepted sample). This amounts to something we call a Markov Process, which can be viewed as a discrete-time stochastic or random process (we will revisit this later), where every sample/event only depends on the previous.
This is why Metropolis MCMC is sometimes referred to as a type of “Gaussian random walk” (in the case of a gaussian proposal distribution) or Random Walk Metropolis (RWM, as I will more generally refer to for the above types of MCMC algorithms).
These algorithms have been stupidly successful, being used from investigating the properties of stars in other galaxies to how the share prices of toy companies will vary in a given month. The key benefit that came from it being the ability to partially side-step the curse of dimensionality that arises if one were to naively grid every dimension of a given distribution one wanted to investigated.
RWM works great for moderately dimensional spaces (very roughly from my experience around 15 dimensions for non-pathological distributions) but is not able to get around the curse entirely of course.
There are two key issues that RWM struggles with in higher dimensional spaces: the way volume behaves in these spaces, and the general in-efficiency of the process as a function of dimension.
Pathological Volumes and Shells
In high dimensional spaces, very basic things one would presume about volume and space break down. e.g. A particularly fun thing that can happen regarding spheres escaping the boxes that you would think contain them is explained in this numberphile video.
The particular issue in the case of RWM algorithms is that the typical set or the space where the probability mass (for our purposes here, this is synonymous with infinitesimal integrated chunks of probability density), becomes exponentially concentrated about a sphere about the origin.
I showed the converse of this in my post on hyperspherical and hyperbolic VAEs, where if you sample uniformly on the unit sphere in high dimensions (~500) the distribution you get out in the marginals (the corner plot) is indistinguishable from those of a similarly dimensional gaussian.

So you get this kind of Faustian bargain where you can either:
- Making the step size/width of the proposal small, so that samples that are close/on the sphere stay close/on the sphere. But then exploring the hall space/distribution will be very inefficient, making the process take exponentially longer.
- Making the step size/width of the proposal large, so you can quickly explore the whole space, but have a low efficiency of having samples fall in the typical set, making the process take exponentially longer….
Walking home drunk is not an efficient way to get home (kinda)
Closely linked to the above issue is how the efficiency of the stochastic process (sometimes referred to as a drunken walk process) scales with dimension. For a \(d\)-dimensional distribution the efficiency of the process assuming an target efficiency of ~0.234 (23.4\%) (refs Optimal scaling of discrete approximations to Langevin diffusions - Roberts, Rosenthal (1998), Investigating the Efficiency of the Metropolis Algorithm - Dou, Li, Wang (2024)) the optimal step size for RWM scales as \(d^{-1/2}\) and efficiency (or number of steps) as \(\mathcal{O}(d)\).
Viewing this process as a diffusion process, where the efficiencies relate to how long it takes to reach an equilibrium distribution or a good enough approximation of one, then leaves the door open for us to wonder if other dynamics/processes could scale significantly better?
On possible algorithm that achieves this are the so-called “Langevin algorithms” that model the process via Langevin dynamics. Roberts & Tweedie (1996) showed that the proposal distributions for Langevin-based algorithms could scale as \(d^{-1/3}\) and the number of steps as \(\mathcal{O}(d^{1/3})\)! Which for high-dimensional sampling is a complete game changer when it comes to efficiency.
Sneak Peek: An intuitive picture of Langevin Monte Carlo
Imagine the sampling trajectories of MCMC as people trying to find their way home after a long night out. They are in their neighborhood, but they are delirious and have no sense of direction.
The RWM drunkards have lost their glasses in the mayhem of the night, and thus have completely failing eyesight. They pick a house at random nearby, stumble to the door, and are then close enough to check the house number on the door (the probability density). If that number is closer to their own address, they feel a sense of “positive feedback” and are likely to stay there for a moment. But then, forgetting everything, they stumble toward another random house. As this process repeats they have an unconcious positive feedback ‘pull’ towards the right house number. They are essentially “diffusing” through the neighborhood. Because they don’t know which way the numbers are increasing or decreasing, they spend a massive amount of time walking in circles or heading toward the wrong end of the street.
The Langevin (LMC) Drunkard is just as tired, but they’ve managed to keep their glasses on. They can actually read the street signs. They notice the house numbers are increasing as they walk East, so they purposefully start walking East (the Drift). They are still drunk, so they still trip over the curb or weave side-to-side randomly (the Diffusion), but their average movement is a direct line toward their front door.
In a 1D street, the RWM drunkard might eventually find home. But imagine a 100-dimensional neighborhood (imagine Los Angeles in Blade Runner 2049 on steroids).
The RWM drunkard is almost certain to take a step that leads them away from their house because there are so many “wrong” directions to choose from.
The LMC drunkard, even with a slight stagger, has a compass (the Gradient). They ignore the 99 wrong directions and focus their energy on the one direction that actually matters.
More formally, LMC introduces a drift into the proposal distribution or gradient flow using the gradient of the target probability density but otherwise uses the same accept-reject as the typical MHMCMC algorithm. And even more formally … is the rest of this post.
Below is a GIF showing how RWM and LMC converge on a target (same step size, set so that RWM is purposefully bad, but even then the LMC is fine).
We will try to explain, motivate how we construct the below algorithm and that it in fact has an equilibrium distribution that matches our target.
Metropolis-Adjusted Langevin Algorithm
- Initialise:
- Have a target density you want to sample from \(f(x)\) (known up to a normalising constant),
- Define the potential energy as the negative log-density: \(U(x) = - \log f(x)\),
- Manually choose a starting point \(x_0\),
- Pick a step size \(\epsilon > 0\),
- Pick the number of samples to generate \(N\).
- For each iteration \(n\) / Repeat \(N\) times:
- Langevin proposal steps:
- Compute the gradient of the log-density at the current state: \(\nabla \log f(x_n)\)
- Generate a proposal point \(x^*\) using a drift term plus noise: \(x^* = x_n + \frac{\epsilon}{2} \nabla \log f(x_n) + \sqrt{\epsilon} \eta\) where \(\eta \sim \mathcal{N}(0, I)\)
- Metropolis Hastings correction steps:
- Compute the proposal densities: \(q(x^* \mid x_n) = \mathcal{N}\left(x^* ; x_n + \frac{\epsilon}{2} \nabla \log f(x_n),\epsilon I\right)\) \(q(x_n \mid x^*) = \mathcal{N}\left(x_n ; x^* + \frac{\epsilon}{2} \nabla \log f(x^*),\epsilon I\right)\)
- Compute the acceptance probability: \(\alpha = \min\left(1, \frac{f(x^*) q(x_n \mid x^*)}{f(x_n) q(x^* \mid x_n)}\right)\)
- Draw a random number \(u \sim \text{Uniform}(0, 1)\)
- Accept or reject:
- If \(u \le \alpha\), set \(x_{n+1} = x^*\)
- If \(u > \alpha\), set \(x_{n+1} = x_n\)
Another Sneak Peek: An intuitive picture of Hamiltonian Monte Carlo
A much more commonly used algorithm (compared to LMC) is Hamiltonian Monte Carlo or HMC. Similar to LMC where we start thinking of the sampling procedure as some stochastic process, HMC looks at the problem as a dynamic deterministic process. They key idea being that you literally treat the problem as some ficticious physical system where the samples of the parameters in the posterior \(\vec{\theta}\) are akin to position vectors, and we add some auxillary momentum vectors \(\vec{r}\).
We then define a Hamiltonian that defines the system, traversing equal energy (probability) contours, such that new proposed points will have similar or higher acceptance probabilities as the initial points, with a metropolis-hasting’s-like adjustment so that the algorithm doesn’t just turn into an optimisation routine. The Hamiiltonian is generally (like 99+\% of the time) as,
\[\begin{align} \mathcal{H}(\vec{\theta}, \vec{r}) = U(\vec{\theta}) + K(\vec{r}), \end{align}\]where \(U\) is our equivalent of potential energy, defined by \(U(\vec{\theta}) = - \log p(\vec{\theta}, \vec{x})\), \(\vec{x}\) being our data but is effectively treated as a constant here. The kinetic energy \(K\) is then similarly defined as the negative of the log of the probability of a given momentum, typically chosen to just be a multivariate normal distribution with \(0\) mean and unit variance. So,
\[\begin{align} K(\vec{r}) &= - \log p(\vec{r}) \\ &= - \log \mathcal{N}(\vec{0}, I) \\ &\propto - \log \exp\left(-\frac{1}{2} (\vec{r} - \vec{0})^T I^{-1} (\vec{r} - \vec{0}) \right) \\ &= \vec{r}^T \vec{r}/2. \end{align}\]This nicely matches up with actual (classical) physical systems where the kinetic energy is \(p^2/2m\).
As in classical physics we can then evole the system according to1,
\[\begin{align} \frac{d}{dt}\vec{\theta} = \frac{\partial H(\vec{\theta}, \vec{r})}{\partial \vec{r}} \\ \frac{d}{dt}\vec{r} = - \frac{\partial H(\vec{\theta}, \vec{r})}{\partial \vec{\theta}}. \end{align}\]If unfamiliar, part of the reason it is we model physical systems this way is because this then enforces that energy is conserved/the derivative of the total energy with respect to time is 0,
\[\begin{align} \frac{d}{dt} H(\vec{\theta}, \vec{r}) &= \frac{\partial H(\vec{\theta}, \vec{r})}{\partial \vec{\theta}} \frac{d}{dt}\vec{\theta} + \frac{\partial H(\vec{\theta}, \vec{r})}{\partial \vec{r}}\frac{d}{dt}\vec{r}\\ &= \frac{\partial H(\vec{\theta}, \vec{r})}{\partial \vec{\theta}} \frac{\partial H(\vec{\theta}, \vec{r})}{\partial \vec{r}} - \frac{\partial H(\vec{\theta}, \vec{r})}{\partial \vec{r}}\frac{\partial H(\vec{\theta}, \vec{r})}{\partial \vec{\theta}} \\ &= 0. \end{align}\]This leads us to essentially modelling a physical system, where previously we were modelling a markov process on a probability density. The potential of the system is the negative log probability of the distribution we want, and the proposal samples are generated by traversing along equal energy contours/the paths particles would take in a physical system. There’s just a few things regard how long the paths are between proposals, how big the steps are in these traversals, monte carlo adjustments related to making a discrete approximation to a continuous system and reversing the momentum after accepting a sample.
A naive implementation is shown in the GIF below.
(Although now I realise the labels for the contours of the KDE estimate are the wrong way around…)
Some mathematical background
In this section we’ll go through the required math particularly to understand Langevin Dynamics and the Fokker-Planck Equation. If you either: are not fussed about the nitty-gritty details or are already familiar with the topics above you can skip this section. I endeavoured to make the other sections not rely on this section but I will refer back to them every now and then for details.
Stochastic Differential Equations (SDEs)
You can essentially describe SDEs as an extension to ODEs that includes some noise or stochastic component (for the purpose of this blog post at least).
In essence, where you would have an ODE described by:
\[\begin{align} \begin{cases} \frac{dx}{dt}(t) = b(x(t)) & (t\geq 0) \\ x(0) = x_0 \end{cases} \end{align}\]that described some smooth trajectory \(x(t)\) (\(x\) some stand-in for some arbitrary quantity) or progression of states for times \(\geq 0\), that may look like the following for example,
If you add a noise component, that may look like,
such that the system can only now be adequately described by,
\[\begin{align} \begin{cases} \frac{dy}{dt}(t) = b(y(t)) + B(y(t)) \xi(t) & (t\geq 0) \\ y(0) = y_0 \end{cases} \end{align}\]where \(B:\mathbb{R}^n \rightarrow \mathbb{M}^{n \times m}\) (space of \(n\times m\) matrices) and \(\xi(t)\) an \(m\)-dimensional vector function that would be described as ‘white noise’ (with now \(y\) now some stand-in for some arbitrary quantity to highlight the difference with \(x\) the smooth ODE result).
The white noise term is often replaced with the time derivative of a “Wiener process”/brownian motion process2 \(\frac{dW}{dt} = \xi(t)\) which then allows us to express the above in differential form as below.
\[\begin{align} \begin{cases} dy = b(y(t)) dt + B(y(t)) dW(t) & (t\geq 0) \\ y(0) = y_0 \end{cases} \end{align}\]Thus the general solution of the SDE is then “just”,
\[\begin{align} y(t) = y_0 + \int_{s=0}^{s=t} b(y(s)) ds + \int_{s=0}^{s=t} B(y(s)) dW(s). \end{align}\]But for anyone vaguely familiar with Brownian motion it is everywhere continuous and nowhere differentiable so \(\frac{dW}{dt}\) for sure doesn’t really exist. So:
- What even is \(dW\)?
- What does \(\int_{s=0}^{s=t} dW(s)\) mean?
- Is the equation for \(y(t)\) actually solvable?
And so we begin.
What even is \(dW\)?
Skipping some motivation and background we can define the process \(W(t)\) as:
Wiener Process Definition.
- The process \(W\) has the initial condition \(W(0) = 0\),
- \(W(t) - W(s) ~ \mathcal{N}(0, t-s), \forall \; t\geq s \geq 0\),
- \(\forall t\) where \(0 \lt t_1 \lt t_2 \dots \lt t_n\), the increments/differences \(W(t_{j+1}) - W(t_j)\) are independent
So we have \(W(t)\), or at least some abstract definition for it. This is basically just describing any process where some particle/object/thing jiggles, such that it’s particular position at any point in time can be represented probabilistically as some gaussian with variance given as the difference in time between either the starting time or some reference point.
In standard calculus, \(dx\) is just the infinitesimal change in \(x\). In our stochastic process world now, \(dW\) is closer to an infinitesimal ‘shock’3 rather than some strict notion of ‘change’ if that makes sense (which complicates how we’re meant to do the integrals above).
But similar to the “smooth” calculus \(dx\), we take the limit \(\Delta t \to dt\) with \(\Delta W = W(t + dt) - W(t)\), to get some of the properties of the “stochastic differential” \(dW\):
- Expectation is zero: \(E[dW] = 0\). On average, the noise doesn’t “push” the particle in any particular direction, on average it stays where it started.
- The “Square” Rule: In standard calculus, \((dt)^2\) is so small we often ignore it. But for a Wiener process: \((dW)^2 \approx dt\) (more strictly \(E[(dW)^2] = dt\) but
).
Stochastic Integrals: Itô Calculus vs. Stratonovich
If we try to solve the integral term \(\int B(y(s)) dW(s)\) using standard Riemann sums (the “rectangles under the curve” method you learned in high school), we break math.
Why?
Because the path of \(W(t)\) is so jagged—it has infinite variation—the answer depends on where in the rectangle you measure the height.
We have two main choices, which essentially boil down to: “When does the noise hit?”
1.The Ito Integral (Common Financial/CS Choice)
In the Ito (or the correctly spelled Itô) interpretation, we evaluate the function at the beginning of the time step, kind of like if you were trying a model a situation in the wild, you only have the information up to the current timestep, you don’t know the information in half a timestep’s time, let alone a full one.
\[\int_0^t B(s) dW(s) \approx \sum_{i} B(t_i) (W(t_{i+1}) - W(t_i))\]This is the standard for finance and computer science. It has the “martingale” property: the noise at the current step is independent of the state, meaning you can’t use the noise to “peek” into the future (Sean Dineen’s textbook on probability and SDEs has a more nice in-depth explanation for this). However, it breaks the chain rule of standard calculus.
To differentiate functions of stochastic processes, you need Ito’s Lemma (the Taylor Series for SDEs):
\[df = \left( f'(y)b(y) + \frac{1}{2}f''(y)B(y)^2 \right)dt + f'(y)B(y)dW\]Remembering that \((dW)^2 \approx dt\), we get extra terms that wouldn’t exist in normal calculus.
2.The Stratonovich Integral (The “Physics” Choice)
Here, we evaluate the function at the midpoint of the time step:
\[\frac{t_i + t_{i+1}}{2}.\]Giving the definition of the integral to be:
\[\int_0^t B(s) dW(s) \approx \sum_{i} B\left( \frac{t_i+ t_{i+1}}{2}\right) (W(t_{i+1}) - W(t_i)).\]This formulation preserves standard calculus rules (the chain rule works as expected!), making it popular in physics.
However, it secretly looks into the future (the midpoint uses information from \(t_{i+1}\)), making it computationally awkward for simulations where causality matters.
Again, kind of by construction for a discrete process like making investments in a finance, you are stuck in the present, you shouldn’t be using information of the future.
TL;DR: We will stick to Ito for simplicity in sampling algorithms, but we have to be careful with our calculus rules. Below is a gif showing some an example difference between the two for the same path.
And then just for the last frame so you don’t have to watch the gif over and over again.
The Langevin SDE: Combining Deterministic Drift and Random Fluctuations
Now we can define the specific SDE that this part of this post is meant to actually be about.
Langevin Dynamics models the movement of a particle in a fluid. It is subject to two forces:
- Drag/Friction: It wants to stop moving or settle into a low-energy state.
- Thermal Fluctuation: The random collisions with fluid molecules (the “noise”) kicking it around.
In the context of sampling (e.g., finding the parameters of a Bayesian model), we treat the “energy landscape” as the negative log-probability of our target distribution (as I defined above for the sneak peak in HMC).
We want the particle to settle in the high-probability regions (low energy) but keep jiggling to explore. We use the Overdamped Langevin Equation, where we assume the friction is so high we can ignore mass and acceleration (inertia)4.
The velocity is proportional to
\[dY_t = -\nabla U(Y_t) dt + \sqrt{2D} dW_t,\]which is dependent on:
- the state/position \(Y_t\)
- the Drift: \(-\nabla U(Y_t)\). This pushes the particle downhill toward the minimum of the potential energy \(U(x)\).
- and finally the diffusion \(\sqrt{2D} dW_t\): This is the thermal noise that prevents the particle from just getting stuck in the nearest local minimum. i.e. \(D\) controls the “temperature” and is assumed to be a constant diagonal matrix (with the square root an element-wise operation or D is just a constant \(\in \mathbb{R}\))
A gif showing the combination of the effects is below.
The Fokker-Planck Equation: “Flowing” Probability?
So far, we’ve looked at the path of a single particle. But since the movement is random, what if we ran this simulation a million times? Or what if we had a million particles? Or more simply, how do we know if a distribution is the steady-state or final solution of our dynamics?
But since the movement is random, looking at one squiggly line doesn’t tell us the whole story.
Instead of tracking the position of one particle, let’s track the probability density \(p(x, t)\). This tells us the probability of finding the particle at position \(x\) at time \(t\).
If we want to see how this probability cloud evolves over time, we need a Partial Differential Equation (PDE).
For this we can imagine that the probability is like a fluid: it cannot be created or destroyed, it can only flow from one place to another.
This is mathematically described by the Continuity Equation:
\[\frac{\partial p(x, t)}{\partial t} = -\nabla \cdot J(x, t)\]Where \(\frac{\partial p}{\partial t}\) is the change in probability at a specific spot, \(J\) is the Probability ‘Flux’ (or ‘current’) describing where the probability is flowing, \(\nabla \cdot J\) is the the ‘divergence’, measuring how much “stuff” is leaving a point (but can also be negative, so also measures stuff entering a point).
More simply the equation says: “The amount of probability at a point changes only if probability flows in or out of that point.” Nothing will magically appear or dissapear and the sum of derivatives of the flux of stuff with respect to space is directly proportional to the time derivative of the total amount of stuff moving in and out the neighbourhood of a given point.
I tried to “GIF”-ify this, but I’m not sure how effective it is. There are two plots below showing a ball of ‘probability’ moving through a box/space (you can think of this as the neighbourhood of a given point) and then an abstract fluid.
ball of stuff
abstract fluid
Although while trying to find some inspiration for GIFs to make that would be better than the above or require me to learn manim, I found this meme where the bit this guy is writing is this exact concept 😂. Which I think is as good as I can do (for those wondering, no it is not simple math…).
For a better resource on the integral formulation and differentual formulation of the continuity equation (we are using the differential form) I’d recommend this video by Steve Brunton.
In our Stochastic Differential Equation, the flux \(J\) is made of two competing parts:
- The ‘Advective’ Flux (Drift): The deterministic force \(b(x)\) pushing the particle.\(J_{\text{drift}} = b(x) p(x, t)\). This is the “River” part. If you just had this (no noise), you would recover the general form of Liouville’s Equation for deterministic flow.
- The ‘Diffusive’ Flux: The random noise spreading things out. By Fick’s Law, stuff naturally flows from high concentration to low concentration. \(J_{\text{diffusion}} = - \frac{D(x, t)}{2} \nabla p(x, t)\). This is the “Dye Spreading” part.
If we just plug our total flux \(J = J_{\text{drift}} + J_{\text{diffusion}}\) back into the continuity equation, we get the Fokker-Planck Equation:
\[\frac{\partial p}{\partial t} = -\nabla \cdot \left[ \underbrace{b(x) p}_{\text{Drift}} - \underbrace{\frac{D(x, t)}{2} \nabla p}_{\text{Diffusion}} \right]\]Expanding the divergence gives us the standard form:
\[\frac{\partial p}{\partial t} = \underbrace{-\nabla \cdot (b(x) p)}_{\text{Drift pushes } p} + \underbrace{\frac{1}{2} \nabla^2 [D(x, t) p]}_{\text{Diffusion spreads } p}\]For our Dynamics we’ll say \(b(x) = -\nabla U(x)\):
\[\frac{\partial p}{\partial t} = \nabla \cdot (\nabla U(x) p) + \frac{1}{2} \nabla^2 [D(x, t) p]\]Conceptually: The Drift tries to compress the probability mass into the minimum of the energy \(U(x)\), while the Diffusion tries to flatten the probability mass out (entropy).
When these two fluxes balance out (\(J_{total} = 0\)), we hit our stationary distribution.
The MCMC Checklist
Now, if we discretize our SDEs to run it on a computer (which we call the Euler-Maruyama method), we are essentially creating a Markov Chain.
For this chain to be useful for sampling, it needs to satisfy three core properties.
1.Stationarity (The Goal)
As we derived from the Fokker-Planck equation, we need our target distribution \(\pi(x) \propto \exp(-U(x))\) to be the stationary distribution of the chain.
This means that once the “probability fluid” reaches this shape, it stays this shape.
2.Ergodicity (The “No Island” Rule)
Ergodicity is the guarantee that the particle can eventually reach any point in the space, no matter where it starts.
In math terms: The chain must be irreducible and aperiodic.
In conceptual terms: If your energy landscape \(U(x)\) has two deep valleys separated by an infinitely high wall, your algorithm isn’t ergodic. It will get stuck in one valley and never “see” the other, giving you a biased view of the distribution. The noise term \(dW\) is what provides ergodicity by allowing the particle to climb over (or tunnel through) energy barriers.
3.Detailed Balance (Reversibility)
This is the most “physical” requirement. It is also actually more robust than what is strictly needed (which is global balance) but this is typically what people try to prove their algorithm satisfies.
A Markov chain is said to satisfy detailed balance if, at equilibrium, the probability of being at state \(A\) and moving to \(B\) is exactly equal to the probability of being at \(B\) and moving to \(A\).
\[p(x) T(x \to x') = p(x') T(x' \to x)\]where \(T\) is the transition probability. In the context of the Fokker-Planck equation, this is equivalent to saying the total flux \(J\) is zero everywhere, not just that its divergence is zero.
If detailed balance holds, there are no “circular currents” in your probability fluid. It’s a “static” equilibrium rather than a “dynamic” one.
The Discretization Gap
This is all good but while the continuous description of our SDE can satisfy these properties, the discrete version we run on a computer (taking finite steps \(\Delta t\)) actually introduces a slight error.
Because we aren’t moving in infinitesimally small steps, we slightly “overshoot” the curves of the energy landscape (even if we aren’t explicitly traversing it with gradients we’re still moving along it).
This means that the stationary distribution of our computer code is actually slightly different from our target \(\pi(x)\).
To fix this, we often add a Metropolis-Hastings “accept/reject” step at the end of each jump to restore detailed balance.
e.g. This turns “Unadjusted Langevin Algorithm” (ULA) into the “Metropolis-Adjusted Langevin Algorithm” (MALA) which is the one used in practice.
In summary
| Property | Physical Intuition | Why we need it |
|---|---|---|
| Stationarity | The fluid has settled into its final shape. | Ensures we are sampling the right distribution. |
| Ergodicity | The particle can visit every “room” in the house. | Ensures we don’t miss any peaks in the data. |
| Detailed Balance | No hidden whirlpools; every move is reversible. | Simplest way to guarantee stationarity. |
Another point that one would like to know is the mixing time or convergence rate which relates to how long it will take for a sampler to converge. We’ll just leave this at intuitive explanations rather than try to prove convergence rates. I don’t quite have the time to explain absolutely everything in this post (and in fact I’m trying to condense multiple textbooks worth of material into what I have above so any more and I might just explode or simply regurgitate all of said textbooks).
Hamiltonian MCMC and NUTS in detail
For this section I will first introduce Hamiltonian Monte Carlo (HMC) and then NUTS which is kind of ‘auto-tuning’ method that makes HMC more user friendly. I will be very heavily using parts of “MCMC using Hamiltonian dynamics” - Radford M. Neal for the HMC section and “The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo” - Hoffman and Gelman for the NUTS section (both papers are those that introduced the respective methods)/
The original HMC algorithm
As a quick refresher, in HMC we define a Hamiltonian as,
\[\begin{align} H(\vec{\theta}, \vec{r}) = U(\vec{\theta}) + K(\vec{r}) = -\log\left(p(\vec{\theta}) \right) + \frac{r^T M^{-1} r}{2}, \end{align}\]where compared to the section above I’m using \(M\) which is positive-definite matrix called the “mass matrix” which as in the name, takes a similar role as mass in standard physical systems. Within the probabilistic interpretation of \(K(\vec{r})\) the mass matrix takes the role of the momentum distribution’s (normal distribution) covariance matrix. The system’s Hamiltonian dynamics are then dictated by the following Hamiltonian equations,
\[\begin{align} \frac{d\vec{\theta}}{dt} = M^{-1} \vec{r} \\ \frac{d\vec{r}}{dt} = \frac{\log p(\vec{\theta})}{\partial \vec{\theta}}. \\ \end{align}\]We then want to use this as a way to traverse the probability density to propose distant points that have similar acceptance probabilities. This is what increases the efficiency of the HMC algorithm compared to Metropolis-Hasting MCMC as we can make more informed proposals.
We then want to make it so that the proposals are far away from the previous so that we can efficiently explore the space and not over sample some trajectories. Together this motivates (but doesn’t fully explain quite yet) the below algorithm for a diagonal mass matrix;
Hamiltonian Monte Carlo Algorithm
- Initialise:
- Have a target density you want to sample from \(p(\vec{\theta})\) (known up to a normalising constant)
- Define the potential energy: \(U(\vec{\theta}) = -\log p(\vec{\theta})\)
- Define the kinetic energy (usually Gaussian): \(K(\vec{r}) = \frac{1}{2} \vec{r}^T M^{-1} \vec{r}\), where \(M\) is a mass matrix (often \(I\) or \(\textrm{diag}[m_1, m_2, ..., m_d]\))
- Manually choose a starting point \(\vec{\theta}_0\)
- Pick a step size \(\epsilon \in \mathbb{R}^+\) and the number of leapfrog steps \(L\in\mathbb{N}\).
- Pick the number of samples to generate \(N\).
- Sample: For each iteration \(n\) / Repeat \(N\) times:
- Resample momentum: Draw a fresh momentum vector \(\vec{r}_n \sim \mathcal{N}(0, M)\).
- Leapfrog Integration (Simulate Dynamics):
- Set initial state for the trajectory: \((\vec{\theta}^*, \vec{r}^*) = (\vec{\theta}_n, \vec{r}_n)\).
- Perform the first half-step for momentum: \(\vec{r}^* \leftarrow \vec{r}^* - \frac{\epsilon}{2} \nabla U(\vec{\theta}^*)\)
- For \(l\) from \(1\) to \(L\):
- Update position: \(\vec{\theta}^* \leftarrow \vec{\theta}^* + \epsilon M^{-1} \vec{r}^*\)
- If \(l < L\), update momentum: \(\vec{r}^* \leftarrow \vec{r}^* - \epsilon \nabla U(\vec{\theta}^*)\)
- Perform the final half-step for momentum: \(\vec{r}^* \leftarrow \vec{r}^* - \frac{\epsilon}{2} \nabla U(x^*)\)
- Metropolis-Hastings Correction:
- Compute the Hamiltonian (total energy) at the start and end: \(H(\vec{\theta}_n, \vec{r}_n) = U(\vec{\theta}_n) + K(p_n)\)\(H(\vec{\theta}^*, \vec{r}^*) = U(\vec{\theta}^*) + K(\vec{r}^*)\)
- Compute the acceptance probability: \(\alpha = \min\left(1, \exp(H(\vec{\theta}_n, \vec{r}_n) - H(\vec{\theta}^*, \vec{r}^*))\right)\)
- Draw a random number \(u \sim \text{Uniform}(0, 1)\).
- Accept or reject: If \(u \le \alpha\), set \(\vec{\theta}_{n+1} = \vec{\theta}^*\). If \(u > \alpha\), set \(\vec{\theta}_{n+1} = \vec{\theta}_n\)
LMC in detail
Recipes for Stochastic Gradient MCMC
Exploring Methods discussed in Ma et al. (2015)
I vehemently detest dot notation, despite most of the literature around this using it, I refuse. ↩
Wiener named after Norbert Wiener is pronounced with a “v” but that doesn’t stop me 100% reading as a “w” to make the papers I read funnier, and yes, I am a child. ↩
I’m using a lot of ‘_’ and “_” in this blog post because I am skipping the rigor to just say things… ↩
As opposed to the actually more commonly used “under”-damped Langevin sampler which doesn’t ignore mass and models momenta ↩