Speeding up MCMC with Langevin Dynamics - Langevin Monte Carlo

13 minute read

Published: December 30, 2025

In this post I’m going to try and introduce Langevin Monte Carlo, an MCMC method that models the process with Langevin dynamics. This means the process is interpreted like one dictated by the Langevin stochastic differential equation which models a specific combination of random and deterministic forces on a system. UNDER CONSTRUCTION

Resources

The Motivation: Why “Random” Isn’t Enough
A sneak peak: A purely intuitive idea of our goal
Stochastic Calculus (and a lil’ Physics)
- (Very Quick) Review of Scalar Fields, Potentials, and Gradients
- Introduction to Stochastic Differential Equations (SDEs)
- The Wiener Process: Formalizing Stochastic Noise
- Stochastic Integrals: Itô Calculus vs. Stratonovich
- The Langevin SDE: Combining Deterministic Drift and Random Fluctuations
- The Fokker-Planck Equation: “Flowing” Probability?
- Stationary Solutions to the SDE
- The Fluctuation-Dissipation Theorem: Why the $\sqrt{2}$??
The Algorithm (Finally)
- Discretization: The Euler-Maruyama Approximation
- The Unadjusted Langevin Algorithm (ULA)
- Stability and the Step-Size ($\epsilon$) Trade-off
- Analysis of Discretization Bias
Refinement and Performance
- The Metropolis-Adjusted Langevin Algorithm (MALA)
- Calculating the Langevin Proposal Density
- Convergence Analysis: $d^{1/3}$ vs. $d$ Scaling
Implementation and Extensions
- Preconditioning: Handling Poorly Scaled Distributions
- From Langevin to Hamiltonian Monte Carlo (HMC)
Conclusion: The Modern Sampling Landscape
Appendices
- Typical Sets and Sequences

The Motivation: Why “Random” Isn’t Enough

MCMC is one of the most revolutionary and widely used tools in a statisticians arsenal when it comes to exploring probability distributions. But despite this it is deceptively simple and possibly doesn’t use as much information that is available to speed up sampling. In this post we’re going to see one way that we can possibly use some of that information to do exactly that.

First, I’ll do a quick refresher for what Metropolis-Hasting MCMC is. If you are completely, unfamiliar with Metropolis-Hasting MCMC I’d recommend (completely unbiased) my practical guide on MCMC and especially the resources/references within but I’ll cover the main points below.

Metropolis-Hasting MCMC (MHMCMC) comprises of three things: a (typically unnormalised) target distribution, a proposal distribution and how many iterations of the below algorithm you can bare to wait for.

Metropolis-Hastings Algorithm
Initialise:
Have a distribution you want to sample from (duh) $f(x)$,
manually create a starting point for the algorithm $x_0$,
pick a distribution $g(x\mid y)$ to sample from
pick the number of samples you can be bothered waiting for $N$
For each iteration $n$/Repeat $N$ times
Sample a new proposal point $x^*$ from the symmetric distribution centred at the previous sample
Calculate the acceptance probability $\alpha$ given as $\alpha = \frac{f(x^*)g(x_n\mid x^*)}{f(x_n)g(x^*\mid x_n)}$ (Here’s the change!)
And if the acceptance probability is more than 1, cap it at 1.
Generate a number, $u$, from a uniform distribution between 0 and 1
Accept or reject
Accept if: $u\leq\alpha$, s.t. $x_{n+1} = x^*$
Reject if: $u>\alpha$, s.t. $x_{n+1} = x_n$

This amounts to exploring a given distribution in the method indicated in the GIF below.

GIF showing the process of an MHMCMC Algorithm (with symmetric proposal making it equivalent to Metropolis MCMC)

The target distribution is the distribution that we are trying to generate samples for and all we need is a way to evaluate it up to a normalisation constant. The proposal distribution is then some distribution using the previous accepted value that we sample to propose a new point (and either accept-reject based on the density ratio with respect to the previous accepted sample). This amounts to something we call a Markov Process, which is a discrete-time stochastic or random process (we will revisit this later), where every sample/event only depends on the previous.

This is why Metropolis MCMC is sometimes referred to as a type of “Gaussian random walk” (in the case of a gaussian proposal distribution) or Random Walk Metropolis (RWM, as I will more generally refer to for the above types of MCMC algorithms).

These algorithms have been stupidly successful, being used from investigating the properties of stars in other galaxies to how the share prices of toy companies will vary in a given month. The key benefit that came from it being the ability to partially side-step the curse of dimensionality that arises if one were to naively grid every dimension of a given distribution one wanted to investigated.

RWM works great for moderately dimensional spaces (very roughly from my experience around 15 dimensions for non-pathological distributions) but is not able to get around the curse entirely of course.

There are two key issues that RWM struggles with in higher dimensional spaces: the way volume behaves in these spaces, and the general in-efficiency of the process as a function of dimension.

Pathological Volumes and Shells

In high dimensional spaces, very basic things one would presume about volume and space break down. A particularly fun thing that can happen regarding spheres escaping the boxes that contains them is explained in this numberphile video.

The particular issue in the case of RWM algorithms is that the typical set or the space where the probability mass (for our purposes here, this is synonymous with infinitesimal integrated chunks of probability density), becomes exponentially concentrated about a sphere about the origin.

I showed the converse of this in my post on hyperspherical and hyperbolic VAEs, where if you sample uniformly on the unit sphere in high dimensions (~500) the distribution you get out in the marginals (the corner plot) is indistinguishable from those of a similarly dimensional gaussian.

Comparison of a sample distribution on a uniform sphere and normal distribution in 500-dimensional space (not necessarily in that order)

So you get this kind of Faustian bargain where you can either:

Making the step size/width of the proposal small, so that samples that are close/on the sphere stay close/on the sphere. But then exploring the hall space/distribution will be very inefficient, making the process take exponentially longer.
Making the step size/width of the proposal large, so you can quickly explore the whole space, but have a low efficiency of having samples fall in the typical set, making the process take exponentially longer….

Walking home drunk is not an efficient way to get home (kinda)

Closely linked to the above issue is how the efficiency of the stochastic process (sometimes referred to as a drunken walk process) scales with dimension. For a $d$-dimensional distribution the efficiency of the process assuming an target efficiency of ~0.234 (23.4\%) (refs Optimal scaling of discrete approximations to Langevin diffusions - Roberts, Rosenthal (1998), Investigating the Efficiency of the Metropolis Algorithm - Dou, Li, Wang (2024)) the optimal step size for RWM scales as $d^{-1/2}$ and efficiency (or number of steps) as $\mathcal{O}(d)$.

Viewing this process as a diffusion process, where the efficiencies relate to how long it takes to reach an equilibrium distribution or a good enough approximation of one, then leaves the door open for us to wonder if other dynamics/processes could scale significantly better?

On possible algorithm that achieves this are the so-called “Langevin algorithms” that model the process via Langevin dynamics. Roberts & Tweedie (1996) showed that the proposal distributions for Langevin-based algorithms could scale as $d^{-1/3}$ and the number of steps as $\mathcal{O}(d^{1/3})$! Which for high-dimensional sampling is a complete game changer when it comes to efficiency.

A sneak peak at the goal of the post: An intuitive picture of Langevin Monte Carlo

Imagine the sampling trajectories of MCMC as people trying to find their way home after a long night out. They are in their neighborhood, but they are delirious and have no sense of direction.

The RWM drunkards have lost their glasses in the mayhem of the night, and thus have completely failing eyesight. They pick a house at random nearby, stumble to the door, and are then close enough to check the house number on the door (the probability density). If that number is closer to their own address, they feel a sense of “positive feedback” and are likely to stay there for a moment. But then, forgetting everything, they stumble toward another random house. As this process repeats they have an unconcious positive feedback ‘pull’ towards the right house number. They are essentially “diffusing” through the neighborhood. Because they don’t know which way the numbers are increasing or decreasing, they spend a massive amount of time walking in circles or heading toward the wrong end of the street.

The Langevin (LMC) Drunkard is just as tired, but they’ve managed to keep their glasses on. They can actually read the street signs. They notice the house numbers are increasing as they walk East, so they purposefully start walking East (the Drift). They are still drunk, so they still trip over the curb or weave side-to-side randomly (the Diffusion), but their average movement is a direct line toward their front door.

In a 1D street, the RWM drunkard might eventually find home. But imagine a 100-dimensional neighborhood (imagine Los Angeles in Blade Runner 2049 on steroids).

The RWM drunkard is almost certain to take a step that leads them away from their house because there are so many “wrong” directions to choose from.
The LMC drunkard, even with a slight stagger, has a compass (the Gradient). They ignore the 99 wrong directions and focus their energy on the one direction that actually matters.

More formally, LMC introduces a drift into the proposal distribution or gradient flow using the gradient of the target probability density but otherwise uses the same accept-reject as the typical MHMCMC algorithm. And even more formally … is the rest of this post.

Below is a GIF showing how RWM and LMC converge on a target (same step size, set so that RWM is purposefully bad, but even then the LMC is fine).

We will try to explain, motivate and show how we construct the below algorithm and that it in fact has an equilibrium distribution that matches our target.

Metropolis-Adjusted Langevin Algorithm
Initialise:
Have a target density you want to sample from $f(x)$ (known up to a normalising constant),
Define the potential energy as the negative log-density: $U(x) = - \log f(x)$,
Manually choose a starting point $x_0$,
Pick a step size $\epsilon > 0$,
Pick the number of samples to generate $N$.
For each iteration $n$ / Repeat $N$ times:
Langevin proposal steps:
Compute the gradient of the log-density at the current state: $\nabla \log f(x_n)$
Generate a proposal point $x^*$ using a drift term plus noise: $x^* = x_n + \frac{\epsilon}{2} \nabla \log f(x_n) + \sqrt{\epsilon} \eta$ where $\eta \sim \mathcal{N}(0, I)$
Metropolis Hastings correction steps:
Compute the proposal densities: $q(x^* \mid x_n) = \mathcal{N}\left(x^* ; x_n + \frac{\epsilon}{2} \nabla \log f(x_n),\epsilon I\right)$ $q(x_n \mid x^*) = \mathcal{N}\left(x_n ; x^* + \frac{\epsilon}{2} \nabla \log f(x^*),\epsilon I\right)$
Compute the acceptance probability: $\alpha = \min\left(1, \frac{f(x^*) q(x_n \mid x^*)}{f(x_n) q(x^* \mid x_n)}\right)$
Draw a random number $u \sim \text{Uniform}(0, 1)$
Accept or reject:
If $u \le \alpha$, set $x_{n+1} = x^*$
If $u > \alpha$, set $x_{n+1} = x_n$

Appendices

Typical Sets and Sequences

Share on

X (formerly Twitter) Facebook LinkedIn

Liam C. Pinchbeck

Speeding up MCMC with Langevin Dynamics - Langevin Monte Carlo

Resources

LMC/Langevin Methods

For Langevin Dynamics

Misc

Table of Contents

The Motivation: Why “Random” Isn’t Enough

Metropolis-Hastings Algorithm

Pathological Volumes and Shells

Walking home drunk is not an efficient way to get home (kinda)

A sneak peak at the goal of the post: An intuitive picture of Langevin Monte Carlo

Metropolis-Adjusted Langevin Algorithm

Appendices

Typical Sets and Sequences

Share on

You May Also Enjoy

Intuitive introduction to Fisher information, Jeffreys priors and lower bounds on the variance of unbiased estimators (Cramer-Rao bound)

Image Classification and Molecular Structure Reconstruction with Constant Curvature VAEs

I <3 Kernels

Investigating student stress indicators using UMAP (Kaggle Dataset)