Practical Intro to the Metropolis-Hastings Algorithm/Fitting a line II

17 minute read

Published:

In this post I’m going to try to give an intuitive intro into the Metropolis-Hastings algorithm without getting bogged down in much of the math to show the utility of this method.

Metropolis-Hastings (or more truthfully the subsequent MCMC field) is one of the most successful analytical methods that statisticians have ever used and is the bench mark for all future analysis methods that we will explore. I am basing this tutorial on various sources such as:

If you don’t like/mesh well with how I explain the concepts below, absolutely head over to any of these resources. They do widely vary in rigor and presentation method but are all great.

And the style of my GIFs is inspired by this video PDF Sampling: MCMC - Metropolis-Hastings algorithm - Ridlo W. Wibowo

Table of Contents

Motivation

Brute force? In this economy?

In the last post we brute forced our analysis by scanning the whole entire parameter space for areas of high posterior probability. Here’s a quick recape.

  1. We were given some data that I told you was from a straight line with some added gaussian noise

    Straight Line Data
  2. We then quantified a likelihood and prior based on this information and produced something like this colormap of the posterior probabilities of the input parameters

    2D Brute forced posterior on our gradient and intercept

In most scenarios this is infeasible (or at least expensive) and we need something more sophisticated to explore regions of high probabilty and somehow get some representation of our posterior. So what could we do instead?

What do we actually get from our analysis?

If our goal is to get something representative of the posterior we generally want to do 1 or more of the following:

  1. Guess where the most optimal set of parameter values are based on our data is (not unique to posterior inference, can just be done with optimisation)
  2. Generate further values that reflect parameter uncertainties
  3. Quantify uncertainty
  4. Compare models via evidence values (normalisation for the posterior)

So whatever other method we use to generate something representative of the posterior we need to keep these in mind1.

You are already familiar with another way that we often represent the results of analysis, that is through samples. For example, sure I could tell you that for model X parameter Y follows a normal distribution with mean of 1 and standard deviation of 0.5. (First of all this only works if our result has a functional representation and we presume our audience even knows what a normal distribution is) Or, we could show them a histogram of our results like the following.

from scipy.stats import norm
from matplotlib import pyplot as plt
import numpy as np


samples = norm(scale=0.5, loc=1).rvs(100000)


fig, ax = plt.subplots(1,1, figsize=(4,4), dpi=200)
ax.hist(samples, bins=48, label="Y samples", density=True)
ax.set_xlabel("Y")
ax.legend()
plt.show()
Samples of a theoretical Y parameter

Samples!?

Now if you tell anyone off the street that the parameter follows this distribution, they would be able glean most of the key information (although I wouldn’t recommend pulling people of the street and asking them random statistics questions btw…you often don’t get the best responses. Unless the question is how many times can someone threaten your life in ~10 seconds, in which it’s extremely informative…).

  1. You can see where the mode of the distribution is
  2. We could generate further samples of variables by using the samples in this distribution and see the result
  3. You can see the spread of the distribution on the parameter to understand our uncertainty on it. (We can also construct credible intervals through various methods to get quantitative values for our uncertainties )

And from these samples we’ve also shown essentially the same information a our first brute scan colormap. What we require are a set of samples representative of the same probability density! This both reduces the initial computation cost, as we don’t have to explore regions of the parameter space where the probabilities are extremely small, and in the storage of the result, as we will likely only require \(\lesssim 100,000\) samples/numbers to show this.

We don’t explicitly have a way to get evidence values for model comparison from this, but we can tackle this later if/when we look at Nested Sampling.

How do we sample our posterior?

Now, unlike the above normal distribution, our function isn’t normalised, we have the top part of the fraction that makes up Bayes’ theorem.

\[\begin{align} p(\vec{\theta}\mid\vec{d}) = \frac{\mathcal{L}(\vec{d}\mid\vec{\theta})\pi(\theta)}{\mathcal{Z}(\vec{d})} \end{align}\]

But from our perspective, \(\vec{d}\) is a constant, so you can say,

\[\begin{align} p(\vec{\theta}\mid\vec{d}) \propto \mathcal{L}(\vec{d}\mid\vec{\theta})\pi(\theta). \end{align}\]

Now you might want to calculate \(\mathcal{Z}(\vec{d})\) directly but you would run into the same issues (and worse) when we tried to directly scan the posterior to begin with (this is further explored in the Section 4 of Speagle’s introduction). So we need some method where we can sample something proportional to a probability density.

Metropolis-Hastings! - Having dessert before dinner

One answer to this is the Metropolis-Hastings algorithm. And like the heading I’m going to show you the end result so you can see why what we’re doing is so cool.

Using the unnormalised posterior function we used in the last post on line fitting, I’m going to create a pretty small function that will almost magically generate sample our distribution.

import numpy as np
import matplotlib.pyplot as plt

# Metropolis-Hastings algorithm for 2D
def metropolis_hastings_function(posterior, data, num_samples, proposal_cov, start_point):
    samples = []
    current = np.array(start_point)
    for _ in range(num_samples):
        # Propose a new point using a 2D Gaussian
        proposal = np.random.multivariate_normal(current, proposal_cov)

        # Calculate acceptance probability
        acceptance_ratio = np.exp(posterior(*data, *proposal) -  posterior(*data, *current))

        # Cap at 1, gotta be a probability and probabilities can't be more than 1
            # Probability _densities_ can but those aren't _probabilities_ until you integrate them
        acceptance_ratio = min(1, acceptance_ratio)  

        # Accept or reject
        if np.random.rand() < acceptance_ratio:
            current = proposal

        samples.append(current)

    return np.array(samples)

# Parameters
num_samples = 50000
proposal_cov = [[0.1, 0], [0, 0.1]]  # Proposal covariance matrix
start_point = [m_true, c_true]  # Starting point for the chain

# Run Metropolis-Hastings
samples = metropolis_hastings_function(unnormalised_log_posterior_better, (y, X_true), num_samples, proposal_cov, start_point)
Comparison of brute force scan vs sampling with the Metropolis-Hastings algorithm

Magic! Now it isn’t the prettiest thing, and to be fair to it this algorithm is (recently, as of 2020) over 50 years old! And since then there have been a lot of improvements, but isn’t it amazing that with so little code we are able to do something like this? Additionally, with the samples there’s a handy package called corner. That allows us to have a closer look at our samples.

from corner import corner
default_kwargs = {'smooth': 0.9,
    'label_kwargs': {'fontsize': 16},
    'title_kwargs': {'fontsize': 16},
    'color': 'tab:blue',
    'truth_color': 'k',
    'levels': (0.3934693402873666,
    0.8646647167633873,
    0.9888910034617577,
    0.9996645373720975,
    0.999996273346828),
    'plot_density': True,
    'plot_datapoints': False,
    'fill_contours': True,
    'max_n_ticks': 4,
    'hist_kwargs':{'density':True},
    "smooth":0.9,
}

fig = plt.figure(figsize=(6, 6), dpi=200)
corner(samples, 
    fig=fig,
    bins=36,
    truths=[m_true, c_true,],
    titles=[r"$m$", r"$c$"],
    show_titles=True,
    **default_kwargs)
plt.show()
Corner plot showing our samples of the unnormalised posterior function using the Metropolis-Hastings algorithm

The top left and bottom right plots in the above show the marginal distributions of the relevant parameters. These plots effectively show distributions once you take the explicit dependence of the other variable(s) out2. This is extremely useful when summarising our results by text or by word we can’t show the plot, so we typically summarise our findings to one variable at a time, and further simplify it by asking what the width of the highest density region that contains the same area as a normal distribution between \(1\sigma\), \(2\sigma\), \(3\sigma\) (68%, 95%, 99.7%) or etc.

The algorithm

So here are the steps in plain-ish english.

Metropolis Algorithm

  1. Initialise:
    1. Have a distribution you want to sample from (duh)
    2. \(f(x)\), manually create a starting point for the algorithm,
    3. pick a symmetric distribution \(g(x\mid y)\) to sample from
      • like a gaussian with a fixed covariance matrix such that \(g(x\mid y)=g(y\mid x)\)
    4. pick the number of samples you can be bothered waiting for \(N\)
  2. For each iteration \(n\)/Repeat \(N\) times
    1. Sample a new proposal point \(x^*\) from the symmetric distribution centred at the previous sample
    2. Calculate the acceptance probability \(\alpha\) given as \(\alpha = f(x^*)/f(x_n)\)
      • And if the acceptance probability is more than 1, cap it at 1.
    3. Generate a number, \(u\), from a uniform distribution between 0 and 1
    4. Accept or reject3
      1. Accept if: \(u\leq\alpha\), s.t. \(x_{n+1} = x^*\)
      2. Reject if: \(u>\alpha\), s.t. \(x_{n+1} = x_n\)

The first thing to notice is that we don’t require the normalisation of our function as it just depends on the ratios, and the normalisation cancels itself out. The second is that it doesn’t have many steps despite being able to do quite a lot.

Now this is the point that I want to show you an animation of the process, but I don’t want to do that with the 2D distribution as it’s slow enough as it is and it would be annoying figuring out how to nicely visualise accepting or rejecting a sample. So forgive me for doing this for a single dimensional gaussian that I made up on the spot.

GIF showing the process of a Metropolis algorithm

Slowing it down and having a look at the accepting conditions.

GIF showing the process of a Metropolis algorithm

You will notice that in the GIF, there are some samples off to the side that don’t seem to fit the distribution. We refer to this as the “burn-in phase”, it is a sequence of samples at the beginning of MCMC sampling (not specific to Metropolis or the Metropolis-Hastings) where the chain of samples hasn’t reached the key part of the distribution yet. When doing your own MCMC sampling you should be sure to throw away a few samples at the beginning4.

Very Normal had a great analogy for this process, because it may not be immediately intuitive why simply asking the ratio of two probabilities at a time allows us to construct the full probability distribution.

Let’s say you wanted to undertake the average distribution of activities that Melburnian’s undertake every week.

  • You go to an activity around Melbourne that you think Melburnian’s undertake (initialisation)
  • You do the activity
  • After spending an amount of time at the activity you randomly pick an activity among a list of nearby ones that you are sure covers the whole range of activities that Melburnians undertake
  • You then ask one of the natives whether they think they spend more time at the current activity or the random one you picked
  • With a probability of the ratio of how much time they spend at each activity you either stay at your current activity or go to the new one
  • Repeat

And eventually, even if you didn’t pick an activity that was very good, you will eventually be lead to the “good” activities (equilibrium distribution) and start to do the same activities as typical Melburnian’s do despite only ever comparing two choices at a time “stay” or “next”. However, if this decision process only allows transitions between certain activities (e.g., people who go to cafes only talk to others at cafes), then some activities might be overrepresented while others remain underexplored5.

This is an issue for the Metropolis algorithm6, which is what I detailed above, but not the generalisation of the algorithm by Wilfred Keith Hastings the Metropolis-Hastings algorithm.

For the Metropolis algorithm to work we presume that the proposal distribution, \(g\), is symmetric, i.e. \(g(x\mid y)=g(y\mid x)\), but this can be restrictive. Some distributions that may produce faster convergence or better fit a particular posterior setup may not be symmetric. So Hastings generalised the result to allow this, by modifying the acceptance probability from \(\alpha = f(x^*)/f(x_n)\) to \(\alpha = \frac{f(x^*)g(x_n\mid x^*)}{f(x_n)g(x^*\mid x_n)}\) which accounts for any assymetry in \(g\) (I’ll go into more detail in a later post).

Metropolis-Hastings Algorithm

  1. Initialise:
    1. Have a distribution you want to sample from (duh)
    2. \(f(x)\), manually create a starting point for the algorithm,
    3. pick a symmetric distribution \(g(x\mid y)\) to sample from
      • like a gaussian with a fixed covariance matrix such that \(g(x\mid y)=g(y\mid x)\)
    4. pick the number of samples you can be bothered waiting for \(N\)
  2. For each iteration \(n\)/Repeat \(N\) times
    1. Sample a new proposal point \(x^*\) from the syymetric distribution centred at the previous sample
    2. Calculate the acceptance probability \(\alpha\) given as \(\alpha = \frac{f(x^*)g(x_n\mid x^*)}{f(x_n)g(x^*\mid x_n)}\) (Here’s the change!)
      • And if the acceptance probability is more than 1, cap it at 1.
    3. Generate a number, \(u\), from a uniform distribution between 0 and 1
    4. Accept or reject3
      1. Accept if: \(u\leq\alpha\), s.t. \(x_{n+1} = x^*\)
      2. Reject if: \(u>\alpha\), s.t. \(x_{n+1} = x_n\)

Now we’ll see when an asymmetric proposal distribution can do better than a symmetric one, we’ll look at the two algorithms side-by-side7.

For the distribution that we’re trying to model we’ll use the Gamma distribution with \(\alpha=2\) and our two proposal distributions will be a normal distribution with a standard deviation of 0.5 for our symmetric distribution and a log-normal distribution with \(\sigma=0.5\) for our asymmetric distribution.

Additionally, arviz can estimate the number of effective samples.

GIF showing a comparison between sampling a gamma distribution using MCMC with a symmetric vs asymmetric proposal distribution

You can see that the asymmetric proposal distribution is able to get the core shape of the distribution quicker than the symmetric distribution. Especially if you look at the left edge of the distribution and the tail passed 7 or so.

Viva le estadistica8!… (looks up noun gender of statistics in Spanish)… Viva la estadistica9!

Coding it up … again

Just for completeness I’ll copy-paste my implementation of the Metropolis-Hastings algorithm here as well.

def metropolis_hastings(
        target_pdf, 
        proposal_sampler, 
        proposal_pdf, 
        num_samples=5000, 
        x_init=1.0):

    samples = [x_init]

    x = x_init

    for _ in range(num_samples):

        # Propose a new sample
        x_prime = proposal_sampler(x) 

        # Just for the gamma distribution to ensure positivity. 
            # I couldn't get it to work nicely without this
            # If _you_ wanna use this for something else, I would take this 
            # bad boi out or replace it with a relevant constraint for the 
            # distribution of interest
        if x_prime <= 0:  
            continue

        # Compute acceptance ratio \alpha = f(x^*)/f(x_n) * g(x_n|x^*)/g(x_^*|x_n)
        p_accept = (target_pdf(x_prime) / target_pdf(x)) \
        * proposal_pdf(x, scale=x_prime)/proposal_pdf(x_prime, scale=x)

        # Cap it, bop it, twist it
        p_accept = min(1, p_accept)

        # Simulate random process of accepting the proposed sample with probability p_accept
        if np.random.rand() < p_accept:
            x = x_prime
        # else: x = x

        samples.append(x)

    return np.array(samples)

A quick note

I’ve gone through the Metropolis-Hasting algorithm here and told you about burn in, but there are other criteria for whether you should trust your samples/diagnostics. I’m going to leave that for another time as you will likely never use Metropolis-Hastings in practice but other algorithms, so I’ll talk about diagnostics there and after I introduce MCMC as a concept in general (which I haven’t so far).

Well… now what?

Next I’m going to attempt to explain what MCMC is in general and why detailed balance as a property of our MCMC algorithms is important and how we can quantify how we can judge whether our samples have converged and later to maybe give a general intro into the now much more commonly used NUTS algorithm that most MCMC python packages like emcee, pyMC and many others use.


  1. And if you don’t need evidence values or rigorous understanding of uncertainties… just optimise my guy. In the approximate works of Andy Casey, don’t burn down forests just so you can look cool using an MCMC sampler. 

  2. There is kind of still a dependence but it’s through the model itself, and importantly, it’s not dependent on any particular value of the other variable(s) (as stated). You can also think of it as taking the average of distributions of the relevant variable over the other parameter(s). 

  3. The process of comparing the acceptance probability to the uniform sample is to simulate a random process where the probability of accepting the proposal is \(\alpha\)  2

  4. There is no hard and fast rule for this that will work every time but if you’re new to MCMC I would start with ~10% of your samples and then wiggle that percentage around for each problem. You want to maximise the number of samples you have in your distribution but you don’t want bad ones. 

  5. Trying to essentially have a common sense explanation of detailed balance. Please feel free to suggest another short and plain English way to explain this. 

  6. by Nicholas Metropolis(one of the best names ever btw), Arianna W. Rosenbluth, Marshall Rosenbluth, Augusta H. Teller and Edward Teller 

  7. Although special note, they aren’t really “two” algorithms it’s just that the Metropolis is a specific case of the Metropolis-Hastings 

  8. statistics 

  9. statistics with correct grammar