My Favorite Algorithm: Metropolis-Hastings

Note: imported from an earlier version of this blog. Sorry if anything is broken.

I’m writing this post because I feel like any computational scientist should have a favorite algorithm, and in my research I have run into this particular algorithm again and again and every time it impresses me. I am also writing this because I googled “What is your favorite algorithm” and was surprised to find that no one mentioned it in the top results. Few algorithms in computational physics have the scope and power of the Metropolis-Hastings algorithm. Briefly, given a domain $\mathcal D$ and a function with bounded integral over that domain $f(\mathbf x)$ , the MH-algorithm samples $n$ points from a probability distribution proportional to $f(\mathbf x)$ . This subtle technique can solve very difficult problems in the design of engines, rockets, generators, nuclear weapons and reactors, plasma physics, the simulation of thermodynamic systems and transport of light in Computer Graphics, the computation of partition functions of RNA macrostates and their transition states, and the computation of posterior distributions in Bayesian statistics (as well as a family of statistical tools that I know relatively less about). In this post I plan to describe the algorithm, Monte Carlo integration, and some examples of applications.

The Algorithm

Let’s say we have some function $f(\mathbf x)$ over some domain $\mathcal D$ such that

$\int_{\mathcal D} f(\mathbf x) d\mathbf x = C .$

For example $f(x)$ could be $e^{-x^2}$ and $\mathcal D$ could be the real line $[-\infty, \infty]$ , or the uniform distribution $U(x)=1$ and the interval $[0,1]$ :

$\int_{-\infty}^\infty e^{-x^2} dx = \sqrt{\pi},$ =

$\int_0^1 dx = 1,$

or some higher dimensional function with a corresponding higher dimensional domain. It turns out we can generate a list of samples from the probability distribution $p$ proportional to this function,

$p(\mathbf x) = \frac{1}{C}\ f(\mathbf x),$

(for our examples $C = \sqrt{\pi}$ and $1$ , respectively) by following a simple procedure: Start at some initial state $\mathbf x_0$ and generate the output, a list of samples $\mathbf x_0, \mathbf x_1, \dots, \mathbf x_n$ , by continuously sampling $\mathbf x_{n+1}$ from some probability distribution function (that you choose) set by $\mathbf x_n$ : $P_t(\mathbf x_n, \mathbf x_{n+1}).$

This list of samples will converge to $p(\mathbf x)$ provided some conditions on $P_t(\mathbf x_n, \mathbf x_{n+1})$ are satisfied, namely ergodicity, which means that every state can eventually reach any other state, and detailed balance. If there are $N(\mathbf x)$ samples in state $\mathbf x$ at some point in the chain, the number of samples in state $\mathbf y$ created by transitions from these states is expected to be $N(\mathbf x)P_t(\mathbf x, \mathbf y)$ . $N(\mathbf x) P_t(\mathbf x,\mathbf y) - N(\mathbf y) P_t(\mathbf y, \mathbf x)$ is called the transition rate from $\mathbf x$ to $\mathbf y$ because it is the expected flow of states from $\mathbf x$ to $\mathbf y$ (or the other way, if it is negative). Detailed balance is said to hold if the transition rates between all states are zero when the sample has reached the correct probability distribution. This way the net flow of probability is zero and the distribution is expected to stay the same. Therefore if we have sampled $N_T$ states and we want to reach distribution $p(\mathbf x)$ , if we have that $N(\mathbf x) = N_T p(\mathbf x)$ and want $N(\mathbf x) P_t(\mathbf x, \mathbf y) = N(\mathbf y) P_t(\mathbf y, \mathbf x)$ , our transition probabilities $P_t$ should be constrained by:

$p(\mathbf x_n) P_t(\mathbf x_n, \mathbf x_{n+1}) = p(\mathbf x_{n+1}) P_t(\mathbf x_{n+1},\mathbf x_n).$

From here we can break $P_t(\mathbf x_1,\mathbf x_2)$ down into the product of 2 different distributions, $T(\mathbf x_1, \mathbf x_2)$ , the probability of generating a transition from $\mathbf x_1$ to $\mathbf x_2$ , and $A(\mathbf x_1, \mathbf x_2)$ , the probability of accepting that transition. We do this to let the definition of $T$ be flexible by having $A$ handle the constraint. This is done by solving for $A$ with a general $T$ :

$p(\mathbf x_1) P_t(\mathbf x_1, \mathbf x_2) = p(\mathbf x_2) P_t(\mathbf x_2,\mathbf x_1)$

$p(\mathbf x_1) T(\mathbf x_1,\mathbf x_2) A(\mathbf x_1,\mathbf x_2) = p(\mathbf x_2) T(\mathbf x_2, \mathbf x_1) A(\mathbf x_2, \mathbf x_1)$

$\frac{A(\mathbf x_1,\mathbf x_2)}{A(\mathbf x_2,\mathbf x_1)} = \frac{p(\mathbf x_2) T(\mathbf x_2,\mathbf x_1)}{p(\mathbf x_1) T(\mathbf x_1,\mathbf x_2) } = \frac{f(\mathbf x_2) T(\mathbf x_2,\mathbf x_1)}{f(\mathbf x_1) T(\mathbf x_1,\mathbf x_2) }$

One choice that satisfies this equation is

$A(\mathbf x_1,\mathbf x_2) = \min \left \{ \frac{f(\mathbf x_2) T(\mathbf x_2,\mathbf x_1) }{f(\mathbf x_1) T(\mathbf x_1,\mathbf x_2) }, 1 \right \}.$

Provided that the function $A$ satisfies these constraints, we can construct a mutation function that samples from $T$ to give us a new element, create the chain by accepting or rejecting the new element, and the samples’ distribution will converge to $p(\mathbf x)$ .

Example

Let’s say we want to sample from a difficult distribution on the $(x,y)$ -plane, perhaps the one proportional to

$f(x,y) = \frac{\sin^2(r)}{r^3} = \frac{\sin^2(\sqrt{x^2 + y^2})}{(x^2+y^2)^{3/2}}.$

Since this distribution is pretty difficult to sample directly, instead we use Metropolis-Hastings. Let’s construct a mutation function that chooses from all directions uniformly and moves in that direction a distance $d$ with probability $p_d(d) = e^{-d}$ . Our acceptance probability should therefore be:

$A(\mathbf x_1, \mathbf x_2) = \min \left \{ \frac{f(\mathbf x_2) e^{-d}/2\pi }{f(\mathbf x_1) e^{-d}/2\pi }, 1 \right \} = \min \left \{ \frac{\sin^2(\sqrt{x_2^2 + y_2^2}) (x_1^2 + y_1^2)^{3/2}}{\sin^2(\sqrt{x_1^2 + y_1^2}) (x_2^2 + y_2^2)^{3/2} }, 1 \right \}$

This is all we need to know to implement a sampling algorithm in R:

## Metropolis Hastings demonstration by Mike Flynn
## Warning: This code has a runtime of around 4 hours on my machine
samples = matrix(nrow = 10^6, ncol = 2)
currentSample = c(0.001,0.001)
set.seed(44)
for(i in 1:(10^8)) {
  ## choose direction and distance for transition
  direction = runif(1, 0, 2*pi)
  distance = rexp(1, 2)

  ## current point
  x1 = currentSample[1]
  y1 = currentSample[2]

  ## compute next point
  x2 = x1 + distance*cos(direction) 
  y2 = y1 + distance*sin(direction)

  ## accept?
  accept = min(sin(sqrt(x2^2 + y2^2))^2*(x1^2 + y1^2)^(3/2)/
               (sin(sqrt(x1^2 + y1^2))^2*(x2^2 + y2^2)^(3/2)),1)
  if(accept > runif(1, 0, 1)) {
    currentSample = c(x2, y2)
  } else {
    currentSample = c(x1, y1)
  }
  ## only take 1 of every 100 to reduce autocorrelation
  if(i %% 100 == 0) {
    samples[i/100,] = currentSample
  }
}

library(ggplot2)
library(ggthemes)
plotdat = data.frame(x = samples[,1], y = samples[,2])
## display p to plot
p = ggplot(data = plotdat, aes(x = x, y=y)) + geom_point(alpha = .05, size =.01) + theme_bw() +
  xlim(-6*pi, 6*pi) + ylim(-6*pi, 6*pi)

which outputs this picture:

Metropolis Hastings sample plot

This plot looks a lot like interference fringes to me, which pleases me as a physicist. Of course, this example is completely contrived: interference fringes sometimes roughly follow this distribution. More importantly, the samples produced by the Metropolis-Hastings algorithm do seem to match the probability distribution proportional to $\sin^2(r)/r^3$ .

Monte Carlo Integration and Importance Sampling

Monte Carlo integration is the process of evaluating an integral by sampling randomly from the domain and averaging. Often Monte Carlo integration of $f(\mathbf x)$ will make use of the expectation value identity for $n$ samples, $\mathbf x_i$ , from a probability distribution $p(\mathbf x)$ on the same domain

$\frac{1}{n} \sum_{i=1}^n \frac{f(\mathbf x_i)}{p(\mathbf x_i)} \xrightarrow[]{\text{converges to}} E\left [ \frac{f(\mathbf x)}{p(\mathbf x)} \right ] = \int_{\mathcal D} \frac{f(\mathbf x)}{p(\mathbf x)} p(\mathbf x) d\mathbf x = \int_{\mathcal D} f(\mathbf x) d\mathbf x.$

For example we might evaluate the integral $\int_1^3 e^{-x}dx$ by randomly sampling $x_i$ from the uniform density on $[1,3]$ , letting $f(x) = e^{-x}$ , $p(x) = 1/(3-1)$ , and averaging:

$\frac{1}{n} \sum_{i=1}^n \frac{e^{-x_i}}{1/(3-1)} \xrightarrow[]{\text{converges to}} \int_1^3 \frac{e^{-x}}{1/(3-1)}1/(3-1) dx = \int_1^3 e^{-x} dx .$

The closer $p(\mathbf x)$ gets to the shape of the integrated function, the faster the convergence will be. For example, if $p(\mathbf x) = f(\mathbf x)/N$ , then the above derivation will yield

$\int_{\mathcal D} f(\mathbf x) d\mathbf x \approx \frac{1}{n} \sum_{i=1}^n \frac{f(\mathbf x)}{f(\mathbf x)/N} = \frac{nN}{n} = N.$

No matter what $n$ is, it will converge after the first iteration. This is cheating because it assumes we already know the answer, $N$ . In general we won’t have access to this knowledge, or else we wouldn’t be doing the integral in the first place. However, we do have the Metropolis-Hastings algorithm to let us sample from $f(\mathbf x)/N$ without explicitly knowing $N$ . We can use this in many creative ways, stemming from the fact that the expectation value of any statistic $g$ on these samples is $\frac{1}{N} \int g(\mathbf x) f(\mathbf x) d\mathbf x$ .

Integrating the Example

Luckily enough our example function can be integrated over the entire x-y plane by hand:

$\int_0^{2\pi} \int_0^\infty \left ( \frac{\sin^2(r)}{r^3} \right ) r dr d\theta = 2\pi \int_0^\infty \frac{\sin^2(r)}{r^2} dr = 2\pi * \frac{\pi}{2} = \pi^2.$

(The $r$ integral can be done via some complex analysis). Does our Metropolis-Hastings result corroborate this? Consider a function $f(\mathbf x)$ and the results of the MH-algorithm on that function, $n$ samples from the distribution $p(\mathbf x) = f(\mathbf x)/N$ . We are looking for $N$ . Let’s say we come up with some function $U(\mathbf x)$ and take the following statistic:

$\frac{1}{n} \sum_{i=1}^n \frac{U(\mathbf x_i)}{f(\mathbf x_i)} \approx E \left [ \frac{U(\mathbf x)}{f(\mathbf x)} \right ] = \int_{\mathcal D} \frac{U(\mathbf x)}{f(\mathbf x)} p(\mathbf x)d\mathbf x = \frac{1}{N} \int_{\mathcal D} U(\mathbf x) d\mathbf x.$

As long as $U(\mathbf x)$ integrates to $1$ over the domain, the answer comes out to be $1/N$ , so we can just take the reciprocal to get our answer. On the $(x,y)$ plane, a good function that integrates to 1 and matches the shape of our function pretty well is $U(r) = \frac{1}{\pi} e^{-r^2}$ , so our statistic becomes:

$\frac{1}{n\pi} \sum_{i=1}^n \frac{ e^{-r_i^2} r_i^3 }{ \sin^2(r_i) }.$

We hope the answer comes out close to $1/\pi^2 \approx 0.10132$ . I compute the statistic with 1 line of R code:

estimate = sum(apply(samples, 1, function(row) exp(- row[1]^2 - row[2]^2) * (row[1]^2 + row[2]^2)^(3/2)/(sin(sqrt(row[1]^2 + row[2]^2))^2) ))/(nrow(samples) * pi)

When running this line on the data generated by the previous code block, I get the answer 0.1016102, less than 1% away from the true value. Taking the square root of the reciprocal I get 3.137121, not a bad estimate of $\pi$ for a seemingly arbitrary sum. Pretty remarkable.

Applications

The applications of Metropolis-Hastings is truly where the algorithm shines. Many integrals are not solvable in “closed form” via the Fundamental Theorem of Calculus because they have no antiderivative that can be expressed in terms of elementary functions (apparently this is provable) , for example the integral

$\int_a^b x^x dx.$

For these problems, if you really need to know the solution you can use numerical methods of estimating the integral, such as the trapezoidal rule (bringing back fond memories of AP calculus). For some integrals we cannot even use those techniques. This could be because either there are too many dimensions we are integrating over, which forces us to use exponentially many “trapezoids”, or it could be that the domain is really large (perhaps infinitely large) and contributions from different parts are uneven. For these problems Monte Carlo integration is the best option.

There are many examples of such problems in statistical mechanics. The spread of heat through conduction, convection, and radiation is essential for the design of engines, rockets, and electric generators. The many pathways of heat transfer must be integrated, a large space over which Monte Carlo integration is the only practical option. Neutron transport for the design of nuclear weapons and reactors is another similar problem. A third example would be the transport of light or photons through a scene for the purpose of rendering a physically accurate image in Computer Graphics, for which the application of the MH-algorithm has it’s own name: Metropolis Light Transport.

To illustrate how the MH-algorithm makes hard integrals solvable, here is the luminosity equation from Computer Graphics:

$L(X, \hat \omega_0) = L_e(X, \hat \omega_0) + \int_{\mathcal S^2}L_i(X, \hat\omega_i) f_{X,\hat n}(\hat \omega_i, \hat \omega_0) |\hat \omega_i \cdot \hat n | d\hat\omega_i.$

A quick description: the $L$ is the light outgoing from a flat surface $X$ in direction $\hat \omega_0$ , $L_e$ is the light emitted by that surface in that direction, $L_i(X, \hat\omega_i)$ is the light incoming to that surface from direction $\hat \omega_i$ , $f_{X, \hat n}(\hat \omega_i, \hat \omega_0) | \hat \omega_i \cdot \hat n |$ is the chance that light incoming from $\hat \omega_i$ will scatter in direction $\hat \omega_0$ , and $\hat \omega_i$ is integrated over the positive hemisphere $\mathcal S^2$ . In English terms, what this equation is saying is that to find the amount of light that is bouncing off a surface towards you, you have to add up the amount of light being bounced towards that surface from all other surfaces, and find all the light that is being bounced towards them, and so on. This integral is defined recursively! It is actually an infinite recursion of infinite integrals. This qualifies as a very hard integral. Luckily a surface absorbs some of the light bouncing off of it so it converges and we don’t go blind, but how do we integrate this?

A slick way to do it is to expand the integral into an integral over paths through each pixel instead of a recursive integral. That way we would have that the light received by pixel $j$ , $m_j$ , could be expressed as:

$\begin{align*}m_j &= \int ( \text{Light from paths of length } 1) dx \\&+ \int ( \text{Light from paths of length } 2)dx + \dots \\ & = \int_\Omega w_j(\mathbf x) f(\mathbf x) d \mu \end{align*}$

Where $\Omega$ is the space of all paths of light through the camera, $w_j(\mathbf x)$ is what proportion of light of each color that path contributes to that pixel, $f(\mathbf x)$ is the total amount of light flowing along that path, and $\mu$ is a measure on that path. We can then treat this as a Monte Carlo integration problem, sample $n$ paths ( $\mathbf x_i$ ) from the distribution $p(\mathbf x) = f(\mathbf x)/C$ using the MH-algorithm, and use the expectation value identity:

$\frac{1}{n} \sum_{i=1}^n w_j(\mathbf x_i) \approx E[w_j(\mathbf x)] = \int_{\Omega} w_j(\mathbf x) p(\mathbf x) d\mu = \frac{1}{C} \int_\Omega w_j(\mathbf x) f(\mathbf x) d\mu,$

to get the value of the integral, multiplied by a normalization constant (but this can be quickly found and set to whatever looks good at the end of the process). Because Metropolis Light Transport samples the most important points first, it has the fastest general convergence time of any unbiased Monte Carlo method. Additionally, it is trivially parallelized.

The Metropolis-Hastings algorithm makes normally impossible integrals solvable with relative efficiency, and because of the wide range of applications of the algorithm to problems that are very cool I consider it my favorite.

References

Much of what I have written here is knowledge that I built up over summer research on sampling methods and a Computer Graphics class I took this fall, including the luminosity equation I got from The Graphics Codex by Morgan McGuire, the Metropolis Light Transport paper and Ph. D thesis of Eric Veach. In addition I learned much about Monte Carlo methods from Computer Graphics: Principles and Practice by by John F. Hughes, Andries van Dam, Morgan McGuire, David F. Sklar, James D. Foley, Steven K. Feiner, and Kurt Akeley, which could be considered a holy text. Also important are the papers where the algorithm originates by Metropolis et al and Williams Hastings, and an explanatory article by Chib and Greenberg. Full citations for these are below:

Chib, Siddhartha, and Edward Greenberg. “Understanding the metropolis-hastings algorithm.” The american statistician 49, no. 4 (1995): 327-335.

Hastings, W.K. (1970). “Monte Carlo Sampling Methods Using Markov Chains and Their Applications”. Biometrika 57 (1): 97–109.

Hughes, John F.; van Dam, Andries; McGuire, Morgan; Sklar, David F.; Foley, James D.; Feiner, Steven K.; and Akeley, Kurt. Computer graphics: principles and practice. Pearson Education, 2013.

McGuire, Morgan. The Graphics Codex. Online book at www.graphicscodex.com.

Metropolis, N.; Rosenbluth, A.W.; Rosenbluth, M.N.; Teller, A.H.; Teller, E. (1953). “Equations of State Calculations by Fast Computing Machines”. Journal of Chemical Physics 21 (6): 1087–1092.

Veach, Eric. “Robust monte carlo methods for light transport simulation.” PhD diss., Stanford University, 1997.

Veach, Eric, and Leonidas J. Guibas. “Metropolis light transport.” In Proceedings of the 24th annual conference on Computer graphics and interactive techniques, pp. 65-76. ACM Press/Addison-Wesley Publishing Co., 1997.

6 Comments

David Rachel

June 1, 2015 at 2:46 pm

Nice article! I’d not heard of Metropolis-Hastings, but it’s awesome.

Since the limit behaviour is independent of the jump width, does that mean that the transition function T (jump width) can be arbitrarily altered mid-algorithm without undermining the result?

If so, is there any potential benefit in this? Something like annealing, or even fancy and adaptive things – e.g. target desired acceptance/rejection rate, or transition function determined by variance (scalar or matrix) of existing selection?
- Patrick Schnell
June 1, 2015 at 4:21 pm

Changing the transition function mid-algorithm can affect the asymptotic result if you’re not careful. For example, making jumps smaller or when the current position is in a specific region could lead to that region being over-represented.

However, there are benefits to tuning the acceptance rate or reducing the autocorrelation of the sampling process, such as reducing the variance of the integral estimate or making sure that you adequately explore all local maxima. To do this without damaging the convergence of the algorithm, one can specify a “burn-in” segment at the beginning in which parameters of the algorithm are tuned, and then throw out those samples. This has the added advantage of making the final result less dependent on the starting point.
Alex Campbell

June 1, 2015 at 5:23 pm

Interesting stuff Michael, have you ever encounter stochastic evolutionary game dynamics?

It’s a way you can take these methods and merge them with some pretty interesting frameworks coming out of game theory. Basically a way to connect the world of analyzing dynamic, stochastic systems to the world of human behavior. Where does the ‘system’ end up, when the system is a complex world of players, beliefs, actions, and payoff functions.

Link below is a pretty good paper on the topic from my thesis advisors at Oxford. Think you would enjoy it.

http://www.econ2.jhu.edu/people/Young/Stochastic_Evolutionary_Game_Dynamics.pdf
- mflynn
June 2, 2015 at 7:02 am

Sounds interesting, I’ll check that paper out!
Dennis Snell

June 1, 2015 at 9:53 pm

Thanks Michael for the great article. It’s such an amazing thing to see when we can take previously unsolvable problems and transform them into nothing more than a tedious calculation. Probability FTW!
Kshitij Lauria

June 3, 2015 at 8:22 pm

@David Rachel
If you change transition probabilities mid-algorithm, you must be very careful that you have not made your Markov chain irreversible.

An intuitive way to understand detailed balance is that it means your state transitions are reversible: the likelihood of a path is the same in either direction.

Simulated annealing makes use of the idea of Metropolis-Hastings by constructing a FAMILY of reversible Markov chains (each temperature). http://www.mit.edu/~dbertsim/papers/Optimization/Simulated%20annealing.pdf is a survey paper.

My Favorite Algorithm: Metropolis-Hastings

6 Comments

Published

Category

Tags

Contact