The particle marginal Metropolis-Hastings (PMMH) particle MCMC algorithm

Introduction

In the previous post I explained how one can use an unbiased estimate of marginal likelihood derived from a particle filter within a Metropolis-Hastings MCMC algorithm in order to construct an exact pseudo-marginal MCMC scheme for the posterior distribution of the model parameters given some time course data. This idea is closely related to that of the particle marginal Metropolis-Hastings (PMMH) algorithm of Andreiu et al (2010), but not really exactly the same. This is because for a Bayesian model with parameters \theta, latent variables x and data y, of the form

\displaystyle  p(\theta,x,y) = p(\theta)p(x|\theta)p(y|x,\theta),

the pseudo-marginal algorithm which exploits the fact that the particle filter’s estimate of likelihood is unbiased is an MCMC algorithm which directly targets the marginal posterior distribution p(\theta|y). On the other hand, the PMMH algorithm is an MCMC algorithm which targets the full joint posterior distribution p(\theta,x|y). Now, the PMMH scheme does reduce to the pseudo-marginal scheme if samples of x are not generated and stored in the state of the Markov chain, and it certainly is the case that the pseudo-marginal algorithm gives some insight into why the PMMH algorithm works. However, the PMMH algorithm is much more powerful, as it solves the “smoothing” and parameter estimation problem simultaneously and exactly, including the “initial value” problem (computing the posterior distribution of the initial state, x_0). Below I will describe the algorithm and explain why it works, but first it is necessary to understand the relationship between marginal, joint and “likelihood-free” MCMC updating schemes for such latent variable models.

MCMC for latent variable models

Marginal approach

If we want to target p(\theta|y) directly, we can use a Metropolis-Hastings scheme with a fairly arbitrary proposal distribution for exploring \theta, where a new \theta^\star is proposed from f(\theta^\star|\theta) and accepted with probability \min\{1,A\}, where

\displaystyle  A = \frac{p(\theta^\star)}{p(\theta)} \times  \frac{f(\theta|\theta^\star)}{f(\theta^\star|\theta)} \times \frac{p({y}|\theta^\star)}{p({y}|\theta)}.

As previously discussed, the problem with this scheme is that the marginal likelihood p(y|\theta) required in the acceptance ratio is often difficult to compute.

Likelihood-free MCMC

A simple “likelihood-free” scheme targets the full joint posterior distribution p(\theta,x|y). It works by exploiting the fact that we can often simulate from the model for the latent variables p(x|\theta) even when we can’t evaluate it, or marginalise x out of the problem. Here the Metropolis-Hastings proposal is constructed in two stages. First, a proposed new \theta^\star is sampled from f(\theta^\star|\theta) and then a corresponding x^\star is simulated from the model p(x^\star|\theta^\star). The pair (\theta^\star,x^\star) is then jointly accepted with ratio

\displaystyle  A = \frac{p(\theta^\star)}{p(\theta)} \times  \frac{f(\theta|\theta^\star)}{f(\theta^\star|\theta)} \times \frac{p(y|{x}^\star,\theta^\star)}{p(y|{x},\theta)}.

The proposal mechanism ensures that the proposed x^\star is consistent with the proposed \theta^\star, and so the procedure can work provided that the dimension of the data y is low. However, in order to work well more generally, we would want the proposed latent variables to be consistent with the data as well as the model parameters.

Ideal joint update

Motivated by the likelihood-free scheme, we would really like to target the joint posterior p(\theta,x|y) by first proposing \theta^\star from f(\theta^\star|\theta) and then a corresponding x^\star from the conditional distribution p(x^\star|\theta^\star,y). The pair (\theta^\star,x^\star) is then jointly accepted with ratio

\displaystyle  A = \frac{p(\theta^\star)}{p(\theta)}   \frac{p({x}^\star|\theta^\star)}{p({x}|\theta)}   \frac{f(\theta|\theta^\star)}{f(\theta^\star|\theta)}   \frac{p(y|{x}^\star,\theta^\star)}{p(y|{x},\theta)}  \frac{p({x}|y,\theta)}{p({x}^\star|y,\theta^\star)}\\  \qquad = \frac{p(\theta^\star)}{p(\theta)}  \frac{p(y|\theta^\star)}{p(y|\theta)} \frac{f(\theta|\theta^\star)}{f(\theta^\star|\theta)}.

Notice how the acceptance ratio simplifies, using the basic marginal likelihood identity (BMI) of Chib (1995), and x drops out of the ratio completely in order to give exactly the ratio used for the marginal updating scheme. Thus, the “ideal” joint updating scheme reduces to the marginal updating scheme if x is not sampled and stored as a component of the Markov chain.

Understanding the relationship between these schemes is useful for understanding the PMMH algorithm. Indeed, we will see that the “ideal” joint updating scheme (and the marginal scheme) corresponds to PMMH using infinitely many particles in the particle filter, and that the likelihood-free scheme corresponds to PMMH using exactly one particle in the particle filter. For an intermediate number of particles, the PMMH scheme is a compromise between the “ideal” scheme and the “blind” likelihood-free scheme, but is always likelihood-free (when used with a bootstrap particle filter) and always has an acceptance ratio leaving the exact posterior invariant.

The PMMH algorithm

The algorithm

The PMMH algorithm is an MCMC algorithm for state space models jointly updating \theta and x_{0:T}, as the algorithms above. First, a proposed new \theta^\star is generated from a proposal f(\theta^\star|\theta), and then a corresponding x_{0:T}^\star is generated by running a bootstrap particle filter (as described in the previous post, and below) using the proposed new model parameters, \theta^\star, and selecting a single trajectory by sampling once from the final set of particles using the final set of weights. This proposed pair (\theta^\star,x_{0:T}^\star) is accepted using the Metropolis-Hastings ratio

\displaystyle  A = \frac{\hat{p}_{\theta^\star}(y_{1:T})p(\theta^\star)q(\theta|\theta^\star)}{\hat{p}_{\theta}(y_{1:T})p(\theta)q(\theta^\star|\theta)},

where \hat{p}_{\theta^\star}(y_{1:T}) is the particle filter’s (unbiased) estimate of marginal likelihood, described in the previous post, and below. Note that this approach tends to the perfect joint/marginal updating scheme as the number of particles used in the filter tends to infinity. Note also that for a single particle, the particle filter just blindly forward simulates from p_\theta(x^\star_{0:T}) and that the filter’s estimate of marginal likelihood is just the observed data likelihood p_\theta(y_{1:T}|x^\star_{0:T}) leading precisely to the simple likelihood-free scheme. To understand for an arbitrary finite number of particles, M, one needs to think carefully about the structure of the particle filter.

Why it works

To understand why PMMH works, it is necessary to think about the joint distribution of all random variables used in the bootstrap particle filter. To this end, it is helpful to re-visit the particle filter, thinking carefully about the resampling and propagation steps.

First introduce notation for the “particle cloud”: \mathbf{x}_t=\{x_t^k|k=1,\ldots,M\}, \boldsymbol{\pi}_t=\{\pi_t^k|k=1,\ldots,M\}, \tilde{\mathbf{x}}_t=\{(x_t^k,\pi_t^k)|k=1,\ldots,M\}. Initialise the particle filter with \tilde{\mathbf{x}}_0, where x_0^k\sim p(x_0) and \pi_0^k=1/M (note that w_0^k is undefined). Now suppose at time t we have a sample from p(x_t|y_{1:t}): \tilde{\mathbf{x}}_t. First resample by sampling a_t^k \sim \mathcal{F}(a_t^k|\boldsymbol{\pi}_t), k=1,\ldots,M. Here we use \mathcal{F}(\cdot|\boldsymbol{\pi}) for the discrete distribution on 1:M with probability mass function \boldsymbol{\pi}. Next sample x_{t+1}^k\sim p(x_{t+1}^k|x_t^{a_t^k}). Set w_{t+1}^k=p(y_{t+1}|x_{t+1}^k) and \pi_{t+1}^k=w_{t+1}^k/\sum_{i=1}^M w_{t+1}^i. Finally, propagate \tilde{\mathbf{x}}_{t+1} to the next step… We define the filter’s estimate of likelihood as \hat{p}(y_t|y_{1:t-1})=\frac{1}{M}\sum_{i=1}^M w_t^i and \hat{p}(y_{1:T})=\prod_{i=1}^T \hat{p}(y_t|y_{1:t-1}). See Doucet et al (2001) for further theoretical background on particle filters and SMC more generally.

Describing the filter carefully as above allows us to write down the joint density of all random variables in the filter as

\displaystyle  \tilde{q}(\mathbf{x}_0,\ldots,\mathbf{x}_T,\mathbf{a}_0,\ldots,\mathbf{a}_{T-1})  = \left[\prod_{k=1}^M p(x_0^k)\right] \left[\prod_{t=0}^{T-1}    \prod_{k=1}^M \pi_t^{a_t^k} p(x_{t+1}^k|x_t^{a_t^k}) \right]

For PMMH we also sample a final index k' from \mathcal{F}(k'|\boldsymbol{\pi}_T) giving the joint density

\displaystyle  \tilde{q}(\mathbf{x}_0,\ldots,\mathbf{x}_T,\mathbf{a}_0,\ldots,\mathbf{a}_{T-1})\pi_T^{k'}

We write the final selected trajectory as

\displaystyle  x_{0:T}^{k'}=(x_0^{b_0^{k'}},\ldots,x_T^{b_T^{k'}}),

where b_t^{k'}=a_t^{b_{t+1}^{k'}}, and b_T^{k'}=k'. If we now think about the structure of the PMMH algorithm, our proposal on the space of all random variables in the problem is in fact

\displaystyle  f(\theta^\star|\theta)\tilde{q}_{\theta^\star}(\mathbf{x}_0^\star,\ldots,\mathbf{x}_T^\star,\mathbf{a}_0^\star,\ldots,\mathbf{a}_{T-1}^\star)\pi_T^{{k'}^\star}

and by considering the proposal and the acceptance ratio, it is clear that detailed balance for the chain is satisfied by the target with density proportional to

\displaystyle  p(\theta)\hat{p}_\theta(y_{1:T})  \tilde{q}_\theta(\mathbf{x}_0,\ldots,\mathbf{x}_T,\mathbf{a}_0,\ldots,\mathbf{a}_{T-1})  \pi_T^{k'}

We want to show that this target marginalises down to the correct posterior p(\theta,x_{0:T}|y_{1:T}) when we consider just the parameters and the selected trajectory. But if we consider the terms in the joint distribution of the proposal corresponding to the trajectory selected by k', this is given by

\displaystyle  p_\theta(x_0^{b_0^{k'}})\left[\prod_{t=0}^{T-1} \pi_t^{b_t^{k'}}    p_\theta(x_{t+1}^{b_{t+1}^{k'}}|x_t^{b_t^{k'}})\right]\pi_T^{k'}  =  p_\theta(x_{0:T}^{k'})\prod_{t=0}^T \pi_t^{b_t^{k'}}

which, by expanding the \pi_t^{b_t^{k'}} in terms of the unnormalised weights, simplifies to

\displaystyle  \frac{p_\theta(x_{0:T}^{k'})p_\theta(y_{1:T}|x_{0:T}^{k'})}{M^{T+1}\hat{p}_\theta(y_{1:T})}

It is worth dwelling on this result, as this is the key insight required to understand why the PMMH algorithm works. The whole point is that the terms in the joint density of the proposal corresponding to the selected trajectory exactly represent the required joint distribution modulo a couple of normalising constants, one of which is the particle filter’s estimate of marginal likelihood. Thus, by including \hat{p}_\theta(y_{1:T}) in the acceptance ratio, we knock out the normalising constant, allowing all of the other terms in the proposal to be marginalised away. In other words, the target of the chain can be written as proportional to

\displaystyle  \frac{p(\theta)p_\theta(x_{0:T}^{k'},y_{1:T})}{M^{T+1}} \times  \text{(Other terms...)}

The other terms are all probabilities of random variables which do not occur elsewhere in the target, and hence can all be marginalised away to leave the correct posterior

\displaystyle  p(\theta,x_{0:T}|y_{1:T})

Thus the PMMH algorithm targets the correct posterior for any number of particles, M. Also note the implied uniform distribution on the selected indices in the target.

I will give some code examples in a future post.

Published by

darrenjw

I am Professor of Statistics within the Department of Mathematical Sciences at Durham University, UK. I am an Bayesian statistician interested in computation and applications, especially to engineering and the life sciences.

24 thoughts on “The particle marginal Metropolis-Hastings (PMMH) particle MCMC algorithm”

  1. This is a really great and clear explanation of pMCMC. Even though I’ve read the Andrieu et al. paper a couple of times, this really helped clear up a few things for me. I’ve actually been trying to use pMCMC to fit nonlinear epidemiological models to time series data. I implemented a basic version of the algorithm in Matlab and it seemed to work well even though my epi models have what I consider to be a large state space. But it was slow… painfully slow actually even with ~100 particles. Since you seem to be a fan of using Java for numerical computation, I was wondering if you’ve implemented the pMCMC algorithm in Java, and if so, if you have code you could share? I’m just starting to make the switch to Java so it would be great if I could see how somebody else would code this.

    Thanks!!

    David Rasmussen
    Biology Dept.
    Duke University
    Durham, NC, USA

  2. Thanks for your feedback. Regarding a Java implementation – not yet! If you’d like to see how to do this in R (which will not be any faster than Matlab), I am currently finishing off an R package “smfsb” for the second edition of my book which includes a pMCMC example. The installation instructions are here: http://www.staff.ncl.ac.uk/d.j.wilkinson/smfsb/2e/ and running ‘demo(“PMCMC”)’ will give a demo of how to do it. It may be of interest to see how I structured the code to make it easy to bolt together different models, simulation algorithms, particle filters and MCMC schemes (using function closures). I’ll talk about that in future posts. The way it is structured, it should port easily to any reasonable functional or O-O language (including Scala and Java). Porting the framework to Java is on my (very long) list of things to do, but don’t hold your breath! Regards,

  3. A bit off-topic question: your book (of 2011 year edition) should arrive to me in a few weeks, and I wonder what additional reading may be advised to a software engineer gradually shifting towards systems biology? Which books (or other resources) do you consider as a must, which ones to read after?

    1. This is actually a tricky question, as systems biology is very broad, so the answer really depends on the kind of systems biology that you are shifting towards. You could do worse than to start with Voit’s book: http://amzn.to/1dQ88X2 – which is quite nicely done, as a starting point. But going beyond that is likely to be very topic specific.

      1. Thank you for advice! It seems to be something I was looking for, field is indeed too broad.

  4. This may be a dumb question, but in deriving the last line of your calculations, what happened to the p(y_{1:T}) in the denominator of the final posterior?

      1. If you are referring to the genuine p(y_{1:T}), then this is just the usual constant of proportionality which will cancel out of the MH ratio. Note that this is different to p(y_{1:T}|theta), which I’ve written as p_theta(y_{1:T}), which you do have to carefully keep track of.

  5. Assuming that theta is known, is it correct to say that the samples of x1T are what you would get if you ran MH using the filtering distribution as a proposal density.
    If so, if the filtering and smoothing distribution have a high KL-divergence (because the hidden state transition is slowly mixing for instance) would it be fair to say that samples of x1T will get trapped in local maxima for many iterations, making the acceptance rate very small?
    Are there techniques that alleviate this problem by using a proposal distribution that depends on the current x1T, but which are capable of getting reasonable acceptance rates?

      1. Oh, but that most be true only for an infinite number of particles then… Is it then that the proposal is from some intermediate distribution between filtering and smoothing when there are a finite number of particles?

      2. OK, to be more precise, the proposal is from the usual particle approximation to the smoothing distribution, which tends to the true smoothing distribution as the number of particles increases. But it is not the filtering distribution.

      3. OK, it’s not even the filtering distribution if n=1 either, is it? Only for the marginal P(x0)…

  6. Actually, for n=1 particle, you just have a blind “likelihood free” forward simulation from the model (ignoring the data). But in that case I guess it is the 1-particle approximation to both the filtering and smoothing distributions (but neither are any good!).

    1. Oh right of course. I do have an intuition that the greater the divergence between filtering and smoothing, the higher the autocorrelation in samples from x1T, but I can’t quite put my finger on why.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.