## Tuning particle MCMC algorithms

Several papers have appeared recently discussing the issue of how to tune the number of particles used in the particle filter within a particle MCMC algorithm such as particle marginal Metropolis Hastings (PMMH). Three such papers are:

I have discussed psuedo marginal MCMC and particle MCMC algorithms in previous posts. It will be useful to refer back to these posts if these topics are unfamiliar. Within particle MCMC algorithms (and psuedo-marginal MCMC algorithms, more generally), an unbiased estimate of marginal likelihood is constructed using a number of particles. The more particles that are used, the better the estimate of marginal likelihood is, and the resulting MCMC algorithm will behave more like a “real” marginal MCMC algorithm. For a small number of particles, the algorithm will still have exactly the correct target, but the noise in the unbiased estimator of marginal likelihood will lead to poor mixing of the MCMC chain. The idea is to use just enough particles to ensure that there isn’t “too much” noise in the unbiased estimator, but not to waste lots of time producing a super-accurate estimate of marginal likelihood if that isn’t necessary to ensure good mixing of the MCMC chain.

The papers above try to give theoretical justifications for certain “rules of thumb” that are commonly used in practice. One widely adopted scheme is to tune the number of particles so that the variance of the log of the estimate of marginal liklihood is around one. The obvious questions are “where?” and “why?”, and these questions turn out to be connected. As we will see, there isn’t really a good answer to the “where?” question, but what people usually do is use a pilot run to get an estimate of the posterior mean, or mode, or MLE, and then pick one and tune the noise variance at that particular parameter value. As to “why?”, well, the papers above make various (slightly different) assumptions, all of which lead to trading off mixing against computation time to obtain an “optimal” number of particles. They don’t all agree that the variance of the noise should be exactly 1, but they all agree to an order of magnitude.

All of the above papers make the assumption that the noise distribution associated with the marginal likelihood estimate is independent of the parameter at which it is being evaluated, which explains why there isn’t a really good answer to the “where?” question – under the assumption it doesn’t matter what parameter value is used for tuning – they are all the same! Easy. Except that’s quite a big assumption, so it would be nice to know that it is reasonable, and unfortunately it isn’t. Let’s look at an example to see what goes wrong.

#### Example

In Chapter 10 of my book I look in detail at constructing a PMMH algorithm for inferring the parameters of a discretely observed stochastic Lotka-Volterra model. I’ve stepped through the computational details in a previous post which you should refer back to for the necessary background. Following that post, we can construct a particle filter to return an unbiased estimate of marginal likelihood using the following R code (which relies on the smfsb CRAN package):

require(smfsb)
# data
data(LVdata)
data=as.timedData(LVnoise10)
noiseSD=10
# measurement error model
dataLik <- function(x,t,y,log=TRUE,...)
{
ll=sum(dnorm(y,x,noiseSD,log=TRUE))
if (log)
return(ll)
else
return(exp(ll))
}
# now define a sampler for the prior on the initial state
simx0 <- function(N,t0,...)
{
mat=cbind(rpois(N,50),rpois(N,100))
colnames(mat)=c("x1","x2")
mat
}
# construct particle filter
mLLik=pfMLLik(150,simx0,0,stepLVc,dataLik,data)


Again, see the relevant previous post for details. So now mLLik() is a function that will return the log of an unbiased estimate of marginal likelihood (based on 150 particles) given a parameter value at which to evaluate.

What we are currently wondering is whether the noise in the estimate is independent of the parameter at which it is evaluated. We can investigate this for this filter easily by looking at how the estimate varies as the first parameter (prey birth rate) varies. The following code computes a log likelihood estimate across a range of values and plots the result.

mLLik1=function(x){mLLik(th=c(th1=x,th2=0.005,th3=0.6))}
x=seq(0.7,1.3,length.out=5001)
y=sapply(x,mLLik1)
plot(x[y>-1e10],y[y>-1e10])


The resulting plot is as follows (click for full size):

So, looking at the plot, it is very clear that the noise variance certainly isn’t constant as the parameter varies – it varies substantially. Furthermore, the way in which it varies is “dangerous”, in that the noise is smallest in the vicinity of the MLE. So, if a parameter close to the MLE is chosen for tuning the number of particles, this will ensure that the noise is small close to the MLE, but not elsewhere in parameter space. This could have bad consequences for the mixing of the MCMC algorithm as it explores the tails of the posterior distribution.

So with the above in mind, how should one tune the number of particles in a pMCMC algorithm? I can’t give a general answer, but I can explain what I do. We can’t rely on theory, so a pragmatic approach is required. The above rule of thumb usually gives a good starting point for exploration. Then I just directly optimise ESS per CPU second of the pMCMC algorithm from pilot runs for varying numbers of particles (and other tuning parameters in the algorithm). ESS is “expected sample size”, which can be estimated using the effectiveSize() function in the coda CRAN package. Ugly and brutish, but it works…

## Gibbs sampling a Gaussian Markov random field (GMRF) using Java

### Introduction

As I’ve explained previously, I’m gradually coming around to the idea of using Java for the development of MCMC codes, and I’m starting to build up a collection of simple examples for getting started. One of the advantages of Java is that it includes a standard cross-platform GUI library. This might not seem like the most important requirement for MCMC, but can actually be very handy in several contexts, particularly for monitoring convergence. One obvious context is that of image analysis, where it can be useful to monitor image reconstructions as the sampler is running. In this post I’ll show three very small simple Java classes which together provide an application for running a Gibbs sampler on a (non-stationary, unconditioned) Gaussian Markov random field.

The model is essentially that the distribution of each pixel is defined intrinsically, dependent only on its four nearest neighbours on a rectangular lattice, and here the distribution will be Gaussian with mean equal to the sample mean of the four neighbouring pixels and a fixed (unit) variance. On its own this isn’t especially useful, but it is a key component of many image analysis applications.

### A simple Java implementation

We will start with the class MrfApp containing the main method for the application:

MrfApp.java

import java.io.*;
class MrfApp {
public static void main(String[] arg)
throws IOException
{
Mrf mrf;
System.out.println("started program");
mrf=new Mrf(800,600);
System.out.println("created mrf object");
mrf.update(1000);
mrf.saveImage("mrf.png");
System.out.println("finished program");
mrf.frame.dispose();
System.exit(0);
}
}


Hopefully this code is largely self-explanatory, but relies on a class called Mrf which contains all of the logic associated with the GMRF.

Mrf.java

import java.io.*;
import java.util.*;
import java.awt.image.*;
import javax.swing.*;
import javax.imageio.ImageIO;

class Mrf
{
int n,m;
double[][] cells;
Random rng;
BufferedImage bi;
WritableRaster wr;
JFrame frame;
ImagePanel ip;

Mrf(int n_arg,int m_arg)
{
n=n_arg;
m=m_arg;
cells=new double[n][m];
rng=new Random();
bi=new BufferedImage(n,m,BufferedImage.TYPE_BYTE_GRAY);
wr=bi.getRaster();
frame=new JFrame("MRF");
frame.setSize(n,m);
frame.setVisible(true);
}

public void saveImage(String filename)
throws IOException
{
ImageIO.write(bi,"PNG",new File(filename));
}

public void updateImage()
{
double mx=-1e+100;
double mn=1e+100;
for (int i=0;i<n;i++) {
for (int j=0;j<m;j++) {
if (cells[i][j]>mx) { mx=cells[i][j]; }
if (cells[i][j]<mn) { mn=cells[i][j]; }
}
}
for (int i=0;i<n;i++) {
for (int j=0;j<m;j++) {
int level=(int) (255*(cells[i][j]-mn)/(mx-mn));
wr.setSample(i,j,0,level);
}
}
frame.repaint();
}

public void update(int num)
{
for (int i=0;i<num;i++) {
updateOnce();
}
}

private void updateOnce()
{
double mean;
for (int i=0;i<n;i++) {
for (int j=0;j<m;j++) {
if (i==0) {
if (j==0) {
mean=0.5*(cells[0][1]+cells[1][0]);
}
else if (j==m-1) {
mean=0.5*(cells[0][j-1]+cells[1][j]);
}
else {
mean=(cells[0][j-1]+cells[0][j+1]+cells[1][j])/3.0;
}
}
else if (i==n-1) {
if (j==0) {
mean=0.5*(cells[i][1]+cells[i-1][0]);
}
else if (j==m-1) {
mean=0.5*(cells[i][j-1]+cells[i-1][j]);
}
else {
mean=(cells[i][j-1]+cells[i][j+1]+cells[i-1][j])/3.0;
}
}
else if (j==0) {
mean=(cells[i-1][0]+cells[i+1][0]+cells[i][1])/3.0;
}
else if (j==m-1) {
mean=(cells[i-1][j]+cells[i+1][j]+cells[i][j-1])/3.0;
}
else {
mean=0.25*(cells[i][j-1]+cells[i][j+1]+cells[i+1][j]
+cells[i-1][j]);
}
cells[i][j]=mean+rng.nextGaussian();
}
}
updateImage();
}

}


This class contains a few simple methods for creating and updating the GMRF, and also for maintaining and updating a graphical view of the GMRF as the sampler is running. The Gibbs sampler update itself is encoded in the final method, updateOnce, and most of the code is to deal with edge and corner cases (in the literal rather than metaphorical sense!). This is called repeatedly by the method update for the required number of iterations. At the end of each iteration, the method updateOnce triggers updateImage which updates the image associated GMRF. The GMRF itself is stored in a 2-dimensional array of doubles, but an image pixel typically consists of a grayscale value represented by an unsigned byte – that is, an integer from 0 to 255. So updateImage scans through the GMRF to find the maximum and minimum values and then maps the GMRF values onto the 0 to 255 scale. The image itself is set up by the constructor method, Mrf. This class relies on an additional class called ImagePanel, which is a simple GUI panel for displaying images:

ImagePanel.java

import java.awt.*;
import java.awt.image.*;
import javax.swing.*;

class ImagePanel extends JPanel {

protected BufferedImage image;

public ImagePanel(BufferedImage image) {
this.image=image;
Dimension dim=new Dimension(image.getWidth(),image.getHeight());
setPreferredSize(dim);
setMinimumSize(dim);
revalidate();
repaint();
}

public void paintComponent(Graphics g) {
g.drawImage(image,0,0,this);
}

}


This completes the application, which can be compiled and run from the command line with

javac *.java
java MrfApp


This should compile the code and run the application, which will show a GMRF updating for 1000 iterations. When the 1000 iterations are complete, the application writes the final image to a file and then quits.

### Using Parallel COLT

The above classes are very convenient, as they should work with any standard Java installation. However, in more complex scenarios, it is likely that a math library such as Parallel COLT will be required. In this case it will make sense to make use of features in the COLT library, such as random number generators and 2d matrix objects. We can adapt the above application by replacing the MrfApp and Mrf classes with the following versions (the ImagePanel class remains unchanged):

MrfApp.java

import java.io.*;
import cern.jet.random.tdouble.engine.*;

class MrfApp {

public static void main(String[] arg)
throws IOException
{
Mrf mrf;
int seed=1234;
System.out.println("started program");
DoubleRandomEngine rngEngine=new DoubleMersenneTwister(seed);
mrf=new Mrf(800,600,rngEngine);
System.out.println("created mrf object");
mrf.update(1000);
mrf.saveImage("mrf.png");
System.out.println("finished program");
mrf.frame.dispose();
System.exit(0);
}

}


Mrf.java

import java.io.*;
import java.util.*;
import java.awt.image.*;
import javax.swing.*;
import javax.imageio.ImageIO;
import cern.jet.random.tdouble.*;
import cern.jet.random.tdouble.engine.*;
import cern.colt.matrix.tdouble.impl.*;

class Mrf
{
int n,m;
DenseDoubleMatrix2D cells;
DoubleRandomEngine rng;
Normal rngN;
BufferedImage bi;
WritableRaster wr;
JFrame frame;
ImagePanel ip;

Mrf(int n_arg,int m_arg,DoubleRandomEngine rng)
{
n=n_arg;
m=m_arg;
cells=new DenseDoubleMatrix2D(n,m);
this.rng=rng;
rngN=new Normal(0.0,1.0,rng);
bi=new BufferedImage(n,m,BufferedImage.TYPE_BYTE_GRAY);
wr=bi.getRaster();
frame=new JFrame("MRF");
frame.setSize(n,m);
frame.setVisible(true);
}

public void saveImage(String filename)
throws IOException
{
ImageIO.write(bi,"PNG",new File(filename));
}

public void updateImage()
{
double mx=-1e+100;
double mn=1e+100;
for (int i=0;i<n;i++) {
for (int j=0;j<m;j++) {
if (cells.getQuick(i,j)>mx) { mx=cells.getQuick(i,j); }
if (cells.getQuick(i,j)<mn) { mn=cells.getQuick(i,j); }
}
}
for (int i=0;i<n;i++) {
for (int j=0;j<m;j++) {
int level=(int) (255*(cells.getQuick(i,j)-mn)/(mx-mn));
wr.setSample(i,j,0,level);
}
}
frame.repaint();
}

public void update(int num)
{
for (int i=0;i<num;i++) {
updateOnce();
}
}

private void updateOnce()
{
double mean;
for (int i=0;i<n;i++) {
for (int j=0;j<m;j++) {
if (i==0) {
if (j==0) {
mean=0.5*(cells.getQuick(0,1)+cells.getQuick(1,0));
}
else if (j==m-1) {
mean=0.5*(cells.getQuick(0,j-1)+cells.getQuick(1,j));
}
else {
mean=(cells.getQuick(0,j-1)+cells.getQuick(0,j+1)+cells.getQuick(1,j))/3.0;
}
}
else if (i==n-1) {
if (j==0) {
mean=0.5*(cells.getQuick(i,1)+cells.getQuick(i-1,0));
}
else if (j==m-1) {
mean=0.5*(cells.getQuick(i,j-1)+cells.getQuick(i-1,j));
}
else {
mean=(cells.getQuick(i,j-1)+cells.getQuick(i,j+1)+cells.getQuick(i-1,j))/3.0;
}
}
else if (j==0) {
mean=(cells.getQuick(i-1,0)+cells.getQuick(i+1,0)+cells.getQuick(i,1))/3.0;
}
else if (j==m-1) {
mean=(cells.getQuick(i-1,j)+cells.getQuick(i+1,j)+cells.getQuick(i,j-1))/3.0;
}
else {
mean=0.25*(cells.getQuick(i,j-1)+cells.getQuick(i,j+1)+cells.getQuick(i+1,j)
+cells.getQuick(i-1,j));
}
cells.setQuick(i,j,mean+rngN.nextDouble());
}
}
updateImage();
}

}


Again, the code should be reasonably self explanatory, and will compile and run in the same way provided that Parallel COLT is installed and in your classpath. This version runs approximately twice as fast as the previous version on all of the machines I’ve tried it on.

### Reference

I have found the following book very useful for understanding how to work with images in Java:

Hunt, K.A. (2010) The Art of Image Processing with Java, A K Peters/CRC Press.

## Parallel particle filtering and pMCMC using R and multicore

### Introduction

In a previous post I showed how to construct a PMMH pMCMC algorithm for parameter estimation with partially observed Markov processes. The inner loop of a pMCMC algorithm consists of running a particle filter to construct an unbiased estimate of marginal likelihood. This inner loop is the place where the code spends almost all of its time, and so speeding up the particle filter will result in dramatic speedup of the pMCMC algorithm. This is fortunate, since as previously discussed, MCMC algorithms are difficult to parallelise other than on a per iteration basis. Here, each iteration can be speeded up if we can effectively parallelise a particle filter. Particle filters are much easier to parallelise than MCMC algorithms, and so it is tempting to try and exploit this within R. In fact, although it is the case that it is possible to effectively parallelise particle filters in efficient languages using low-level parallelisation tools (say, using C with MPI, or Java concurrency tools), it is not so easy to speed up R-based particle filters using R’s high-level parallelisation constructs, as we shall see.

### Particle filters

In the previous post we looked at the function pfMLLik within the CRAN package smfsb. As a reminder, the source code is

pfMLLik <- function (n, simx0, t0, stepFun, dataLik, data)
{
times = c(t0, as.numeric(rownames(data)))
deltas = diff(times)
return(function(...) {
xmat = simx0(n, t0, ...)
ll = 0
for (i in 1:length(deltas)) {
xmat = t(apply(xmat, 1, stepFun, t0 = times[i], deltat = deltas[i], ...))
w = apply(xmat, 1, dataLik, t = times[i + 1], y = data[i,], log = FALSE, ...)
if (max(w) < 1e-20) {
warning("Particle filter bombed")
return(-1e+99)
}
ll = ll + log(mean(w))
rows = sample(1:n, n, replace = TRUE, prob = w)
xmat = xmat[rows, ]
}
ll
})
}


The function itself doesn’t actually run a particle filter, but instead returns a function closure which does (see the previous post for a discussion of lexical scope and function closures in R). There are obviously several different steps within the particle filter, and several of these are amenable to parallelisation. However, for complex models, forward simulation from the model will be the rate-limiting step, where the vast majority of CPU cycles will be spent. Line 9 in the above code is where forward simulation takes place, and in particular, the key function call is the apply call:

apply(xmat, 1, stepFun, t0 = times[i], deltat = deltas[i], ...)


This call applies the forward simulation algorithm stepFun to each row of the matrix xmat independently. Since there are no dependencies between the function calls, this is in principle very straightforward to parallelise on multicore hardware.

### Multicore support in R

I’m writing this post on a laptop with an Intel i7 quad core chip, running the 64 bit version of Ubuntu 11.10. R has support for multicore processing on this platform – it is just a simple matter of installing the relevant packages. However, things are changing rapidly regarding multicore support in R right now, so YMMV. Ubuntu 11.10 has R 2.13 by default, but the multicore support is slightly different in the recently released R 2.14. I’m still using R 2.13. I may update this post (or comment) when I move to R 2.14. The main difference is that the package multicore has been replaced by the package parallel. There are a few other minor changes, but it should be easy to adapt what is presented here to 2.14.

There is a new O’Reilly book called Parallel R. I’ve got a copy of it. It does cover the new parallel package in R 2.14, as well as other parallel R topics, but the book is a bit light weight, to say the least, and I reviewed it on this blog. Please read my review for further details before you buy it.

If you haven’t used multicore in R previously, then

install.packages(c("multicore","doMC"))


should get you started (again, I’m assuming that your R version is strictly < 2.14). You can test it has worked with:

library(multicore)
multicore:::detectCores()


When I do this, I get the answer 8 (I have 4 cores, each of which is hyper-threaded). To begin with, I want to tell R to use just 4 process threads, and I can do this with

library(doMC)
registerDoMC(4)


Replacing the second line with registerDoMC() will set things up to use all detected cores (in my case, 8). There are a couple of different strategies we could use to parallelise this. One strategy for parallelising the apply call discussed above is to be to replace it with a foreach / %dopar% loop. This is best illustrated by example. Start with line 9 from the function pfMLLik:

xmat = t(apply(xmat, 1, stepFun, t0 = times[i], deltat = deltas[i], ...))


We can produce a parallelised version by replacing this line with the following block of code:

res=foreach(j=1:dim(xmat)[1]) %dopar% {
stepFun(xmat[j,], t0 = times[i], deltat = deltas[i], ...)
}
xmat=t(sapply(res,cbind))


Each iteration of the foreach loop is executed independently (possibly using multiple cores), and the result of each iteration is returned as a list, and captured in res. This list of return vectors is then coerced back into a matrix with the final line.

In fact, we can improve on this by using the .combine argument to foreach, which describes how to combine the results from each iteration. Here we can just use rbind to combine the results into a matrix, using:

xmat=foreach(j=1:dim(xmat)[1], .combine="rbind") %dopar% {
stepFun(xmat[j,], t0 = times[i], deltat = deltas[i], ...)
}


This code is much neater, and in principle ought to be a bit faster, though I haven’t noticed much difference in practice.

In fact, it is not necessary to use the foreach construct at all. The multicore package provides the mclapply function, which is a multicore version of lapply. To use mclapply (or, indeed, lapply) here, we first need to split our matrix into a list of rows, which we can do using the split command. So in fact, our apply call can be replaced with the single line:

xmat=t(sapply(mclapply(split(xmat,row(xmat)), stepFun, t0=times[i], deltat=deltas[i], ...),cbind))


This is actually a much cleaner solution than the method using foreach, but it does require grokking a bit more R. Note that mclapply uses a different method to specify the number of threads to use than foreach/doMC. Here you can either use the named argument to mclapply, mc.cores, or use options(), eg. options(cores=4).

As well as being much cleaner, I find that the mclapply approach is much faster than the foreach/dopar approach for this problem. I’m guessing that this is because foreach doesn’t pre-schedule tasks by default, whereas mclapply does, but I haven’t had a chance to dig into this in detail yet.

### A parallelised particle filter

We can now splice the parallelised forward simulation step (using mclapply) back into our particle filter function to get:

require(multicore)
pfMLLik <- function (n, simx0, t0, stepFun, dataLik, data)
{
times = c(t0, as.numeric(rownames(data)))
deltas = diff(times)
return(function(...) {
xmat = simx0(n, t0, ...)
ll = 0
for (i in 1:length(deltas)) {
xmat=t(sapply(mclapply(split(xmat,row(xmat)), stepFun, t0=times[i], deltat=deltas[i], ...),cbind))
w = apply(xmat, 1, dataLik, t = times[i + 1], y = data[i,], log = FALSE, ...)
if (max(w) < 1e-20) {
warning("Particle filter bombed")
return(-1e+99)
}
ll = ll + log(mean(w))
rows = sample(1:n, n, replace = TRUE, prob = w)
xmat = xmat[rows, ]
}
ll
})
}


This can be used in place of the version supplied with the smfsb package for slow simulation algorithms running on modern multicore machines.

There is an issue regarding Monte Carlo simulations such as this and the multicore package (whether you use mclapply or foreach/dopar) in that it adopts a “different seeds” approach to parallel random number generation, rather than a true parallel random number generator. This probably isn’t worth worrying too much about now, since it is fixed in the new parallel package in R 2.14, but is something to be aware of. I discuss parallel random number generation issues in Wilkinson (2005).

### Multiple particle filters and pMCMC

Let’s look again at the main loop of the pMCMC algorithm discussed in the previous post:

for (i in 1:iters) {
message(paste(i,""),appendLF=FALSE)
for (j in 1:thin) {
thprop=th*exp(rnorm(p,0,tune))
llprop=mLLik(thprop)
if (log(runif(1)) < llprop - ll) {
th=thprop
ll=llprop
}
}
thmat[i,]=th
}


It is clear that the main computational bottleneck of this code is the call to mLLik on line 5, as this is the call which runs the particle filter. The purpose of making the call is to obtain an unbiased estimate of marginal likelihood. However, there are plenty of other ways that we can obtain such estimates than by running a single particle filter. In particular, we could run multiple particle filters and average the results. So, let’s look at how to do this in the multicore setting. Let’s start by thinking about running 4 particle filters. We could just replace the line

llprop=mLLik(thprop)


with the code

llprop=0.25*foreach(i=1:4, .combine="+") %dopar% {
mLLik(thprop)
}


Now, there are at least 2 issues with this. The first is that we are now just running 4 particle filters rather than 1, and so even with perfect parallelisation, it will run no quicker than the code we started with. However, the idea is that by running 4 particle filters we ought to be able to get away with each particle filter using fewer particles, though it isn’t trivial to figure out exactly how many. For example, averaging the results from 4 particle filters, each of which uses 25 particles is not as good as running a single particle filter with 100 particles. In practice, some trial and error is likely to be required. The second problem is that we have computed the mean of the log of the likelihoods, and not the likelihoods themselves. This will almost certainly work fine in practice, as the resulting estimate will in most cases be very close to unbiased, but it will not be exactly unbiased, as so will not lead to an “exact” approximate algorithm. In principle, this can be fixed by instead using

res=foreach(i=1:4) %dopar% {
mLLik(thprop)
}
llprop=log(mean(sapply(res,exp)))


but in practice this is likely to be subject to numerical underflow problems, as it involves manipulating raw likelihood values, which is generally a bad idea. It is possible to compute the log of the mean of the likelihoods in a more numerically stable way, but that is left as an exercise for the reader, as this post is way too long already… However, one additional tip worth mentioning is that the foreach package includes a convenience function called times for situations like the above, where the argument is not varying over calls. So the above code can be replaced with

res=times(4) %dopar% mLLik(thprop)
llprop=log(mean(sapply(res,exp)))


which is a bit cleaner and more readable.

Using this approach to parallelisation, there is now a much better chance of getting some speedup on multicore architectures, as the granularity of the tasks being parallelised is now much larger. Consider the example from the previous post, where at each iteration we ran a particle filter with 100 particles. If we now re-run that example, but instead use 4 particle filters each using 25 particles, we do get a slight speedup. However, on my laptop, the speedup is only around a factor of 1.6 using 4 cores, and as already discussed, 4 filters each with 25 particles isn’t actually quite as good as a single filter with 100 particles anyway. So, the benefits are rather modest here, but will be much better with less trivial examples (slower simulators). For completeness, a complete runnable demo script is included after the references. Also, it is probably worth emphasising that if your pMCMC algorithm has a short burn-in period, you may well get much better overall speed-ups by just running parallel MCMC chains. Depressing, perhaps, but true.

### References

• McCallum, E., Weston, S. (2011) Parallel R, O’Reilly.
• Wilkinson, D. J. (2005) Parallel Bayesian Computation, Chapter 16 in E. J. Kontoghiorghes (ed.) Handbook of Parallel Computing and Statistics, Marcel Dekker/CRC Press, 481-512.
• Wilkinson, D. J. (2011) Stochastic Modelling for Systems Biology, second edition, Boca Raton, Florida: Chapman & Hall/CRC Press.
• ### Demo script

require(smfsb)
data(LVdata)

require(multicore)
require(doMC)
registerDoMC(4)

# set up data likelihood
noiseSD=10
dataLik <- function(x,t,y,log=TRUE,...)
{
ll=sum(dnorm(y,x,noiseSD,log=TRUE))
if (log)
return(ll)
else
return(exp(ll))
}
# now define a sampler for the prior on the initial state
simx0 <- function(N,t0,...)
{
mat=cbind(rpois(N,50),rpois(N,100))
colnames(mat)=c("x1","x2")
mat
}
# convert the time series to a timed data matrix
LVdata=as.timedData(LVnoise10)
# create marginal log-likelihood functions, based on a particle filter

# use 25 particles instead of 100
mLLik=pfMLLik(25,simx0,0,stepLVc,dataLik,LVdata)

iters=1000
tune=0.01
thin=10
th=c(th1 = 1, th2 = 0.005, th3 = 0.6)
p=length(th)
ll=-1e99
thmat=matrix(0,nrow=iters,ncol=p)
colnames(thmat)=names(th)
# Main pMCMC loop
for (i in 1:iters) {
message(paste(i,""),appendLF=FALSE)
for (j in 1:thin) {
thprop=th*exp(rnorm(p,0,tune))
res=times(4) %dopar% mLLik(thprop)
llprop=log(mean(sapply(res,exp)))
if (log(runif(1)) < llprop - ll) {
th=thprop
ll=llprop
}
}
thmat[i,]=th
}
message("Done!")
# Compute and plot some basic summaries
mcmcSummary(thmat)


## Particle filtering and pMCMC using R

In the previous post I gave a quick introduction to the CRAN R package smfsb, and how it can be used for simulation of Markov processes determined by stochastic kinetic networks. In this post I’ll show how to use data and particle MCMC techniques in order to carry out Bayesian inference for the parameters of partially observed Markov processes.

### The simulation model and the data

For this post we will assume that the smfsb package is installed and loaded (see previous post for details). The package includes the function stepLVc which simulates from the Markov transition kernel of a Lotka-Volterra process by calling out to some native C code for speed. So, for example,

stepLVc(c(x1=50,x2=100),0,1)


will simulate the state of the process at time 1 given an initial condition of 50 prey and 100 predators at time 0, using the default rate parameters of the function, th = c(1, 0.005, 0.6). The package also includes some data simulated from this model using these parameters, with and without added noise. The datasets can be loaded with

data(LVdata)


For simplicity, we will just make use of the dataset LVnoise10 in this post. This dataset is a multivariate time series consisting of 16 equally spaced observations on both prey and predators subject to Gaussian measurement error with a standard deviation of 10. We can plot the data with

plot(LVnoise10,plot.type="single",col=c(2,4))


giving:

The Bayesian inference problem is to see how much we are able to learn about the parameters which generated the data using only the data and our knowledge of the structure of the problem. There are many approaches one can take to this problem, but most are computationally intensive, due to the analytical intractability of the transition kernel of the LV process. Here we will follow Wilkinson (2011) and take a particle MCMC (pMCMC) approach, and specifically, use a pseudo-marginal “exact approximate” MCMC algorithm based on the particle marginal Metropolis-Hastings (PMMH) algorithm. I have discussed the pseudo-marginal approach, using particle filters for marginal likelihood estimation, and the PMMH algorithm in previous posts, so if you have been following my posts for a while, this should all make perfect sense…

### Particle filter

One of the key ingredients required to implement the pseudo-marginal MCMC scheme is a (bootstrap) particle filter which generates an unbiased estimate of the marginal likelihood of the data given the parameters (integrated over the unobserved state trajectory). The algorithm was discussed in this post, and R code to implement this is included in the smfsb R package as pfMLLik. For reasons of numerical stability, the function computes and returns the log of the marginal likelihood, but it is important to understand that it is the actually likelihood estimate that is unbiased for the true likelihood, and not the corresponding statement for the logs. The actual code of the function is relatively short, and for completeness is given below:

pfMLLik <- function (n, simx0, t0, stepFun, dataLik, data)
{
times = c(t0, as.numeric(rownames(data)))
deltas = diff(times)
return(function(...) {
xmat = simx0(n, t0, ...)
ll = 0
for (i in 1:length(deltas)) {
xmat = t(apply(xmat, 1, stepFun, t0 = times[i], deltat = deltas[i], ...))
w = apply(xmat, 1, dataLik, t = times[i + 1], y = data[i,], log = FALSE, ...)
if (max(w) < 1e-20) {
warning("Particle filter bombed")
return(-1e+99)
}
ll = ll + log(mean(w))
rows = sample(1:n, n, replace = TRUE, prob = w)
xmat = xmat[rows, ]
}
ll
})
}


We need to set up the prior and the data likelihood correctly before we can use this function, but first note that the function does not actually run a particle filter at all, but instead stores everything it needs to know to run the particle filter in the local environment, and then returns a function closure for evaluating the marginal likelihood at a given set of parameters. The resulting function (closure) can then be used to run a particle filter for a given set of parameters, simply by passing the required parameters into the function. This functional programming style is consistent with that used throughout the smfsb R package, and leads to quite simple, modular code. To use pfMLLik, we first need to define a function which evaluates the log-likelihood of an observation conditional on the true state, and another which samples from the prior distribution of the initial state of the system. Here, we can do that as follows.

# set up data likelihood
noiseSD=10
dataLik <- function(x,t,y,log=TRUE,...)
{
ll=sum(dnorm(y,x,noiseSD,log=TRUE))
if (log)
return(ll)
else
return(exp(ll))
}
# now define a sampler for the prior on the initial state
simx0 <- function(N,t0,...)
{
mat=cbind(rpois(N,50),rpois(N,100))
colnames(mat)=c("x1","x2")
mat
}
# convert the time series to a timed data matrix
LVdata=as.timedData(LVnoise10)
# create marginal log-likelihood functions, based on a particle filter
mLLik=pfMLLik(100,simx0,0,stepLVc,dataLik,LVdata)


Now the function (closure) mLLik will, for a given parameter vector, run a particle filter (using 100 particles) and return the log of the particle filter’s unbiased estimate of the marginal likelihood of the data. It is then very easy to use this function to create a simple PMMH algorithm for parameter inference.

### PMMH algorithm

Below is an algorithm based on flat priors and a simple Metropolis-Hastings update for the parameters using the function closure mLLik, defined above.

iters=1000
tune=0.01
thin=10
th=c(th1 = 1, th2 = 0.005, th3 = 0.6)
p=length(th)
ll=-1e99
thmat=matrix(0,nrow=iters,ncol=p)
colnames(thmat)=names(th)
# Main pMCMC loop
for (i in 1:iters) {
message(paste(i,""),appendLF=FALSE)
for (j in 1:thin) {
thprop=th*exp(rnorm(p,0,tune))
llprop=mLLik(thprop)
if (log(runif(1)) < llprop - ll) {
th=thprop
ll=llprop
}
}
thmat[i,]=th
}
message("Done!")
# Compute and plot some basic summaries
mcmcSummary(thmat)


This will take a little while to run, but in the end should give a plot something like the following (click for full size):

So, although we should really run the chain for a bit longer, we see that we can learn a great deal about the parameters of the process from very little data. For completeness, a full runnable demo script is included below the references. Of course there are many obvious extensions of this basic problem, such as partial observation (eg. only observing the prey) and unknown measurement error. These are discussed in Wilkinson (2011), and code for these cases is included within the demo(PMCMC), which should be inspected for further details.

### Discussion

At this point it is probably worth emphasising that there are other “likelihood free” approaches which can be taken to parameter inference for partially observed Markov process (POMP) models. Many of these are implemented in the pomp R package, also available from CRAN, by King et al (2008). The pomp package is well documented, and has a couple of good tutorial vignettes which should be sufficient to get people started. The API of the package is rather cumbersome, but the algorithms appear to be quite robust. Approximate Bayesian computation (ABC) approaches are also quite natural for POMP models (see, for example, Toni et al (2009)). This is because “exact” likelihood free procedures break down in the case of low/no measurement error or high-dimensional observations. There are some R packages for ABC, but I am not sufficiently familiar with them to be able to give recommendations.

If one is able to move away from the “likelihood free” paradigm, it is possible to develop “exact” pMCMC algorithms which do not break down in challenging observation scenarios. The problem here is the intractability of the Markov transition kernel. In the case of nonlinear Markov jump processes, finding very generic solutions seems quite difficult, but for diffusion (approximation) processes based on stochastic differential equations, it seems to be possible to develop irreducible pMCMC algorithms which have very broad applicability – see Golightly and Wilkinson (2011) for further details of how such techniques can be used in the context of stochastic kinetic models similar to those considered in this post.

### References

• Golightly, A., Wilkinson, D. J. (2011) Bayesian parameter inference for stochastic biochemical network models using particle MCMC, Interface Focus, 1(6):807-820.
• King, A.A., Ionides, E.L., & Breto, C.M. (2008) pomp: Statistical inference for partially observed Markov processes, CRAN.
• Toni, T., Welch, D., Strelkowa, N., Ipsen, A. & Stumpf, M. (2009) Approximate Bayesian computation scheme for parameter inference and model selection in dynamical systems, J. R. Soc. Interface 6(31): 187-202.
• Wilkinson, D. J. (2011) Stochastic Modelling for Systems Biology, second edition, Boca Raton, Florida: Chapman & Hall/CRC Press.
• ### Demo script

require(smfsb)
data(LVdata)

# set up data likelihood
noiseSD=10
dataLik <- function(x,t,y,log=TRUE,...)
{
ll=sum(dnorm(y,x,noiseSD,log=TRUE))
if (log)
return(ll)
else
return(exp(ll))
}
# now define a sampler for the prior on the initial state
simx0 <- function(N,t0,...)
{
mat=cbind(rpois(N,50),rpois(N,100))
colnames(mat)=c("x1","x2")
mat
}
# convert the time series to a timed data matrix
LVdata=as.timedData(LVnoise10)
# create marginal log-likelihood functions, based on a particle filter
mLLik=pfMLLik(100,simx0,0,stepLVc,dataLik,LVdata)

iters=1000
tune=0.01
thin=10
th=c(th1 = 1, th2 = 0.005, th3 = 0.6)
p=length(th)
ll=-1e99
thmat=matrix(0,nrow=iters,ncol=p)
colnames(thmat)=names(th)
# Main pMCMC loop
for (i in 1:iters) {
message(paste(i,""),appendLF=FALSE)
for (j in 1:thin) {
thprop=th*exp(rnorm(p,0,tune))
llprop=mLLik(thprop)
if (log(runif(1)) < llprop - ll) {
th=thprop
ll=llprop
}
}
thmat[i,]=th
}
message("Done!")
# Compute and plot some basic summaries
mcmcSummary(thmat)


## Introduction

In the previous post I explained how one can use an unbiased estimate of marginal likelihood derived from a particle filter within a Metropolis-Hastings MCMC algorithm in order to construct an exact pseudo-marginal MCMC scheme for the posterior distribution of the model parameters given some time course data. This idea is closely related to that of the particle marginal Metropolis-Hastings (PMMH) algorithm of Andreiu et al (2010), but not really exactly the same. This is because for a Bayesian model with parameters $\theta$, latent variables $x$ and data $y$, of the form

$\displaystyle p(\theta,x,y) = p(\theta)p(x|\theta)p(y|x,\theta),$

the pseudo-marginal algorithm which exploits the fact that the particle filter’s estimate of likelihood is unbiased is an MCMC algorithm which directly targets the marginal posterior distribution $p(\theta|y)$. On the other hand, the PMMH algorithm is an MCMC algorithm which targets the full joint posterior distribution $p(\theta,x|y)$. Now, the PMMH scheme does reduce to the pseudo-marginal scheme if samples of $x$ are not generated and stored in the state of the Markov chain, and it certainly is the case that the pseudo-marginal algorithm gives some insight into why the PMMH algorithm works. However, the PMMH algorithm is much more powerful, as it solves the “smoothing” and parameter estimation problem simultaneously and exactly, including the “initial value” problem (computing the posterior distribution of the initial state, $x_0$). Below I will describe the algorithm and explain why it works, but first it is necessary to understand the relationship between marginal, joint and “likelihood-free” MCMC updating schemes for such latent variable models.

### MCMC for latent variable models

#### Marginal approach

If we want to target $p(\theta|y)$ directly, we can use a Metropolis-Hastings scheme with a fairly arbitrary proposal distribution for exploring $\theta$, where a new $\theta^\star$ is proposed from $f(\theta^\star|\theta)$ and accepted with probability $\min\{1,A\}$, where

$\displaystyle A = \frac{p(\theta^\star)}{p(\theta)} \times \frac{f(\theta|\theta^\star)}{f(\theta^\star|\theta)} \times \frac{p({y}|\theta^\star)}{p({y}|\theta)}.$

As previously discussed, the problem with this scheme is that the marginal likelihood $p(y|\theta)$ required in the acceptance ratio is often difficult to compute.

#### Likelihood-free MCMC

A simple “likelihood-free” scheme targets the full joint posterior distribution $p(\theta,x|y)$. It works by exploiting the fact that we can often simulate from the model for the latent variables $p(x|\theta)$ even when we can’t evaluate it, or marginalise $x$ out of the problem. Here the Metropolis-Hastings proposal is constructed in two stages. First, a proposed new $\theta^\star$ is sampled from $f(\theta^\star|\theta)$ and then a corresponding $x^\star$ is simulated from the model $p(x^\star|\theta^\star)$. The pair $(\theta^\star,x^\star)$ is then jointly accepted with ratio

$\displaystyle A = \frac{p(\theta^\star)}{p(\theta)} \times \frac{f(\theta|\theta^\star)}{f(\theta^\star|\theta)} \times \frac{p(y|{x}^\star,\theta^\star)}{p(y|{x},\theta)}.$

The proposal mechanism ensures that the proposed $x^\star$ is consistent with the proposed $\theta^\star$, and so the procedure can work provided that the dimension of the data $y$ is low. However, in order to work well more generally, we would want the proposed latent variables to be consistent with the data as well as the model parameters.

#### Ideal joint update

Motivated by the likelihood-free scheme, we would really like to target the joint posterior $p(\theta,x|y)$ by first proposing $\theta^\star$ from $f(\theta^\star|\theta)$ and then a corresponding $x^\star$ from the conditional distribution $p(x^\star|\theta^\star,y)$. The pair $(\theta^\star,x^\star)$ is then jointly accepted with ratio

$\displaystyle A = \frac{p(\theta^\star)}{p(\theta)} \frac{p({x}^\star|\theta^\star)}{p({x}|\theta)} \frac{f(\theta|\theta^\star)}{f(\theta^\star|\theta)} \frac{p(y|{x}^\star,\theta^\star)}{p(y|{x},\theta)} \frac{p({x}|y,\theta)}{p({x}^\star|y,\theta^\star)}\\ \qquad = \frac{p(\theta^\star)}{p(\theta)} \frac{p(y|\theta^\star)}{p(y|\theta)} \frac{f(\theta|\theta^\star)}{f(\theta^\star|\theta)}.$

Notice how the acceptance ratio simplifies, using the basic marginal likelihood identity (BMI) of Chib (1995), and $x$ drops out of the ratio completely in order to give exactly the ratio used for the marginal updating scheme. Thus, the “ideal” joint updating scheme reduces to the marginal updating scheme if $x$ is not sampled and stored as a component of the Markov chain.

Understanding the relationship between these schemes is useful for understanding the PMMH algorithm. Indeed, we will see that the “ideal” joint updating scheme (and the marginal scheme) corresponds to PMMH using infinitely many particles in the particle filter, and that the likelihood-free scheme corresponds to PMMH using exactly one particle in the particle filter. For an intermediate number of particles, the PMMH scheme is a compromise between the “ideal” scheme and the “blind” likelihood-free scheme, but is always likelihood-free (when used with a bootstrap particle filter) and always has an acceptance ratio leaving the exact posterior invariant.

### The PMMH algorithm

#### The algorithm

The PMMH algorithm is an MCMC algorithm for state space models jointly updating $\theta$ and $x_{0:T}$, as the algorithms above. First, a proposed new $\theta^\star$ is generated from a proposal $f(\theta^\star|\theta)$, and then a corresponding $x_{0:T}^\star$ is generated by running a bootstrap particle filter (as described in the previous post, and below) using the proposed new model parameters, $\theta^\star$, and selecting a single trajectory by sampling once from the final set of particles using the final set of weights. This proposed pair $(\theta^\star,x_{0:T}^\star)$ is accepted using the Metropolis-Hastings ratio

$\displaystyle A = \frac{\hat{p}_{\theta^\star}(y_{1:T})p(\theta^\star)q(\theta|\theta^\star)}{\hat{p}_{\theta}(y_{1:T})p(\theta)q(\theta^\star|\theta)},$

where $\hat{p}_{\theta^\star}(y_{1:T})$ is the particle filter’s (unbiased) estimate of marginal likelihood, described in the previous post, and below. Note that this approach tends to the perfect joint/marginal updating scheme as the number of particles used in the filter tends to infinity. Note also that for a single particle, the particle filter just blindly forward simulates from $p_\theta(x^\star_{0:T})$ and that the filter’s estimate of marginal likelihood is just the observed data likelihood $p_\theta(y_{1:T}|x^\star_{0:T})$ leading precisely to the simple likelihood-free scheme. To understand for an arbitrary finite number of particles, $M$, one needs to think carefully about the structure of the particle filter.

#### Why it works

To understand why PMMH works, it is necessary to think about the joint distribution of all random variables used in the bootstrap particle filter. To this end, it is helpful to re-visit the particle filter, thinking carefully about the resampling and propagation steps.

First introduce notation for the “particle cloud”: $\mathbf{x}_t=\{x_t^k|k=1,\ldots,M\}$, $\boldsymbol{\pi}_t=\{\pi_t^k|k=1,\ldots,M\}$, $\tilde{\mathbf{x}}_t=\{(x_t^k,\pi_t^k)|k=1,\ldots,M\}$. Initialise the particle filter with $\tilde{\mathbf{x}}_0$, where $x_0^k\sim p(x_0)$ and $\pi_0^k=1/M$ (note that $w_0^k$ is undefined). Now suppose at time $t$ we have a sample from $p(x_t|y_{1:t})$: $\tilde{\mathbf{x}}_t$. First resample by sampling $a_t^k \sim \mathcal{F}(a_t^k|\boldsymbol{\pi}_t)$, $k=1,\ldots,M$. Here we use $\mathcal{F}(\cdot|\boldsymbol{\pi})$ for the discrete distribution on $1:M$ with probability mass function $\boldsymbol{\pi}$. Next sample $x_{t+1}^k\sim p(x_{t+1}^k|x_t^{a_t^k})$. Set $w_{t+1}^k=p(y_{t+1}|x_{t+1}^k)$ and $\pi_{t+1}^k=w_{t+1}^k/\sum_{i=1}^M w_{t+1}^i$. Finally, propagate $\tilde{\mathbf{x}}_{t+1}$ to the next step… We define the filter’s estimate of likelihood as $\hat{p}(y_t|y_{1:t-1})=\frac{1}{M}\sum_{i=1}^M w_t^i$ and $\hat{p}(y_{1:T})=\prod_{i=1}^T \hat{p}(y_t|y_{1:t-1})$. See Doucet et al (2001) for further theoretical background on particle filters and SMC more generally.

Describing the filter carefully as above allows us to write down the joint density of all random variables in the filter as

$\displaystyle \tilde{q}(\mathbf{x}_0,\ldots,\mathbf{x}_T,\mathbf{a}_0,\ldots,\mathbf{a}_{T-1}) = \left[\prod_{k=1}^M p(x_0^k)\right] \left[\prod_{t=0}^{T-1} \prod_{k=1}^M \pi_t^{a_t^k} p(x_{t+1}^k|x_t^{a_t^k}) \right]$

For PMMH we also sample a final index $k'$ from $\mathcal{F}(k'|\boldsymbol{\pi}_T)$ giving the joint density

$\displaystyle \tilde{q}(\mathbf{x}_0,\ldots,\mathbf{x}_T,\mathbf{a}_0,\ldots,\mathbf{a}_{T-1})\pi_T^{k'}$

We write the final selected trajectory as

$\displaystyle x_{0:T}^{k'}=(x_0^{b_0^{k'}},\ldots,x_T^{b_T^{k'}}),$

where $b_t^{k'}=a_t^{b_{t+1}^{k'}}$, and $b_T^{k'}=k'$. If we now think about the structure of the PMMH algorithm, our proposal on the space of all random variables in the problem is in fact

$\displaystyle f(\theta^\star|\theta)\tilde{q}_{\theta^\star}(\mathbf{x}_0^\star,\ldots,\mathbf{x}_T^\star,\mathbf{a}_0^\star,\ldots,\mathbf{a}_{T-1}^\star)\pi_T^{{k'}^\star}$

and by considering the proposal and the acceptance ratio, it is clear that detailed balance for the chain is satisfied by the target with density proportional to

$\displaystyle p(\theta)\hat{p}_\theta(y_{1:T}) \tilde{q}_\theta(\mathbf{x}_0,\ldots,\mathbf{x}_T,\mathbf{a}_0,\ldots,\mathbf{a}_{T-1}) \pi_T^{k'}$

We want to show that this target marginalises down to the correct posterior $p(\theta,x_{0:T}|y_{1:T})$ when we consider just the parameters and the selected trajectory. But if we consider the terms in the joint distribution of the proposal corresponding to the trajectory selected by $k'$, this is given by

$\displaystyle p_\theta(x_0^{b_0^{k'}})\left[\prod_{t=0}^{T-1} \pi_t^{b_t^{k'}} p_\theta(x_{t+1}^{b_{t+1}^{k'}}|x_t^{b_t^{k'}})\right]\pi_T^{k'} = p_\theta(x_{0:T}^{k'})\prod_{t=0}^T \pi_t^{b_t^{k'}}$

which, by expanding the $\pi_t^{b_t^{k'}}$ in terms of the unnormalised weights, simplifies to

$\displaystyle \frac{p_\theta(x_{0:T}^{k'})p_\theta(y_{1:T}|x_{0:T}^{k'})}{M^{T+1}\hat{p}_\theta(y_{1:T})}$

It is worth dwelling on this result, as this is the key insight required to understand why the PMMH algorithm works. The whole point is that the terms in the joint density of the proposal corresponding to the selected trajectory exactly represent the required joint distribution modulo a couple of normalising constants, one of which is the particle filter’s estimate of marginal likelihood. Thus, by including $\hat{p}_\theta(y_{1:T})$ in the acceptance ratio, we knock out the normalising constant, allowing all of the other terms in the proposal to be marginalised away. In other words, the target of the chain can be written as proportional to

$\displaystyle \frac{p(\theta)p_\theta(x_{0:T}^{k'},y_{1:T})}{M^{T+1}} \times \text{(Other terms...)}$

The other terms are all probabilities of random variables which do not occur elsewhere in the target, and hence can all be marginalised away to leave the correct posterior

$\displaystyle p(\theta,x_{0:T}|y_{1:T})$

Thus the PMMH algorithm targets the correct posterior for any number of particles, $M$. Also note the implied uniform distribution on the selected indices in the target.

I will give some code examples in a future post.

## Motivation and background

In this post I will try and explain an important idea behind some recent developments in MCMC theory. First, let me give some motivation. Suppose you are trying to implement a Metropolis-Hastings algorithm, as discussed in a previous post (required reading!), but a key likelihood term needed for the acceptance ratio is difficult to evaluate (most likely it is a marginal likelihood of some sort). If it is possible to obtain a Monte-Carlo estimate for that likelihood term (which it sometimes is, for example, using importance sampling), one could obviously just plug it in to the acceptance ratio and hope for the best. What is not at all obvious is that if your Monte-Carlo estimate satisfies some fairly weak property then the equilibrium distribution of the Markov chain will remain exactly as it would be if the exact likelihood had been available. Furthermore, it is exact even if the Monte-Carlo estimate is very noisy and imprecise (though the mixing of the chain will be poorer in this case). This is the “exact approximate” pseudo-marginal MCMC approach. To give credit where it is due, the idea was first introduced by Mark Beaumont in Beaumont (2003), where he was using an importance sampling based approximate likelihood in the context of a statistical genetics example. This was later picked up by Christophe Andrieu and Gareth Roberts, who studied the technical properties of the approach in Andrieu and Roberts (2009). The idea is turning out to be useful in several contexts, and in particular, underpins the interesting new Particle MCMC algorithms of Andrieu et al (2010), which I will discuss in a future post. I’ve heard Mark, Christophe, Gareth and probably others present this concept, but this post is most strongly inspired by a talk that Christophe gave at the IMS 2010 meeting in Gothenburg this summer.

## The pseudo-marginal Metropolis-Hastings algorithm

Let’s focus on the simplest version of the problem, where we want to sample from a target $p(x)$ using a proposal $q(x'|x)$. As explained previously, the required Metropolis-Hastings acceptance ratio will take the form

$\displaystyle A=\frac{p(x')q(x|x')}{p(x)q(x'|x)}.$

Here we are assuming that $p(x)$ is difficult to evaluate (usually because it is a marginalised version of some higher-dimensional distribution), but that a Monte-Carlo estimate of $p(x)$, which we shall denote $r(x)$, can be computed. We can obviously just substitute this estimate into the acceptance ratio to get

$\displaystyle A=\frac{r(x')q(x|x')}{r(x)q(x'|x)},$

but it is not immediately clear that in many cases this will lead to the Markov chain having an equilibrium distribution that is exactly $p(x)$. It turns out that it is sufficient that the likelihood estimate, $r(x)$ is non-negative, and unbiased, in the sense that $E(r(x))=p(x)$, where the expectation is with respect to the Monte-Carlo error for a given fixed value of $x$. In fact, as we shall see, this condition is actually a bit stronger than is really required.

Put $W=r(x)/p(x)$, representing the noise in the Monte-Carlo estimate of $p(x)$, and suppose that $W \sim p(w|x)$ (note that in an abuse of notation, the function $p(w|x)$ is unrelated to $p(x)$). The main condition we will assume is that $E(W|x)=c$, where $c>0$ is a constant independent of $x$. In the case of $c=1$, we have the (typical) special case of $E(r(x))=p(x)$. For now, we will also assume that $W\geq 0$, but we will consider relaxing this constraint later.

The key to understanding the pseudo-marginal approach is to realise that at each iteration of the MCMC algorithm a new value of $W$ is being proposed in addition to a new value for $x$. If we regard the proposal mechanism as a joint update of $x$ and $w$, it is clear that the proposal generates $(x',w')$ from the density $q(x'|x)p(w'|x')$, and we can re-write our “approximate” acceptance ratio as

$\displaystyle A=\frac{w'p(x')p(w'|x')q(x|x')p(w|x)}{wp(x)p(w|x)q(x'|x)p(w'|x')}.$

Inspection of this acceptance ratio reveals that the target of the chain must be (proportional to)

$p(x)wp(w|x).$

This is a joint density for $(x,w)$, but the marginal for $x$ can be obtained by integrating over the range of $W$ with respect to $w$. Using the fact that $E(W|x)=c$, this then clearly gives a density proportional to $p(x)$, and this is precisely the target that we require. Note that for this to work, we must keep the old value of $w$ from one iteration to the next – that is, we must keep and re-use our noisy $r(x)$ value to include in the acceptance ratio for our next MCMC iteration – we should not compute a new Monte-Carlo estimate for the likelihood of the old state of the chain.

## Examples

We will consider again the example from the previous post – simulation of a chain with a N(0,1) target using uniform innovations. Using R, the main MCMC loop takes the form

pmmcmc<-function(n=1000,alpha=0.5)
{
vec=vector("numeric", n)
x=0
oldlik=noisydnorm(x)
vec[1]=x
for (i in 2:n) {
innov=runif(1,-alpha,alpha)
can=x+innov
lik=noisydnorm(can)
aprob=lik/oldlik
u=runif(1)
if (u < aprob) {
x=can
oldlik=lik
}
vec[i]=x
}
vec
}


Here we are assuming that we are unable to compute dnorm exactly, but instead only a Monte-Carlo estimate called noisydnorm. We can start with the following implementation

noisydnorm<-function(z)
{
dnorm(z)*rexp(1,1)
}


Each time this function is called, it will return a non-negative random quantity whose expectation is dnorm(z). We can now run this code as follows.

plot.mcmc<-function(mcmc.out)
{
op=par(mfrow=c(2,2))
plot(ts(mcmc.out),col=2)
hist(mcmc.out,30,col=3)
qqnorm(mcmc.out,col=4)
abline(0,1,col=2)
acf(mcmc.out,col=2,lag.max=100)
par(op)
}

metrop.out<-pmmcmc(10000,1)
plot.mcmc(metrop.out)


So we see that we have exactly the right N(0,1) target, despite using the (very) noisy noisydnorm function in place of the dnorm function. This noisy likelihood function is clearly unbiased. However, as already discussed, a constant bias in the noise is also acceptable, as the following function shows.

noisydnorm<-function(z)
{
dnorm(z)*rexp(1,2)
}


Re-running the code with this function also leads to the correct equilibrium distribution for the chain. However, it really does matter that the bias is independent of the state of the chain, as the following function shows.

noisydnorm<-function(z)
{
dnorm(z)*rexp(1,0.1+10*z*z)
}


Running with this function leads to the wrong equilibrium. However, it is OK for the distribution of the noise to depend on the state, as long as its expectation does not. The following function illustrates this.

noisydnorm<-function(z)
{
dnorm(z)*rgamma(1,0.1+10*z*z,0.1+10*z*z)
}


This works just fine. So far we have been assuming that our noisy likelihood estimates are non-negative. This is clearly desirable, as otherwise we could wind up with negative Metropolis-Hasting ratios. However, as long as we are careful about exactly what we mean, even this non-negativity condition may be relaxed. The following function illustrates the point.

noisydnorm<-function(z)
{
dnorm(z)*rnorm(1,1)
}


Despite the fact that this function will often produce negative values, the equilibrium distribution of the chain still seems to be correct! An even more spectacular example follows.

noisydnorm<-function(z)
{
dnorm(z)*rnorm(1,0)
}


Astonishingly, this one works too, despite having an expected value of zero! However, the following doesn’t work, even though it too has a constant expectation of zero.

noisydnorm<-function(z)
{
dnorm(z)*rnorm(1,0,0.1+10*z*z)
}


I don’t have time to explain exactly what is going on here, so the precise details are left as an exercise for the reader. Suffice to say that the key requirement is that it is the distribution of W conditioned to be non-negative which must have the constant bias property – something clearly violated by the final noisydnorm example.

Before finishing this post, it is worth re-emphasising the issue of re-using old Monte-Carlo estimates. The following function will not work (exactly), though in the case of good Monte-Carlo estimates will often work tolerably well.

approxmcmc<-function(n=1000,alpha=0.5)
{
vec=vector("numeric", n)
x=0
vec[1]=x
for (i in 2:n) {
innov=runif(1,-alpha,alpha)
can=x+innov
lik=noisydnorm(can)
oldlik=noisydnorm(x)
aprob=lik/oldlik
u=runif(1)
if (u < aprob) {
x=can
}
vec[i]=x
}
vec
}


In a subsequent post I will show how these ideas can be put into practice in the context of a Bayesian inference example.