Tuning particle MCMC algorithms

Several papers have appeared recently discussing the issue of how to tune the number of particles used in the particle filter within a particle MCMC algorithm such as particle marginal Metropolis Hastings (PMMH). Three such papers are:

I have discussed psuedo marginal MCMC and particle MCMC algorithms in previous posts. It will be useful to refer back to these posts if these topics are unfamiliar. Within particle MCMC algorithms (and psuedo-marginal MCMC algorithms, more generally), an unbiased estimate of marginal likelihood is constructed using a number of particles. The more particles that are used, the better the estimate of marginal likelihood is, and the resulting MCMC algorithm will behave more like a “real” marginal MCMC algorithm. For a small number of particles, the algorithm will still have exactly the correct target, but the noise in the unbiased estimator of marginal likelihood will lead to poor mixing of the MCMC chain. The idea is to use just enough particles to ensure that there isn’t “too much” noise in the unbiased estimator, but not to waste lots of time producing a super-accurate estimate of marginal likelihood if that isn’t necessary to ensure good mixing of the MCMC chain.

The papers above try to give theoretical justifications for certain “rules of thumb” that are commonly used in practice. One widely adopted scheme is to tune the number of particles so that the variance of the log of the estimate of marginal liklihood is around one. The obvious questions are “where?” and “why?”, and these questions turn out to be connected. As we will see, there isn’t really a good answer to the “where?” question, but what people usually do is use a pilot run to get an estimate of the posterior mean, or mode, or MLE, and then pick one and tune the noise variance at that particular parameter value. As to “why?”, well, the papers above make various (slightly different) assumptions, all of which lead to trading off mixing against computation time to obtain an “optimal” number of particles. They don’t all agree that the variance of the noise should be exactly 1, but they all agree to an order of magnitude.

All of the above papers make the assumption that the noise distribution associated with the marginal likelihood estimate is independent of the parameter at which it is being evaluated, which explains why there isn’t a really good answer to the “where?” question – under the assumption it doesn’t matter what parameter value is used for tuning – they are all the same! Easy. Except that’s quite a big assumption, so it would be nice to know that it is reasonable, and unfortunately it isn’t. Let’s look at an example to see what goes wrong.

Example

In Chapter 10 of my book I look in detail at constructing a PMMH algorithm for inferring the parameters of a discretely observed stochastic Lotka-Volterra model. I’ve stepped through the computational details in a previous post which you should refer back to for the necessary background. Following that post, we can construct a particle filter to return an unbiased estimate of marginal likelihood using the following R code (which relies on the smfsb CRAN package):

require(smfsb)
# data
data(LVdata)
data=as.timedData(LVnoise10)
noiseSD=10
# measurement error model
dataLik <- function(x,t,y,log=TRUE,...)
{
    ll=sum(dnorm(y,x,noiseSD,log=TRUE))
    if (log)
        return(ll)
    else
        return(exp(ll))
}
# now define a sampler for the prior on the initial state
simx0 <- function(N,t0,...)
{
    mat=cbind(rpois(N,50),rpois(N,100))
    colnames(mat)=c("x1","x2")
    mat
}
# construct particle filter
mLLik=pfMLLik(150,simx0,0,stepLVc,dataLik,data)

Again, see the relevant previous post for details. So now mLLik() is a function that will return the log of an unbiased estimate of marginal likelihood (based on 150 particles) given a parameter value at which to evaluate.

What we are currently wondering is whether the noise in the estimate is independent of the parameter at which it is evaluated. We can investigate this for this filter easily by looking at how the estimate varies as the first parameter (prey birth rate) varies. The following code computes a log likelihood estimate across a range of values and plots the result.

mLLik1=function(x){mLLik(th=c(th1=x,th2=0.005,th3=0.6))}
x=seq(0.7,1.3,length.out=5001)
y=sapply(x,mLLik1)
plot(x[y>-1e10],y[y>-1e10])

The resulting plot is as follows (click for full size):

Log marginal likelihood

So, looking at the plot, it is very clear that the noise variance certainly isn’t constant as the parameter varies – it varies substantially. Furthermore, the way in which it varies is “dangerous”, in that the noise is smallest in the vicinity of the MLE. So, if a parameter close to the MLE is chosen for tuning the number of particles, this will ensure that the noise is small close to the MLE, but not elsewhere in parameter space. This could have bad consequences for the mixing of the MCMC algorithm as it explores the tails of the posterior distribution.

So with the above in mind, how should one tune the number of particles in a pMCMC algorithm? I can’t give a general answer, but I can explain what I do. We can’t rely on theory, so a pragmatic approach is required. The above rule of thumb usually gives a good starting point for exploration. Then I just directly optimise ESS per CPU second of the pMCMC algorithm from pilot runs for varying numbers of particles (and other tuning parameters in the algorithm). ESS is “expected sample size”, which can be estimated using the effectiveSize() function in the coda CRAN package. Ugly and brutish, but it works…

Published by

darrenjw

I am Professor of Statistics within the Department of Mathematical Sciences at Durham University, UK. I am an Bayesian statistician interested in computation and applications, especially to engineering and the life sciences.

11 thoughts on “Tuning particle MCMC algorithms”

  1. I do something even more rough: for a fixed location, I plot the variance of the marginal likelihood (Y axis) against the number of particles and I check where the curve levels off.

    To me the above slice of the marginal likelihood suggests that using a ML or MAP estimator in this context might be preferable to a full MCMC analysis. Even more so if the computational budget is limited… any thoughts?

    Matteo

  2. The problem is that the number of particles/samples is assumed constant across the parameter theta. If the variance of the log-likelihood estimate is easy to compute, then the optimal number of particles can be selected as a function of the parameter theta so that the variance of the log-likelihood estimate equals 1. See http://arxiv.org/abs/1309.3339 for a detailed discussion.

  3. Hi Darren,

    We do agree with you that the variance of the log-likelihood estimator is not constant across the parameter space. However, in practice the issue is whether the variance is approximately constant under the posterior for theta. As T, the number of observations, increases then the posterior will typically concentrate around a central value at the rate 1/sqrt(T). Therefore, under reasonable assumptions, the variance (which is a function of theta) will itself become more concentrated around a central value of choice.

    Results formalizing this intuition are given in the previous version of arXiv:1210.1871; see Lemma 4 in Section 6 and the associated figures. (This section has been suppressed in the third version for space reasons). Your graph is not reflecting this point as you are only displaying the log-likelihood estimate over a grid of values of theta, rather than over posterior samples from theta. You are considering a log-likelihood range of about 300 whereas to reflect posterior support one would normally consider a range of 3 to 4 (under a reasonable prior). By just looking at your graph, it is clear that this would reduce substantially the range of relevant theta and consequently the variability of the associated variance across this interval.

    You can find at
    http://www.stats.ox.ac.uk/~doucet/ExaminingAssumption.pdf histograms of the log-likelihood estimator error evaluated at the posterior mean. We also display histograms of the log-likelihood estimator error over the posterior for the parameter. We do this for various values of T (number of data samples) and number of particles N. For T small, the posterior over the parameters is fairly diffuse and neither our constant assumption for the variance nor the normality distributional assumption for the error hold.
    As T increases, the distribution of the log-likelihood estimate evaluated at the posterior mean and the distribution of the log-likelihood estimator error under the posterior are essentially indistinguishable and correspond very closely to the postulated normal distribution as expected.
    We present these results for a simple AR(1) Gaussian process observed in Gaussian noise but also for a complex continuous-time two-factor stochastic volatility model applied to real data.

    Hence we believe that in for moderate to large data one can rely on theory and that the guidelines we provide are useful.

    Cordially,

    Arnaud Doucet & Mike Pitt

    1. Hi guys,

      Thanks for your considered (and polite!) response. Your intuition about the large T limit may well be correct, but I’m often interested in small T. There I’m not totally convinced by results based on 100 samples from the posterior. I’m sure we can all agree that there is less of a problem if the posterior has compact support. My concern is that the variance of the estimator increases (and is unbounded?) as the MCMC chain takes an excursion into the tails of the posterior. For the problem in my post, the corresponding posterior is shown in Figure 10.5 of my book, and the summary stats for it are shown on page 298. Near the posterior mode, at around 0.95, the variance of the log-likelihood estimator is roughly 1. However, at 0.84 (around the smallest value my MCMC sampler visited), the variance is around 17. This may be a slightly less dramatic difference than a naive look at my plot would suggest, but it’s non-trivial. It certainly makes a difference whether you tune the number of particles at 0.84 or 0.95.

      Cheers,

      Darren

      > x=rep(0.84,1000)
      > y=sapply(x,mLLik1)
      > var(y)
      [1] 16.96225
      > x=rep(0.95,1000)
      > y=sapply(x,mLLik1)
      > var(y)
      [1] 0.9849833

      1. Hi Darren,
        You are right in that it does clearly make a large difference (16 fold) whether you choose N based upon the posterior mean or the tail value. However, if you choose based upon the posterior mean then it is clear (from your graph) that a random walk, RW, proposal will almost always accept moves from the tails towards the posterior.

        This can be seen because pi(theta’)_/pi(theta), with theta’ being the proposed value near 0.95 and theta being out in the tails (say, 0.84) will be very large relative to exp(z’-z), where z is the log-likelihood error. So the performance (stickiness) of the scheme will be governed by what is happening near the mean of theta.

        If you were to implement a RW particle scheme for your model and to plot the IACT against sigma (the SD evaluated at the posterior mean, 0.95) by varying N then I would imagine this would be minimised around the values suggested in the papers.
        Best wishes,
        Mike Pitt

      2. Yes, I agree – the PMMH sampler for this filter is fine. I generally find that tuning the variance of the noise at the posterior mean is usually OK, as in fact the overall performance (ESS/sec) is relatively insensitive to the precise value of N over a reasonable vicinity of the optimum. But knowing that the noise can vary substantially, I always like to check…
        Cheers,

  4. For some models it is possible, as Minh Ngoc indicates, to adaptively alter N with the proposed value of theta to try to keep sigma constant. If each section of the likelihood, for individuals t=1,..,T, can be estimated via importance sampling then a closed form estimator of the variance can be easily implemented simply from the weights used in the estimator itself. For particle filters of course the situation is more complicated and any solution may be more expensive than just keeping N fixed…
    Cheers, Mike

  5. Dear Darren,

    I have not yet really done anything with particle filters, but clearly, I am interested in having a try. Right now, my (modest) objective is to get a qualitative idea of how the approach scales with the size of the problem (T). So it is not exactly the subject of your post, but still it is a related issue.

    Is it correct to say that, under relatively general conditions, the complexity of a particle MH will typically scale as O(T^2) ? That is, the variance of the estimator of the ratio of likelihoods, or equivalently of the variance of the estimator of the log likelihood, will be in T/N, where N is the number of particles, and this is more or less what you want to control in order to keep the acceptance rate afloat. Thus, the number of particles should increase linearly with T, so that the resulting mcmc is in O(NT) = O(T^2).

    This seems to be suggested by some of the articles you mention, but I am not yet sufficiently comfortable with the whole thing to be sure that I understand correctly.

    1. Yes, the folklore (supported by theory) is that the number of particles in your filter should scale linearly with the number of time-points, at least when being used inside pMCMC algorithms.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.