likelihood – Darren Wilkinson's blog

Tuning particle MCMC algorithms

Several papers have appeared recently discussing the issue of how to tune the number of particles used in the particle filter within a particle MCMC algorithm such as particle marginal Metropolis Hastings (PMMH). Three such papers are:

Doucet, Arnaud, Michael Pitt, and Robert Kohn. Efficient implementation of Markov chain Monte Carlo when using an unbiased likelihood estimator. arXiv preprint arXiv:1210.1871 (2012).
Pitt, Michael K., Ralph dos Santos Silva, Paolo Giordani, and Robert Kohn. On some properties of Markov chain Monte Carlo simulation methods based on the particle filter. Journal of Econometrics 171, no. 2 (2012): 134-151.
Sherlock, Chris, Alexandre H. Thiery, Gareth O. Roberts, and Jeffrey S. Rosenthal. On the efficiency of pseudo-marginal random walk Metropolis algorithms. arXiv preprint arXiv:1309.7209 (2013).

I have discussed psuedo marginal MCMC and particle MCMC algorithms in previous posts. It will be useful to refer back to these posts if these topics are unfamiliar. Within particle MCMC algorithms (and psuedo-marginal MCMC algorithms, more generally), an unbiased estimate of marginal likelihood is constructed using a number of particles. The more particles that are used, the better the estimate of marginal likelihood is, and the resulting MCMC algorithm will behave more like a “real” marginal MCMC algorithm. For a small number of particles, the algorithm will still have exactly the correct target, but the noise in the unbiased estimator of marginal likelihood will lead to poor mixing of the MCMC chain. The idea is to use just enough particles to ensure that there isn’t “too much” noise in the unbiased estimator, but not to waste lots of time producing a super-accurate estimate of marginal likelihood if that isn’t necessary to ensure good mixing of the MCMC chain.

The papers above try to give theoretical justifications for certain “rules of thumb” that are commonly used in practice. One widely adopted scheme is to tune the number of particles so that the variance of the log of the estimate of marginal liklihood is around one. The obvious questions are “where?” and “why?”, and these questions turn out to be connected. As we will see, there isn’t really a good answer to the “where?” question, but what people usually do is use a pilot run to get an estimate of the posterior mean, or mode, or MLE, and then pick one and tune the noise variance at that particular parameter value. As to “why?”, well, the papers above make various (slightly different) assumptions, all of which lead to trading off mixing against computation time to obtain an “optimal” number of particles. They don’t all agree that the variance of the noise should be exactly 1, but they all agree to an order of magnitude.

All of the above papers make the assumption that the noise distribution associated with the marginal likelihood estimate is independent of the parameter at which it is being evaluated, which explains why there isn’t a really good answer to the “where?” question – under the assumption it doesn’t matter what parameter value is used for tuning – they are all the same! Easy. Except that’s quite a big assumption, so it would be nice to know that it is reasonable, and unfortunately it isn’t. Let’s look at an example to see what goes wrong.

Example

In Chapter 10 of my book I look in detail at constructing a PMMH algorithm for inferring the parameters of a discretely observed stochastic Lotka-Volterra model. I’ve stepped through the computational details in a previous post which you should refer back to for the necessary background. Following that post, we can construct a particle filter to return an unbiased estimate of marginal likelihood using the following R code (which relies on the smfsb CRAN package):

require(smfsb)
# data
data(LVdata)
data=as.timedData(LVnoise10)
noiseSD=10
# measurement error model
dataLik <- function(x,t,y,log=TRUE,...)
{
    ll=sum(dnorm(y,x,noiseSD,log=TRUE))
    if (log)
        return(ll)
    else
        return(exp(ll))
}
# now define a sampler for the prior on the initial state
simx0 <- function(N,t0,...)
{
    mat=cbind(rpois(N,50),rpois(N,100))
    colnames(mat)=c("x1","x2")
    mat
}
# construct particle filter
mLLik=pfMLLik(150,simx0,0,stepLVc,dataLik,data)

Again, see the relevant previous post for details. So now mLLik() is a function that will return the log of an unbiased estimate of marginal likelihood (based on 150 particles) given a parameter value at which to evaluate.

What we are currently wondering is whether the noise in the estimate is independent of the parameter at which it is evaluated. We can investigate this for this filter easily by looking at how the estimate varies as the first parameter (prey birth rate) varies. The following code computes a log likelihood estimate across a range of values and plots the result.

mLLik1=function(x){mLLik(th=c(th1=x,th2=0.005,th3=0.6))}
x=seq(0.7,1.3,length.out=5001)
y=sapply(x,mLLik1)
plot(x[y>-1e10],y[y>-1e10])

The resulting plot is as follows (click for full size):

So, looking at the plot, it is very clear that the noise variance certainly isn’t constant as the parameter varies – it varies substantially. Furthermore, the way in which it varies is “dangerous”, in that the noise is smallest in the vicinity of the MLE. So, if a parameter close to the MLE is chosen for tuning the number of particles, this will ensure that the noise is small close to the MLE, but not elsewhere in parameter space. This could have bad consequences for the mixing of the MCMC algorithm as it explores the tails of the posterior distribution.

So with the above in mind, how should one tune the number of particles in a pMCMC algorithm? I can’t give a general answer, but I can explain what I do. We can’t rely on theory, so a pragmatic approach is required. The above rule of thumb usually gives a good starting point for exploration. Then I just directly optimise ESS per CPU second of the pMCMC algorithm from pilot runs for varying numbers of particles (and other tuning parameters in the algorithm). ESS is “expected sample size”, which can be estimated using the effectiveSize() function in the coda CRAN package. Ugly and brutish, but it works…

Marginal likelihood from tempered Bayesian posteriors

Introduction

In the previous post I showed that it is possible to couple parallel tempered MCMC chains in order to improve mixing. Such methods can be used when the target of interest is a Bayesian posterior distribution that is difficult to sample. There are (at least) a couple of obvious ways that one can temper a Bayesian posterior distribution. Perhaps the most obvious way is a simple flattening, so that if

$\pi(\theta|y) \propto \pi(\theta)\pi(y|\theta)$

is the posterior distribution, then for $t\in [0,1]$ we define

$\pi_t(\theta|y) \propto \pi(\theta|y)^t \propto [ \pi(\theta)\pi(y|\theta) ]^t.$

This corresponds with the tempering that is often used in statistical physics applications. We recover the posterior of interest for $t=1$ and tend to a flat distribution as $t\longrightarrow 0$ . However, for Bayesian posterior distributions, there is a different way of tempering that is often more natural and useful, and that is to temper using the power posterior, defined by

$\pi_t(\theta|y) \propto \pi(\theta)\pi(y|\theta)^t.$

Here we again recover the posterior for $t=1$ , but get the prior for $t=0$ . Thus, the family of distributions forms a natural bridge or path from the prior to the posterior distributions. The power posterior is a special case of the more general concept of a geometric path from distribution $f(\theta)$ (at $t=0$ ) to $g(\theta)$ (at $t=1$ ) defined by

$h_t(\theta) \propto f(\theta)^{1-t}g(\theta)^t,$

where, in our case, $f(\cdot)$ is the prior and $g(\cdot)$ is the posterior.

So, given a posterior distribution that is difficult to sample, choose a temperature schedule

$0=t_0<t_1<\cdots<t_{N-1}<t_N=1$

and run a parallel tempering scheme as outlined in the previous post. The idea is that for small values of $t$ mixing will be good, as prior-like distributions are usually well-behaved, and the mixing of these "high temperature" chains can help to improve the mixing of the "low temperature" chains that are more like the posterior (note that $t$ is really an inverse temperature parameter the way I’ve defined it here…).

Marginal likelihood and normalising constants

The marginal likelihood of a Bayesian model is

$\pi(y) = \int_\Theta \pi(\theta)\pi(y|\theta)d\theta.$

This quantity is of interest for many reasons, including calculation of the Bayes factor between two competing models. Note that this quantity has several different names in different fields. In particular, it is often known as the evidence, due to its role in Bayes factors. It is also worth noting that it is the normalising constant of the Bayesian posterior distribution. Although it is very easy to describe and define, it is notoriously difficult to compute reliably for complex models.

The normalising constant is conceptually very easy to estimate. From the above integral representation, it is clear that

$\pi(y) = E_\pi [ \pi(y|\theta) ]$

where the expectation is taken with respect to the prior. So, given samples from the prior, $\theta_1,\theta_2,\ldots,\theta_n$ , we can construct the Monte Carlo estimate

$\displaystyle \widehat{\pi}(y) = \frac{1}{n}\sum_{i=1}^n \pi(y|\theta_i)$

and this will be a consistent estimator of the true evidence under fairly mild regularity conditions. Unfortunately, in practice it is likely to be a very poor estimator if the posterior and prior are not very similar. Now, we could also use Bayes theorem to re-write the integral as an expectation with respect to the posterior, so we could then use samples from the posterior to estimate the evidence. This leads to the harmonic mean estimator of the evidence, which has been described as the worst Monte Carlo method ever! Now it turns out that there are many different ways one can construct estimators of the evidence using samples from the prior and the posterior, some of which are considerably better than the two I’ve outlined. This is the subject of the bridge sampling paper of Meng and Wong. However, the reality is that no method will work well if the prior and posterior are very different.

If we have tempered chains, then we have a sequence of chains targeting distributions which, by construction, are not too different, and so we can use the output from tempered chains in order to construct estimators of the evidence that are more numerically stable. If we call the evidence of the $i$ th chain $z_i$ , so that $z_0=1$ and $z_N=\pi(y)$ , then we can write the evidence in telescoping fashion as

$\displaystyle \pi(y)=z_N = \frac{z_N}{z_0} = \frac{z_1}{z_0}\times \frac{z_2}{z_1}\times \cdots \times \frac{z_N}{z_{N-1}}.$

Now the $i$ th term in this product is $z_{i+1}/z_{i}$ , which can be estimated using the output from the $i$ th and/or $(i+1)$ th chain(s). Again, this can be done in a variety of ways, using your favourite bridge sampling estimator, but the point is that the estimator should be reasonably good due to the fact that the $i$ th and $(i+1)$ th targets are very similar. For the power posterior, the simplest method is to write

$\displaystyle \frac{z_{i+1}}{z_i} = \frac{\displaystyle \int \pi(\theta)\pi(y|\theta)^{t_{i+1}}d\theta}{z_i} = \int \pi(y|\theta)^{t_{i+1}-t_i}\times \frac{\pi(y|\theta)^{t_i}\pi(\theta)}{z_i}d\theta$

$\displaystyle \mbox{}\qquad = E_i\left[\pi(y|\theta)^{t_{i+1}-t_i}\right],$

where the expectation is with respect to the $i$ th target, and hence can be estimated in the usual way using samples from the $i$ th chain.

For numerical stability, in practice we compute the log of the evidence as

$\displaystyle \log\pi(y) = \sum_{i=0}^{N-1} \log\frac{z_{i+1}}{z_i} = \sum_{i=0}^{N-1} \log E_i\left[\pi(y|\theta)^{t_{i+1}-t_i}\right]$

$\displaystyle = \sum_{i=0}^{N-1} \log E_i\left[\exp\{(t_{i+1}-t_i)\log\pi(y|\theta)\}\right].\qquad(\dagger)$

The above expression is exact, and is the obvious formula to use for computation. However, it is clear that if $t_i$ and $t_{i+1}$ are sufficiently close, it will be approximately OK to switch the expectation and exponential, giving

$\displaystyle \log\pi(y) \approx \sum_{i=0}^{N-1}(t_{i+1}-t_i)E_i\left[\log\pi(y|\theta)\right].$

In the continuous limit, this gives rise to the well-known path sampling identity,

$\displaystyle \log\pi(y) = \int_0^1 E_t\left[\log\pi(y|\theta)\right]dt.$

So, an alternative approach to computing the evidence is to use the samples to approximately numerically integrate the above integral, say, using the trapezium rule. However, it isn’t completely clear (to me) that this is better than using $(\dagger)$ directly, since there there is no numerical integration error to worry about.

Numerical illustration

We can illustrate these ideas using the simple double potential well example from the previous post. Now that example doesn’t really correspond to a Bayesian posterior, and is tempered directly, rather than as a power posterior, but essentially the same ideas follow for general parallel tempered distributions. In general, we can use the sample to estimate the ratio of the last and first normalising constants, $z_N/z_0$ . Here it isn’t obvious why we’d want to know that, but we’ll compute it anyway to illustrate the method. As before, we expand as a telescopic product, where the $i$ th term is now

$\displaystyle \frac{z_{i+1}}{z_i} = E_i\left[\exp\{-(\gamma_{i+1}-\gamma_i)(x^2-1)^2\}\right].$

A Monte Carlo estimate of each of these terms is formed using the samples from the $i$ th chain, and the logs of these are then summed to give $\log(z_N/z_0)$ . A complete R script to run the Metropolis coupled sampler and compute the evidence is given below.

U=function(gam,x)
{
  gam*(x*x-1)*(x*x-1)
}

temps=2^(0:3)
iters=1e5

chains=function(pot=U, tune=0.1, init=1)
{
  x=rep(init,length(temps))
  xmat=matrix(0,iters,length(temps))
  for (i in 1:iters) {
    can=x+rnorm(length(temps),0,tune)
    logA=unlist(Map(pot,temps,x))-unlist(Map(pot,temps,can))
    accept=(log(runif(length(temps)))<logA)
    x[accept]=can[accept]
    swap=sample(1:length(temps),2)
    logA=pot(temps[swap[1]],x[swap[1]])+pot(temps[swap[2]],x[swap[2]])-
            pot(temps[swap[1]],x[swap[2]])-pot(temps[swap[2]],x[swap[1]])
    if (log(runif(1))<logA)
      x[swap]=rev(x[swap])
    xmat[i,]=x
  }
  colnames(xmat)=paste("gamma=",temps,sep="")
  xmat
}

mat=chains()
mat=mat[,1:(length(temps)-1)]
diffs=diff(temps)
mat=(mat*mat-1)^2
mat=-t(diffs*t(mat))
mat=exp(mat)
logEvidence=sum(log(colMeans(mat)))
message(paste("The log of the ratio of the last and first normalising constants is",logEvidence))

It turns out that these double well potential densities are tractable, and so the normalising constants can be computed exactly. So, with a little help from Wolfram Alpha, I compute log of the ratio of the last and first normalising constants to be approximately -1.12. Hopefully the above script will output something a bit like that…

References

Meng, Xiao-Li, and Wing Hung Wong. “Simulating ratios of normalizing constants via a simple identity: a theoretical exploration.” Statistica Sinica 6.4 (1996): 831-860.
Gelman, Andrew, and Xiao-Li Meng. “Simulating normalizing constants: From importance sampling to bridge sampling to path sampling.” Statistical Science (1998): 163-185.
Friel, Nial, and Anthony N. Pettitt. “Marginal likelihood estimation via power posteriors.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 70.3 (2008): 589-607.
Friel, Nial, and Jason Wyse. “Estimating the evidence–a review.” Statistica Neerlandica 66.3 (2012): 288-308.

Particle filtering and pMCMC using R

In the previous post I gave a quick introduction to the CRAN R package smfsb, and how it can be used for simulation of Markov processes determined by stochastic kinetic networks. In this post I’ll show how to use data and particle MCMC techniques in order to carry out Bayesian inference for the parameters of partially observed Markov processes.

The simulation model and the data

For this post we will assume that the smfsb package is installed and loaded (see previous post for details). The package includes the function stepLVc which simulates from the Markov transition kernel of a Lotka-Volterra process by calling out to some native C code for speed. So, for example,

stepLVc(c(x1=50,x2=100),0,1)

will simulate the state of the process at time 1 given an initial condition of 50 prey and 100 predators at time 0, using the default rate parameters of the function, th = c(1, 0.005, 0.6). The package also includes some data simulated from this model using these parameters, with and without added noise. The datasets can be loaded with

data(LVdata)

For simplicity, we will just make use of the dataset LVnoise10 in this post. This dataset is a multivariate time series consisting of 16 equally spaced observations on both prey and predators subject to Gaussian measurement error with a standard deviation of 10. We can plot the data with

plot(LVnoise10,plot.type="single",col=c(2,4))

giving:

The Bayesian inference problem is to see how much we are able to learn about the parameters which generated the data using only the data and our knowledge of the structure of the problem. There are many approaches one can take to this problem, but most are computationally intensive, due to the analytical intractability of the transition kernel of the LV process. Here we will follow Wilkinson (2011) and take a particle MCMC (pMCMC) approach, and specifically, use a pseudo-marginal “exact approximate” MCMC algorithm based on the particle marginal Metropolis-Hastings (PMMH) algorithm. I have discussed the pseudo-marginal approach, using particle filters for marginal likelihood estimation, and the PMMH algorithm in previous posts, so if you have been following my posts for a while, this should all make perfect sense…

Particle filter

One of the key ingredients required to implement the pseudo-marginal MCMC scheme is a (bootstrap) particle filter which generates an unbiased estimate of the marginal likelihood of the data given the parameters (integrated over the unobserved state trajectory). The algorithm was discussed in this post, and R code to implement this is included in the smfsb R package as pfMLLik. For reasons of numerical stability, the function computes and returns the log of the marginal likelihood, but it is important to understand that it is the actually likelihood estimate that is unbiased for the true likelihood, and not the corresponding statement for the logs. The actual code of the function is relatively short, and for completeness is given below:

pfMLLik <- function (n, simx0, t0, stepFun, dataLik, data) 
{
    times = c(t0, as.numeric(rownames(data)))
    deltas = diff(times)
    return(function(...) {
        xmat = simx0(n, t0, ...)
        ll = 0
        for (i in 1:length(deltas)) {
            xmat = t(apply(xmat, 1, stepFun, t0 = times[i], deltat = deltas[i], ...))
            w = apply(xmat, 1, dataLik, t = times[i + 1], y = data[i,], log = FALSE, ...)
            if (max(w) < 1e-20) {
                warning("Particle filter bombed")
                return(-1e+99)
            }
            ll = ll + log(mean(w))
            rows = sample(1:n, n, replace = TRUE, prob = w)
            xmat = xmat[rows, ]
        }
        ll
    })
}

We need to set up the prior and the data likelihood correctly before we can use this function, but first note that the function does not actually run a particle filter at all, but instead stores everything it needs to know to run the particle filter in the local environment, and then returns a function closure for evaluating the marginal likelihood at a given set of parameters. The resulting function (closure) can then be used to run a particle filter for a given set of parameters, simply by passing the required parameters into the function. This functional programming style is consistent with that used throughout the smfsb R package, and leads to quite simple, modular code. To use pfMLLik, we first need to define a function which evaluates the log-likelihood of an observation conditional on the true state, and another which samples from the prior distribution of the initial state of the system. Here, we can do that as follows.

# set up data likelihood
noiseSD=10
dataLik <- function(x,t,y,log=TRUE,...)
{
	ll=sum(dnorm(y,x,noiseSD,log=TRUE))
	if (log)
		return(ll)
	else
		return(exp(ll))
}
# now define a sampler for the prior on the initial state
simx0 <- function(N,t0,...)
{
	mat=cbind(rpois(N,50),rpois(N,100))
	colnames(mat)=c("x1","x2")
	mat
}
# convert the time series to a timed data matrix
LVdata=as.timedData(LVnoise10)
# create marginal log-likelihood functions, based on a particle filter
mLLik=pfMLLik(100,simx0,0,stepLVc,dataLik,LVdata)

Now the function (closure) mLLik will, for a given parameter vector, run a particle filter (using 100 particles) and return the log of the particle filter’s unbiased estimate of the marginal likelihood of the data. It is then very easy to use this function to create a simple PMMH algorithm for parameter inference.

PMMH algorithm

Below is an algorithm based on flat priors and a simple Metropolis-Hastings update for the parameters using the function closure mLLik, defined above.

iters=1000
tune=0.01
thin=10
th=c(th1 = 1, th2 = 0.005, th3 = 0.6)
p=length(th)
ll=-1e99
thmat=matrix(0,nrow=iters,ncol=p)
colnames(thmat)=names(th)
# Main pMCMC loop
for (i in 1:iters) {
	message(paste(i,""),appendLF=FALSE)
	for (j in 1:thin) {
		thprop=th*exp(rnorm(p,0,tune))
		llprop=mLLik(thprop)
		if (log(runif(1)) < llprop - ll) {
			th=thprop
			ll=llprop
			}
		}
	thmat[i,]=th
	}
message("Done!")
# Compute and plot some basic summaries
mcmcSummary(thmat)

This will take a little while to run, but in the end should give a plot something like the following (click for full size):

So, although we should really run the chain for a bit longer, we see that we can learn a great deal about the parameters of the process from very little data. For completeness, a full runnable demo script is included below the references. Of course there are many obvious extensions of this basic problem, such as partial observation (eg. only observing the prey) and unknown measurement error. These are discussed in Wilkinson (2011), and code for these cases is included within the demo(PMCMC), which should be inspected for further details.

Discussion

At this point it is probably worth emphasising that there are other “likelihood free” approaches which can be taken to parameter inference for partially observed Markov process (POMP) models. Many of these are implemented in the pomp R package, also available from CRAN, by King et al (2008). The pomp package is well documented, and has a couple of good tutorial vignettes which should be sufficient to get people started. The API of the package is rather cumbersome, but the algorithms appear to be quite robust. Approximate Bayesian computation (ABC) approaches are also quite natural for POMP models (see, for example, Toni et al (2009)). This is because “exact” likelihood free procedures break down in the case of low/no measurement error or high-dimensional observations. There are some R packages for ABC, but I am not sufficiently familiar with them to be able to give recommendations.

If one is able to move away from the “likelihood free” paradigm, it is possible to develop “exact” pMCMC algorithms which do not break down in challenging observation scenarios. The problem here is the intractability of the Markov transition kernel. In the case of nonlinear Markov jump processes, finding very generic solutions seems quite difficult, but for diffusion (approximation) processes based on stochastic differential equations, it seems to be possible to develop irreducible pMCMC algorithms which have very broad applicability – see Golightly and Wilkinson (2011) for further details of how such techniques can be used in the context of stochastic kinetic models similar to those considered in this post.

References

Golightly, A., Wilkinson, D. J. (2011) Bayesian parameter inference for stochastic biochemical network models using particle MCMC, Interface Focus, 1(6):807-820.

King, A.A., Ionides, E.L., & Breto, C.M. (2008) pomp: Statistical inference for partially observed Markov processes, CRAN.

Toni, T., Welch, D., Strelkowa, N., Ipsen, A. & Stumpf, M. (2009) Approximate Bayesian computation scheme for parameter inference and model selection in dynamical systems, J. R. Soc. Interface 6(31): 187-202.

Wilkinson, D. J. (2011) Stochastic Modelling for Systems Biology, second edition, Boca Raton, Florida: Chapman & Hall/CRC Press.

Demo script

require(smfsb)
data(LVdata)

# set up data likelihood
noiseSD=10
dataLik <- function(x,t,y,log=TRUE,...)
{
	ll=sum(dnorm(y,x,noiseSD,log=TRUE))
	if (log)
		return(ll)
	else
		return(exp(ll))
}
# now define a sampler for the prior on the initial state
simx0 <- function(N,t0,...)
{
	mat=cbind(rpois(N,50),rpois(N,100))
	colnames(mat)=c("x1","x2")
	mat
}
# convert the time series to a timed data matrix
LVdata=as.timedData(LVnoise10)
# create marginal log-likelihood functions, based on a particle filter
mLLik=pfMLLik(100,simx0,0,stepLVc,dataLik,LVdata)

iters=1000
tune=0.01
thin=10
th=c(th1 = 1, th2 = 0.005, th3 = 0.6)
p=length(th)
ll=-1e99
thmat=matrix(0,nrow=iters,ncol=p)
colnames(thmat)=names(th)
# Main pMCMC loop
for (i in 1:iters) {
	message(paste(i,""),appendLF=FALSE)
	for (j in 1:thin) {
		thprop=th*exp(rnorm(p,0,tune))
		llprop=mLLik(thprop)
		if (log(runif(1)) < llprop - ll) {
			th=thprop
			ll=llprop
			}
		}
	thmat[i,]=th
	}
message("Done!")
# Compute and plot some basic summaries
mcmcSummary(thmat)

The particle marginal Metropolis-Hastings (PMMH) particle MCMC algorithm

Introduction

In the previous post I explained how one can use an unbiased estimate of marginal likelihood derived from a particle filter within a Metropolis-Hastings MCMC algorithm in order to construct an exact pseudo-marginal MCMC scheme for the posterior distribution of the model parameters given some time course data. This idea is closely related to that of the particle marginal Metropolis-Hastings (PMMH) algorithm of Andreiu et al (2010), but not really exactly the same. This is because for a Bayesian model with parameters $\theta$ , latent variables $x$ and data $y$ , of the form

$\displaystyle p(\theta,x,y) = p(\theta)p(x|\theta)p(y|x,\theta),$

the pseudo-marginal algorithm which exploits the fact that the particle filter’s estimate of likelihood is unbiased is an MCMC algorithm which directly targets the marginal posterior distribution $p(\theta|y)$ . On the other hand, the PMMH algorithm is an MCMC algorithm which targets the full joint posterior distribution $p(\theta,x|y)$ . Now, the PMMH scheme does reduce to the pseudo-marginal scheme if samples of $x$ are not generated and stored in the state of the Markov chain, and it certainly is the case that the pseudo-marginal algorithm gives some insight into why the PMMH algorithm works. However, the PMMH algorithm is much more powerful, as it solves the “smoothing” and parameter estimation problem simultaneously and exactly, including the “initial value” problem (computing the posterior distribution of the initial state, $x_0$ ). Below I will describe the algorithm and explain why it works, but first it is necessary to understand the relationship between marginal, joint and “likelihood-free” MCMC updating schemes for such latent variable models.

MCMC for latent variable models

Marginal approach

If we want to target $p(\theta|y)$ directly, we can use a Metropolis-Hastings scheme with a fairly arbitrary proposal distribution for exploring $\theta$ , where a new $\theta^\star$ is proposed from $f(\theta^\star|\theta)$ and accepted with probability $\min\{1,A\}$ , where

$\displaystyle A = \frac{p(\theta^\star)}{p(\theta)} \times \frac{f(\theta|\theta^\star)}{f(\theta^\star|\theta)} \times \frac{p({y}|\theta^\star)}{p({y}|\theta)}.$

As previously discussed, the problem with this scheme is that the marginal likelihood $p(y|\theta)$ required in the acceptance ratio is often difficult to compute.

Likelihood-free MCMC

A simple “likelihood-free” scheme targets the full joint posterior distribution $p(\theta,x|y)$ . It works by exploiting the fact that we can often simulate from the model for the latent variables $p(x|\theta)$ even when we can’t evaluate it, or marginalise $x$ out of the problem. Here the Metropolis-Hastings proposal is constructed in two stages. First, a proposed new $\theta^\star$ is sampled from $f(\theta^\star|\theta)$ and then a corresponding $x^\star$ is simulated from the model $p(x^\star|\theta^\star)$ . The pair $(\theta^\star,x^\star)$ is then jointly accepted with ratio

$\displaystyle A = \frac{p(\theta^\star)}{p(\theta)} \times \frac{f(\theta|\theta^\star)}{f(\theta^\star|\theta)} \times \frac{p(y|{x}^\star,\theta^\star)}{p(y|{x},\theta)}.$

The proposal mechanism ensures that the proposed $x^\star$ is consistent with the proposed $\theta^\star$ , and so the procedure can work provided that the dimension of the data $y$ is low. However, in order to work well more generally, we would want the proposed latent variables to be consistent with the data as well as the model parameters.

Ideal joint update

Motivated by the likelihood-free scheme, we would really like to target the joint posterior $p(\theta,x|y)$ by first proposing $\theta^\star$ from $f(\theta^\star|\theta)$ and then a corresponding $x^\star$ from the conditional distribution $p(x^\star|\theta^\star,y)$ . The pair $(\theta^\star,x^\star)$ is then jointly accepted with ratio

$\displaystyle A = \frac{p(\theta^\star)}{p(\theta)} \frac{p({x}^\star|\theta^\star)}{p({x}|\theta)} \frac{f(\theta|\theta^\star)}{f(\theta^\star|\theta)} \frac{p(y|{x}^\star,\theta^\star)}{p(y|{x},\theta)} \frac{p({x}|y,\theta)}{p({x}^\star|y,\theta^\star)}\\ \qquad = \frac{p(\theta^\star)}{p(\theta)} \frac{p(y|\theta^\star)}{p(y|\theta)} \frac{f(\theta|\theta^\star)}{f(\theta^\star|\theta)}.$

Notice how the acceptance ratio simplifies, using the basic marginal likelihood identity (BMI) of Chib (1995), and $x$ drops out of the ratio completely in order to give exactly the ratio used for the marginal updating scheme. Thus, the “ideal” joint updating scheme reduces to the marginal updating scheme if $x$ is not sampled and stored as a component of the Markov chain.

Understanding the relationship between these schemes is useful for understanding the PMMH algorithm. Indeed, we will see that the “ideal” joint updating scheme (and the marginal scheme) corresponds to PMMH using infinitely many particles in the particle filter, and that the likelihood-free scheme corresponds to PMMH using exactly one particle in the particle filter. For an intermediate number of particles, the PMMH scheme is a compromise between the “ideal” scheme and the “blind” likelihood-free scheme, but is always likelihood-free (when used with a bootstrap particle filter) and always has an acceptance ratio leaving the exact posterior invariant.

The PMMH algorithm

The algorithm

The PMMH algorithm is an MCMC algorithm for state space models jointly updating $\theta$ and $x_{0:T}$ , as the algorithms above. First, a proposed new $\theta^\star$ is generated from a proposal $f(\theta^\star|\theta)$ , and then a corresponding $x_{0:T}^\star$ is generated by running a bootstrap particle filter (as described in the previous post, and below) using the proposed new model parameters, $\theta^\star$ , and selecting a single trajectory by sampling once from the final set of particles using the final set of weights. This proposed pair $(\theta^\star,x_{0:T}^\star)$ is accepted using the Metropolis-Hastings ratio

$\displaystyle A = \frac{\hat{p}_{\theta^\star}(y_{1:T})p(\theta^\star)q(\theta|\theta^\star)}{\hat{p}_{\theta}(y_{1:T})p(\theta)q(\theta^\star|\theta)},$

where $\hat{p}_{\theta^\star}(y_{1:T})$ is the particle filter’s (unbiased) estimate of marginal likelihood, described in the previous post, and below. Note that this approach tends to the perfect joint/marginal updating scheme as the number of particles used in the filter tends to infinity. Note also that for a single particle, the particle filter just blindly forward simulates from $p_\theta(x^\star_{0:T})$ and that the filter’s estimate of marginal likelihood is just the observed data likelihood $p_\theta(y_{1:T}|x^\star_{0:T})$ leading precisely to the simple likelihood-free scheme. To understand for an arbitrary finite number of particles, $M$ , one needs to think carefully about the structure of the particle filter.

Why it works

To understand why PMMH works, it is necessary to think about the joint distribution of all random variables used in the bootstrap particle filter. To this end, it is helpful to re-visit the particle filter, thinking carefully about the resampling and propagation steps.

First introduce notation for the “particle cloud”: $\mathbf{x}_t=\{x_t^k|k=1,\ldots,M\}$ , $\boldsymbol{\pi}_t=\{\pi_t^k|k=1,\ldots,M\}$ , $\tilde{\mathbf{x}}_t=\{(x_t^k,\pi_t^k)|k=1,\ldots,M\}$ . Initialise the particle filter with $\tilde{\mathbf{x}}_0$ , where $x_0^k\sim p(x_0)$ and $\pi_0^k=1/M$ (note that $w_0^k$ is undefined). Now suppose at time $t$ we have a sample from $p(x_t|y_{1:t})$ : $\tilde{\mathbf{x}}_t$ . First resample by sampling $a_t^k \sim \mathcal{F}(a_t^k|\boldsymbol{\pi}_t)$ , $k=1,\ldots,M$ . Here we use $\mathcal{F}(\cdot|\boldsymbol{\pi})$ for the discrete distribution on $1:M$ with probability mass function $\boldsymbol{\pi}$ . Next sample $x_{t+1}^k\sim p(x_{t+1}^k|x_t^{a_t^k})$ . Set $w_{t+1}^k=p(y_{t+1}|x_{t+1}^k)$ and $\pi_{t+1}^k=w_{t+1}^k/\sum_{i=1}^M w_{t+1}^i$ . Finally, propagate $\tilde{\mathbf{x}}_{t+1}$ to the next step… We define the filter’s estimate of likelihood as $\hat{p}(y_t|y_{1:t-1})=\frac{1}{M}\sum_{i=1}^M w_t^i$ and $\hat{p}(y_{1:T})=\prod_{i=1}^T \hat{p}(y_t|y_{1:t-1})$ . See Doucet et al (2001) for further theoretical background on particle filters and SMC more generally.

Describing the filter carefully as above allows us to write down the joint density of all random variables in the filter as

$\displaystyle \tilde{q}(\mathbf{x}_0,\ldots,\mathbf{x}_T,\mathbf{a}_0,\ldots,\mathbf{a}_{T-1}) = \left[\prod_{k=1}^M p(x_0^k)\right] \left[\prod_{t=0}^{T-1} \prod_{k=1}^M \pi_t^{a_t^k} p(x_{t+1}^k|x_t^{a_t^k}) \right]$

For PMMH we also sample a final index $k'$ from $\mathcal{F}(k'|\boldsymbol{\pi}_T)$ giving the joint density

$\displaystyle \tilde{q}(\mathbf{x}_0,\ldots,\mathbf{x}_T,\mathbf{a}_0,\ldots,\mathbf{a}_{T-1})\pi_T^{k'}$

We write the final selected trajectory as

$\displaystyle x_{0:T}^{k'}=(x_0^{b_0^{k'}},\ldots,x_T^{b_T^{k'}}),$

where $b_t^{k'}=a_t^{b_{t+1}^{k'}}$ , and $b_T^{k'}=k'$ . If we now think about the structure of the PMMH algorithm, our proposal on the space of all random variables in the problem is in fact

$\displaystyle f(\theta^\star|\theta)\tilde{q}_{\theta^\star}(\mathbf{x}_0^\star,\ldots,\mathbf{x}_T^\star,\mathbf{a}_0^\star,\ldots,\mathbf{a}_{T-1}^\star)\pi_T^{{k'}^\star}$

and by considering the proposal and the acceptance ratio, it is clear that detailed balance for the chain is satisfied by the target with density proportional to

$\displaystyle p(\theta)\hat{p}_\theta(y_{1:T}) \tilde{q}_\theta(\mathbf{x}_0,\ldots,\mathbf{x}_T,\mathbf{a}_0,\ldots,\mathbf{a}_{T-1}) \pi_T^{k'}$

We want to show that this target marginalises down to the correct posterior $p(\theta,x_{0:T}|y_{1:T})$ when we consider just the parameters and the selected trajectory. But if we consider the terms in the joint distribution of the proposal corresponding to the trajectory selected by $k'$ , this is given by

$\displaystyle p_\theta(x_0^{b_0^{k'}})\left[\prod_{t=0}^{T-1} \pi_t^{b_t^{k'}} p_\theta(x_{t+1}^{b_{t+1}^{k'}}|x_t^{b_t^{k'}})\right]\pi_T^{k'} = p_\theta(x_{0:T}^{k'})\prod_{t=0}^T \pi_t^{b_t^{k'}}$

which, by expanding the $\pi_t^{b_t^{k'}}$ in terms of the unnormalised weights, simplifies to

$\displaystyle \frac{p_\theta(x_{0:T}^{k'})p_\theta(y_{1:T}|x_{0:T}^{k'})}{M^{T+1}\hat{p}_\theta(y_{1:T})}$

It is worth dwelling on this result, as this is the key insight required to understand why the PMMH algorithm works. The whole point is that the terms in the joint density of the proposal corresponding to the selected trajectory exactly represent the required joint distribution modulo a couple of normalising constants, one of which is the particle filter’s estimate of marginal likelihood. Thus, by including $\hat{p}_\theta(y_{1:T})$ in the acceptance ratio, we knock out the normalising constant, allowing all of the other terms in the proposal to be marginalised away. In other words, the target of the chain can be written as proportional to

$\displaystyle \frac{p(\theta)p_\theta(x_{0:T}^{k'},y_{1:T})}{M^{T+1}} \times \text{(Other terms...)}$

The other terms are all probabilities of random variables which do not occur elsewhere in the target, and hence can all be marginalised away to leave the correct posterior

$\displaystyle p(\theta,x_{0:T}|y_{1:T})$

Thus the PMMH algorithm targets the correct posterior for any number of particles, $M$ . Also note the implied uniform distribution on the selected indices in the target.

I will give some code examples in a future post.

MCMC, Monte Carlo likelihood estimation, and the bootstrap particle filter

The pseudo-marginal approach to MCMC for Bayesian inference

In a previous post I described a generalisation of the Metropolis Hastings MCMC algorithm which uses unbiased Monte Carlo estimates of likelihood in the acceptance ratio, but is nevertheless exact, when considered as a pseudo-marginal approach to “exact approximate” MCMC. To be useful in the context of Bayesian inference, we need to be able to compute unbiased estimates of the (marginal) likelihood of the data given some proposed model parameters with any “latent variables” integrated out.

To be more precise, consider a model for data $y$ with parameters $\theta$ of the form $\pi(y|\theta)$ together with a prior on $\theta$ , $\pi(\theta)$ , giving a joint model

$\displaystyle \pi(\theta,y)=\pi(\theta)\pi(y|\theta).$

Suppose now that interest is in the posterior distribution

$\displaystyle \pi(\theta|y) \propto \pi(\theta,y)=\pi(\theta)\pi(y|\theta).$

We can construct a fairly generic (marginal) MCMC scheme for this posterior by first proposing $\theta^\star \sim f(\theta^\star|\theta)$ from some fairly arbitrary proposal distribution and then accepting the value with probability $\min\{1,A\}$ where

$\displaystyle A = \frac{\pi(\theta^\star)}{\pi(\theta)} \frac{f(\theta|\theta^\star)}{f(\theta^\star|\theta)} \frac{\pi(y|\theta^\star)}{\pi(y|\theta)}$

This method is great provided that the (marginal) likelihood of the data $\pi(y|\theta)$ is available to us analytically, but in many (most) interesting models it is not. However, in the previous post I explained why substituting in a Monte Carlo estimate $\hat\pi(y|\theta)$ will still lead to the exact posterior if the estimate is unbiased in the sense that $E[\hat\pi(y|\theta)]=\pi(y|\theta)$ . Consequently, sources of (cheap) unbiased Monte Carlo estimates of (marginal) likelihood are of potential interest in the development of exact MCMC algorithms.

Latent variables and marginalisation

Often the reason that we cannot evaluate $\pi(y|\theta)$ is that there are latent variables in the problem, and the model for the data is conditional on those latent variables. Explicitly, if we denote the latent variables by $x$ , then the joint distribution for the model takes the form

$\displaystyle \pi(\theta,x,y) = \pi(\theta)\pi(x|\theta)\pi(y|x,\theta)$

Now since

$\displaystyle \pi(y|\theta) = \int_X \pi(y|x,\theta)\pi(x|\theta)\,dx$

there is a simple and obvious Monte Carlo strategy for estimating $\pi(y|\theta)$ provided that we can evaluate $\pi(y|x,\theta)$ and simulate realisations from $\pi(x|\theta)$ . That is, simulate values $x_1,x_2,\ldots,x_n$ from $\pi(x|\theta)$ for some suitably large $n$ , and then put

$\displaystyle \hat\pi(y|\theta) = \frac{1}{n}\sum_{i=1}^n \pi(y|x_i,\theta).$

It is clear by the law of large numbers that this estimate will converge to $\pi(y|\theta)$ as $n\rightarrow \infty$ . That is, $\hat\pi(y|\theta)$ is a consistent estimate of $\pi(y|\theta)$ . However, a moment’s thought reveals that this estimate is not only consistent, but also unbiased, since each term in the sum has expectation $\pi(y|\theta)$ . This simple Monte Carlo estimate of likelihood can therefore be substituted into a Metropolis-Hastings acceptance ratio without affecting the (marginal) target distribution of the Markov chain. Note that this estimate of marginal likelihood is sometimes referred to as the Rao-Blackwellised estimate, due to its connection with the Rao-Blackwell theorem.

Importance sampling

Suppose now that we cannot sample values directly from $\pi(x|\theta)$ , but can sample instead from a distribution $\pi'(x|\theta)$ having the same support as $\pi(x|\theta)$ . We can then instead produce an importance sampling estimate for $\pi(y|\theta)$ by noting that

$\displaystyle \pi(y|\theta) = \int_X \pi(y|x,\theta)\frac{\pi(x|\theta)}{\pi'(x|\theta)}\pi'(x|\theta)\,dx.$

Consequently, samples $x_1,x_2,\ldots,x_n$ from $\pi'(x|\theta)$ can be used to construct the estimate

$\displaystyle \hat{\pi}(y|\theta) = \frac{1}{n}\sum_{i=1}^n \pi(y|x_i,\theta) \frac{\pi(x_i|\theta)}{\pi'(x_i|\theta)}$

which again is clearly both consistent and unbiased. This estimate is often written

$\displaystyle \hat{\pi}(y|\theta) = \frac{1}{n}\sum_{i=1}^n \pi(y|x_i,\theta) w_i$
where $w_i=\pi(x_i|\theta)/\pi'(x_i|\theta)$ . The weights, $w_i$ , are known as importance weights.

Importance resampling

An idea closely related to that of importance sampling is that of importance resampling where importance weights are used to resample a sample in order to equalise the weights, often prior to a further round of weighting and resampling. The basic idea is to generate an approximate sample from a target density $\pi(x)$ using values sampled from an auxiliary distribution $\pi'(x)$ , where we now supress any dependence of the distributions on model parameters, $\theta$ .

First generate a sample $x_1,\ldots,x_n$ from $\pi'(x)$ and compute weights $w_i=\pi(x_i)/\pi'(x_i),\ i=1,\ldots,n$ . Then compute normalised weights $\tilde{w}_i=w_i/\sum_{k=1}^n w_k$ . Generate a new sample of size $n$ by sampling $n$ times with replacement from the original sample with the probability of choosing each value determined by its normalised weight.

As an example, consider using a sample from the Cauchy distribution as an auxiliary distribution for approximately sampling standard normal random quantities. We can do this using a few lines of R as follows.

n=1000
xa=rcauchy(n)
w=dnorm(xa)/dcauchy(xa)
x=sample(xa,n,prob=w,replace=TRUE)
hist(x,30)
mean(w)

Note that we don’t actually need to compute the normalised weights, as the sample function will do this for us. Note also that the average weight will be close to one. It should be clear that the expected value of the weights will be exactly 1 when both the target and auxiliary densities are correctly normalised. Also note that the procedure can be used when one or both of the densities are not correctly normalised, since the weights will be normalised prior to sampling anyway. Note that in this case the expected weight will be the (ratio of) normalising constant(s), and so looking at the average weight will give an estimate of the normalising constant.

Note that the importance sampling procedure is approximate. Unlike a technique such as rejection sampling, which leads to samples having exactly the correct distribution, this is not the case here. Indeed, it is clear that in the $n=1$ case, the final sample will be exactly drawn from the auxiliary and not the target. The procedure is asymptotic, in that it improves as the sample size increases, tending to the exact target as $n\rightarrow \infty$ .

We can understand why importance resampling works by first considering the univariate case, using correctly normalised densities. Consider a very large number of particles, $N$ . The proportion of the auxiliary samples falling in a small interval $[x,x+dx)$ will be $\pi'(x)dx$ , corresponding to roughly $N\pi'(x)dx$ particles. The weight for each of those particles will be $w(x)=\pi(x)/\pi'(x)$ , and since the expected weight of a random particle is 1, the sum of all weights will be (roughly) $N$ , leading to normalised weights for the particles near $x$ of $\tilde{w}(x)=\pi(x)/[N\pi'(x)]$ . The combined weight of all particles in $[x,x+dx)$ is therefore $\pi(x)dx$ . Clearly then, when we resample $N$ times we expect to select roughly $N\pi(x)dx$ particles from this interval. This corresponds to a proportion $\pi(x)dt$ , corresponding to a density of $\pi(x)$ in the final sample.

Obviously the above argument is very informal, but can be tightened up into a reasonably rigorous proof for the 1d case without too much effort, and the multivariate extension is also reasonably clear.

The bootstrap particle filter

The bootstrap particle filter is an iterative method for carrying out Bayesian inference for dynamic state space (partially observed Markov process) models, sometimes also known as hidden Markov models (HMMs). Here, an unobserved Markov process, $x_0,x_1,\ldots,x_T$ governed by a transition kernel $p(x_{t+1}|x_t)$ is partially observed via some measurement model $p(y_t|x_t)$ leading to data $y_1,\ldots,y_T$ . The idea is to make inference for the hidden states $x_{0:T}$ given the data $y_{1:T}$ . The method is a very simple application of the importance resampling technique. At each time, $t$ , we assume that we have a (approximate) sample from $p(x_t|y_{1:t})$ and use importance resampling to generate an approximate sample from $p(x_{t+1}|y_{1:t+1})$ .

More precisely, the procedure is initialised with a sample from $x_0^k \sim p(x_0),\ k=1,\ldots,M$ with uniform normalised weights ${w'}_0^k=1/M$ . Then suppose that we have a weighted sample $\{x_t^k,{w'}_t^k|k=1,\ldots,M\}$ from $p(x_t|y_{1:t})$ . First generate an equally weighted sample by resampling with replacement $M$ times to obtain $\{\tilde{x}_t^k|k=1,\ldots,M\}$ (giving an approximate random sample from $p(x_t|y_{1:t})$ ). Note that each sample is independently drawn from $\sum_{i=1}^M {w'}_t^i\delta(x-x_t^i)$ . Next propagate each particle forward according to the Markov process model by sampling $x_{t+1}^k\sim p(x_{t+1}|\tilde{x}_t^k),\ k=1,\ldots,M$ (giving an approximate random sample from $p(x_{t+1}|y_{1:t})$ ). Then for each of the new particles, compute a weight $w_{t+1}^k=p(y_{t+1}|x_{t+1}^k)$ , and then a normalised weight ${w'}_{t+1}^k=w_{t+1}^k/\sum_i w_{t+1}^i$ .

It is clear from our understanding of importance resampling that these weights are appropriate for representing a sample from $p(x_{t+1}|y_{1:t+1})$ , and so the particles and weights can be propagated forward to the next time point. It is also clear that the average weight at each time gives an estimate of the marginal likelihood of the current data point given the data so far. So we define

$\displaystyle \hat{p}(y_t|y_{1:t-1})=\frac{1}{M}\sum_{k=1}^M w_t^k$

and

$\displaystyle \hat{p}(y_{1:T}) = \hat{p}(y_1)\prod_{t=2}^T \hat{p}(y_t|y_{1:t-1}).$

Again, from our understanding of importance resampling, it should be reasonably clear that $\hat{p}(y_{1:T})$ is a consistent estimator of ${p}(y_{1:T})$ . It is much less clear, but nevertheless true that this estimator is also unbiased. The standard reference for this fact is Del Moral (2004), but this is a rather technical monograph. A much more accessible proof (for a very general particle filter) is given in Pitt et al (2011).

It should therefore be clear that if one is interested in developing MCMC algorithms for state space models, one can use a pseudo-marginal MCMC scheme, substituting in $\hat{p}_\theta(y_{1:T})$ from a bootstrap particle filter in place of $p(y_{1:T}|\theta)$ . This turns out to be a simple special case of the particle marginal Metropolis-Hastings (PMMH) algorithm described in Andreiu et al (2010). However, the PMMH algorithm in fact has the full joint posterior $p(\theta,x_{0:T}|y_{1:T})$ as its target. I will explain the PMMH algorithm in a subsequent post.