## Tuning particle MCMC algorithms

Several papers have appeared recently discussing the issue of how to tune the number of particles used in the particle filter within a particle MCMC algorithm such as particle marginal Metropolis Hastings (PMMH). Three such papers are:

I have discussed psuedo marginal MCMC and particle MCMC algorithms in previous posts. It will be useful to refer back to these posts if these topics are unfamiliar. Within particle MCMC algorithms (and psuedo-marginal MCMC algorithms, more generally), an unbiased estimate of marginal likelihood is constructed using a number of particles. The more particles that are used, the better the estimate of marginal likelihood is, and the resulting MCMC algorithm will behave more like a “real” marginal MCMC algorithm. For a small number of particles, the algorithm will still have exactly the correct target, but the noise in the unbiased estimator of marginal likelihood will lead to poor mixing of the MCMC chain. The idea is to use just enough particles to ensure that there isn’t “too much” noise in the unbiased estimator, but not to waste lots of time producing a super-accurate estimate of marginal likelihood if that isn’t necessary to ensure good mixing of the MCMC chain.

The papers above try to give theoretical justifications for certain “rules of thumb” that are commonly used in practice. One widely adopted scheme is to tune the number of particles so that the variance of the log of the estimate of marginal liklihood is around one. The obvious questions are “where?” and “why?”, and these questions turn out to be connected. As we will see, there isn’t really a good answer to the “where?” question, but what people usually do is use a pilot run to get an estimate of the posterior mean, or mode, or MLE, and then pick one and tune the noise variance at that particular parameter value. As to “why?”, well, the papers above make various (slightly different) assumptions, all of which lead to trading off mixing against computation time to obtain an “optimal” number of particles. They don’t all agree that the variance of the noise should be exactly 1, but they all agree to an order of magnitude.

All of the above papers make the assumption that the noise distribution associated with the marginal likelihood estimate is independent of the parameter at which it is being evaluated, which explains why there isn’t a really good answer to the “where?” question – under the assumption it doesn’t matter what parameter value is used for tuning – they are all the same! Easy. Except that’s quite a big assumption, so it would be nice to know that it is reasonable, and unfortunately it isn’t. Let’s look at an example to see what goes wrong.

#### Example

In Chapter 10 of my book I look in detail at constructing a PMMH algorithm for inferring the parameters of a discretely observed stochastic Lotka-Volterra model. I’ve stepped through the computational details in a previous post which you should refer back to for the necessary background. Following that post, we can construct a particle filter to return an unbiased estimate of marginal likelihood using the following R code (which relies on the smfsb CRAN package):

require(smfsb)
# data
data(LVdata)
data=as.timedData(LVnoise10)
noiseSD=10
# measurement error model
dataLik <- function(x,t,y,log=TRUE,...)
{
ll=sum(dnorm(y,x,noiseSD,log=TRUE))
if (log)
return(ll)
else
return(exp(ll))
}
# now define a sampler for the prior on the initial state
simx0 <- function(N,t0,...)
{
mat=cbind(rpois(N,50),rpois(N,100))
colnames(mat)=c("x1","x2")
mat
}
# construct particle filter
mLLik=pfMLLik(150,simx0,0,stepLVc,dataLik,data)


Again, see the relevant previous post for details. So now mLLik() is a function that will return the log of an unbiased estimate of marginal likelihood (based on 150 particles) given a parameter value at which to evaluate.

What we are currently wondering is whether the noise in the estimate is independent of the parameter at which it is evaluated. We can investigate this for this filter easily by looking at how the estimate varies as the first parameter (prey birth rate) varies. The following code computes a log likelihood estimate across a range of values and plots the result.

mLLik1=function(x){mLLik(th=c(th1=x,th2=0.005,th3=0.6))}
x=seq(0.7,1.3,length.out=5001)
y=sapply(x,mLLik1)
plot(x[y>-1e10],y[y>-1e10])


The resulting plot is as follows (click for full size):

So, looking at the plot, it is very clear that the noise variance certainly isn’t constant as the parameter varies – it varies substantially. Furthermore, the way in which it varies is “dangerous”, in that the noise is smallest in the vicinity of the MLE. So, if a parameter close to the MLE is chosen for tuning the number of particles, this will ensure that the noise is small close to the MLE, but not elsewhere in parameter space. This could have bad consequences for the mixing of the MCMC algorithm as it explores the tails of the posterior distribution.

So with the above in mind, how should one tune the number of particles in a pMCMC algorithm? I can’t give a general answer, but I can explain what I do. We can’t rely on theory, so a pragmatic approach is required. The above rule of thumb usually gives a good starting point for exploration. Then I just directly optimise ESS per CPU second of the pMCMC algorithm from pilot runs for varying numbers of particles (and other tuning parameters in the algorithm). ESS is “expected sample size”, which can be estimated using the effectiveSize() function in the coda CRAN package. Ugly and brutish, but it works…

## Introduction

In the previous post I showed that it is possible to couple parallel tempered MCMC chains in order to improve mixing. Such methods can be used when the target of interest is a Bayesian posterior distribution that is difficult to sample. There are (at least) a couple of obvious ways that one can temper a Bayesian posterior distribution. Perhaps the most obvious way is a simple flattening, so that if

$\pi(\theta|y) \propto \pi(\theta)\pi(y|\theta)$

is the posterior distribution, then for $t\in [0,1]$ we define

$\pi_t(\theta|y) \propto \pi(\theta|y)^t \propto [ \pi(\theta)\pi(y|\theta) ]^t.$

This corresponds with the tempering that is often used in statistical physics applications. We recover the posterior of interest for $t=1$ and tend to a flat distribution as $t\longrightarrow 0$. However, for Bayesian posterior distributions, there is a different way of tempering that is often more natural and useful, and that is to temper using the power posterior, defined by

$\pi_t(\theta|y) \propto \pi(\theta)\pi(y|\theta)^t.$

Here we again recover the posterior for $t=1$, but get the prior for $t=0$. Thus, the family of distributions forms a natural bridge or path from the prior to the posterior distributions. The power posterior is a special case of the more general concept of a geometric path from distribution $f(\theta)$ (at $t=0$) to $g(\theta)$ (at $t=1$) defined by

$h_t(\theta) \propto f(\theta)^{1-t}g(\theta)^t,$

where, in our case, $f(\cdot)$ is the prior and $g(\cdot)$ is the posterior.

So, given a posterior distribution that is difficult to sample, choose a temperature schedule

$0=t_0

and run a parallel tempering scheme as outlined in the previous post. The idea is that for small values of $t$ mixing will be good, as prior-like distributions are usually well-behaved, and the mixing of these "high temperature" chains can help to improve the mixing of the "low temperature" chains that are more like the posterior (note that $t$ is really an inverse temperature parameter the way I’ve defined it here…).

## Marginal likelihood and normalising constants

The marginal likelihood of a Bayesian model is

$\pi(y) = \int_\Theta \pi(\theta)\pi(y|\theta)d\theta.$

This quantity is of interest for many reasons, including calculation of the Bayes factor between two competing models. Note that this quantity has several different names in different fields. In particular, it is often known as the evidence, due to its role in Bayes factors. It is also worth noting that it is the normalising constant of the Bayesian posterior distribution. Although it is very easy to describe and define, it is notoriously difficult to compute reliably for complex models.

The normalising constant is conceptually very easy to estimate. From the above integral representation, it is clear that

$\pi(y) = E_\pi [ \pi(y|\theta) ]$

where the expectation is taken with respect to the prior. So, given samples from the prior, $\theta_1,\theta_2,\ldots,\theta_n$, we can construct the Monte Carlo estimate

$\displaystyle \widehat{\pi}(y) = \frac{1}{n}\sum_{i=1}^n \pi(y|\theta_i)$

and this will be a consistent estimator of the true evidence under fairly mild regularity conditions. Unfortunately, in practice it is likely to be a very poor estimator if the posterior and prior are not very similar. Now, we could also use Bayes theorem to re-write the integral as an expectation with respect to the posterior, so we could then use samples from the posterior to estimate the evidence. This leads to the harmonic mean estimator of the evidence, which has been described as the worst Monte Carlo method ever! Now it turns out that there are many different ways one can construct estimators of the evidence using samples from the prior and the posterior, some of which are considerably better than the two I’ve outlined. This is the subject of the bridge sampling paper of Meng and Wong. However, the reality is that no method will work well if the prior and posterior are very different.

If we have tempered chains, then we have a sequence of chains targeting distributions which, by construction, are not too different, and so we can use the output from tempered chains in order to construct estimators of the evidence that are more numerically stable. If we call the evidence of the $i$th chain $z_i$, so that $z_0=1$ and $z_N=\pi(y)$, then we can write the evidence in telescoping fashion as

$\displaystyle \pi(y)=z_N = \frac{z_N}{z_0} = \frac{z_1}{z_0}\times \frac{z_2}{z_1}\times \cdots \times \frac{z_N}{z_{N-1}}.$

Now the $i$th term in this product is $z_{i+1}/z_{i}$, which can be estimated using the output from the $i$th and/or $(i+1)$th chain(s). Again, this can be done in a variety of ways, using your favourite bridge sampling estimator, but the point is that the estimator should be reasonably good due to the fact that the $i$th and $(i+1)$th targets are very similar. For the power posterior, the simplest method is to write

$\displaystyle \frac{z_{i+1}}{z_i} = \frac{\displaystyle \int \pi(\theta)\pi(y|\theta)^{t_{i+1}}d\theta}{z_i} = \int \pi(y|\theta)^{t_{i+1}-t_i}\times \frac{\pi(y|\theta)^{t_i}\pi(\theta)}{z_i}d\theta$

$\displaystyle \mbox{}\qquad = E_i\left[\pi(y|\theta)^{t_{i+1}-t_i}\right],$

where the expectation is with respect to the $i$th target, and hence can be estimated in the usual way using samples from the $i$th chain.

For numerical stability, in practice we compute the log of the evidence as

$\displaystyle \log\pi(y) = \sum_{i=0}^{N-1} \log\frac{z_{i+1}}{z_i} = \sum_{i=0}^{N-1} \log E_i\left[\pi(y|\theta)^{t_{i+1}-t_i}\right]$

$\displaystyle = \sum_{i=0}^{N-1} \log E_i\left[\exp\{(t_{i+1}-t_i)\log\pi(y|\theta)\}\right].\qquad(\dagger)$

The above expression is exact, and is the obvious formula to use for computation. However, it is clear that if $t_i$ and $t_{i+1}$ are sufficiently close, it will be approximately OK to switch the expectation and exponential, giving

$\displaystyle \log\pi(y) \approx \sum_{i=0}^{N-1}(t_{i+1}-t_i)E_i\left[\log\pi(y|\theta)\right].$

In the continuous limit, this gives rise to the well-known path sampling identity,

$\displaystyle \log\pi(y) = \int_0^1 E_t\left[\log\pi(y|\theta)\right]dt.$

So, an alternative approach to computing the evidence is to use the samples to approximately numerically integrate the above integral, say, using the trapezium rule. However, it isn’t completely clear (to me) that this is better than using $(\dagger)$ directly, since there there is no numerical integration error to worry about.

## Numerical illustration

We can illustrate these ideas using the simple double potential well example from the previous post. Now that example doesn’t really correspond to a Bayesian posterior, and is tempered directly, rather than as a power posterior, but essentially the same ideas follow for general parallel tempered distributions. In general, we can use the sample to estimate the ratio of the last and first normalising constants, $z_N/z_0$. Here it isn’t obvious why we’d want to know that, but we’ll compute it anyway to illustrate the method. As before, we expand as a telescopic product, where the $i$th term is now

$\displaystyle \frac{z_{i+1}}{z_i} = E_i\left[\exp\{-(\gamma_{i+1}-\gamma_i)(x^2-1)^2\}\right].$

A Monte Carlo estimate of each of these terms is formed using the samples from the $i$th chain, and the logs of these are then summed to give $\log(z_N/z_0)$. A complete R script to run the Metropolis coupled sampler and compute the evidence is given below.

U=function(gam,x)
{
gam*(x*x-1)*(x*x-1)
}

temps=2^(0:3)
iters=1e5

chains=function(pot=U, tune=0.1, init=1)
{
x=rep(init,length(temps))
xmat=matrix(0,iters,length(temps))
for (i in 1:iters) {
can=x+rnorm(length(temps),0,tune)
logA=unlist(Map(pot,temps,x))-unlist(Map(pot,temps,can))
accept=(log(runif(length(temps)))<logA)
x[accept]=can[accept]
swap=sample(1:length(temps),2)
logA=pot(temps[swap[1]],x[swap[1]])+pot(temps[swap[2]],x[swap[2]])-
pot(temps[swap[1]],x[swap[2]])-pot(temps[swap[2]],x[swap[1]])
if (log(runif(1))<logA)
x[swap]=rev(x[swap])
xmat[i,]=x
}
colnames(xmat)=paste("gamma=",temps,sep="")
xmat
}

mat=chains()
mat=mat[,1:(length(temps)-1)]
diffs=diff(temps)
mat=(mat*mat-1)^2
mat=-t(diffs*t(mat))
mat=exp(mat)
logEvidence=sum(log(colMeans(mat)))
message(paste("The log of the ratio of the last and first normalising constants is",logEvidence))


It turns out that these double well potential densities are tractable, and so the normalising constants can be computed exactly. So, with a little help from Wolfram Alpha, I compute log of the ratio of the last and first normalising constants to be approximately -1.12. Hopefully the above script will output something a bit like that…

## Introduction

Parallel tempering is a method for getting Metropolis-Hastings based MCMC algorithms to work better on multi-modal distributions. Although the idea has been around for more than 20 years, and works well on many problems, it still isn’t routinely used in applications. I think this is partly because relatively few people understand how it works, and partly due to the perceived difficulty of implementation. I hope to show here that it is both very easy to understand and to implement. It is also rather easy to implement in parallel on multi-core systems, though I won’t get into that in this post.

## Sampling a double-well potential

To illustrate the ideas, we need a toy multi-modal distribution to sample. There are obviously many possibilities here, but I rather like to use a double potential well distribution. The simplest version of this assumes a potential function of the form

$U(x) = \gamma (x^2-1)^2$

for some given potential barrier height $\gamma$. The potential function $U(x)$ corresponds to the probability density function

$\pi(x) \propto \exp\{-U(x)\}.$

There is a physical explanation for the terminology, via Langevin diffusions, but that isn’t really important here. It is fine to just think of potentials as being a (negative) log-density scale. On this scale, high potential barrier heights correspond to regions of very low probability density. We can set up a double well potential and plot it for the case $\gamma=4$ in R with the following code

U=function(gam,x)
{
gam*(x*x-1)*(x*x-1)
}

curried=function(gam)
{
message(paste("Returning a function for gamma =",gam))
function(x) U(gam,x)
}
U4=curried(4)

op=par(mfrow=c(2,1))
curve(U4(x),-2,2,main="Potential function, U(x)")
curve(exp(-U4(x)),-2,2,main="Unnormalised density function, exp(-U(x))")
par(op)


leading to the following plot

Incidentally, the function curried(), which curries the potential function, did not include the message() statement when I first wrote it. It mostly worked fine, but some of the code below didn’t behave as I expected. I inserted the message() statement to figure out what was going on, and the code started behaving perfectly – a beautiful example of a Heisenbug! The reason is that the message statement is not redundant – it forces evaluation of the gam variable, which is necessary in some cases, due to the lazy evaluation model that R uses for function arguments. If you don’t like the message() statement, replacing it with a simple gam works just as well.

Anyway, the point is that we have defined a multi-modal density, and that a Metropolis-Hastings algorithm initialised in one of the modes will have a hard time jumping to the other mode, and the difficulty of this jump will increase as we increase the value of $\gamma$.

We can write a simple function for a Metropolis algorithm targeting a particular potential function as follows.

chain=function(target,tune=0.1,init=1)
{
x=init
xvec=numeric(iters)
for (i in 1:iters) {
can=x+rnorm(1,0,tune)
logA=target(x)-target(can)
if (log(runif(1))<logA)
x=can
xvec[i]=x
}
xvec
}


We can use this code to run some chains for a few different values of $\gamma$ as follows.

temps=2^(0:3)
iters=1e5

mat=sapply(lapply(temps,curried),chain)
colnames(mat)=paste("gamma=",temps,sep="")

require(smfsb)
mcmcSummary(mat,rows=length(temps))


leading to the plot below.

We see that as $\gamma$ increases, the chain jumps between modes less frequently. Indeed, for $\gamma=8$, the chain fails to jump to the second mode at all during this particular run of 100,000 iterations. That’s a problem if we are really interested in sampling from distributions like this. Of course, for this particular problem, there are all kinds of ways to fix this sampler, but the point is to try and develop methods that will work in high-dimensional situations where we cannot just “look” at what is going wrong.

Before we see how to couple the chains and improve the mixing, it is useful to think how to re-write this sampler. Above we sampled each chain in turn for different barrier heights. To couple the chains, we need to sample them together, sampling each iteration for all of the chains in turn. This is very easy to do. The code below isn’t especially efficient, but it is written to look very similar to the single chain code above.

chains=function(pot=U, tune=0.1, init=1)
{
x=rep(init,length(temps))
xmat=matrix(0,iters,length(temps))
for (i in 1:iters) {
can=x+rnorm(length(temps),0,tune)
logA=unlist(Map(pot,temps,x))-unlist(Map(pot,temps,can))
accept=(log(runif(length(temps)))<logA)
x[accept]=can[accept]
xmat[i,]=x
}
colnames(xmat)=paste("gamma=",temps,sep="")
xmat
}

mcmcSummary(chains(),rows=length(temps))


This code should behave identically to the previous code, simulating independent parallel MCMC chains. However, the code is now in the form that is very easy to modify to couple the chains together in an attempt to improve mixing.

## Coupling parallel chains

In the above example the chains we are simulating are all independent of one another. Some mix reasonably well, and some mix very badly. But the distributions are all related to one another, changing gradually as the barrier height changes. The idea behind coupling the chains is to try and swap states between the chains to use the chains which are mixing well to improve the mixing of the chains which aren’t. In particular, suppose interest is in the target of the worst mixing chain. The other chains can be constructed as “tempered” versions of the target of interest, here by raising it to a power between 0 and 1, with 0 corresponding to a complete flattening of the distribution, and 1 corresponding to the desired target. The use of parallel chains to improve mixing in this way is usually referred to as parallel tempering, but also sometimes as $(\text{MC})^3$. In the context of Bayesian inference, tempering using the “power posterior” can often be more natural and useful (to be discussed in a subsequent post).

So, how do we swap states between the chains without affecting the target distributions? As always, just use a Metropolis-Hastings update… To keep things simple, let’s just suppose that we have two (independent, parallel) chains, one with target $f(x)$ and the other with target $g(y)$. We can consider these chains to be evolving together, with joint target $\pi(x,y)=f(x)g(y)$. The updates chosen to update the within-chain states will obviously preserve this joint target. Now we consider how to swap states between the two chains without destroying the target. We simply propose a swap of $x$ and $y$. That is, we propose to move from $(x,y)$ to $(x^\star,y^\star)$, where $x^\star=y$ and $y^\star=x$. We are proposing this move as a standard Metropolis-Hastings update of the joint chain. We may therefore wonder about exactly what the proposal density for this move is. In fact, it is a point mass at the swapped values, and therefore has density

$q((x^\star,y^\star)|(x,y)) = \delta(x^\star-y)\delta(y^\star-x),$

but it really doesn’t matter, as it is clearly a symmetric proposal, and hence will drop out of the M-H ratio. Our acceptance probability is therefore $\min\{1,A\}$, where

$\displaystyle A = \frac{\pi(x^\star,y^\star)}{\pi(x,y)} = \frac{\pi(y,x)}{\pi(x,y)} = \frac{f(y)g(x)}{f(x)g(y)}.$

So, if we use this acceptance probability whenever we propose a swap of the states between two chains, then we will preserve the joint target, and hence the marginal targets and asymptotic independence of the target. However, the chains themselves will no longer be independent of one another. They will be coupled – Metropolis coupled. This is very easy to implement. We can just add a few lines to our previous chains() function as follows

chains=function(pot=U, tune=0.1, init=1)
{
x=rep(init,length(temps))
xmat=matrix(0,iters,length(temps))
for (i in 1:iters) {
can=x+rnorm(length(temps),0,tune)
logA=unlist(Map(pot,temps,x))-unlist(Map(pot,temps,can))
accept=(log(runif(length(temps)))<logA)
x[accept]=can[accept]
# now the coupling update
swap=sample(1:length(temps),2)
logA=pot(temps[swap[1]],x[swap[1]])+pot(temps[swap[2]],x[swap[2]])-
pot(temps[swap[1]],x[swap[2]])-pot(temps[swap[2]],x[swap[1]])
if (log(runif(1))<logA)
x[swap]=rev(x[swap])
# end of the coupling update
xmat[i,]=x
}
colnames(xmat)=paste("gamma=",temps,sep="")
xmat
}


This can be used as before, but now gives results as illustrated in the following plots.

We see immediately from the plots that whilst the individual target distributions remain unchanged, the mixing of the chains is greatly improved (though still far from perfect). Note that in the code above I just picked two chains at random to propose a state swap. In practice people typically only propose to swap states between chains which are adjacent – i.e. most similar, since proposed swaps between chains with very different targets are unlikely to be accepted. Since implementation of either strategy is very easy, I would recommend trying both to see which works best.

## Parallel implementation

It should be clear that there are opportunities for parallelising the above algorithm to make effective use of modern multi-core hardware. An approach using OpenMP with C++ is discussed in this blog post. Also note that if the state space of the chains is large, it can be much more efficient to swap temperatures between the chains rather than states – so long as you are careful about keeping track of stuff – this is explored in detail in Altekar et al (’04).

## Complete R script

For convenience, a complete R script to run all of the examples in this post is given below.

# temper.R
# functions for messing around with tempering MCMC

U=function(gam,x)
{
gam*(x*x-1)*(x*x-1)
}

curried=function(gam)
{
#gam
message(paste("Returning a function for gamma =",gam))
function(x) U(gam,x)
}
U4=curried(4)

op=par(mfrow=c(2,1))
curve(U4(x),-2,2,main="Potential function, U(x)")
curve(exp(-U4(x)),-2,2,main="Unnormalised density function, exp(-U(x))")
par(op)

# global settings
temps=2^(0:3)
iters=1e5

# First look at some independent chains
chain=function(target,tune=0.1,init=1)
{
x=init
xvec=numeric(iters)
for (i in 1:iters) {
can=x+rnorm(1,0,tune)
logA=target(x)-target(can)
if (log(runif(1))<logA)
x=can
xvec[i]=x
}
xvec
}

mat=sapply(lapply(temps,curried),chain)
colnames(mat)=paste("gamma=",temps,sep="")

require(smfsb)
mcmcSummary(mat,rows=length(temps))

# Next, let's generate 5 chains at once...
chains=function(pot=U, tune=0.1, init=1)
{
x=rep(init,length(temps))
xmat=matrix(0,iters,length(temps))
for (i in 1:iters) {
can=x+rnorm(length(temps),0,tune)
logA=unlist(Map(pot,temps,x))-unlist(Map(pot,temps,can))
accept=(log(runif(length(temps)))<logA)
x[accept]=can[accept]
xmat[i,]=x
}
colnames(xmat)=paste("gamma=",temps,sep="")
xmat
}

mcmcSummary(chains(),rows=length(temps))

# Next let's couple the chains...
chains=function(pot=U, tune=0.1, init=1)
{
x=rep(init,length(temps))
xmat=matrix(0,iters,length(temps))
for (i in 1:iters) {
can=x+rnorm(length(temps),0,tune)
logA=unlist(Map(pot,temps,x))-unlist(Map(pot,temps,can))
accept=(log(runif(length(temps)))<logA)
x[accept]=can[accept]
# now the coupling update
swap=sample(1:length(temps),2)
logA=pot(temps[swap[1]],x[swap[1]])+pot(temps[swap[2]],x[swap[2]])-
pot(temps[swap[1]],x[swap[2]])-pot(temps[swap[2]],x[swap[1]])
if (log(runif(1))<logA)
x[swap]=rev(x[swap])
# end of the coupling update
xmat[i,]=x
}
colnames(xmat)=paste("gamma=",temps,sep="")
xmat
}

mcmcSummary(chains(),rows=length(temps))

# eof


## Summary stats for ABC

#### Introduction

In the previous post I gave a very brief introduction to ABC, including a simple example for inferring the parameters of a Markov process given some time series observations. Towards the end of the post I observed that there were (at least!) two potential problems with scaling up the simple approach described, one relating to the dimension of the data and the other relating to the dimension of the parameter space. Before moving on to the (to me, more interesting) problem of the dimension of the parameter space, I will briefly discuss the data dimension problem in this post, and provide a couple of references for further reading.

#### Summary stats

Recall that the simple rejection sampling approach to ABC involves first sampling a candidate parameter $\theta^\star$ from the prior and then sampling a corresponding data set $x^\star$ from the model. This simulated data set is compared with the true data $x$ using some (pseudo-)norm, $\Vert\cdot\Vert$, and accepting $\theta^\star$ if the simulated data set is sufficiently close to the true data, $\Vert x^\star - x\Vert <\epsilon$. It should be clear that if we are using a proper norm then as $\epsilon$ tends to zero the distribution of the accepted values tends to the desired posterior distribution of the parameters given the data.

However, smaller choices of $\epsilon$ will lead to higher rejection rates. This will be a particular problem in the context of high-dimensional $x$, where it is often unrealistic to expect a close match between all components of $x$ and the simulated data $x^\star$, even for a good choice of $\theta^\star$. In this case, it makes more sense to look for good agreement between particular aspects of $x$, such as the mean, or variance, or auto-correlation, depending on the exact problem and context. If we can find a finite set of sufficient statistics, $s(x)$ for $\theta$, then it should be clear that replacing the acceptance criterion with $\Vert s(x^\star) - s(x)\Vert <\epsilon$ will also lead to a scheme tending to the true posterior as $\epsilon$ tends to zero (assuming a proper norm on the space of sufficient statistics), and will typically be better than the naive method, since the sufficient statistics will be of lower dimension and less “noisy” that the raw data, leading to higher acceptance rates with no loss of information.

Unfortunately for most problems of practical interest it is not possible to find low-dimensional sufficient statistics, and so people in practice use domain knowledge and heuristics to come up with a set of summary statistics, $s(x)$ which they hope will closely approximate sufficient statistics. There is still a question as to how these statistics should be weighted or transformed to give a particular norm. This can be done using theory or heuristics, and some relevant references for this problem are given at the end of the post.

#### Implementation in R

Let’s now look at the problem from the previous post. Here, instead of directly computing the Euclidean distance between the real and simulated data, we will look at the Euclidean distance between some (normalised) summary statistics. First we will load some packages and set some parameters.

require(smfsb)
require(parallel)
options(mc.cores=4)
data(LVdata)

N=1e7
bs=1e5
batches=N/bs
message(paste("N =",N," | bs =",bs," | batches =",batches))


Next we will define some summary stats for a univariate time series – the mean, the (log) variance, and the first two auto-correlations.

ssinit <- function(vec)
{
ac23=as.vector(acf(vec,lag.max=2,plot=FALSE)$acf)[2:3] c(mean(vec),log(var(vec)+1),ac23) }  Once we have this, we can define some stats for a bivariate time series by combining the stats for the two component series, along with the cross-correlation between them. ssi <- function(ts) { c(ssinit(ts[,1]),ssinit(ts[,2]),cor(ts[,1],ts[,2])) }  This gives a set of summary stats, but these individual statistics are potentially on very different scales. They can be transformed and re-weighted in a variety of ways, usually on the basis of a pilot run which gives some information about the distribution of the summary stats. Here we will do the simplest possible thing, which is to normalise the variance of the stats on the basis of a pilot run. This is not at all optimal – see the references at the end of the post for a description of better methods. message("Batch 0: Pilot run batch") prior=cbind(th1=exp(runif(bs,-6,2)),th2=exp(runif(bs,-6,2)),th3=exp(runif(bs,-6,2))) rows=lapply(1:bs,function(i){prior[i,]}) samples=mclapply(rows,function(th){simTs(c(50,100),0,30,2,stepLVc,th)}) sumstats=mclapply(samples,ssi) sds=apply(sapply(sumstats,c),1,sd) print(sds) # now define a standardised distance ss<-function(ts) { ssi(ts)/sds } ss0=ss(LVperfect) distance <- function(ts) { diff=ss(ts)-ss0 sum(diff*diff) }  Now we have a normalised distance function defined, we can proceed exactly as before to obtain an ABC posterior via rejection sampling. post=NULL for (i in 1:batches) { message(paste("batch",i,"of",batches)) prior=cbind(th1=exp(runif(bs,-6,2)),th2=exp(runif(bs,-6,2)),th3=exp(runif(bs,-6,2))) rows=lapply(1:bs,function(i){prior[i,]}) samples=mclapply(rows,function(th){simTs(c(50,100),0,30,2,stepLVc,th)}) dist=mclapply(samples,distance) dist=sapply(dist,c) cutoff=quantile(dist,1000/N,na.rm=TRUE) post=rbind(post,prior[dist<cutoff,]) } message(paste("Finished. Kept",dim(post)[1],"simulations"))  Having obtained the posterior, we can use the following code to plot the results. th=c(th1 = 1, th2 = 0.005, th3 = 0.6) op=par(mfrow=c(2,3)) for (i in 1:3) { hist(post[,i],30,col=5,main=paste("Posterior for theta[",i,"]",sep="")) abline(v=th[i],lwd=2,col=2) } for (i in 1:3) { hist(log(post[,i]),30,col=5,main=paste("Posterior for log(theta[",i,"])",sep="")) abline(v=log(th[i]),lwd=2,col=2) } par(op)  This gives the plot shown below. From this we can see that the ABC posterior obtained here is very similar to that obtained in the previous post using the full data. Here the dimension reduction is not that great – reducing from 32 data points to 9 summary statistics – and so the improvement in performance is not that noticable. But in higher dimensional problems reducing the dimension of the data is practically essential. #### Summary and References As before, I recommend the wikipedia article on approximate Bayesian computation for further information and a comprehensive set of references for further reading. Here I just want to highlight two references particularly relevant to the issue of summary statistics. It is quite difficult to give much practical advice on how to construct good summary statistics, but how to transform a set of summary stats in a “good” way is a problem that is reasonably well understood. In this post I did something rather naive (normalising the variance), but the following two papers describe much better approaches. I still haven’t addressed the issue of a high-dimensional parameter space – that will be the topic of a subsequent post. #### The complete R script require(smfsb) require(parallel) options(mc.cores=4) data(LVdata) N=1e6 bs=1e5 batches=N/bs message(paste("N =",N," | bs =",bs," | batches =",batches)) ssinit <- function(vec) { ac23=as.vector(acf(vec,lag.max=2,plot=FALSE)$acf)[2:3]
c(mean(vec),log(var(vec)+1),ac23)
}

ssi <- function(ts)
{
c(ssinit(ts[,1]),ssinit(ts[,2]),cor(ts[,1],ts[,2]))
}

message("Batch 0: Pilot run batch")
prior=cbind(th1=exp(runif(bs,-6,2)),th2=exp(runif(bs,-6,2)),th3=exp(runif(bs,-6,2)))
rows=lapply(1:bs,function(i){prior[i,]})
samples=mclapply(rows,function(th){simTs(c(50,100),0,30,2,stepLVc,th)})
sumstats=mclapply(samples,ssi)
sds=apply(sapply(sumstats,c),1,sd)
print(sds)

# now define a standardised distance
ss<-function(ts)
{
ssi(ts)/sds
}

ss0=ss(LVperfect)

distance <- function(ts)
{
diff=ss(ts)-ss0
sum(diff*diff)
}

post=NULL
for (i in 1:batches) {
message(paste("batch",i,"of",batches))
prior=cbind(th1=exp(runif(bs,-6,2)),th2=exp(runif(bs,-6,2)),th3=exp(runif(bs,-6,2)))
rows=lapply(1:bs,function(i){prior[i,]})
samples=mclapply(rows,function(th){simTs(c(50,100),0,30,2,stepLVc,th)})
dist=mclapply(samples,distance)
dist=sapply(dist,c)
cutoff=quantile(dist,1000/N,na.rm=TRUE)
post=rbind(post,prior[dist<cutoff,])
}
message(paste("Finished. Kept",dim(post)[1],"simulations"))

# plot the results
th=c(th1 = 1, th2 = 0.005, th3 = 0.6)
op=par(mfrow=c(2,3))
for (i in 1:3) {
hist(post[,i],30,col=5,main=paste("Posterior for theta[",i,"]",sep=""))
abline(v=th[i],lwd=2,col=2)
}
for (i in 1:3) {
hist(log(post[,i]),30,col=5,main=paste("Posterior for log(theta[",i,"])",sep=""))
abline(v=log(th[i]),lwd=2,col=2)
}
par(op)


## Introduction to Approximate Bayesian Computation (ABC)

Many of the posts in this blog have been concerned with using MCMC based methods for Bayesian inference. These methods are typically “exact” in the sense that they have the exact posterior distribution of interest as their target equilibrium distribution, but are obviously “approximate”, in that for any finite amount of computing time, we can only generate a finite sample of correlated realisations from a Markov chain that we hope is close to equilibrium.

Approximate Bayesian Computation (ABC) methods go a step further, and generate samples from a distribution which is not the true posterior distribution of interest, but a distribution which is hoped to be close to the real posterior distribution of interest. There are many variants on ABC, and I won’t get around to explaining all of them in this blog. The wikipedia page on ABC is a good starting point for further reading. In this post I’ll explain the most basic rejection sampling version of ABC, and in a subsequent post, I’ll explain a sequential refinement, often referred to as ABC-SMC. As usual, I’ll use R code to illustrate the ideas.

#### Basic idea

There is a close connection between “likelihood free” MCMC methods and those of approximate Bayesian computation (ABC). To keep things simple, consider the case of a perfectly observed system, so that there is no latent variable layer. Then there are model parameters $\theta$ described by a prior $\pi(\theta)$, and a forwards-simulation model for the data $x$, defined by $\pi(x|\theta)$. It is clear that a simple algorithm for simulating from the desired posterior $\pi(\theta|x)$ can be obtained as follows. First simulate from the joint distribution $\pi(\theta,x)$ by simulating $\theta^\star\sim\pi(\theta)$ and then $x^\star\sim \pi(x|\theta^\star)$. This gives a sample $(\theta^\star,x^\star)$ from the joint distribution. A simple rejection algorithm which rejects the proposed pair unless $x^\star$ matches the true data $x$ clearly gives a sample from the required posterior distribution.

#### Exact rejection sampling

• 1. Sample $\theta^\star \sim \pi(\theta^\star)$
• 2. Sample $x^\star\sim \pi(x^\star|\theta^\star)$
• 3. If $x^\star=x$, keep $\theta^\star$ as a sample from $\pi(\theta|x)$, otherwise reject.

This algorithm is exact, and for discrete $x$ will have a non-zero acceptance rate. However, in most interesting problems the rejection rate will be intolerably high. In particular, the acceptance rate will typically be zero for continuous valued $x$.

#### ABC rejection sampling

The ABC “approximation” is to accept values provided that $x^\star$ is “sufficiently close” to $x$. In the first instance, we can formulate this as follows.

• 1. Sample $\theta^\star \sim \pi(\theta^\star)$
• 2. Sample $x^\star\sim \pi(x^\star|\theta^\star)$
• 3. If $\Vert x^\star-x\Vert< \epsilon$, keep $\theta^\star$ as a sample from $\pi(\theta|x)$, otherwise reject.

Euclidean distance is usually chosen as the norm, though any norm can be used. This procedure is “honest”, in the sense that it produces exact realisations from

$\theta^\star\big|\Vert x^\star-x\Vert < \epsilon.$

For suitable small choice of $\epsilon$, this will closely approximate the true posterior. However, smaller choices of $\epsilon$ will lead to higher rejection rates. This will be a particular problem in the context of high-dimensional $x$, where it is often unrealistic to expect a close match between all components of $x$ and the simulated data $x^\star$, even for a good choice of $\theta^\star$. In this case, it makes more sense to look for good agreement between particular aspects of $x$, such as the mean, or variance, or auto-correlation, depending on the exact problem and context.

In the simplest case, this is done by forming a (vector of) summary statistic(s), $s(x^\star)$ (ideally a sufficient statistic), and accepting provided that $\Vert s(x^\star)-s(x)\Vert<\epsilon$ for some suitable choice of metric and $\epsilon$. We will return to this issue in a subsequent post.

#### Inference for an intractable Markov process

I’ll illustrate ABC in the context of parameter inference for a Markov process with an intractable transition kernel: the discrete stochastic Lotka-Volterra model. A function for simulating exact realisations from the intractable kernel is included in the smfsb CRAN package discussed in a previous post. Using pMCMC to solve the parameter inference problem is discussed in another post. It may be helpful to skim through those posts quickly to become familiar with this problem before proceeding.

So, for a given proposed set of parameters, realisations from the process can be sampled using the functions simTs and stepLV (from the smfsb package). We will use the sample data set LVperfect (from the LVdata dataset) as our “true”, or “target” data, and try to find parameters for the process which are consistent with this data. A fairly minimal R script for this problem is given below.

require(smfsb)
data(LVdata)

N=1e5
message(paste("N =",N))
prior=cbind(th1=exp(runif(N,-6,2)),th2=exp(runif(N,-6,2)),th3=exp(runif(N,-6,2)))
rows=lapply(1:N,function(i){prior[i,]})
message("starting simulation")
samples=lapply(rows,function(th){simTs(c(50,100),0,30,2,stepLVc,th)})
message("finished simulation")

distance<-function(ts)
{
diff=ts-LVperfect
sum(diff*diff)
}

message("computing distances")
dist=lapply(samples,distance)
message("distances computed")

dist=sapply(dist,c)
cutoff=quantile(dist,1000/N)
post=prior[dist<cutoff,]

op=par(mfrow=c(2,3))
apply(post,2,hist,30)
apply(log(post),2,hist,30)
par(op)


This script should take 5-10 minutes to run on a decent laptop, and will result in histograms of the posterior marginals for the components of $\theta$ and $\log(\theta)$. Note that I have deliberately adopted a functional programming style, making use of the lapply function for the most computationally intensive steps. The reason for this will soon become apparent. Note that rather than pre-specifying a cutoff $\epsilon$, I’ve instead picked a quantile of the distance distribution. This is common practice in scenarios where the distance is difficult to have good intuition about. In fact here I’ve gone a step further and chosen a quantile to give a final sample of size 1000. Obviously then in this case I could have just selected out the top 1000 directly, but I wanted to illustrate the quantile based approach.

One problem with the above script is that all proposed samples are stored in memory at once. This is problematic for problems involving large numbers of samples. However, it is convenient to do simulations in large batches, both for computation of quantiles, and also for efficient parallelisation. The script below illustrates how to implement a batch parallelisation strategy for this problem. Samples are generated in batches of size 10^4, and only the best fitting samples are stored before the next batch is processed. This strategy can be used to get a good sized sample based on a more stringent acceptance criterion at the cost of addition simulation time. Note that the parallelisation code will only work with recent versions of R, and works by replacing calls to lapply with the parallel version, mclapply. You should notice an appreciable speed-up on a multicore machine.

require(smfsb)
require(parallel)
options(mc.cores=4)
data(LVdata)

N=1e5
bs=1e4
batches=N/bs
message(paste("N =",N," | bs =",bs," | batches =",batches))

distance<-function(ts)
{
diff=ts[,1]-LVprey
sum(diff*diff)
}

post=NULL
for (i in 1:batches) {
message(paste("batch",i,"of",batches))
prior=cbind(th1=exp(runif(bs,-6,2)),th2=exp(runif(bs,-6,2)),th3=exp(runif(bs,-6,2)))
rows=lapply(1:bs,function(i){prior[i,]})
samples=mclapply(rows,function(th){simTs(c(50,100),0,30,2,stepLVc,th)})
dist=mclapply(samples,distance)
dist=sapply(dist,c)
cutoff=quantile(dist,1000/N)
post=rbind(post,prior[dist<cutoff,])
}
message(paste("Finished. Kept",dim(post)[1],"simulations"))

op=par(mfrow=c(2,3))
apply(post,2,hist,30)
apply(log(post),2,hist,30)
par(op)


Note that there is an additional approximation here, since the top 100 samples from each of 10 batches of simulations won’t correspond exactly to the top 1000 samples overall, but given all of the other approximations going on in ABC, this one is likely to be the least of your worries.

Now, if you compare the approximate posteriors obtained here with the “true” posteriors obtained in an earlier post using pMCMC, you will see that these posteriors are really quite poor. However, this isn’t a very fair comparison, since we’ve only done 10^5 simulations. Jacking N up to 10^7 gives the ABC posterior below.

This is a bit better, but really not great. There are two basic problems with the simplistic ABC strategy adopted here, one related to the dimensionality of the data and the other the dimensionality of the parameter space. The most basic problem that we have here is the dimensionality of the data. We have 16 (bivariate) observations, so we want our stochastic simulation to shoot at a point in a 16- or 32-dimensional space. That’s a tough call. The standard way to address this problem is to reduce the dimension of the data by introducing a few carefully chosen summary statistics and then just attempting to hit those. I’ll illustrate this in a subsequent post. The other problem is that often the prior and posterior over the parameters are quite different, and this problem too is exacerbated as the dimension of the parameter space increases. The standard way to deal with this is to sequentially adapt from the prior through a sequence of ABC posteriors. I’ll examine this in a future post as well, once I’ve also posted an introduction to the use of sequential Monte Carlo (SMC) samplers for static problems.

For further reading, I suggest browsing the reference list of the Wikipedia page for ABC. Also look through the list of software on that page. In particular, note that there is a CRAN package, abc, providing R support for ABC. There is a vignette for this package which should be sufficient to get started.

## Getting started with Bayesian variable selection using JAGS and rjags

#### Bayesian variable selection

In a previous post I gave a quick introduction to using the rjags R package to access the JAGS Bayesian inference from within R. In this post I want to give a quick guide to using rjags for Bayesian variable selection. I intend to use this post as a starting point for future posts on Bayesian model and variable selection using more sophisticated approaches.

I will use the simple example of multiple linear regression to illustrate the ideas, but it should be noted that I’m just using that as an example. It turns out that in the context of linear regression there are lots of algebraic and computational tricks which can be used to simplify the variable selection problem. The approach I give here is therefore rather inefficient for linear regression, but generalises to more complex (non-linear) problems where analytical and computational short-cuts can’t be used so easily.

Consider a linear regression problem with n observations and p covariates, which we can write in matrix form as

$y = \alpha \boldmath{1} + X\beta + \varepsilon,$

where $X$ is an $n\times p$ matrix. The idea of variable selection is that probably not all of the p covariates are useful for predicting y, and therefore it would be useful to identify the variables which are, and just use those. Clearly each combination of variables corresponds to a different model, and so the variable selection amounts to choosing among the $2^p$ possible models. For large values of p it won’t be practical to consider each possible model separately, and so the idea of Bayesian variable selection is to consider a model containing all of the possible model combinations as sub-models, and the variable selection problem as just another aspect of the model which must be estimated from data. I’m simplifying and glossing over lots of details here, but there is a very nice review paper by O’Hara and Sillanpaa (2009) which the reader is referred to for further details.

The simplest and most natural way to tackle the variable selection problem from a Bayesian perspective is to introduce an indicator random variable $I_i$ for each covariate, and introduce these into the model in order to “zero out” inactive covariates. That is we write the ith regression coefficient $\beta_i$ as $\beta_i=I_i\beta^\star_i$, so that $\beta^\star_i$ is the regression coefficient when $I_i=1$, and “doesn’t matter” when $I_i=0$. There are various ways to choose the prior over $I_i$ and $\beta^\star_i$, but the simplest and most natural choice is to make them independent. This approach was used in Kuo and Mallick (1998), and hence is referred to as the Kuo and Mallick approach in O’Hara and Sillanpaa.

#### Simulating some data

In order to see how things work, let’s first simulate some data from a regression model with geometrically decaying regression coefficients.

n=500
p=20
X=matrix(rnorm(n*p),ncol=p)
beta=2^(0:(1-p))
print(beta)
alpha=3
tau=2
eps=rnorm(n,0,1/sqrt(tau))
y=alpha+as.vector(X%*%beta + eps)


Let’s also fit the model by least squares.

mod=lm(y~X)
print(summary(mod))


This should give output something like the following.

Call:
lm(formula = y ~ X)

Residuals:
Min       1Q   Median       3Q      Max
-1.62390 -0.48917 -0.02355  0.45683  2.35448

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  3.0565406  0.0332104  92.036  < 2e-16 ***
X1           0.9676415  0.0322847  29.972  < 2e-16 ***
X2           0.4840052  0.0333444  14.515  < 2e-16 ***
X3           0.2680482  0.0320577   8.361  6.8e-16 ***
X4           0.1127954  0.0314472   3.587 0.000369 ***
X5           0.0781860  0.0334818   2.335 0.019946 *
X6           0.0136591  0.0335817   0.407 0.684379
X7           0.0035329  0.0321935   0.110 0.912662
X8           0.0445844  0.0329189   1.354 0.176257
X9           0.0269504  0.0318558   0.846 0.397968
X10          0.0114942  0.0326022   0.353 0.724575
X11         -0.0045308  0.0330039  -0.137 0.890868
X12          0.0111247  0.0342482   0.325 0.745455
X13         -0.0584796  0.0317723  -1.841 0.066301 .
X14         -0.0005005  0.0343499  -0.015 0.988381
X15         -0.0410424  0.0334723  -1.226 0.220742
X16          0.0084832  0.0329650   0.257 0.797026
X17          0.0346331  0.0327433   1.058 0.290718
X18          0.0013258  0.0328920   0.040 0.967865
X19         -0.0086980  0.0354804  -0.245 0.806446
X20          0.0093156  0.0342376   0.272 0.785671
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7251 on 479 degrees of freedom
Multiple R-squared: 0.7187,     Adjusted R-squared: 0.707
F-statistic:  61.2 on 20 and 479 DF,  p-value: < 2.2e-16


The first 4 variables are “highly significant” and the 5th is borderline.

#### Saturated model

We can fit the saturated model using JAGS with the following code.

require(rjags)
data=list(y=y,X=X,n=n,p=p)
init=list(tau=1,alpha=0,beta=rep(0,p))
modelstring="
model {
for (i in 1:n) {
mean[i]<-alpha+inprod(X[i,],beta)
y[i]~dnorm(mean[i],tau)
}
for (j in 1:p) {
beta[j]~dnorm(0,0.001)
}
alpha~dnorm(0,0.0001)
tau~dgamma(1,0.001)
}
"
model=jags.model(textConnection(modelstring),
data=data,inits=init)
update(model,n.iter=100)
output=coda.samples(model=model,variable.names=c("alpha","beta","tau"),
n.iter=10000,thin=1)
print(summary(output))
plot(output)


I’ve hard-coded various hyper-parameters in the script which are vaguely reasonable for this kind of problem. I won’t include all of the output in this post, but this works fine and gives sensible results. However, it does not address the variable selection problem.

#### Basic variable selection

Let’s now modify the above script to do basic variable selection in the style of Kuo and Mallick.

data=list(y=y,X=X,n=n,p=p)
init=list(tau=1,alpha=0,betaT=rep(0,p),ind=rep(0,p))
modelstring="
model {
for (i in 1:n) {
mean[i]<-alpha+inprod(X[i,],beta)
y[i]~dnorm(mean[i],tau)
}
for (j in 1:p) {
ind[j]~dbern(0.2)
betaT[j]~dnorm(0,0.001)
beta[j]<-ind[j]*betaT[j]
}
alpha~dnorm(0,0.0001)
tau~dgamma(1,0.001)
}
"
model=jags.model(textConnection(modelstring),
data=data,inits=init)
update(model,n.iter=1000)
output=coda.samples(model=model,
variable.names=c("alpha","beta","ind","tau"),
n.iter=10000,thin=1)
print(summary(output))
plot(output)


Note that I’ve hard-coded an expectation that around 20% of variables should be included in the model. Again, I won’t include all of the output here, but the posterior mean of the indicator variables can be interpreted as posterior probabilities that the variables should be included in the model. Inspecting the output then reveals that the first three variables have a posterior probability of very close to one, the 4th variable has a small but non-negligible probability of inclusion, and the other variables all have very small probabilities of inclusion.

This is fine so far as it goes, but is not entirely satisfactory. One problem is that the choice of a “fixed effects” prior for the regression coefficients of the included variables is likely to lead to a Lindley’s paradox type situation, and a consequent under-selection of variables. It is arguably better to model the distribution of included variables using a “random effects” approach, leading to a more appropriate distribution for the included variables.

#### Variable selection with random effects

Adopting a random effects distribution for the included coefficients that is normal with mean zero and unknown variance helps to combat Lindley’s paradox, and can be implemented as follows.

data=list(y=y,X=X,n=n,p=p)
init=list(tau=1,taub=1,alpha=0,betaT=rep(0,p),ind=rep(0,p))
modelstring="
model {
for (i in 1:n) {
mean[i]<-alpha+inprod(X[i,],beta)
y[i]~dnorm(mean[i],tau)
}
for (j in 1:p) {
ind[j]~dbern(0.2)
betaT[j]~dnorm(0,taub)
beta[j]<-ind[j]*betaT[j]
}
alpha~dnorm(0,0.0001)
tau~dgamma(1,0.001)
taub~dgamma(1,0.001)
}
"
model=jags.model(textConnection(modelstring),
data=data,inits=init)
update(model,n.iter=1000)
output=coda.samples(model=model,
variable.names=c("alpha","beta","ind","tau","taub"),
n.iter=10000,thin=1)
print(summary(output))
plot(output)


This leads to a large inclusion probability for the 4th variable, and non-negligible inclusion probabilities for the next few (it is obviously somewhat dependent on the simulated data set). This random effects variable selection modelling approach generally performs better, but it still has the potentially undesirable feature of hard-coding the probability of variable inclusion. Under the prior model, the number of variables included is binomial, and the binomial distribution is rather concentrated about its mean. Where there is a general desire to control the degree of sparsity in the model, this is a good thing, but if there is considerable uncertainty about the degree of sparsity that is anticipated, then a more flexible model may be desirable.

#### Variable selection with random effects and a prior on the inclusion probability

The previous model can be modified by introducing a Beta prior for the model inclusion probability. This induces a distribution for the number of included variables which has longer tails than the binomial distribution, allowing the model to learn about the degree of sparsity.

data=list(y=y,X=X,n=n,p=p)
init=list(tau=1,taub=1,pind=0.5,alpha=0,betaT=rep(0,p),ind=rep(0,p))
modelstring="
model {
for (i in 1:n) {
mean[i]<-alpha+inprod(X[i,],beta)
y[i]~dnorm(mean[i],tau)
}
for (j in 1:p) {
ind[j]~dbern(pind)
betaT[j]~dnorm(0,taub)
beta[j]<-ind[j]*betaT[j]
}
alpha~dnorm(0,0.0001)
tau~dgamma(1,0.001)
taub~dgamma(1,0.001)
pind~dbeta(2,8)
}
"
model=jags.model(textConnection(modelstring),
data=data,inits=init)
update(model,n.iter=1000)
output=coda.samples(model=model,
variable.names=c("alpha","beta","ind","tau","taub","pind"),
n.iter=10000,thin=1)
print(summary(output))
plot(output)


It turns out that for this particular problem the posterior distribution is not very different to the previous case, as for this problem the hard-coded choice of 20% is quite consistent with the data. However, the variable inclusion probabilities can be rather sensitive to the choice of hard-coded proportion.

#### Conclusion

Bayesian variable selection (and model selection more generally) is a very delicate topic, and there is much more to say about it. In this post I’ve concentrated on the practicalities of introducing variable selection into JAGS models. For further reading, I highly recommend the review of O’Hara and Sillanpaa (2009), which discusses other computational algorithms for variable selection. I intend to discuss some of the other methods in future posts.

#### References

O’Hara, R. and Sillanpaa, M. (2009) A review of Bayesian variable selection methods: what, how and which. Bayesian Analysis, 4(1):85-118. [DOI, PDF, Supp, BUGS Code]
Kuo, L. and Mallick, B. (1998) Variable selection for regression models. Sankhya B, 60(1):65-81.

## Keeping R up to date on Ubuntu linux

R is included as part of the standard Ubuntu distribution, and can be installed with a command like

sudo apt-get install r-base


Obviously the software included as part of the standard distribution usually lags a little behind the latest version, and this is usually quite acceptable for most users most of the time. However, R is evolving quite quickly at the moment, and for various reasons I have decided to skip Ubuntu 12.10 (quantal) and stick with Ubuntu 12.4 (precise) for the time being. Since R 2.14 is included with Ubuntu 12.4, and I’d rather use R 2.15, I’d like to run with the latest R builds on my Ubuntu system.

Fortunately this is very easy, as there is a maintained repository for Ubuntu builds of R on CRAN. Full instructions are provided on CRAN, but here is the quick summary. First you need to know your nearest CRAN mirror – there is a list of mirrors on CRAN. I generally use the Bristol mirror, and so I will use it in the following.

sudo su
echo "deb http://www.stats.bris.ac.uk/R/bin/linux/ubuntu precise/" >> /etc/apt/sources.list
apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E084DAB9
apt-get update


That’s it. You are updated to the latest version of R, and your system will check for updates in the usual way. There are just two things you may need to edit in line 2 above. The first is the address of the CRAN mirror (here “www.stats.bris.ac.uk”). The second is the name of the Ubuntu distro you are running (here “precise”).

## Inlining JAGS models in R scripts for rjags

JAGS (Just Another Gibbs Sampler) is a general purpose MCMC engine similar to WinBUGS and OpenBUGS. I have a slight preference for JAGS as it is free and portable, works well on Linux, and interfaces well with R. It is tempting to write a tutorial introduction to JAGS and the corresponding R package, rjags, but there is a lot of material freely available on-line already, so it isn’t really necessary. If you are new to JAGS, I suggest starting with Getting Started with JAGS, rjags, and Bayesian Modelling. In this post I want to focus specifically on the problem of inlining JAGS models in R scripts as it can be very useful, and is usually skipped in introductory material.

#### JAGS and rjags on Ubuntu Linux

On recent versions of Ubuntu, assuming that R is already installed, the simplest way to install JAGS and rjags is using the command

sudo apt-get install jags r-cran-rjags


Now rjags is a CRAN package, so it can be installed in the usual way with install.packages("rjags"). However, taking JAGS and rjags direct from the Ubuntu repos should help to ensure that the versions of JAGS and rjags are in sync, which is a good thing.

#### Toy model

For this post, I will use a trivial toy example of inference for the mean and precision of a normal random sample. That is, we will assume data

$X_i \sim N(\mu,1/\tau),\quad i=1,2,\ldots n,$

with priors on $\mu$ and $\tau$ of the form

$\tau\sim Ga(a,b),\quad \mu \sim N(c,1/d).$

#### Separate model file

The usual way to fit this model in R using rjags is to first create a separate file containing the model

  model {
for (i in 1:n) {
x[i]~dnorm(mu,tau)
}
mu~dnorm(cc,d)
tau~dgamma(a,b)
}


Then, supposing that this file is called jags1.jags, an R session to fit the model could be constructed as follows:

require(rjags)
x=rnorm(15,25,2)
data=list(x=x,n=length(x))
hyper=list(a=3,b=11,cc=10,d=1/100)
init=list(mu=0,tau=1)
model=jags.model("jags1.jags",data=append(data,hyper), inits=init)
update(model,n.iter=100)
output=coda.samples(model=model,variable.names=c("mu", "tau"), n.iter=10000, thin=1)
print(summary(output))
plot(output)


This is all fine, and it can be very useful to have the model declared in a separate file, especially if the model is large and complex, and you might want to use it from outside R. However, very often for simple models it can be quite inconvenient to have the model separate from the R script which runs it. In particular, people often have issues with naming files correctly, making sure R is looking in the correct directory, moving the model with the R script, etc. So it would be nice to be able to just inline the JAGS model within an R script, to keep the model, the data, and the analysis all together in one place.

#### Using a temporary file

What we want to do is declare the JAGS model within a text string inside an R script and then somehow pass this into the call to jags.model(). The obvious way to do this is to write the string to a text file, and then pass the name of that text file into jags.model(). This works fine, but some care needs to be taken to make sure this works in a generic platform independent way. For example, you need to write to a file that you know doesn’t exist in a directory that is writable using a filename that is valid on the OS on which the script is being run. For this purpose R has an excellent little function called tempfile() which solves exactly this naming problem. It should always return the name of a file which does not exist in a writable directly within the standard temporary file location on the OS on which R is being run. This function is exceedingly useful for all kinds of things, but doesn’t seem to be very well known by newcomers to R. Using this we can construct a stand-alone R script to fit the model as follows:

require(rjags)
x=rnorm(15,25,2)
data=list(x=x,n=length(x))
hyper=list(a=3,b=11,cc=10,d=1/100)
init=list(mu=0,tau=1)
modelstring="
model {
for (i in 1:n) {
x[i]~dnorm(mu,tau)
}
mu~dnorm(cc,d)
tau~dgamma(a,b)
}
"
tmpf=tempfile()
tmps=file(tmpf,"w")
cat(modelstring,file=tmps)
close(tmps)
model=jags.model(tmpf,data=append(data,hyper), inits=init)
update(model,n.iter=100)
output=coda.samples(model=model,variable.names=c("mu", "tau"), n.iter=10000, thin=1)
print(summary(output))
plot(output)


Now, although there is a file containing the model temporarily involved, the script is stand-alone and portable.

#### Using a text connection

The solution above works fine, but still involves writing a file to disk and reading it back in again, which is a bit pointless in this case. We can solve this by using another under-appreciated R function, textConnection(). Many R functions which take a file as an argument will work fine if instead passed a textConnection object, and the rjags function jags.model() is no exception. Here, instead of writing the model string to disk, we can turn it into a textConnection object and then pass that directly into jags.model() without ever actually writing the model file to disk. This is faster, neater and cleaner. An R session which takes this approach is given below.

require(rjags)
x=rnorm(15,25,2)
data=list(x=x,n=length(x))
hyper=list(a=3,b=11,cc=10,d=1/100)
init=list(mu=0,tau=1)
modelstring="
model {
for (i in 1:n) {
x[i]~dnorm(mu,tau)
}
mu~dnorm(cc,d)
tau~dgamma(a,b)
}
"
model=jags.model(textConnection(modelstring), data=append(data,hyper), inits=init)
update(model,n.iter=100)
output=coda.samples(model=model,variable.names=c("mu", "tau"), n.iter=10000, thin=1)
print(summary(output))
plot(output)


This is my preferred way to use rjags. Note again that textConnection objects have many and varied uses and applications that have nothing to do with rjags.

## Metropolis Hastings MCMC when the proposal and target have differing support

### Introduction

Very often it is desirable to use Metropolis Hastings MCMC for a target distribution which does not have full support (for example, it may correspond to a non-negative random variable), using a proposal distribution which does (for example, a Gaussian random walk proposal). This isn’t a problem at all, but on more than one occasion now I have come across students getting this wrong, so I thought it might be useful to have a brief post on how to do it right, see what people sometimes get wrong, and why, and then think about correcting the wrong method in order to make it right…

### A simple example

For this post we will consider a simple $Ga(2,1)$ target distribution, with density

$\pi(x) = xe^{-x},\quad x\geq 0.$

Of course this is a very simple distribution, and there are many straightforward ways to simulate it directly, but for this post we will use a random walk Metropolis-Hastings (MH) scheme with standard Gaussian innovations. So, if the current state of the chain is $x$, a proposed new value $x^\star$ will be generated from

$f(x^\star|x) = \phi(x^\star-x),$

where $\phi(\cdot)$ is the standard normal density. This proposed new value is accepted with probability $\min\{1,A\}$, where

$\displaystyle A = \frac{\pi(x^\star)}{\pi(x)} \frac{f(x|x^\star)}{f(x^\star|x)} = \frac{\pi(x^\star)}{\pi(x)} \frac{\phi(x-x^\star)}{\phi(x^\star-x)} = \frac{\pi(x^\star)}{\pi(x)} ,$

since the standard normal density is symmetric.

#### Correct implementation

We can easily implement this using R as follows:

met1=function(iters)
{
xvec=numeric(iters)
x=1
for (i in 1:iters) {
xs=x+rnorm(1)
A=dgamma(xs,2,1)/dgamma(x,2,1)
if (runif(1)<A)
x=xs
xvec[i]=x
}
return(xvec)
}


We can run it, plot the results and check it against the true target with the following commands.

iters=1000000
out=met1(iters)
hist(out,100,freq=FALSE,main="met1")


If you have a slow computer, you may prefer to use iters=100000. The above code uses R’s built-in gamma density. Alternatively, we can hard-code the density as follows.

met2=function(iters)
{
xvec=numeric(iters)
x=1
for (i in 1:iters) {
xs=x+rnorm(1)
A=xs*exp(-xs)/(x*exp(-x))
if (runif(1)<A)
x=xs
xvec[i]=x
}
return(xvec)
}


We can run this code using the following commands, to verify that it does work as expected.

out=met2(iters)
hist(out,100,freq=FALSE,main="met2")


However, there is a potential problem with the above code that we have got away with in this instance, which often catches people out. We have hard-coded the density for $x>0$ without checking the sign of $x$. Here we get away with it as a negative proposal will lead to a negative acceptance ratio that we will reject straight away. This is not always the case (consider, for example, a $Ga(3,1)$ distribution). So really we should check the sign of $x^\star$ and reject immediately if is not within the support of the target.

Although this problem often catches people out, it tends not to be a big issue in practice, as it typically leads to an obviously incorrect sampler, or a sampler which crashes, and is relatively simple to debug and fix.

#### An incorrect sampler

The problem I want to focus on here is more subtle, but closely related. It is clear that any $x^\star<0$ should be rejected. With the above code, such values are indeed rejected, and the sampler advances to the next iteration. However, in more complex samplers, where an update like this might be one tiny part of a massive sampler with a very high-dimensional state space, it seems like a bit of a "waste" of a MH move to just propose a negative value, throw it away, and move on. Evidently, it seems tempting, therefore, to keep on sampling $x^\star$ values until a non-negative value is obtained, and then evaluate the acceptance ratio and decide whether or not to accept. We could code up this sampler as follows.

met3=function(iters)
{
xvec=numeric(iters)
x=1
for (i in 1:iters) {
repeat {
xs=x+rnorm(1)
if (xs>0)
break
}
A=xs*exp(-xs)/(x*exp(-x))
if (runif(1)<A)
x=xs
xvec[i]=x
}
return(xvec)
}


As reasonable as this idea may at first seem, it does not lead to a sampler having the desired target, as can be verified using the following commands.

out=met3(iters)
hist(out,100,freq=FALSE,main="met3")


So, this sampler seems to be sampling something close to the desired target, but not the same. This raises a couple of questions. First and most important, can we fix this sampler so that it does sample the correct target (yes), and second, can we figure out what target density the incorrect sampler is actually sampling (again, yes)? Let’s start with the issue of how to fix the sampler, as this will also help us to understand what the incorrect sampler is doing.

#### Fixing the truncated sampler

By repeatedly sampling from the proposal until we obtain a non-negative value, we are actually implementing a rejection sampler for sampling from the proposal distribution truncated at zero. This is a perfectly reasonable proposal distribution, so we can use it provided that we use the correct MH acceptance ratio. Now, the truncated density has the same density as the untruncated density, apart from the differing support and a normalising constant. Indeed, this may be why people often assume this method will work, because normalising constants often don’t matter in MH schemes. However, the normalising constant only doesn’t matter if it is independent of the state, and here it is not… Explicitly, we have

$f(x^\star|x) \propto \phi(x^\star-x),\quad x^\star>0.$

Including the normalising constant we have

$\displaystyle f(x^\star|x) = \frac{\phi(x^\star-x)}{\Phi(x)},\quad x^\star>0,$

where $\Phi(x)$ is the standard normal CDF. Consequently, the correct acceptance ratio to use with this proposal is

$\displaystyle A = \frac{\pi(x^\star)}{\pi(x)} \frac{\phi(x-x^\star)}{\phi(x^\star-x)}\frac{\Phi(x)}{\Phi(x^\star)} = \frac{\pi(x^\star)}{\pi(x)}\frac{\Phi(x)}{\Phi(x^\star)},$

where we see that the normalising constants do not cancel out. We can modify the previous sampler to use the correct acceptance ratio as follows.

met4=function(iters)
{
xvec=numeric(iters)
x=1
for (i in 1:iters) {
repeat {
xs=x+rnorm(1)
if (xs>0)
break
}
A=xs*exp(-xs)/(x*exp(-x))
A=A*pnorm(x)/pnorm(xs)
if (runif(1)<A)
x=xs
xvec[i]=x
}
return(xvec)
}


We can verify that this sampler gives leads to the correct target with the following commands.

out=met4(iters)
hist(out,100,freq=FALSE,main="met4")


So, truncating the proposal at zero is fine, provided that you modify the acceptance ratio accordingly.

#### What does the incorrect sampler target?

Now that we understand why the naive truncated sampler was wrong and how to fix it, we can, out of curiosity, wonder what distribution that sampler actually targets. Now we understand what proposal we are actually using, we can re-write the acceptance ratio as

$\displaystyle A = \frac{\pi(x^\star)\Phi(x^\star)}{\pi(x)\Phi(x)}\frac{\frac{\phi(x-x^\star)}{\Phi(x^\star)}}{\frac{\phi(x^\star-x)}{\Phi(x)}},$

from which it is clear that the actual target of this chain is

$\tilde\pi(x) \propto \pi(x)\Phi(x),$

or

$\tilde\pi(x)\propto xe^{-x}\Phi(x),\quad x\geq 0.$

The constant of proportionality is not immediately obvious, but is tractable, and turns out to be a nice undergraduate exercise in integration by parts, leading to

$\displaystyle \tilde\pi(x) = \frac{2\sqrt{2\pi}}{2+\sqrt{2\pi}}xe^{-x}\Phi(x),\quad x\geq 0.$

We can verify this using the following commands.

out=met3(iters)
hist(out,100,freq=FALSE,main="met3")


Now we know the actual target of the incorrect sampler, we can compare it with the correct target as follows.

curve(dgamma(x,2,1),0,10,col=2,lwd=2,main="Densities")


So we see that the distributions are different, but not so different that one would immediate suspect an error on the basis of a sample of output. This makes it a difficult bug to track down.

### Summary

There is no problem in principle using a proposal with full support for a target with limited support in MH algorithms. However, it is important to check whether a proposed value is within the support of the target and reject the proposed move if it is not. If you are concerned that such a scheme might be inefficient, it is possible to use a truncated proposal provided that you modify the MH acceptance ratio to include the relevant normalisation constants. If you don’t modify the acceptance probability, you will get a sampler which targets the wrong distribution, but it will often be quite similar to the correct target, making it a difficult bug to spot and track down.

## Multivariate data analysis (using R)

I’ve been very quiet on-line in the last few months, due mainly to the fact that I’ve been writing a new undergraduate course on multivariate data analysis. Although there are many books and on-line notes on the general topic of multivariate statistics, I wanted to do something a little bit different from any text I have yet discovered. First, I wanted to have a strong emphasis on using techniques in practice on example data sets of reasonable size. For this, I found Hastie et al (2009) to be very useful, as it covered some interesting example data sets which have been bundled in the CRAN R package, ElemStatLearn. I used several of the data sets from this package as running examples throughout the course. In fact my initial plan was to use Hastie et al as the main course text, but it turned out that this text was in some places overly technical and in many places far too terse to be good as an undergraduate text. I would still recommend the book for researchers who want a good overview of the interface between statistics and machine learning, but with hindsight I’m not convinced it is ideal for typical statistics undergraduate students.

I also wanted to have a strong emphasis on numerical linear algebra as the basis for multivariate statistical computation. Again, this is a bit different from “old school” multivariate statistics (which reminds me, John Marden has produced a great text available freely on-line on old school multivariate analysis, which isn’t quite as “old school” as the title might suggest). I wanted to spend some time talking about linear systems and matrix factorisations, explaining, for example how the LU decomposition, the Cholesky factorisation and the QR factorisations are related, and why the latter two are both fundamental to multivariate data analysis, and how the singular value decomposition (SVD) is related to the spectral decomposition, and why it is generally better to construct principal components from the SVD of the centred data matrix than the eigen-decomposition of the sample variance matrix, etc. These sorts of topics are not often covered in undergraduate statistics courses, but they are crucial to understanding how to analyse large multivariate data sets in a numerically stable way.

I also wanted to downplay distribution theory as much as possible, as multivariate distribution theory is quite difficult, and not necessary for understanding most of the essential concepts in multivariate data analysis. Also, it is not obviously very useful. Essentially all introductory courses are based around the multivariate normal distribution, but I have yet to see a real non-trivial multivariate data set for which an assumption of multivariate normality is appropriate. Consequently I delayed the introduction of the multivariate normal until well into the course, and didn’t bother with the Wishart distribution, or testing for multivariate normality. Like much frequentist testing, it is really just a matter of seeing if you have yet collected a large enough sample to reject the null hypothesis – I just don’t see the point (null)!

Finally, I wanted to use R to illustrate all of the methods in practice as they were introduced. We use R throughout our undergraduate statistics programme, and I think it is a good language for learning about statistical methods, algorithms and concepts. In most cases I begin by showing how to carry out analyses using “elementary” operations (such as matrix manipulations), and then go on to show how to accomplish the same task more simply using higher-level R functions and packages. Again, I think it really helps understanding to first see the mathematical description directly translated into computer code before jumping to high-level data analysis functions.

There are several aspects of the course that I would like to distil out into self-contained blog posts, but I have a busy summer schedule, and a couple of other things I want to write about before I’ll have a chance to get around to it, so in the mean time, anyone interested is welcome to download a copy of the course notes (PDF, with hyperlinks). This is the student version, containing gaps, but the gaps mainly correspond to bits of standard theory and examples designed to be worked through by hand. All of the essential theory and context and all of the R examples are present in this version of the notes. There are seven chapters: Introduction to multivariate data; PCA and matrix factorisations; Inference, the MVN and multivariate regression; Cluster analysis and unsupervised learning; Discrimination and classification; Graphical modelling; Variable selection and multiple testing.