## Posts Tagged ‘MCMC’

### Tuning particle MCMC algorithms

08/06/2014

Several papers have appeared recently discussing the issue of how to tune the number of particles used in the particle filter within a particle MCMC algorithm such as particle marginal Metropolis Hastings (PMMH). Three such papers are:

I have discussed psuedo marginal MCMC and particle MCMC algorithms in previous posts. It will be useful to refer back to these posts if these topics are unfamiliar. Within particle MCMC algorithms (and psuedo-marginal MCMC algorithms, more generally), an unbiased estimate of marginal likelihood is constructed using a number of particles. The more particles that are used, the better the estimate of marginal likelihood is, and the resulting MCMC algorithm will behave more like a “real” marginal MCMC algorithm. For a small number of particles, the algorithm will still have exactly the correct target, but the noise in the unbiased estimator of marginal likelihood will lead to poor mixing of the MCMC chain. The idea is to use just enough particles to ensure that there isn’t “too much” noise in the unbiased estimator, but not to waste lots of time producing a super-accurate estimate of marginal likelihood if that isn’t necessary to ensure good mixing of the MCMC chain.

The papers above try to give theoretical justifications for certain “rules of thumb” that are commonly used in practice. One widely adopted scheme is to tune the number of particles so that the variance of the log of the estimate of marginal liklihood is around one. The obvious questions are “where?” and “why?”, and these questions turn out to be connected. As we will see, there isn’t really a good answer to the “where?” question, but what people usually do is use a pilot run to get an estimate of the posterior mean, or mode, or MLE, and then pick one and tune the noise variance at that particular parameter value. As to “why?”, well, the papers above make various (slightly different) assumptions, all of which lead to trading off mixing against computation time to obtain an “optimal” number of particles. They don’t all agree that the variance of the noise should be exactly 1, but they all agree to an order of magnitude.

All of the above papers make the assumption that the noise distribution associated with the marginal likelihood estimate is independent of the parameter at which it is being evaluated, which explains why there isn’t a really good answer to the “where?” question – under the assumption it doesn’t matter what parameter value is used for tuning – they are all the same! Easy. Except that’s quite a big assumption, so it would be nice to know that it is reasonable, and unfortunately it isn’t. Let’s look at an example to see what goes wrong.

#### Example

In Chapter 10 of my book I look in detail at constructing a PMMH algorithm for inferring the parameters of a discretely observed stochastic Lotka-Volterra model. I’ve stepped through the computational details in a previous post which you should refer back to for the necessary background. Following that post, we can construct a particle filter to return an unbiased estimate of marginal likelihood using the following R code (which relies on the smfsb CRAN package):

require(smfsb)
# data
data(LVdata)
data=as.timedData(LVnoise10)
noiseSD=10
# measurement error model
dataLik <- function(x,t,y,log=TRUE,...)
{
ll=sum(dnorm(y,x,noiseSD,log=TRUE))
if (log)
return(ll)
else
return(exp(ll))
}
# now define a sampler for the prior on the initial state
simx0 <- function(N,t0,...)
{
mat=cbind(rpois(N,50),rpois(N,100))
colnames(mat)=c("x1","x2")
mat
}
# construct particle filter
mLLik=pfMLLik(150,simx0,0,stepLVc,dataLik,data)


Again, see the relevant previous post for details. So now mLLik() is a function that will return the log of an unbiased estimate of marginal likelihood (based on 150 particles) given a parameter value at which to evaluate.

What we are currently wondering is whether the noise in the estimate is independent of the parameter at which it is evaluated. We can investigate this for this filter easily by looking at how the estimate varies as the first parameter (prey birth rate) varies. The following code computes a log likelihood estimate across a range of values and plots the result.

mLLik1=function(x){mLLik(th=c(th1=x,th2=0.005,th3=0.6))}
x=seq(0.7,1.3,length.out=5001)
y=sapply(x,mLLik1)
plot(x[y>-1e10],y[y>-1e10])


The resulting plot is as follows (click for full size):

So, looking at the plot, it is very clear that the noise variance certainly isn’t constant as the parameter varies – it varies substantially. Furthermore, the way in which it varies is “dangerous”, in that the noise is smallest in the vicinity of the MLE. So, if a parameter close to the MLE is chosen for tuning the number of particles, this will ensure that the noise is small close to the MLE, but not elsewhere in parameter space. This could have bad consequences for the mixing of the MCMC algorithm as it explores the tails of the posterior distribution.

So with the above in mind, how should one tune the number of particles in a pMCMC algorithm? I can’t give a general answer, but I can explain what I do. We can’t rely on theory, so a pragmatic approach is required. The above rule of thumb usually gives a good starting point for exploration. Then I just directly optimise ESS per CPU second of the pMCMC algorithm from pilot runs for varying numbers of particles (and other tuning parameters in the algorithm). ESS is “expected sample size”, which can be estimated using the effectiveSize() function in the coda CRAN package. Ugly and brutish, but it works…

23/02/2014

### Introduction

In previous posts I have discussed general issues regarding parallel MCMC and examined in detail parallel Monte Carlo on a multicore laptop. In those posts I used the C programming language in conjunction with the MPI parallel library in order to illustrate the concepts. In this post I want to take the example from the second post and re-examine it using the Scala programming language.

The toy problem considered in the parallel Monte Carlo post used $10^9$ $U(0,1)$ random quantities to construct a Monte Carlo estimate of the integral

$\displaystyle I=\int_0^1\exp\{-u^2\}du$.

A very simple serial program to implement this algorithm is given below:

import java.util.concurrent.ThreadLocalRandom
import scala.math.exp
import scala.annotation.tailrec

object MonteCarlo {

@tailrec
def sum(its: Long,acc: Double): Double = {
if (its==0)
(acc)
else {
sum(its-1,acc+exp(-u*u))
}
}

def main(args: Array[String]) = {
println("Hello")
val iters=1000000000
val result=sum(iters,0.0)
println(result/iters)
println("Goodbye")
}

}


Note that ThreadLocalRandom is a parallel random number generator introduced into recent versions of the Java programming language, which can be easily utilised from Scala code. Assuming that Scala is installed, this can be compiled and run with commands like

scalac monte-carlo.scala
time scala MonteCarlo


This program works, and the timings (in seconds) for three runs are 57.79, 57.77 and 57.55 on the same laptop considered in the previous post. The first thing to note is that this Scala code is actually slightly faster than the corresponding C+MPI code in the single processor special case! Now that we have a good working implementation we should think how to parallelise it…

### Parallel implementation

Before constructing a parallel implementation, we will first construct a slightly re-factored serial version that will be easier to parallelise. The simplest way to introduce parallelisation into Scala code is to parallelise a map over a collection. We therefore need a collection and a map to apply to it. Here we will just divide our $10^9$ iterations into $N=4$ separate computations, and use a map to compute the required Monte Carlo sums.

import java.util.concurrent.ThreadLocalRandom
import scala.math.exp
import scala.annotation.tailrec

object MonteCarlo {

@tailrec
def sum(its: Long,acc: Double): Double = {
if (its==0)
(acc)
else {
sum(its-1,acc+exp(-u*u))
}
}

def main(args: Array[String]) = {
println("Hello")
val N=4
val iters=1000000000
val its=iters/N
val sums=(1 to N).toList map {x => sum(its,0.0)}
val result=sums.reduce(_+_)
println(result/iters)
println("Goodbye")
}

}


Running this new code confirms that it works and gives similar estimates for the Monte Carlo integral as the previous version. The timings for 3 runs on my laptop were 57.57, 57.67 and 57.80, similar to the previous version of the code. So far so good. But how do we make it parallel? Like this:

import java.util.concurrent.ThreadLocalRandom
import scala.math.exp
import scala.annotation.tailrec

object MonteCarlo {

@tailrec
def sum(its: Long,acc: Double): Double = {
if (its==0)
(acc)
else {
sum(its-1,acc+exp(-u*u))
}
}

def main(args: Array[String]) = {
println("Hello")
val N=4
val iters=1000000000
val its=iters/N
val sums=(1 to N).toList.par map {x => sum(its,0.0)}
val result=sums.reduce(_+_)
println(result/iters)
println("Goodbye")
}

}


That’s it! It’s now parallel. Studying the above code reveals that the only difference from the previous version is the introduction of the 4 characters .par in line 22 of the code. R programmers will find this very much analagous to using lapply() versus mclapply() in R code. The function par converts the collection (here an immutable List) to a parallel collection (here an immutable parallel List), and then subsequent maps, filters, etc., can be computed in parallel on appropriate multicore architectures. Timings for 3 runs on my laptop were 20.74, 20.82 and 20.88. Note that these timings are faster than the timings for N=4 processors for the corresponding C+MPI code…

### Varying the size of the parallel collection

We can trivially modify the previous code to make the size of the parallel collection, N, a command line argument:

import java.util.concurrent.ThreadLocalRandom
import scala.math.exp
import scala.annotation.tailrec

object MonteCarlo {

@tailrec
def sum(its: Long,acc: Double): Double = {
if (its==0)
(acc)
else {
sum(its-1,acc+exp(-u*u))
}
}

def main(args: Array[String]) = {
println("Hello")
val N=args(0).toInt
val iters=1000000000
val its=iters/N
val sums=(1 to N).toList.par map {x => sum(its,0.0)}
val result=sums.reduce(_+_)
println(result/iters)
println("Goodbye")
}

}


We can now run this code with varying sizes of N in order to see how the runtime of the code changes as the size of the parallel collection increases. Timings on my laptop are summarised in the table below.

 N     T1     T2     T3
1   57.67  57.62  57.83
2   32.20  33.24  32.76
3   26.63  26.60  26.63
4   20.99  20.92  20.75
5   20.13  18.70  18.76
6   16.57  16.52  16.59
7   15.72  14.92  15.27
8   13.56  13.51  13.32
9   18.30  18.13  18.12
10   17.25  17.33  17.22
11   17.04  16.99  17.09
12   15.95  15.85  15.91

16   16.62  16.68  16.74
32   15.41  15.54  15.42
64   15.03  15.03  15.28


So we see that the timings decrease steadily until the size of the parallel collection hits 8 (the number of processors my hyper-threaded quad-core presents via Linux), and then increases very slightly, but not much as the size of the collection increases. This is better than the case of C+MPI where performance degrades noticeably if too many processes are requested. Here, the Scala compiler and JVM runtime manage an appropriate number of threads for the collection irrespective of the actual size of the collection. Also note that all of the timings are faster than the corresponding C+MPI code discussed in the previous post.

However, the notion that the size of the collection is irrelevant is only true up to a point. Probably the most natural way to code this algorithm would be as:

import java.util.concurrent.ThreadLocalRandom
import scala.math.exp

object MonteCarlo {

def main(args: Array[String]) = {
println("Hello")
val iters=1000000000
val sums=(1 to iters).toList map {x => ThreadLocalRandom.current().nextDouble()} map {x => exp(-x*x)}
val result=sums.reduce(_+_)
println(result/iters)
println("Goodbye")
}

}


or as the parallel equivalent

import java.util.concurrent.ThreadLocalRandom
import scala.math.exp

object MonteCarlo {

def main(args: Array[String]) = {
println("Hello")
val iters=1000000000
val sums=(1 to iters).toList.par map {x => ThreadLocalRandom.current().nextDouble()} map {x => exp(-x*x)}
val result=sums.reduce(_+_)
println(result/iters)
println("Goodbye")
}

}


Although these algorithms are in many ways cleaner and more natural, they will bomb out with a lack of heap space unless you have a huge amount of RAM, as they rely on having all $10^9$ realisations in RAM simultaneously. The lesson here is that even though functional languages make it very easy to write clean, efficient parallel code, we must still be careful not to fill up the heap with gigantic (immutable) data structures…

25/01/2014

### Introduction

Particle MCMC (the use of approximate SMC proposals within exact MCMC algorithms) is arguably one of the most important developments in computational Bayesian inference of the 21st Century. The key concepts underlying these methods are described in a famously impenetrable “read paper” by Andrieu et al (2010). Probably the most generally useful method outlined in that paper is the particle marginal Metropolis-Hastings (PMMH) algorithm that I have described previously – that post is required preparatory reading for this one.

In this post I want to discuss some of the other topics covered in the pMCMC paper, leading up to a description of the particle Gibbs sampler. The basic particle Gibbs algorithm is arguably less powerful than PMMH for a few reasons, some of which I will elaborate on. But there is still a lot of active research concerning particle Gibbs-type algorithms, which are attempting to address some of the deficiencies of the basic approach. Clearly, in order to understand and appreciate the recent developments it is first necessary to understand the basic principles, and so that is what I will concentrate on here. I’ll then finish with some pointers to more recent work in this area.

### PIMH

I will adopt the same approach and notation as for my post on the PMMH algorithm, using a simple bootstrap particle filter for a state space model as the SMC proposal. It is simplest to understand particle Gibbs first in the context of known static parameters, and so it is helpful to first reconsider the special case of the PMMH algorithm where there are no unknown parameters and only the state path, $x$ of the process is being updated. That is, we target $p(x|y)$ (for known, fixed, $\theta$) rather than $p(\theta,x|y)$. This special case is known as the particle independent Metropolis-Hastings (PIMH) sampler.

Here we envisage proposing a new path $x_{0:T}^\star$ using a bootstrap filter, and then accepting the proposal with probability $\min\{1,A\}$, where $A$ is the Metropolis-Hastings ratio

$\displaystyle A = \frac{\hat{p}(y_{1:T})^\star}{\hat{p}(y_{1:T})},$

where $\hat{p}(y_{1:T})^\star$ is the bootstrap filter’s estimate of marginal likelihood for the new path, and $\hat{p}(y_{1:T})$ is the estimate associated with the current path. Again using notation from the previous post it is clear that this ratio targets a distribution on the joint space of all simulated random variables proportional to

$\displaystyle \hat{p}(y_{1:T})\tilde{q}(\mathbf{x}_0,\ldots,\mathbf{x}_T,\mathbf{a}_0,\ldots,\mathbf{a}_{T-1})$

and that in this case the marginal distribution of the accepted path is exactly $p(x_{0:T}|y_{1:T})$. Again, be sure to see the previous post for the explanation.

### Conditional SMC update

So far we have just recapped the previous post in the case of known parameters, but it gives us insight in how to proceed. A general issue with Metropolis independence samplers in high dimensions is that they often exhibit “sticky” behaviour, whereby an unusually “good” accepted path is hard to displace. This motivates consideration of a block-Gibbs-style algorithm where updates are used that are always accepted. It is clear that simply running a bootstrap filter will target the particle filter distribution

$\tilde{q}(\mathbf{x}_0,\ldots,\mathbf{x}_T,\mathbf{a}_0,\ldots,\mathbf{a}_{T-1})$

and so the marginal distribution of the accepted path will be the approximate $\hat{p}(x_{0:T}|y_{1:T})$ rather than the exact conditional distribution $p(x_{0:T}|y_{1:T})$. However, we know from consideration of the PIMH algorithm that what we really want to do is target the slightly modified distribution proportional to

$\displaystyle \hat{p}(y_{1:T})\tilde{q}(\mathbf{x}_0,\ldots,\mathbf{x}_T,\mathbf{a}_0,\ldots,\mathbf{a}_{T-1})$,

as this will lead to accepted paths with the exact marginal distribution. For the PIMH this modification is achieved using a Metropolis-Hastings correction, but we now try to avoid this by instead conditioning on the previously accepted path. For this target the accepted paths have exactly the required marginal distribution, so we now write the target as the product of the marginal for the current path times a conditional for all of the remaining variables.

$\displaystyle \frac{p(x_{0:T}^k|y_{1:T})}{M^T} \times \frac{M^T}{p(x_{0:T}^k|y_{1:T})} \hat{p}(y_{1:T})\tilde{q}(\mathbf{x}_0,\ldots,\mathbf{x}_T,\mathbf{a}_0,\ldots,\mathbf{a}_{T-1})$

where in addition to the correct marginal for $x$ we assume iid uniform ancestor indices. The important thing to note here is that the conditional distribution of the remaining variables simplifies to

$\displaystyle \frac{\tilde{q}(\mathbf{x}_0,\ldots,\mathbf{x}_T,\mathbf{a}_0,\ldots,\mathbf{a}_{T-1})} {\displaystyle p(x_0^{b_0^k})\left[\prod_{t=0}^{T-1} \pi_t^{b_t^k}p\left(x_{t+1}^{b_{t+1}^k}|x_t^{b_t^k}\right)\right]}$.

The terms in the denominator are precisely the terms in the numerator corresponding to the current path, and hence “cancel out” the current path terms in the numerator. It is therefore clear that we can sample directly from this conditional distribution by running a bootstrap particle filter that includes the current path and which leaves the current path fixed. This is the conditional SMC (CSMC) update, which here is just a conditional bootstrap particle filter update. It is clear from the form of the conditional density how this filter must be constructed, but for completeness it is described below.

The bootstrap filter is run conditional on one trajectory. This is usually the trajectory sampled at the last run of the particle filter. The idea is that you do not sample new state or ancestor values for that one trajectory. Note that this guarantees that the conditioned on trajectory survives the filter right through to the final sweep of the filter at which point a new trajectory is picked from the current selection of $M$ paths, of which the conditioned-on trajectory is one.

Let $x_{1:T} = (x_1^{b_1},x_2^{b_2},\ldots,x_T^{b_T})$ be the path that is to be conditioned on, with ancestral lineage $b_{1:T}$. Then, for $k\not= b_1$, sample $x_0^k \sim p(x_0)$ and set $\pi_0^k=1/M$. Now suppose that at time $t$ we have a weighted sample from $p(x_t|y_{1:t})$. First resample by sampling $a_t^k\sim \mathcal{F}(a_t^k|\boldsymbol{\pi}_t),\ \forall k\not= b_t$. Next sample $x_{t+1}^k\sim p(x_{t+1}^k|x_t^{a_t^k}),\ \forall k\not=b_t$. Then for all $k$ set $w_{t+1}^k=p(y_{t+1}|x_{t+1}^k)$ and normalise with $\pi_{t+1}^k=w_{t+1}^k/\sum_{i=1}^M w_{t+1}^i$. Propagate this weighted set of particles to the next time point. At time $T$ select a single trajectory by sampling $k'\sim \mathcal{F}(k'|\boldsymbol{\pi}_T)$.

This defines a block Gibbs sampler which updates $2(M-1)T+1$ of the $2MT+1$ random variables in the augmented state space at each iteration. Since the block of variables to be updated is random, this defines an ergodic sampler for $M\geq2$ particles, and we have explained why the marginal distribution of the selected trajectory is the exact conditional distribution.

Before going on to consider the introduction of unknown parameters, it is worth considering the limitations of this method. One of the main motivations for considering a Gibbs-style update was concern about the “stickiness” of a Metropolis independence sampler. However, it is clear that conditional SMC updates also have the potential to stick. For a large number of time points, particle filter genealogies coalesce, or degenerate, to a single path. Since here we are conditioning on the current path, if there is coalescence, it is guaranteed to be to the previous path. So although the conditional SMC updates are always accepted, it is likely that much of the new path will be identical to the previous path, which is just another kind of “sticking” of the sampler. This problem with conditional SMC and particle Gibbs more generally is well recognised, and quite a bit of recent research activity in this area is directed at alleviating this sticking problem. The most obvious strategy to use is “backward sampling” (Godsill et al, 2004), which has been used in this context by Lindsten and Schon (2012), Whiteley et al (2010), and Chopin and Singh (2013), among others. Another related idea is “ancestor sampling” (Lindsten et al, 2014), which can be done in a single forward pass. Both of these techniques work well, but both rely on the tractability of the transition kernel of the state space model, which can be problematic in certain applications.

### Particle Gibbs sampling

As we are working in the context of Gibbs-style updates, the introduction of static parameters, $\theta$, into the problem is relatively straightforward. It turns out to be correct to do the obvious thing, which is to alternate between sampling $\theta$ given $y$ and the currently sampled path, $x$, and sampling a new path using a conditional SMC update, conditional on the previous path in addition to $\theta$ and $y$. Although this is the obvious thing to do, understanding exactly why it works is a little delicate, due to the augmented state space and conditional SMC update. However, it is reasonably clear that this strategy defines a “collapsed Gibbs sampler” (Lui, 1994), and so actually everything is fine. This particular collapsed Gibbs sampler is relatively easy to understand as a marginal sampler which integrates out the augmented variables, but then nevertheless samples the augmented variables at each iteration conditional on everything else.

Note that the Gibbs update of $\theta$ may be problematic in the context of a state space model with intractable transition kernel.

In a subsequent post I’ll show how to code up the particle Gibbs and other pMCMC algorithms in a reasonably efficient way.

### A functional Gibbs sampler in Scala

04/10/2013

For many years I’ve had a passing interest in functional programming and languages which support functional programming approaches. I’m also quite interested in MOOCs and their future role in higher education. So I recently signed up for my first on-line course, Functional Programming Principles in Scala, via Coursera. I’m around half way through the course at the time of writing, and I’m enjoying it very much. I knew that I didn’t know much about Scala before starting the course, but during the course I’ve also learned that I didn’t know as much about functional programming as I thought I did, either! ;-) The course itself is very interesting, the assignments are well designed and appropriately challenging, and the web infrastructure to support the course is working well. I suspect I’ll try other on-line courses in the future.

Functional programming emphasises immutability, and discourages imperative programming approaches that use variables that can be modified during run-time. There are many advantages to immutability, especially in the context of parallel and concurrent programming, which is becoming increasingly important as multi-core systems become the norm. I’ve always found functional programming to be intellectually appealing, but have often worried about the practicalities of using functional programming in the context of scientific computing where many algorithms are iterative in nature, and are typically encoded using imperative approaches. The Scala programming language is appealing to me as it supports both imperative and functional styles of programming, as well as object oriented approaches. However, as a result of taking this course I am now determined to pursue functional approaches further, and get more of a feel for how practical they are for scientific computing applications.

For my first experiment, I’m going back to my post describing a Gibbs sampler in various languages. See that post for further details of the algorithm. In that post I did have an example implementation in Scala, which looked like this:

object GibbsSc {

import cern.jet.random.tdouble.engine.DoubleMersenneTwister
import cern.jet.random.tdouble.Normal
import cern.jet.random.tdouble.Gamma
import Math.sqrt
import java.util.Date

def main(args: Array[String]) {
val N=50000
val thin=1000
val rngEngine=new DoubleMersenneTwister(new Date)
val rngN=new Normal(0.0,1.0,rngEngine)
val rngG=new Gamma(1.0,1.0,rngEngine)
var x=0.0
var y=0.0
println("Iter x y")
for (i <- 0 until N) {
for (j <- 0 until thin) {
x=rngG.nextDouble(3.0,y*y+4)
y=rngN.nextDouble(1.0/(x+1),1.0/sqrt(2*x+2))
}
println(i+" "+x+" "+y)
}
}

}


At the time I wrote that post I knew even less about Scala than I do now, so I created the code by starting from the Java version and removing all of the annoying clutter! ;-) Clearly this code has an imperative style, utilising variables (declared with var) x and y having mutable state that is updated by a nested for loop. This algorithm is typical of the kind I use every day, so if I can’t re-write this in a more functional style, removing all mutable variables from my code, then I’m not going to get very far with functional programming!

In fact it is very easy to re-write this in a more functional style without utilising mutable variables. One possible approach is presented below.

object FunGibbs {

import cern.jet.random.tdouble.engine.DoubleMersenneTwister
import cern.jet.random.tdouble.Normal
import cern.jet.random.tdouble.Gamma
import java.util.Date
import scala.math.sqrt

val rngEngine=new DoubleMersenneTwister(new Date)
val rngN=new Normal(0.0,1.0,rngEngine)
val rngG=new Gamma(1.0,1.0,rngEngine)

class State(val x: Double,val y: Double)

def nextIter(s: State): State = {
val newX=rngG.nextDouble(3.0,(s.y)*(s.y)+4.0)
new State(newX,
rngN.nextDouble(1.0/(newX+1),1.0/sqrt(2*newX+2)))
}

def nextThinnedIter(s: State,left: Int): State = {
if (left==0) s
else nextThinnedIter(nextIter(s),left-1)
}

def genIters(s: State,current: Int,stop: Int,thin: Int): State = {
if (!(current>stop)) {
println(current+" "+s.x+" "+s.y)
genIters(nextThinnedIter(s,thin),current+1,stop,thin)
}
else s
}

def main(args: Array[String]) {
println("Iter x y")
genIters(new State(0.0,0.0),1,50000,1000)
}

}


Although it is a few lines longer, it is a fairly clean implementation, and doesn’t look like a hack. Like many functional programs, this one makes extensive use of recursion. This is one of the things that has always concerned me about functional programming – many scientific computing applications involve lots of iteration, and that can potentially translate into very deep recursion. The above program has an apparent recursion depth of 50 million! However, it runs fine without crashing despite the fact that most programming languages will crash out with a stack overflow with recursion depths of more than a couple of thousand. So why doesn’t this crash? It runs fine because the recursion I used is a special form of recursion known as a tail call. Most functional (and some imperative) programming languages automatically perform tail call elimination which essentially turns the tail call into an iteration which runs very fast without creating new stack frames. In fact, this functional version of the code runs at roughly the same speed as the iterative version I presented first (perhaps just a few percent slower – I haven’t done careful timings), and runs well within a factor of 2 of imperative C code. So actually this seems perfectly practical so far, and I’m looking forward to experimenting more with functional programming approaches to statistical computation over the coming months…

29/09/2013

## Introduction

Parallel tempering is a method for getting Metropolis-Hastings based MCMC algorithms to work better on multi-modal distributions. Although the idea has been around for more than 20 years, and works well on many problems, it still isn’t routinely used in applications. I think this is partly because relatively few people understand how it works, and partly due to the perceived difficulty of implementation. I hope to show here that it is both very easy to understand and to implement. It is also rather easy to implement in parallel on multi-core systems, though I won’t get into that in this post.

## Sampling a double-well potential

To illustrate the ideas, we need a toy multi-modal distribution to sample. There are obviously many possibilities here, but I rather like to use a double potential well distribution. The simplest version of this assumes a potential function of the form

$U(x) = \gamma (x^2-1)^2$

for some given potential barrier height $\gamma$. The potential function $U(x)$ corresponds to the probability density function

$\pi(x) \propto \exp\{-U(x)\}.$

There is a physical explanation for the terminology, via Langevin diffusions, but that isn’t really important here. It is fine to just think of potentials as being a (negative) log-density scale. On this scale, high potential barrier heights correspond to regions of very low probability density. We can set up a double well potential and plot it for the case $\gamma=4$ in R with the following code

U=function(gam,x)
{
gam*(x*x-1)*(x*x-1)
}

curried=function(gam)
{
message(paste("Returning a function for gamma =",gam))
function(x) U(gam,x)
}
U4=curried(4)

op=par(mfrow=c(2,1))
curve(U4(x),-2,2,main="Potential function, U(x)")
curve(exp(-U4(x)),-2,2,main="Unnormalised density function, exp(-U(x))")
par(op)


Incidentally, the function curried(), which curries the potential function, did not include the message() statement when I first wrote it. It mostly worked fine, but some of the code below didn’t behave as I expected. I inserted the message() statement to figure out what was going on, and the code started behaving perfectly – a beautiful example of a Heisenbug! The reason is that the message statement is not redundant – it forces evaluation of the gam variable, which is necessary in some cases, due to the lazy evaluation model that R uses for function arguments. If you don’t like the message() statement, replacing it with a simple gam works just as well.

Anyway, the point is that we have defined a multi-modal density, and that a Metropolis-Hastings algorithm initialised in one of the modes will have a hard time jumping to the other mode, and the difficulty of this jump will increase as we increase the value of $\gamma$.

We can write a simple function for a Metropolis algorithm targeting a particular potential function as follows.

chain=function(target,tune=0.1,init=1)
{
x=init
xvec=numeric(iters)
for (i in 1:iters) {
can=x+rnorm(1,0,tune)
logA=target(x)-target(can)
if (log(runif(1))<logA)
x=can
xvec[i]=x
}
xvec
}


We can use this code to run some chains for a few different values of $\gamma$ as follows.

temps=2^(0:3)
iters=1e5

mat=sapply(lapply(temps,curried),chain)
colnames(mat)=paste("gamma=",temps,sep="")

require(smfsb)
mcmcSummary(mat,rows=length(temps))


We see that as $\gamma$ increases, the chain jumps between modes less frequently. Indeed, for $\gamma=8$, the chain fails to jump to the second mode at all during this particular run of 100,000 iterations. That’s a problem if we are really interested in sampling from distributions like this. Of course, for this particular problem, there are all kinds of ways to fix this sampler, but the point is to try and develop methods that will work in high-dimensional situations where we cannot just “look” at what is going wrong.

Before we see how to couple the chains and improve the mixing, it is useful to think how to re-write this sampler. Above we sampled each chain in turn for different barrier heights. To couple the chains, we need to sample them together, sampling each iteration for all of the chains in turn. This is very easy to do. The code below isn’t especially efficient, but it is written to look very similar to the single chain code above.

chains=function(pot=U, tune=0.1, init=1)
{
x=rep(init,length(temps))
xmat=matrix(0,iters,length(temps))
for (i in 1:iters) {
can=x+rnorm(length(temps),0,tune)
logA=unlist(Map(pot,temps,x))-unlist(Map(pot,temps,can))
accept=(log(runif(length(temps)))<logA)
x[accept]=can[accept]
xmat[i,]=x
}
colnames(xmat)=paste("gamma=",temps,sep="")
xmat
}

mcmcSummary(chains(),rows=length(temps))


This code should behave identically to the previous code, simulating independent parallel MCMC chains. However, the code is now in the form that is very easy to modify to couple the chains together in an attempt to improve mixing.

## Coupling parallel chains

In the above example the chains we are simulating are all independent of one another. Some mix reasonably well, and some mix very badly. But the distributions are all related to one another, changing gradually as the barrier height changes. The idea behind coupling the chains is to try and swap states between the chains to use the chains which are mixing well to improve the mixing of the chains which aren’t. In particular, suppose interest is in the target of the worst mixing chain. The other chains can be constructed as “tempered” versions of the target of interest, here by raising it to a power between 0 and 1, with 0 corresponding to a complete flattening of the distribution, and 1 corresponding to the desired target. The use of parallel chains to improve mixing in this way is usually referred to as parallel tempering, but also sometimes as $(\text{MC})^3$. In the context of Bayesian inference, tempering using the “power posterior” can often be more natural and useful (to be discussed in a subsequent post).

So, how do we swap states between the chains without affecting the target distributions? As always, just use a Metropolis-Hastings update… To keep things simple, let’s just suppose that we have two (independent, parallel) chains, one with target $f(x)$ and the other with target $g(y)$. We can consider these chains to be evolving together, with joint target $\pi(x,y)=f(x)g(y)$. The updates chosen to update the within-chain states will obviously preserve this joint target. Now we consider how to swap states between the two chains without destroying the target. We simply propose a swap of $x$ and $y$. That is, we propose to move from $(x,y)$ to $(x^\star,y^\star)$, where $x^\star=y$ and $y^\star=x$. We are proposing this move as a standard Metropolis-Hastings update of the joint chain. We may therefore wonder about exactly what the proposal density for this move is. In fact, it is a point mass at the swapped values, and therefore has density

$q((x^\star,y^\star)|(x,y)) = \delta(x^\star-y)\delta(y^\star-x),$

but it really doesn’t matter, as it is clearly a symmetric proposal, and hence will drop out of the M-H ratio. Our acceptance probability is therefore $\min\{1,A\}$, where

$\displaystyle A = \frac{\pi(x^\star,y^\star)}{\pi(x,y)} = \frac{\pi(y,x)}{\pi(x,y)} = \frac{f(y)g(x)}{f(x)g(y)}.$

So, if we use this acceptance probability whenever we propose a swap of the states between two chains, then we will preserve the joint target, and hence the marginal targets and asymptotic independence of the target. However, the chains themselves will no longer be independent of one another. They will be coupled – Metropolis coupled. This is very easy to implement. We can just add a few lines to our previous chains() function as follows

chains=function(pot=U, tune=0.1, init=1)
{
x=rep(init,length(temps))
xmat=matrix(0,iters,length(temps))
for (i in 1:iters) {
can=x+rnorm(length(temps),0,tune)
logA=unlist(Map(pot,temps,x))-unlist(Map(pot,temps,can))
accept=(log(runif(length(temps)))<logA)
x[accept]=can[accept]
# now the coupling update
swap=sample(1:length(temps),2)
logA=pot(temps[swap[1]],x[swap[1]])+pot(temps[swap[2]],x[swap[2]])-
pot(temps[swap[1]],x[swap[2]])-pot(temps[swap[2]],x[swap[1]])
if (log(runif(1))<logA)
x[swap]=rev(x[swap])
# end of the coupling update
xmat[i,]=x
}
colnames(xmat)=paste("gamma=",temps,sep="")
xmat
}


This can be used as before, but now gives results as illustrated in the following plots.

We see immediately from the plots that whilst the individual target distributions remain unchanged, the mixing of the chains is greatly improved (though still far from perfect). Note that in the code above I just picked two chains at random to propose a state swap. In practice people typically only propose to swap states between chains which are adjacent – i.e. most similar, since proposed swaps between chains with very different targets are unlikely to be accepted. Since implementation of either strategy is very easy, I would recommend trying both to see which works best.

## Parallel implementation

It should be clear that there are opportunities for parallelising the above algorithm to make effective use of modern multi-core hardware. An approach using OpenMP with C++ is discussed in this blog post. Also note that if the state space of the chains is large, it can be much more efficient to swap temperatures between the chains rather than states – so long as you are careful about keeping track of stuff – this is explored in detail in Altekar et al (’04).

## Complete R script

For convenience, a complete R script to run all of the examples in this post is given below.

# temper.R
# functions for messing around with tempering MCMC

U=function(gam,x)
{
gam*(x*x-1)*(x*x-1)
}

curried=function(gam)
{
#gam
message(paste("Returning a function for gamma =",gam))
function(x) U(gam,x)
}
U4=curried(4)

op=par(mfrow=c(2,1))
curve(U4(x),-2,2,main="Potential function, U(x)")
curve(exp(-U4(x)),-2,2,main="Unnormalised density function, exp(-U(x))")
par(op)

# global settings
temps=2^(0:3)
iters=1e5

# First look at some independent chains
chain=function(target,tune=0.1,init=1)
{
x=init
xvec=numeric(iters)
for (i in 1:iters) {
can=x+rnorm(1,0,tune)
logA=target(x)-target(can)
if (log(runif(1))<logA)
x=can
xvec[i]=x
}
xvec
}

mat=sapply(lapply(temps,curried),chain)
colnames(mat)=paste("gamma=",temps,sep="")

require(smfsb)
mcmcSummary(mat,rows=length(temps))

# Next, let's generate 5 chains at once...
chains=function(pot=U, tune=0.1, init=1)
{
x=rep(init,length(temps))
xmat=matrix(0,iters,length(temps))
for (i in 1:iters) {
can=x+rnorm(length(temps),0,tune)
logA=unlist(Map(pot,temps,x))-unlist(Map(pot,temps,can))
accept=(log(runif(length(temps)))<logA)
x[accept]=can[accept]
xmat[i,]=x
}
colnames(xmat)=paste("gamma=",temps,sep="")
xmat
}

mcmcSummary(chains(),rows=length(temps))

# Next let's couple the chains...
chains=function(pot=U, tune=0.1, init=1)
{
x=rep(init,length(temps))
xmat=matrix(0,iters,length(temps))
for (i in 1:iters) {
can=x+rnorm(length(temps),0,tune)
logA=unlist(Map(pot,temps,x))-unlist(Map(pot,temps,can))
accept=(log(runif(length(temps)))<logA)
x[accept]=can[accept]
# now the coupling update
swap=sample(1:length(temps),2)
logA=pot(temps[swap[1]],x[swap[1]])+pot(temps[swap[2]],x[swap[2]])-
pot(temps[swap[1]],x[swap[2]])-pot(temps[swap[2]],x[swap[1]])
if (log(runif(1))<logA)
x[swap]=rev(x[swap])
# end of the coupling update
xmat[i,]=x
}
colnames(xmat)=paste("gamma=",temps,sep="")
xmat
}

mcmcSummary(chains(),rows=length(temps))

# eof


### Inlining JAGS models in R scripts for rjags

02/10/2012

JAGS (Just Another Gibbs Sampler) is a general purpose MCMC engine similar to WinBUGS and OpenBUGS. I have a slight preference for JAGS as it is free and portable, works well on Linux, and interfaces well with R. It is tempting to write a tutorial introduction to JAGS and the corresponding R package, rjags, but there is a lot of material freely available on-line already, so it isn’t really necessary. If you are new to JAGS, I suggest starting with Getting Started with JAGS, rjags, and Bayesian Modelling. In this post I want to focus specifically on the problem of inlining JAGS models in R scripts as it can be very useful, and is usually skipped in introductory material.

#### JAGS and rjags on Ubuntu Linux

On recent versions of Ubuntu, assuming that R is already installed, the simplest way to install JAGS and rjags is using the command

sudo apt-get install jags r-cran-rjags


Now rjags is a CRAN package, so it can be installed in the usual way with install.packages("rjags"). However, taking JAGS and rjags direct from the Ubuntu repos should help to ensure that the versions of JAGS and rjags are in sync, which is a good thing.

#### Toy model

For this post, I will use a trivial toy example of inference for the mean and precision of a normal random sample. That is, we will assume data

$X_i \sim N(\mu,1/\tau),\quad i=1,2,\ldots n,$

with priors on $\mu$ and $\tau$ of the form

$\tau\sim Ga(a,b),\quad \mu \sim N(c,1/d).$

#### Separate model file

The usual way to fit this model in R using rjags is to first create a separate file containing the model

  model {
for (i in 1:n) {
x[i]~dnorm(mu,tau)
}
mu~dnorm(cc,d)
tau~dgamma(a,b)
}


Then, supposing that this file is called jags1.jags, an R session to fit the model could be constructed as follows:

require(rjags)
x=rnorm(15,25,2)
data=list(x=x,n=length(x))
hyper=list(a=3,b=11,cc=10,d=1/100)
init=list(mu=0,tau=1)
model=jags.model("jags1.jags",data=append(data,hyper), inits=init)
update(model,n.iter=100)
output=coda.samples(model=model,variable.names=c("mu", "tau"), n.iter=10000, thin=1)
print(summary(output))
plot(output)


This is all fine, and it can be very useful to have the model declared in a separate file, especially if the model is large and complex, and you might want to use it from outside R. However, very often for simple models it can be quite inconvenient to have the model separate from the R script which runs it. In particular, people often have issues with naming files correctly, making sure R is looking in the correct directory, moving the model with the R script, etc. So it would be nice to be able to just inline the JAGS model within an R script, to keep the model, the data, and the analysis all together in one place.

#### Using a temporary file

What we want to do is declare the JAGS model within a text string inside an R script and then somehow pass this into the call to jags.model(). The obvious way to do this is to write the string to a text file, and then pass the name of that text file into jags.model(). This works fine, but some care needs to be taken to make sure this works in a generic platform independent way. For example, you need to write to a file that you know doesn’t exist in a directory that is writable using a filename that is valid on the OS on which the script is being run. For this purpose R has an excellent little function called tempfile() which solves exactly this naming problem. It should always return the name of a file which does not exist in a writable directly within the standard temporary file location on the OS on which R is being run. This function is exceedingly useful for all kinds of things, but doesn’t seem to be very well known by newcomers to R. Using this we can construct a stand-alone R script to fit the model as follows:

require(rjags)
x=rnorm(15,25,2)
data=list(x=x,n=length(x))
hyper=list(a=3,b=11,cc=10,d=1/100)
init=list(mu=0,tau=1)
modelstring="
model {
for (i in 1:n) {
x[i]~dnorm(mu,tau)
}
mu~dnorm(cc,d)
tau~dgamma(a,b)
}
"
tmpf=tempfile()
tmps=file(tmpf,"w")
cat(modelstring,file=tmps)
close(tmps)
model=jags.model(tmpf,data=append(data,hyper), inits=init)
update(model,n.iter=100)
output=coda.samples(model=model,variable.names=c("mu", "tau"), n.iter=10000, thin=1)
print(summary(output))
plot(output)


Now, although there is a file containing the model temporarily involved, the script is stand-alone and portable.

#### Using a text connection

The solution above works fine, but still involves writing a file to disk and reading it back in again, which is a bit pointless in this case. We can solve this by using another under-appreciated R function, textConnection(). Many R functions which take a file as an argument will work fine if instead passed a textConnection object, and the rjags function jags.model() is no exception. Here, instead of writing the model string to disk, we can turn it into a textConnection object and then pass that directly into jags.model() without ever actually writing the model file to disk. This is faster, neater and cleaner. An R session which takes this approach is given below.

require(rjags)
x=rnorm(15,25,2)
data=list(x=x,n=length(x))
hyper=list(a=3,b=11,cc=10,d=1/100)
init=list(mu=0,tau=1)
modelstring="
model {
for (i in 1:n) {
x[i]~dnorm(mu,tau)
}
mu~dnorm(cc,d)
tau~dgamma(a,b)
}
"
model=jags.model(textConnection(modelstring), data=append(data,hyper), inits=init)
update(model,n.iter=100)
output=coda.samples(model=model,variable.names=c("mu", "tau"), n.iter=10000, thin=1)
print(summary(output))
plot(output)


This is my preferred way to use rjags. Note again that textConnection objects have many and varied uses and applications that have nothing to do with rjags.

### MCMC on the Raspberry Pi

07/07/2012

I’ve recently taken delivery of a Raspberry Pi mini computer. For anyone who doesn’t know, this is a low cost, low power machine, costing around 20 GBP (25 USD) and consuming around 2.5 Watts of power (it is powered by micro-USB). This amazing little device can run linux very adequately, and so naturally I’ve been interested to see if I can get MCMC codes to run on it, and to see how fast they run.

Now, I’m fairly sure that the majority of readers of this blog won’t want to be swamped with lots of Raspberry Pi related posts, so I’ve re-kindled my old personal blog for this purpose. Apart from this post, I’ll try not to write about my experiences with the Pi here on my main blog. Consequently, if you are interested in my ramblings about the Pi, you may wish to consider subscribing to my personal blog in addition to this one. Of course I’m not guaranteeing that the occasional Raspberry-flavoured post won’t find its way onto this blog, but I’ll try only to do so if it has strong relevance to statistical computing or one of the other core topics of this blog.

In order to get started with MCMC on the Pi, I’ve taken the C code gibbs.c for a simple Gibbs sampler described in a previous post (on this blog) and run it on a couple of laptops I have available, in addition to the Pi, and looked at timings. The full details of the experiment are recorded in this post over on my other blog, to which interested parties are referred. Here I will just give the “executive summary”.

The code runs fine on the Pi (running Raspbian), at around half the speed of my Intel Atom based netbook (running Ubuntu). My netbook in turn runs at around one fifth the speed of my Intel i7 based laptop. So the code runs at around one tenth of the speed of the fastest machine I have conveniently available.

As discussed over on my other blog, although the Pi is relatively slow, its low cost and low power consumption mean that is has a bang-for-buck comparable with high-end laptops and desktops. Further, a small cluster of Pis (known as a bramble) seems like a good, low cost way to learn about parallel and distributed statistical computing.

04/06/2012

### Introduction

Very often it is desirable to use Metropolis Hastings MCMC for a target distribution which does not have full support (for example, it may correspond to a non-negative random variable), using a proposal distribution which does (for example, a Gaussian random walk proposal). This isn’t a problem at all, but on more than one occasion now I have come across students getting this wrong, so I thought it might be useful to have a brief post on how to do it right, see what people sometimes get wrong, and why, and then think about correcting the wrong method in order to make it right…

### A simple example

For this post we will consider a simple $Ga(2,1)$ target distribution, with density

$\pi(x) = xe^{-x},\quad x\geq 0.$

Of course this is a very simple distribution, and there are many straightforward ways to simulate it directly, but for this post we will use a random walk Metropolis-Hastings (MH) scheme with standard Gaussian innovations. So, if the current state of the chain is $x$, a proposed new value $x^\star$ will be generated from

$f(x^\star|x) = \phi(x^\star-x),$

where $\phi(\cdot)$ is the standard normal density. This proposed new value is accepted with probability $\min\{1,A\}$, where

$\displaystyle A = \frac{\pi(x^\star)}{\pi(x)} \frac{f(x|x^\star)}{f(x^\star|x)} = \frac{\pi(x^\star)}{\pi(x)} \frac{\phi(x-x^\star)}{\phi(x^\star-x)} = \frac{\pi(x^\star)}{\pi(x)} ,$

since the standard normal density is symmetric.

#### Correct implementation

We can easily implement this using R as follows:

met1=function(iters)
{
xvec=numeric(iters)
x=1
for (i in 1:iters) {
xs=x+rnorm(1)
A=dgamma(xs,2,1)/dgamma(x,2,1)
if (runif(1)<A)
x=xs
xvec[i]=x
}
return(xvec)
}


We can run it, plot the results and check it against the true target with the following commands.

iters=1000000
out=met1(iters)
hist(out,100,freq=FALSE,main="met1")


If you have a slow computer, you may prefer to use iters=100000. The above code uses R’s built-in gamma density. Alternatively, we can hard-code the density as follows.

met2=function(iters)
{
xvec=numeric(iters)
x=1
for (i in 1:iters) {
xs=x+rnorm(1)
A=xs*exp(-xs)/(x*exp(-x))
if (runif(1)<A)
x=xs
xvec[i]=x
}
return(xvec)
}


We can run this code using the following commands, to verify that it does work as expected.

out=met2(iters)
hist(out,100,freq=FALSE,main="met2")


However, there is a potential problem with the above code that we have got away with in this instance, which often catches people out. We have hard-coded the density for $x>0$ without checking the sign of $x$. Here we get away with it as a negative proposal will lead to a negative acceptance ratio that we will reject straight away. This is not always the case (consider, for example, a $Ga(3,1)$ distribution). So really we should check the sign of $x^\star$ and reject immediately if is not within the support of the target.

Although this problem often catches people out, it tends not to be a big issue in practice, as it typically leads to an obviously incorrect sampler, or a sampler which crashes, and is relatively simple to debug and fix.

#### An incorrect sampler

The problem I want to focus on here is more subtle, but closely related. It is clear that any $x^\star<0$ should be rejected. With the above code, such values are indeed rejected, and the sampler advances to the next iteration. However, in more complex samplers, where an update like this might be one tiny part of a massive sampler with a very high-dimensional state space, it seems like a bit of a "waste" of a MH move to just propose a negative value, throw it away, and move on. Evidently, it seems tempting, therefore, to keep on sampling $x^\star$ values until a non-negative value is obtained, and then evaluate the acceptance ratio and decide whether or not to accept. We could code up this sampler as follows.

met3=function(iters)
{
xvec=numeric(iters)
x=1
for (i in 1:iters) {
repeat {
xs=x+rnorm(1)
if (xs>0)
break
}
A=xs*exp(-xs)/(x*exp(-x))
if (runif(1)<A)
x=xs
xvec[i]=x
}
return(xvec)
}


As reasonable as this idea may at first seem, it does not lead to a sampler having the desired target, as can be verified using the following commands.

out=met3(iters)
hist(out,100,freq=FALSE,main="met3")


So, this sampler seems to be sampling something close to the desired target, but not the same. This raises a couple of questions. First and most important, can we fix this sampler so that it does sample the correct target (yes), and second, can we figure out what target density the incorrect sampler is actually sampling (again, yes)? Let’s start with the issue of how to fix the sampler, as this will also help us to understand what the incorrect sampler is doing.

#### Fixing the truncated sampler

By repeatedly sampling from the proposal until we obtain a non-negative value, we are actually implementing a rejection sampler for sampling from the proposal distribution truncated at zero. This is a perfectly reasonable proposal distribution, so we can use it provided that we use the correct MH acceptance ratio. Now, the truncated density has the same density as the untruncated density, apart from the differing support and a normalising constant. Indeed, this may be why people often assume this method will work, because normalising constants often don’t matter in MH schemes. However, the normalising constant only doesn’t matter if it is independent of the state, and here it is not… Explicitly, we have

$f(x^\star|x) \propto \phi(x^\star-x),\quad x^\star>0.$

Including the normalising constant we have

$\displaystyle f(x^\star|x) = \frac{\phi(x^\star-x)}{\Phi(x)},\quad x^\star>0,$

where $\Phi(x)$ is the standard normal CDF. Consequently, the correct acceptance ratio to use with this proposal is

$\displaystyle A = \frac{\pi(x^\star)}{\pi(x)} \frac{\phi(x-x^\star)}{\phi(x^\star-x)}\frac{\Phi(x)}{\Phi(x^\star)} = \frac{\pi(x^\star)}{\pi(x)}\frac{\Phi(x)}{\Phi(x^\star)},$

where we see that the normalising constants do not cancel out. We can modify the previous sampler to use the correct acceptance ratio as follows.

met4=function(iters)
{
xvec=numeric(iters)
x=1
for (i in 1:iters) {
repeat {
xs=x+rnorm(1)
if (xs>0)
break
}
A=xs*exp(-xs)/(x*exp(-x))
A=A*pnorm(x)/pnorm(xs)
if (runif(1)<A)
x=xs
xvec[i]=x
}
return(xvec)
}


We can verify that this sampler gives leads to the correct target with the following commands.

out=met4(iters)
hist(out,100,freq=FALSE,main="met4")


So, truncating the proposal at zero is fine, provided that you modify the acceptance ratio accordingly.

#### What does the incorrect sampler target?

Now that we understand why the naive truncated sampler was wrong and how to fix it, we can, out of curiosity, wonder what distribution that sampler actually targets. Now we understand what proposal we are actually using, we can re-write the acceptance ratio as

$\displaystyle A = \frac{\pi(x^\star)\Phi(x^\star)}{\pi(x)\Phi(x)}\frac{\frac{\phi(x-x^\star)}{\Phi(x^\star)}}{\frac{\phi(x^\star-x)}{\Phi(x)}},$

from which it is clear that the actual target of this chain is

$\tilde\pi(x) \propto \pi(x)\Phi(x),$

or

$\tilde\pi(x)\propto xe^{-x}\Phi(x),\quad x\geq 0.$

The constant of proportionality is not immediately obvious, but is tractable, and turns out to be a nice undergraduate exercise in integration by parts, leading to

$\displaystyle \tilde\pi(x) = \frac{2\sqrt{2\pi}}{2+\sqrt{2\pi}}xe^{-x}\Phi(x),\quad x\geq 0.$

We can verify this using the following commands.

out=met3(iters)
hist(out,100,freq=FALSE,main="met3")


Now we know the actual target of the incorrect sampler, we can compare it with the correct target as follows.

curve(dgamma(x,2,1),0,10,col=2,lwd=2,main="Densities")


So we see that the distributions are different, but not so different that one would immediate suspect an error on the basis of a sample of output. This makes it a difficult bug to track down.

### Summary

There is no problem in principle using a proposal with full support for a target with limited support in MH algorithms. However, it is important to check whether a proposed value is within the support of the target and reject the proposed move if it is not. If you are concerned that such a scheme might be inefficient, it is possible to use a truncated proposal provided that you modify the MH acceptance ratio to include the relevant normalisation constants. If you don’t modify the acceptance probability, you will get a sampler which targets the wrong distribution, but it will often be quite similar to the correct target, making it a difficult bug to spot and track down.

01/06/2012

### Introduction

As I’ve explained previously, I’m gradually coming around to the idea of using Java for the development of MCMC codes, and I’m starting to build up a collection of simple examples for getting started. One of the advantages of Java is that it includes a standard cross-platform GUI library. This might not seem like the most important requirement for MCMC, but can actually be very handy in several contexts, particularly for monitoring convergence. One obvious context is that of image analysis, where it can be useful to monitor image reconstructions as the sampler is running. In this post I’ll show three very small simple Java classes which together provide an application for running a Gibbs sampler on a (non-stationary, unconditioned) Gaussian Markov random field.

The model is essentially that the distribution of each pixel is defined intrinsically, dependent only on its four nearest neighbours on a rectangular lattice, and here the distribution will be Gaussian with mean equal to the sample mean of the four neighbouring pixels and a fixed (unit) variance. On its own this isn’t especially useful, but it is a key component of many image analysis applications.

### A simple Java implementation

We will start with the class MrfApp containing the main method for the application:

MrfApp.java

import java.io.*;
class MrfApp {
public static void main(String[] arg)
throws IOException
{
Mrf mrf;
System.out.println("started program");
mrf=new Mrf(800,600);
System.out.println("created mrf object");
mrf.update(1000);
mrf.saveImage("mrf.png");
System.out.println("finished program");
mrf.frame.dispose();
System.exit(0);
}
}


Hopefully this code is largely self-explanatory, but relies on a class called Mrf which contains all of the logic associated with the GMRF.

Mrf.java

import java.io.*;
import java.util.*;
import java.awt.image.*;
import javax.swing.*;
import javax.imageio.ImageIO;

class Mrf
{
int n,m;
double[][] cells;
Random rng;
BufferedImage bi;
WritableRaster wr;
JFrame frame;
ImagePanel ip;

Mrf(int n_arg,int m_arg)
{
n=n_arg;
m=m_arg;
cells=new double[n][m];
rng=new Random();
bi=new BufferedImage(n,m,BufferedImage.TYPE_BYTE_GRAY);
wr=bi.getRaster();
frame=new JFrame("MRF");
frame.setSize(n,m);
frame.setVisible(true);
}

public void saveImage(String filename)
throws IOException
{
ImageIO.write(bi,"PNG",new File(filename));
}

public void updateImage()
{
double mx=-1e+100;
double mn=1e+100;
for (int i=0;i<n;i++) {
for (int j=0;j<m;j++) {
if (cells[i][j]>mx) { mx=cells[i][j]; }
if (cells[i][j]<mn) { mn=cells[i][j]; }
}
}
for (int i=0;i<n;i++) {
for (int j=0;j<m;j++) {
int level=(int) (255*(cells[i][j]-mn)/(mx-mn));
wr.setSample(i,j,0,level);
}
}
frame.repaint();
}

public void update(int num)
{
for (int i=0;i<num;i++) {
updateOnce();
}
}

private void updateOnce()
{
double mean;
for (int i=0;i<n;i++) {
for (int j=0;j<m;j++) {
if (i==0) {
if (j==0) {
mean=0.5*(cells[0][1]+cells[1][0]);
}
else if (j==m-1) {
mean=0.5*(cells[0][j-1]+cells[1][j]);
}
else {
mean=(cells[0][j-1]+cells[0][j+1]+cells[1][j])/3.0;
}
}
else if (i==n-1) {
if (j==0) {
mean=0.5*(cells[i][1]+cells[i-1][0]);
}
else if (j==m-1) {
mean=0.5*(cells[i][j-1]+cells[i-1][j]);
}
else {
mean=(cells[i][j-1]+cells[i][j+1]+cells[i-1][j])/3.0;
}
}
else if (j==0) {
mean=(cells[i-1][0]+cells[i+1][0]+cells[i][1])/3.0;
}
else if (j==m-1) {
mean=(cells[i-1][j]+cells[i+1][j]+cells[i][j-1])/3.0;
}
else {
mean=0.25*(cells[i][j-1]+cells[i][j+1]+cells[i+1][j]
+cells[i-1][j]);
}
cells[i][j]=mean+rng.nextGaussian();
}
}
updateImage();
}

}


This class contains a few simple methods for creating and updating the GMRF, and also for maintaining and updating a graphical view of the GMRF as the sampler is running. The Gibbs sampler update itself is encoded in the final method, updateOnce, and most of the code is to deal with edge and corner cases (in the literal rather than metaphorical sense!). This is called repeatedly by the method update for the required number of iterations. At the end of each iteration, the method updateOnce triggers updateImage which updates the image associated GMRF. The GMRF itself is stored in a 2-dimensional array of doubles, but an image pixel typically consists of a grayscale value represented by an unsigned byte – that is, an integer from 0 to 255. So updateImage scans through the GMRF to find the maximum and minimum values and then maps the GMRF values onto the 0 to 255 scale. The image itself is set up by the constructor method, Mrf. This class relies on an additional class called ImagePanel, which is a simple GUI panel for displaying images:

ImagePanel.java

import java.awt.*;
import java.awt.image.*;
import javax.swing.*;

class ImagePanel extends JPanel {

protected BufferedImage image;

public ImagePanel(BufferedImage image) {
this.image=image;
Dimension dim=new Dimension(image.getWidth(),image.getHeight());
setPreferredSize(dim);
setMinimumSize(dim);
revalidate();
repaint();
}

public void paintComponent(Graphics g) {
g.drawImage(image,0,0,this);
}

}


This completes the application, which can be compiled and run from the command line with

javac *.java
java MrfApp


This should compile the code and run the application, which will show a GMRF updating for 1000 iterations. When the 1000 iterations are complete, the application writes the final image to a file and then quits.

### Using Parallel COLT

The above classes are very convenient, as they should work with any standard Java installation. However, in more complex scenarios, it is likely that a math library such as Parallel COLT will be required. In this case it will make sense to make use of features in the COLT library, such as random number generators and 2d matrix objects. We can adapt the above application by replacing the MrfApp and Mrf classes with the following versions (the ImagePanel class remains unchanged):

MrfApp.java

import java.io.*;
import cern.jet.random.tdouble.engine.*;

class MrfApp {

public static void main(String[] arg)
throws IOException
{
Mrf mrf;
int seed=1234;
System.out.println("started program");
DoubleRandomEngine rngEngine=new DoubleMersenneTwister(seed);
mrf=new Mrf(800,600,rngEngine);
System.out.println("created mrf object");
mrf.update(1000);
mrf.saveImage("mrf.png");
System.out.println("finished program");
mrf.frame.dispose();
System.exit(0);
}

}


Mrf.java

import java.io.*;
import java.util.*;
import java.awt.image.*;
import javax.swing.*;
import javax.imageio.ImageIO;
import cern.jet.random.tdouble.*;
import cern.jet.random.tdouble.engine.*;
import cern.colt.matrix.tdouble.impl.*;

class Mrf
{
int n,m;
DenseDoubleMatrix2D cells;
DoubleRandomEngine rng;
Normal rngN;
BufferedImage bi;
WritableRaster wr;
JFrame frame;
ImagePanel ip;

Mrf(int n_arg,int m_arg,DoubleRandomEngine rng)
{
n=n_arg;
m=m_arg;
cells=new DenseDoubleMatrix2D(n,m);
this.rng=rng;
rngN=new Normal(0.0,1.0,rng);
bi=new BufferedImage(n,m,BufferedImage.TYPE_BYTE_GRAY);
wr=bi.getRaster();
frame=new JFrame("MRF");
frame.setSize(n,m);
frame.setVisible(true);
}

public void saveImage(String filename)
throws IOException
{
ImageIO.write(bi,"PNG",new File(filename));
}

public void updateImage()
{
double mx=-1e+100;
double mn=1e+100;
for (int i=0;i<n;i++) {
for (int j=0;j<m;j++) {
if (cells.getQuick(i,j)>mx) { mx=cells.getQuick(i,j); }
if (cells.getQuick(i,j)<mn) { mn=cells.getQuick(i,j); }
}
}
for (int i=0;i<n;i++) {
for (int j=0;j<m;j++) {
int level=(int) (255*(cells.getQuick(i,j)-mn)/(mx-mn));
wr.setSample(i,j,0,level);
}
}
frame.repaint();
}

public void update(int num)
{
for (int i=0;i<num;i++) {
updateOnce();
}
}

private void updateOnce()
{
double mean;
for (int i=0;i<n;i++) {
for (int j=0;j<m;j++) {
if (i==0) {
if (j==0) {
mean=0.5*(cells.getQuick(0,1)+cells.getQuick(1,0));
}
else if (j==m-1) {
mean=0.5*(cells.getQuick(0,j-1)+cells.getQuick(1,j));
}
else {
mean=(cells.getQuick(0,j-1)+cells.getQuick(0,j+1)+cells.getQuick(1,j))/3.0;
}
}
else if (i==n-1) {
if (j==0) {
mean=0.5*(cells.getQuick(i,1)+cells.getQuick(i-1,0));
}
else if (j==m-1) {
mean=0.5*(cells.getQuick(i,j-1)+cells.getQuick(i-1,j));
}
else {
mean=(cells.getQuick(i,j-1)+cells.getQuick(i,j+1)+cells.getQuick(i-1,j))/3.0;
}
}
else if (j==0) {
mean=(cells.getQuick(i-1,0)+cells.getQuick(i+1,0)+cells.getQuick(i,1))/3.0;
}
else if (j==m-1) {
mean=(cells.getQuick(i-1,j)+cells.getQuick(i+1,j)+cells.getQuick(i,j-1))/3.0;
}
else {
mean=0.25*(cells.getQuick(i,j-1)+cells.getQuick(i,j+1)+cells.getQuick(i+1,j)
+cells.getQuick(i-1,j));
}
cells.setQuick(i,j,mean+rngN.nextDouble());
}
}
updateImage();
}

}


Again, the code should be reasonably self explanatory, and will compile and run in the same way provided that Parallel COLT is installed and in your classpath. This version runs approximately twice as fast as the previous version on all of the machines I’ve tried it on.

### Reference

I have found the following book very useful for understanding how to work with images in Java:

Hunt, K.A. (2010) The Art of Image Processing with Java, A K Peters/CRC Press.

29/12/2011

### Introduction

In a previous post I showed how to construct a PMMH pMCMC algorithm for parameter estimation with partially observed Markov processes. The inner loop of a pMCMC algorithm consists of running a particle filter to construct an unbiased estimate of marginal likelihood. This inner loop is the place where the code spends almost all of its time, and so speeding up the particle filter will result in dramatic speedup of the pMCMC algorithm. This is fortunate, since as previously discussed, MCMC algorithms are difficult to parallelise other than on a per iteration basis. Here, each iteration can be speeded up if we can effectively parallelise a particle filter. Particle filters are much easier to parallelise than MCMC algorithms, and so it is tempting to try and exploit this within R. In fact, although it is the case that it is possible to effectively parallelise particle filters in efficient languages using low-level parallelisation tools (say, using C with MPI, or Java concurrency tools), it is not so easy to speed up R-based particle filters using R’s high-level parallelisation constructs, as we shall see.

### Particle filters

In the previous post we looked at the function pfMLLik within the CRAN package smfsb. As a reminder, the source code is

pfMLLik <- function (n, simx0, t0, stepFun, dataLik, data)
{
times = c(t0, as.numeric(rownames(data)))
deltas = diff(times)
return(function(...) {
xmat = simx0(n, t0, ...)
ll = 0
for (i in 1:length(deltas)) {
xmat = t(apply(xmat, 1, stepFun, t0 = times[i], deltat = deltas[i], ...))
w = apply(xmat, 1, dataLik, t = times[i + 1], y = data[i,], log = FALSE, ...)
if (max(w) < 1e-20) {
warning("Particle filter bombed")
return(-1e+99)
}
ll = ll + log(mean(w))
rows = sample(1:n, n, replace = TRUE, prob = w)
xmat = xmat[rows, ]
}
ll
})
}


The function itself doesn’t actually run a particle filter, but instead returns a function closure which does (see the previous post for a discussion of lexical scope and function closures in R). There are obviously several different steps within the particle filter, and several of these are amenable to parallelisation. However, for complex models, forward simulation from the model will be the rate-limiting step, where the vast majority of CPU cycles will be spent. Line 9 in the above code is where forward simulation takes place, and in particular, the key function call is the apply call:

apply(xmat, 1, stepFun, t0 = times[i], deltat = deltas[i], ...)


This call applies the forward simulation algorithm stepFun to each row of the matrix xmat independently. Since there are no dependencies between the function calls, this is in principle very straightforward to parallelise on multicore hardware.

### Multicore support in R

I’m writing this post on a laptop with an Intel i7 quad core chip, running the 64 bit version of Ubuntu 11.10. R has support for multicore processing on this platform – it is just a simple matter of installing the relevant packages. However, things are changing rapidly regarding multicore support in R right now, so YMMV. Ubuntu 11.10 has R 2.13 by default, but the multicore support is slightly different in the recently released R 2.14. I’m still using R 2.13. I may update this post (or comment) when I move to R 2.14. The main difference is that the package multicore has been replaced by the package parallel. There are a few other minor changes, but it should be easy to adapt what is presented here to 2.14.

There is a new O’Reilly book called Parallel R. I’ve got a copy of it. It does cover the new parallel package in R 2.14, as well as other parallel R topics, but the book is a bit light weight, to say the least, and I reviewed it on this blog. Please read my review for further details before you buy it.

If you haven’t used multicore in R previously, then

install.packages(c("multicore","doMC"))


should get you started (again, I’m assuming that your R version is strictly < 2.14). You can test it has worked with:

library(multicore)
multicore:::detectCores()


When I do this, I get the answer 8 (I have 4 cores, each of which is hyper-threaded). To begin with, I want to tell R to use just 4 process threads, and I can do this with

library(doMC)
registerDoMC(4)


Replacing the second line with registerDoMC() will set things up to use all detected cores (in my case, 8). There are a couple of different strategies we could use to parallelise this. One strategy for parallelising the apply call discussed above is to be to replace it with a foreach / %dopar% loop. This is best illustrated by example. Start with line 9 from the function pfMLLik:

xmat = t(apply(xmat, 1, stepFun, t0 = times[i], deltat = deltas[i], ...))


We can produce a parallelised version by replacing this line with the following block of code:

res=foreach(j=1:dim(xmat)[1]) %dopar% {
stepFun(xmat[j,], t0 = times[i], deltat = deltas[i], ...)
}
xmat=t(sapply(res,cbind))


Each iteration of the foreach loop is executed independently (possibly using multiple cores), and the result of each iteration is returned as a list, and captured in res. This list of return vectors is then coerced back into a matrix with the final line.

In fact, we can improve on this by using the .combine argument to foreach, which describes how to combine the results from each iteration. Here we can just use rbind to combine the results into a matrix, using:

xmat=foreach(j=1:dim(xmat)[1], .combine="rbind") %dopar% {
stepFun(xmat[j,], t0 = times[i], deltat = deltas[i], ...)
}


This code is much neater, and in principle ought to be a bit faster, though I haven’t noticed much difference in practice.

In fact, it is not necessary to use the foreach construct at all. The multicore package provides the mclapply function, which is a multicore version of lapply. To use mclapply (or, indeed, lapply) here, we first need to split our matrix into a list of rows, which we can do using the split command. So in fact, our apply call can be replaced with the single line:

xmat=t(sapply(mclapply(split(xmat,row(xmat)), stepFun, t0=times[i], deltat=deltas[i], ...),cbind))


This is actually a much cleaner solution than the method using foreach, but it does require grokking a bit more R. Note that mclapply uses a different method to specify the number of threads to use than foreach/doMC. Here you can either use the named argument to mclapply, mc.cores, or use options(), eg. options(cores=4).

As well as being much cleaner, I find that the mclapply approach is much faster than the foreach/dopar approach for this problem. I’m guessing that this is because foreach doesn’t pre-schedule tasks by default, whereas mclapply does, but I haven’t had a chance to dig into this in detail yet.

### A parallelised particle filter

We can now splice the parallelised forward simulation step (using mclapply) back into our particle filter function to get:

require(multicore)
pfMLLik <- function (n, simx0, t0, stepFun, dataLik, data)
{
times = c(t0, as.numeric(rownames(data)))
deltas = diff(times)
return(function(...) {
xmat = simx0(n, t0, ...)
ll = 0
for (i in 1:length(deltas)) {
xmat=t(sapply(mclapply(split(xmat,row(xmat)), stepFun, t0=times[i], deltat=deltas[i], ...),cbind))
w = apply(xmat, 1, dataLik, t = times[i + 1], y = data[i,], log = FALSE, ...)
if (max(w) < 1e-20) {
warning("Particle filter bombed")
return(-1e+99)
}
ll = ll + log(mean(w))
rows = sample(1:n, n, replace = TRUE, prob = w)
xmat = xmat[rows, ]
}
ll
})
}


This can be used in place of the version supplied with the smfsb package for slow simulation algorithms running on modern multicore machines.

There is an issue regarding Monte Carlo simulations such as this and the multicore package (whether you use mclapply or foreach/dopar) in that it adopts a “different seeds” approach to parallel random number generation, rather than a true parallel random number generator. This probably isn’t worth worrying too much about now, since it is fixed in the new parallel package in R 2.14, but is something to be aware of. I discuss parallel random number generation issues in Wilkinson (2005).

### Multiple particle filters and pMCMC

Let’s look again at the main loop of the pMCMC algorithm discussed in the previous post:

for (i in 1:iters) {
message(paste(i,""),appendLF=FALSE)
for (j in 1:thin) {
thprop=th*exp(rnorm(p,0,tune))
llprop=mLLik(thprop)
if (log(runif(1)) < llprop - ll) {
th=thprop
ll=llprop
}
}
thmat[i,]=th
}


It is clear that the main computational bottleneck of this code is the call to mLLik on line 5, as this is the call which runs the particle filter. The purpose of making the call is to obtain an unbiased estimate of marginal likelihood. However, there are plenty of other ways that we can obtain such estimates than by running a single particle filter. In particular, we could run multiple particle filters and average the results. So, let’s look at how to do this in the multicore setting. Let’s start by thinking about running 4 particle filters. We could just replace the line

llprop=mLLik(thprop)


with the code

llprop=0.25*foreach(i=1:4, .combine="+") %dopar% {
mLLik(thprop)
}


Now, there are at least 2 issues with this. The first is that we are now just running 4 particle filters rather than 1, and so even with perfect parallelisation, it will run no quicker than the code we started with. However, the idea is that by running 4 particle filters we ought to be able to get away with each particle filter using fewer particles, though it isn’t trivial to figure out exactly how many. For example, averaging the results from 4 particle filters, each of which uses 25 particles is not as good as running a single particle filter with 100 particles. In practice, some trial and error is likely to be required. The second problem is that we have computed the mean of the log of the likelihoods, and not the likelihoods themselves. This will almost certainly work fine in practice, as the resulting estimate will in most cases be very close to unbiased, but it will not be exactly unbiased, as so will not lead to an “exact” approximate algorithm. In principle, this can be fixed by instead using

res=foreach(i=1:4) %dopar% {
mLLik(thprop)
}
llprop=log(mean(sapply(res,exp)))


but in practice this is likely to be subject to numerical underflow problems, as it involves manipulating raw likelihood values, which is generally a bad idea. It is possible to compute the log of the mean of the likelihoods in a more numerically stable way, but that is left as an exercise for the reader, as this post is way too long already… However, one additional tip worth mentioning is that the foreach package includes a convenience function called times for situations like the above, where the argument is not varying over calls. So the above code can be replaced with

res=times(4) %dopar% mLLik(thprop)
llprop=log(mean(sapply(res,exp)))


which is a bit cleaner and more readable.

Using this approach to parallelisation, there is now a much better chance of getting some speedup on multicore architectures, as the granularity of the tasks being parallelised is now much larger. Consider the example from the previous post, where at each iteration we ran a particle filter with 100 particles. If we now re-run that example, but instead use 4 particle filters each using 25 particles, we do get a slight speedup. However, on my laptop, the speedup is only around a factor of 1.6 using 4 cores, and as already discussed, 4 filters each with 25 particles isn’t actually quite as good as a single filter with 100 particles anyway. So, the benefits are rather modest here, but will be much better with less trivial examples (slower simulators). For completeness, a complete runnable demo script is included after the references. Also, it is probably worth emphasising that if your pMCMC algorithm has a short burn-in period, you may well get much better overall speed-ups by just running parallel MCMC chains. Depressing, perhaps, but true.

### References

• McCallum, E., Weston, S. (2011) Parallel R, O’Reilly.
• Wilkinson, D. J. (2005) Parallel Bayesian Computation, Chapter 16 in E. J. Kontoghiorghes (ed.) Handbook of Parallel Computing and Statistics, Marcel Dekker/CRC Press, 481-512.
• Wilkinson, D. J. (2011) Stochastic Modelling for Systems Biology, second edition, Boca Raton, Florida: Chapman & Hall/CRC Press.
• ### Demo script

require(smfsb)
data(LVdata)

require(multicore)
require(doMC)
registerDoMC(4)

# set up data likelihood
noiseSD=10
dataLik <- function(x,t,y,log=TRUE,...)
{
ll=sum(dnorm(y,x,noiseSD,log=TRUE))
if (log)
return(ll)
else
return(exp(ll))
}
# now define a sampler for the prior on the initial state
simx0 <- function(N,t0,...)
{
mat=cbind(rpois(N,50),rpois(N,100))
colnames(mat)=c("x1","x2")
mat
}
# convert the time series to a timed data matrix
LVdata=as.timedData(LVnoise10)
# create marginal log-likelihood functions, based on a particle filter

# use 25 particles instead of 100
mLLik=pfMLLik(25,simx0,0,stepLVc,dataLik,LVdata)

iters=1000
tune=0.01
thin=10
th=c(th1 = 1, th2 = 0.005, th3 = 0.6)
p=length(th)
ll=-1e99
thmat=matrix(0,nrow=iters,ncol=p)
colnames(thmat)=names(th)
# Main pMCMC loop
for (i in 1:iters) {
message(paste(i,""),appendLF=FALSE)
for (j in 1:thin) {
thprop=th*exp(rnorm(p,0,tune))
res=times(4) %dopar% mLLik(thprop)
llprop=log(mean(sapply(res,exp)))
if (log(runif(1)) < llprop - ll) {
th=thprop
ll=llprop
}
}
thmat[i,]=th
}
message("Done!")
# Compute and plot some basic summaries
mcmcSummary(thmat)


The stupidest thing...