## Introduction

In a previous post I’ve given a brief introduction to monads in Scala, aimed at people interested in scientific and statistical computing. Monads are a concept from category theory which turn out to be exceptionally useful for solving many problems in functional programming. But most categorical concepts have a dual, usually prefixed with “co”, so the dual of a monad is the comonad. Comonads turn out to be especially useful for formulating algorithms from scientific and statistical computing in an elegant way. In this post I’ll illustrate their use in signal processing, image processing, numerical integration of PDEs, and Gibbs sampling (of an Ising model). Comonads enable the extension of a local computation to a global computation, and this pattern crops up all over the place in statistical computing.

Simplifying massively, from the viewpoint of a Scala programmer, a monad is a mappable (functor) type class augmented with the methods pure and flatMap:

trait Monad[M[_]] extends Functor[M] {
def pure[T](v: T): M[T]
def flatMap[T,S](v: M[T])(f: T => M[S]): M[S]
}


In category theory, the dual of a concept is typically obtained by “reversing the arrows”. Here that means reversing the direction of the methods pure and flatMap to get extract and coflatMap, respectively.

trait Comonad[W[_]] extends Functor[W] {
def extract[T](v: W[T]): T
def coflatMap[T,S](v: W[T])(f: W[T] => S): W[S]
}


So, while pure allows you to wrap plain values in a monad, extract allows you to get a value out of a comonad. So you can always get a value out of a comonad (unlike a monad). Similarly, while flatMap allows you to transform a monad using a function returning a monad, coflatMap allows you to transform a comonad using a function which collapses a comonad to a single value. It is coflatMap (sometimes called extend) which can extend a local computation (producing a single value) to the entire comonad. We’ll look at how that works in the context of some familiar examples.

## Applying a linear filter to a data stream

One of the simplest examples of a comonad is an infinite stream of data. I’ve discussed streams in a previous post. By focusing on infinite streams we know the stream will never be empty, so there will always be a value that we can extract. Which value does extract give? For a Stream encoded as some kind of lazy list, the only value we actually know is the value at the head of the stream, with subsequent values to be lazily computed as required. So the head of the list is the only reasonable value for extract to return.

Understanding coflatMap is a bit more tricky, but it is coflatMap that provides us with the power to apply a non-trivial statistical computation to the stream. The input is a function which transforms a stream into a value. In our example, that will be a function which computes a weighted average of the first few values and returns that weighted average as the result. But the return type of coflatMap must be a stream of such computations. Following the types, a few minutes thought reveals that the only reasonable thing to do is to return the stream formed by applying the weighted average function to all sub-streams, recursively. So, for a Stream s (of type Stream[T]) and an input function f: W[T] => S, we form a stream whose head is f(s) and whose tail is coflatMap(f) applied to s.tail. Again, since we are working with an infinite stream, we don’t have to worry about whether or not the tail is empty. This gives us our comonadic Stream, and it is exactly what we need for applying a linear filter to the data stream.

In Scala, Cats is a library providing type classes from Category theory, and instances of those type classes for parametrised types in the standard library. In particular, it provides us with comonadic functionality for the standard Scala Stream. Let’s start by defining a stream corresponding to the logistic map.

import cats._
import cats.implicits._

val lam = 3.7
def s = Stream.iterate(0.5)(x => lam*x*(1-x))
s.take(10).toList
// res0: List[Double] = List(0.5, 0.925, 0.25668749999999985,
//  0.7059564011718747, 0.7680532550204203, 0.6591455741499428, ...


Let us now suppose that we want to apply a linear filter to this stream, in order to smooth the values. The idea behind using comonads is that you figure out how to generate one desired value, and let coflatMap take care of applying the same logic to the rest of the structure. So here, we need a function to generate the first filtered value (since extract is focused on the head of the stream). A simple first attempt a function to do this might look like the following.

  def linearFilterS(weights: Stream[Double])(s: Stream[Double]): Double =
(weights, s).parMapN(_*_).sum


This aligns each weight in parallel with a corresponding value from the stream, and combines them using multiplication. The resulting (hopefully finite length) stream is then summed (with addition). We can test this with

linearFilterS(Stream(0.25,0.5,0.25))(s)
// res1: Double = 0.651671875


and let coflatMap extend this computation to the rest of the stream with something like:

s.coflatMap(linearFilterS(Stream(0.25,0.5,0.25))).take(5).toList
// res2: List[Double] = List(0.651671875, 0.5360828502929686, ...


This is all completely fine, but our linearFilterS function is specific to the Stream comonad, despite the fact that all we’ve used about it in the function is that it is a parallelly composable and foldable. We can make this much more generic as follows:

  def linearFilter[F[_]: Foldable, G[_]](
weights: F[Double], s: F[Double]
)(implicit ev: NonEmptyParallel[F, G]): Double =
(weights, s).parMapN(_*_).fold


This uses some fairly advanced Scala concepts which I don’t want to get into right now (I should also acknowledge that I had trouble getting the syntax right for this, and got help from Fabio Labella (@SystemFw) on the Cats gitter channel). But this version is more generic, and can be used to linearly filter other data structures than Stream. We can use this for regular Streams as follows:

s.coflatMap(s => linearFilter(Stream(0.25,0.5,0.25),s))
// res3: scala.collection.immutable.Stream[Double] = Stream(0.651671875, ?)


But we can apply this new filter to other collections. This could be other, more sophisticated, streams such as provided by FS2, Monix or Akka streams. But it could also be a non-stream collection, such as List:

val sl = s.take(10).toList
sl.coflatMap(sl => linearFilter(List(0.25,0.5,0.25),sl))
// res4: List[Double] = List(0.651671875, 0.5360828502929686, ...


Assuming that we have the Breeze scientific library available, we can plot the raw and smoothed trajectories.

def myFilter(s: Stream[Double]): Double =
linearFilter(Stream(0.25, 0.5, 0.25),s)
val n = 500
import breeze.plot._
import breeze.linalg._
val fig = Figure(s"The (smoothed) logistic map (lambda=$lam)") val p0 = fig.subplot(3,1,0) p0 += plot(linspace(1,n,n),s.take(n)) p0.ylim = (0.0,1.0) p0.title = s"The logistic map (lambda=$lam)"
val p1 = fig.subplot(3,1,1)
p1 += plot(linspace(1,n,n),s.coflatMap(myFilter).take(n))
p1.ylim = (0.0,1.0)
p1.title = "Smoothed by a simple linear filter"
val p2 = fig.subplot(3,1,2)
p2 += plot(linspace(1,n,n),s.coflatMap(myFilter).coflatMap(myFilter).coflatMap(myFilter).coflatMap(myFilter).coflatMap(myFilter).take(n))
p2.ylim = (0.0,1.0)
p2.title = "Smoothed with 5 applications of the linear filter"
fig.refresh


## Image processing and the heat equation

Streaming data is in no way the only context in which a comonadic approach facilitates an elegant approach to scientific and statistical computing. Comonads crop up anywhere where we want to extend a computation that is local to a small part of a data structure to the full data structure. Another commonly cited area of application of comonadic approaches is image processing (I should acknowledge that this section of the post is very much influenced by a blog post on comonadic image processing in Haskell). However, the kinds of operations used in image processing are in many cases very similar to the operations used in finite difference approaches to numerical integration of partial differential equations (PDEs) such as the heat equation, so in this section I will blur (sic) the distinction between the two, and numerically integrate the 2D heat equation in order to Gaussian blur a noisy image.

First we need a simple image type which can have pixels of arbitrary type T (this is very important – all functors must be fully type polymorphic).

  import scala.collection.parallel.immutable.ParVector
case class Image[T](w: Int, h: Int, data: ParVector[T]) {
def apply(x: Int, y: Int): T = data(x*h+y)
def map[S](f: T => S): Image[S] = Image(w, h, data map f)
def updated(x: Int, y: Int, value: T): Image[T] =
Image(w,h,data.updated(x*h+y,value))
}


Here I’ve chosen to back the image with a parallel immutable vector. This wasn’t necessary, but since this type has a map operation which automatically parallelises over multiple cores, any map operations applied to the image will be automatically parallelised. This will ultimately lead to all of our statistical computations being automatically parallelised without us having to think about it.

As it stands, this image isn’t comonadic, since it doesn’t implement extract or coflatMap. Unlike the case of Stream, there isn’t really a uniquely privileged pixel, so it’s not clear what extract should return. For many data structures of this type, we make them comonadic by adding a “cursor” pointing to a “current” element of interest, and use this as the focus for computations applied with coflatMap. This is simplest to explain by example. We can define our “pointed” image type as follows:

  case class PImage[T](x: Int, y: Int, image: Image[T]) {
def extract: T = image(x, y)
def map[S](f: T => S): PImage[S] = PImage(x, y, image map f)
def coflatMap[S](f: PImage[T] => S): PImage[S] = PImage(
x, y, Image(image.w, image.h,
(0 until (image.w * image.h)).toVector.par.map(i => {
val xx = i / image.h
val yy = i % image.h
f(PImage(xx, yy, image))
})))


There is missing a closing brace, as I’m not quite finished. Here x and y represent the location of our cursor, so extract returns the value of the pixel indexed by our cursor. Similarly, coflatMap forms an image where the value of the image at each location is the result of applying the function f to the image which had the cursor set to that location. Clearly f should use the cursor in some way, otherwise the image will have the same value at every pixel location. Note that map and coflatMap operations will be automatically parallelised. The intuitive idea behind coflatMap is that it extends local computations. For the stream example, the local computation was a linear combination of nearby values. Similarly, in image analysis problems, we often want to apply a linear filter to nearby pixels. We can get at the pixel at the cursor location using extract, but we probably also want to be able to move the cursor around to nearby locations. We can do that by adding some appropriate methods to complete the class definition.

    def up: PImage[T] = {
val py = y-1
val ny = if (py >= 0) py else (py + image.h)
PImage(x,ny,image)
}
def down: PImage[T] = {
val py = y+1
val ny = if (py < image.h) py else (py - image.h)
PImage(x,ny,image)
}
def left: PImage[T] = {
val px = x-1
val nx = if (px >= 0) px else (px + image.w)
PImage(nx,y,image)
}
def right: PImage[T] = {
val px = x+1
val nx = if (px < image.w) px else (px - image.w)
PImage(nx,y,image)
}
}


Here each method returns a new pointed image with the cursor shifted by one pixel in the appropriate direction. Note that I’ve used periodic boundary conditions here, which often makes sense for numerical integration of PDEs, but makes less sense for real image analysis problems. Note that we have embedded all “indexing” issues inside the definition of our classes. Now that we have it, none of the statistical algorithms that we develop will involve any explicit indexing. This makes it much less likely to develop algorithms containing bugs corresponding to “off-by-one” or flipped axis errors.

This class is now fine for our requirements. But if we wanted Cats to understand that this structure is really a comonad (perhaps because we wanted to use derived methods, such as coflatten), we would need to provide evidence for this. The details aren’t especially important for this post, but we can do it simply as follows:

  implicit val pimageComonad = new Comonad[PImage] {
def extract[A](wa: PImage[A]) = wa.extract
def coflatMap[A,B](wa: PImage[A])(f: PImage[A] => B): PImage[B] =
wa.coflatMap(f)
def map[A,B](wa: PImage[A])(f: A => B): PImage[B] = wa.map(f)
}


It’s handy to have some functions for converting Breeze dense matrices back and forth with our image class.

  import breeze.linalg.{Vector => BVec, _}
def BDM2I[T](m: DenseMatrix[T]): Image[T] =
Image(m.cols, m.rows, m.data.toVector.par)
def I2BDM(im: Image[Double]): DenseMatrix[Double] =
new DenseMatrix(im.h,im.w,im.data.toArray)


Now we are ready to see how to use this in practice. Let’s start by defining a very simple linear filter.

def fil(pi: PImage[Double]): Double = (2*pi.extract+
pi.up.extract+pi.down.extract+pi.left.extract+pi.right.extract)/6.0


This simple filter can be used to “smooth” or “blur” an image. However, from a more sophisticated viewpoint, exactly this type of filter can be used to represent one time step of a numerical method for time integration of the 2D heat equation. Now we can simulate a noisy image and apply our filter to it using coflatMap:

import breeze.stats.distributions.Gaussian
val bdm = DenseMatrix.tabulate(200,250){case (i,j) => math.cos(
0.1*math.sqrt((i*i+j*j))) + Gaussian(0.0,2.0).draw}
val pim0 = PImage(0,0,BDM2I(bdm))
def pims = Stream.iterate(pim0)(_.coflatMap(fil))


Note that here, rather than just applying the filter once, I’ve generated an infinite stream of pointed images, each one representing an additional application of the linear filter. Thus the sequence represents the time solution of the heat equation with initial condition corresponding to our simulated noisy image.

We can render the first few frames to check that it seems to be working.

import breeze.plot._
val fig = Figure("Diffusing a noisy image")
pims.take(25).zipWithIndex.foreach{case (pim,i) => {
val p = fig.subplot(5,5,i)
p += image(I2BDM(pim.image))
}}


Note that the numerical integration is carried out in parallel on all available cores automatically. Other image filters can be applied, and other (parabolic) PDEs can be numerically integrated in an essentially similar way.

## Gibbs sampling the Ising model

Another place where the concept of extending a local computation to a global computation crops up is in the context of Gibbs sampling a high-dimensional probability distribution by cycling through the sampling of each variable in turn from its full-conditional distribution. I’ll illustrate this here using the Ising model, so that I can reuse the pointed image class from above, but the principles apply to any Gibbs sampling problem. In particular, the Ising model that we consider has a conditional independence structure corresponding to a graph of a square lattice. As above, we will use the comonadic structure of the square lattice to construct a Gibbs sampler. However, we can construct a Gibbs sampler for arbitrary graphical models in an essentially identical way by using a graph comonad.

Let’s begin by simulating a random image containing +/-1s:

import breeze.stats.distributions.{Binomial,Bernoulli}
val beta = 0.4
val bdm = DenseMatrix.tabulate(500,600){
case (i,j) => (new Binomial(1,0.2)).draw
}.map(_*2 - 1) // random matrix of +/-1s
val pim0 = PImage(0,0,BDM2I(bdm))


We can use this to initialise our Gibbs sampler. We now need a Gibbs kernel representing the update of each pixel.

def gibbsKernel(pi: PImage[Int]): Int = {
val sum = pi.up.extract+pi.down.extract+pi.left.extract+pi.right.extract
val p1 = math.exp(beta*sum)
val p2 = math.exp(-beta*sum)
val probplus = p1/(p1+p2)
if (new Bernoulli(probplus).draw) 1 else -1
}


So far so good, but there a couple of issues that we need to consider before we plough ahead and start coflatMapping. The first is that pure functional programmers will object to the fact that this function is not pure. It is a stochastic function which has the side-effect of mutating the random number state. I’m just going to duck that issue here, as I’ve previously discussed how to fix it using probability monads, and I don’t want it to distract us here.

However, there is a more fundamental problem here relating to parallel versus sequential application of Gibbs kernels. coflatMap is conceptually parallel (irrespective of how it is implemented) in that all computations used to build the new comonad are based solely on the information available in the starting comonad. OTOH, detailed balance of the Markov chain will only be preserved if the kernels for each pixel are applied sequentially. So if we coflatMap this kernel over the image we will break detailed balance. I should emphasise that this has nothing to do with the fact that I’ve implemented the pointed image using a parallel vector. Exactly the same issue would arise if we switched to backing the image with a regular (sequential) immutable Vector.

The trick here is to recognise that if we coloured alternate pixels black and white using a chequerboard pattern, then all of the black pixels are conditionally independent given the white pixels and vice-versa. Conditionally independent pixels can be updated by parallel application of a Gibbs kernel. So we just need separate kernels for updating odd and even pixels.

def oddKernel(pi: PImage[Int]): Int =
if ((pi.x+pi.y) % 2 != 0) pi.extract else gibbsKernel(pi)
def evenKernel(pi: PImage[Int]): Int =
if ((pi.x+pi.y) % 2 == 0) pi.extract else gibbsKernel(pi)


Each of these kernels can be coflatMapped over the image preserving detailed balance of the chain. So we can now construct an infinite stream of MCMC iterations as follows.

def pims = Stream.iterate(pim0)(_.coflatMap(oddKernel).
coflatMap(evenKernel))


We can animate the first few iterations with:

import breeze.plot._
val fig = Figure("Ising model Gibbs sampler")
fig.width = 1000
fig.height = 800
pims.take(50).zipWithIndex.foreach{case (pim,i) => {
print(s"$i ") fig.clear val p = fig.subplot(1,1,0) p.title = s"Ising model: frame$i"
p += image(I2BDM(pim.image.map{_.toDouble}))
fig.refresh
}}
println


Here I have a movie showing the first 1000 iterations. Note that youtube seems to have over-compressed it, but you should get the basic idea.

Again, note that this MCMC sampler runs in parallel on all available cores, automatically. This issue of odd/even pixel updating emphasises another issue that crops up a lot in functional programming: very often, thinking about how to express an algorithm functionally leads to an algorithm which parallelises naturally. For general graphs, figuring out which groups of nodes can be updated in parallel is essentially the graph colouring problem. I’ve discussed this previously in relation to parallel MCMC in:

Wilkinson, D. J. (2005) Parallel Bayesian Computation, Chapter 16 in E. J. Kontoghiorghes (ed.) Handbook of Parallel Computing and Statistics, Marcel Dekker/CRC Press, 481-512.

There are quite a few blog posts discussing comonads in the context of Haskell. In particular, the post on comonads for image analysis I mentioned previously, and this one on cellular automata. Bartosz’s post on comonads gives some connection back to the mathematical origins. Runar’s Scala comonad tutorial is the best source I know for comonads in Scala.

Full runnable code corresponding to this blog post is available from my blog repo.

## A scalable particle filter in Scala

### Introduction

Many modern algorithms in computational Bayesian statistics have at their heart a particle filter or some other sequential Monte Carlo (SMC) procedure. In this blog I’ve discussed particle MCMC algorithms which use a particle filter in the inner-loop in order to compute a (noisy, unbiased) estimate of the marginal likelihood of the data. These algorithms are often very computationally intensive, either because the forward model used to propagate the particles is expensive, or because the likelihood associated with each particle/observation is expensive (or both). In this case it is desirable to parallelise the particle filter to run on all available cores of a machine, or in some cases, it would even be desirable to distribute the the particle filter computation across a cluster of machines.

Parallelisation is difficult when using the conventional imperative programming languages typically used in scientific and statistical computing, but is much easier using modern functional languages such as Scala. In fact, in languages such as Scala it is possible to describe algorithms at a higher level of abstraction, so that exactly the same algorithm can run in serial, run in parallel across all available cores on a single machine, or run in parallel across a cluster of machines, all without changing any code. Doing so renders parallelisation a non-issue. In this post I’ll talk through how to do this for a simple bootstrap particle filter, but the same principle applies for a large range of statistical computing algorithms.

In the previous post I gave a quick introduction to the monad concept, and to monadic collections in particular. Many computational tasks in statistics can be accomplished using a sequence of operations on monadic collections. We would like to write code that is independent of any particular implementation of a monadic collection, so that we can switch to a different implementation without changing the code of our algorithm (for example, switching from a serial to a parallel collection). But in strongly typed languages we need to know at compile time that the collection we use has the methods that we require. Typeclasses provide a nice solution to this problem. I don’t want to get bogged down in a big discussion about Scala typeclasses here, but suffice to say that they describe a family of types conforming to a particular interface in an ad hoc loosely coupled way (they are said to provide ad hoc polymorphism). They are not the same as classes in traditional O-O languages, but they do solve a similar problem to the adaptor design pattern, in a much cleaner way. We can describe a simple typeclass for our monadic collection as follows:

trait GenericColl[C[_]] {
def map[A, B](ca: C[A])(f: A => B): C[B]
def reduce[A](ca: C[A])(f: (A, A) => A): A
def flatMap[A, B, D[B] <: GenTraversable[B]](ca: C[A])(f: A => D[B]): C[B]
def zip[A, B](ca: C[A])(cb: C[B]): C[(A, B)]
def length[A](ca: C[A]): Int
}


In the typeclass we just list the methods that we expect our generic collection to provide, but do not say anything about how they are implemented. For example, we know that operations such as map and reduce can be executed in parallel, but this is a separate concern. We can now write code that can be used for any collection conforming to the requirements of this typeclass. The full code for this example is provided in the associated github repo for this blog, and includes the obvious syntax for this typeclass, and typeclass instances for the Scala collections Vector and ParVector, that we will exploit later in the example.

### SIR step for a bootstrap filter

We can now write some code for a single observation update of a bootstrap particle filter.

def update[S: State, O: Observation, C[_]: GenericColl](
dataLik: (S, O) => LogLik, stepFun: S => S
)(x: C[S], o: O): (LogLik, C[S]) = {
val xp = x map (stepFun(_))
val lw = xp map (dataLik(_, o))
val max = lw reduce (math.max(_, _))
val rw = lw map (lwi => math.exp(lwi - max))
val srw = rw reduce (_ + _)
val l = rw.length
val z = rw zip xp
val rx = z flatMap (p => Vector.fill(Poisson(p._1 * l / srw).draw)(p._2))
(max + math.log(srw / l), rx)
}


This is a very simple bootstrap filter, using Poisson resampling for simplicity and data locality, but does include use of the log-sum-exp trick to prevent over/underflow of raw weight calculations, and tracks the marginal (log-)likelihood of the observation. With this function we can now pass in a “prior” particle distribution in any collection conforming to our typeclass, together with a propagator function, an observation (log-)likelihood, and an observation, and it will return back a new collection of particles of exactly the same type that was provided for input. Note that all of the operations we require can be accomplished with the standard monadic collection operations declared in our typeclass.

### Filtering as a functional fold

Once we have a function for executing one step of a particle filter, we can produce a function for particle filtering as a functional fold over a sequence of observations:

def pFilter[S: State, O: Observation, C[_]: GenericColl, D[O] <: GenTraversable[O]](
x0: C[S], data: D[O], dataLik: (S, O) => LogLik, stepFun: S => S
): (LogLik, C[S]) = {
val updater = update[S, O, C](dataLik, stepFun) _
data.foldLeft((0.0, x0))((prev, o) => {
val next = updater(prev._2, o)
(prev._1 + next._1, next._2)
})
}


Folding data structures is a fundamental concept in functional programming, and is exactly what is required for any kind of filtering problem. Note that Brian Beckman has recently written a series of articles on Kalman filtering as a functional fold.

### Marginal likelihoods and parameter estimation

So far we haven’t said anything about parameters or parameter estimation, but this is appropriate, since parametrisation is a separate concern from filtering. However, once we have a function for particle filtering, we can produce a function concerned with evaluating marginal likelihoods trivially:

def pfMll[S: State, P: Parameter, O: Observation,
C[_]: GenericColl, D[O] <: GenTraversable[O]](
simX0: P => C[S], stepFun: P => S => S,
dataLik: P => (S, O) => LogLik, data: D[O]
): (P => LogLik) = (th: P) =>
pFilter(simX0(th), data, dataLik(th), stepFun(th))._1


Note that this higher-order function does not return a value, but instead a function which will accept a parameter as input and return a (log-)likelihood as output. This can then be used for parameter estimation purposes, perhaps being used in a PMMH pMCMC algorithm, or something else. Again, this is a separate concern.

### Example

Here I’ll just give a completely trivial toy example, purely to show how the functions work. For avoidance of doubt, I know that there are many better/simpler/easier ways to tackle this problem! Here we will just look at inferring the auto-regression parameter of a linear Gaussian AR(1)-plus-noise model using the functions we have developed.

First we can simulate some synthetic data from this model, using a value of 0.8 for the auto-regression parameter:

val inNoise = Gaussian(0.0, 1.0).sample(99)
val state = DenseVector(inNoise.scanLeft(0.0)((s, i) => 0.8 * s + i).toArray)
val noise = DenseVector(Gaussian(0.0, 2.0).sample(100).toArray)
val data = (state + noise).toArray.toList


Now assuming that we don’t know the auto-regression parameter, we can construct a function to evaluate the likelihood of different parameter values as follows:

val mll = pfMll(
(th: Double) => Gaussian(0.0, 10.0).sample(10000).toVector.par,
(th: Double) => (s: Double) => Gaussian(th * s, 1.0).draw,
(th: Double) => (s: Double, o: Double) => Gaussian(s, 2.0).logPdf(o),
data
)


Note that the 4 characters “.par” at the end of line 2 are the only difference between running this code serially or in parallel! Now we can run this code by calling the returned function with different values. So, hopefully mll(0.8) will return a larger log-likelihood than (say) mll(0.6) or mll(0.9). The example code in the github repo plots the results of calling mll() for a range of values (note that if that was the genuine use-case, then it would be much better to parallellise the parameter range than the particle filter, due to providing better parallelisation granularity, but many other examples require parallelisation of the particle filter itself). In this particular example, both the forward model and the likelihood are very cheap operations, so there is little to be gained from parallelisation. Nevertheless, I still get a speedup of more than a factor of two using the parallel version on my laptop.

### Conclusion

In this post we have shown how typeclasses can be used in Scala to write code that is parallelisation-agnostic. Code written in this way can be run on one or many cores as desired. We’ve illustrated the concept with a scalable particle filter, but nothing about the approach is specific to that application. It would be easy to build up a library of statistical routines this way, all of which can effectively exploit available parallel hardware. Further, although we haven’t demonstrated it here, it is trivial to extend this idea to allow code to be distribution over a cluster of parallel machines if necessary. For example, if an Apache Spark cluster is available, it is easy to make a Spark RDD instance for our generic collection typeclass, that will then allow us to run our (unmodified) particle filter code over a Spark cluster. This emphasises the fact that Spark can be useful for distributing computation as well as just processing “big data”. I’ll say more about Spark in subsequent posts.

## Tuning particle MCMC algorithms

Several papers have appeared recently discussing the issue of how to tune the number of particles used in the particle filter within a particle MCMC algorithm such as particle marginal Metropolis Hastings (PMMH). Three such papers are:

I have discussed psuedo marginal MCMC and particle MCMC algorithms in previous posts. It will be useful to refer back to these posts if these topics are unfamiliar. Within particle MCMC algorithms (and psuedo-marginal MCMC algorithms, more generally), an unbiased estimate of marginal likelihood is constructed using a number of particles. The more particles that are used, the better the estimate of marginal likelihood is, and the resulting MCMC algorithm will behave more like a “real” marginal MCMC algorithm. For a small number of particles, the algorithm will still have exactly the correct target, but the noise in the unbiased estimator of marginal likelihood will lead to poor mixing of the MCMC chain. The idea is to use just enough particles to ensure that there isn’t “too much” noise in the unbiased estimator, but not to waste lots of time producing a super-accurate estimate of marginal likelihood if that isn’t necessary to ensure good mixing of the MCMC chain.

The papers above try to give theoretical justifications for certain “rules of thumb” that are commonly used in practice. One widely adopted scheme is to tune the number of particles so that the variance of the log of the estimate of marginal liklihood is around one. The obvious questions are “where?” and “why?”, and these questions turn out to be connected. As we will see, there isn’t really a good answer to the “where?” question, but what people usually do is use a pilot run to get an estimate of the posterior mean, or mode, or MLE, and then pick one and tune the noise variance at that particular parameter value. As to “why?”, well, the papers above make various (slightly different) assumptions, all of which lead to trading off mixing against computation time to obtain an “optimal” number of particles. They don’t all agree that the variance of the noise should be exactly 1, but they all agree to an order of magnitude.

All of the above papers make the assumption that the noise distribution associated with the marginal likelihood estimate is independent of the parameter at which it is being evaluated, which explains why there isn’t a really good answer to the “where?” question – under the assumption it doesn’t matter what parameter value is used for tuning – they are all the same! Easy. Except that’s quite a big assumption, so it would be nice to know that it is reasonable, and unfortunately it isn’t. Let’s look at an example to see what goes wrong.

#### Example

In Chapter 10 of my book I look in detail at constructing a PMMH algorithm for inferring the parameters of a discretely observed stochastic Lotka-Volterra model. I’ve stepped through the computational details in a previous post which you should refer back to for the necessary background. Following that post, we can construct a particle filter to return an unbiased estimate of marginal likelihood using the following R code (which relies on the smfsb CRAN package):

require(smfsb)
# data
data(LVdata)
data=as.timedData(LVnoise10)
noiseSD=10
# measurement error model
dataLik <- function(x,t,y,log=TRUE,...)
{
ll=sum(dnorm(y,x,noiseSD,log=TRUE))
if (log)
return(ll)
else
return(exp(ll))
}
# now define a sampler for the prior on the initial state
simx0 <- function(N,t0,...)
{
mat=cbind(rpois(N,50),rpois(N,100))
colnames(mat)=c("x1","x2")
mat
}
# construct particle filter
mLLik=pfMLLik(150,simx0,0,stepLVc,dataLik,data)


Again, see the relevant previous post for details. So now mLLik() is a function that will return the log of an unbiased estimate of marginal likelihood (based on 150 particles) given a parameter value at which to evaluate.

What we are currently wondering is whether the noise in the estimate is independent of the parameter at which it is evaluated. We can investigate this for this filter easily by looking at how the estimate varies as the first parameter (prey birth rate) varies. The following code computes a log likelihood estimate across a range of values and plots the result.

mLLik1=function(x){mLLik(th=c(th1=x,th2=0.005,th3=0.6))}
x=seq(0.7,1.3,length.out=5001)
y=sapply(x,mLLik1)
plot(x[y>-1e10],y[y>-1e10])


The resulting plot is as follows (click for full size):

So, looking at the plot, it is very clear that the noise variance certainly isn’t constant as the parameter varies – it varies substantially. Furthermore, the way in which it varies is “dangerous”, in that the noise is smallest in the vicinity of the MLE. So, if a parameter close to the MLE is chosen for tuning the number of particles, this will ensure that the noise is small close to the MLE, but not elsewhere in parameter space. This could have bad consequences for the mixing of the MCMC algorithm as it explores the tails of the posterior distribution.

So with the above in mind, how should one tune the number of particles in a pMCMC algorithm? I can’t give a general answer, but I can explain what I do. We can’t rely on theory, so a pragmatic approach is required. The above rule of thumb usually gives a good starting point for exploration. Then I just directly optimise ESS per CPU second of the pMCMC algorithm from pilot runs for varying numbers of particles (and other tuning parameters in the algorithm). ESS is “expected sample size”, which can be estimated using the effectiveSize() function in the coda CRAN package. Ugly and brutish, but it works…

## Introduction to the particle Gibbs sampler

### Introduction

Particle MCMC (the use of approximate SMC proposals within exact MCMC algorithms) is arguably one of the most important developments in computational Bayesian inference of the 21st Century. The key concepts underlying these methods are described in a famously impenetrable “read paper” by Andrieu et al (2010). Probably the most generally useful method outlined in that paper is the particle marginal Metropolis-Hastings (PMMH) algorithm that I have described previously – that post is required preparatory reading for this one.

In this post I want to discuss some of the other topics covered in the pMCMC paper, leading up to a description of the particle Gibbs sampler. The basic particle Gibbs algorithm is arguably less powerful than PMMH for a few reasons, some of which I will elaborate on. But there is still a lot of active research concerning particle Gibbs-type algorithms, which are attempting to address some of the deficiencies of the basic approach. Clearly, in order to understand and appreciate the recent developments it is first necessary to understand the basic principles, and so that is what I will concentrate on here. I’ll then finish with some pointers to more recent work in this area.

### PIMH

I will adopt the same approach and notation as for my post on the PMMH algorithm, using a simple bootstrap particle filter for a state space model as the SMC proposal. It is simplest to understand particle Gibbs first in the context of known static parameters, and so it is helpful to first reconsider the special case of the PMMH algorithm where there are no unknown parameters and only the state path, $x$ of the process is being updated. That is, we target $p(x|y)$ (for known, fixed, $\theta$) rather than $p(\theta,x|y)$. This special case is known as the particle independent Metropolis-Hastings (PIMH) sampler.

Here we envisage proposing a new path $x_{0:T}^\star$ using a bootstrap filter, and then accepting the proposal with probability $\min\{1,A\}$, where $A$ is the Metropolis-Hastings ratio

$\displaystyle A = \frac{\hat{p}(y_{1:T})^\star}{\hat{p}(y_{1:T})},$

where $\hat{p}(y_{1:T})^\star$ is the bootstrap filter’s estimate of marginal likelihood for the new path, and $\hat{p}(y_{1:T})$ is the estimate associated with the current path. Again using notation from the previous post it is clear that this ratio targets a distribution on the joint space of all simulated random variables proportional to

$\displaystyle \hat{p}(y_{1:T})\tilde{q}(\mathbf{x}_0,\ldots,\mathbf{x}_T,\mathbf{a}_0,\ldots,\mathbf{a}_{T-1})$

and that in this case the marginal distribution of the accepted path is exactly $p(x_{0:T}|y_{1:T})$. Again, be sure to see the previous post for the explanation.

### Conditional SMC update

So far we have just recapped the previous post in the case of known parameters, but it gives us insight in how to proceed. A general issue with Metropolis independence samplers in high dimensions is that they often exhibit “sticky” behaviour, whereby an unusually “good” accepted path is hard to displace. This motivates consideration of a block-Gibbs-style algorithm where updates are used that are always accepted. It is clear that simply running a bootstrap filter will target the particle filter distribution

$\tilde{q}(\mathbf{x}_0,\ldots,\mathbf{x}_T,\mathbf{a}_0,\ldots,\mathbf{a}_{T-1})$

and so the marginal distribution of the accepted path will be the approximate $\hat{p}(x_{0:T}|y_{1:T})$ rather than the exact conditional distribution $p(x_{0:T}|y_{1:T})$. However, we know from consideration of the PIMH algorithm that what we really want to do is target the slightly modified distribution proportional to

$\displaystyle \hat{p}(y_{1:T})\tilde{q}(\mathbf{x}_0,\ldots,\mathbf{x}_T,\mathbf{a}_0,\ldots,\mathbf{a}_{T-1})$,

as this will lead to accepted paths with the exact marginal distribution. For the PIMH this modification is achieved using a Metropolis-Hastings correction, but we now try to avoid this by instead conditioning on the previously accepted path. For this target the accepted paths have exactly the required marginal distribution, so we now write the target as the product of the marginal for the current path times a conditional for all of the remaining variables.

$\displaystyle \frac{p(x_{0:T}^k|y_{1:T})}{M^T} \times \frac{M^T}{p(x_{0:T}^k|y_{1:T})} \hat{p}(y_{1:T})\tilde{q}(\mathbf{x}_0,\ldots,\mathbf{x}_T,\mathbf{a}_0,\ldots,\mathbf{a}_{T-1})$

where in addition to the correct marginal for $x$ we assume iid uniform ancestor indices. The important thing to note here is that the conditional distribution of the remaining variables simplifies to

$\displaystyle \frac{\tilde{q}(\mathbf{x}_0,\ldots,\mathbf{x}_T,\mathbf{a}_0,\ldots,\mathbf{a}_{T-1})} {\displaystyle p(x_0^{b_0^k})\left[\prod_{t=0}^{T-1} \pi_t^{b_t^k}p\left(x_{t+1}^{b_{t+1}^k}|x_t^{b_t^k}\right)\right]}$.

The terms in the denominator are precisely the terms in the numerator corresponding to the current path, and hence “cancel out” the current path terms in the numerator. It is therefore clear that we can sample directly from this conditional distribution by running a bootstrap particle filter that includes the current path and which leaves the current path fixed. This is the conditional SMC (CSMC) update, which here is just a conditional bootstrap particle filter update. It is clear from the form of the conditional density how this filter must be constructed, but for completeness it is described below.

The bootstrap filter is run conditional on one trajectory. This is usually the trajectory sampled at the last run of the particle filter. The idea is that you do not sample new state or ancestor values for that one trajectory. Note that this guarantees that the conditioned on trajectory survives the filter right through to the final sweep of the filter at which point a new trajectory is picked from the current selection of $M$ paths, of which the conditioned-on trajectory is one.

Let $x_{1:T} = (x_1^{b_1},x_2^{b_2},\ldots,x_T^{b_T})$ be the path that is to be conditioned on, with ancestral lineage $b_{1:T}$. Then, for $k\not= b_1$, sample $x_0^k \sim p(x_0)$ and set $\pi_0^k=1/M$. Now suppose that at time $t$ we have a weighted sample from $p(x_t|y_{1:t})$. First resample by sampling $a_t^k\sim \mathcal{F}(a_t^k|\boldsymbol{\pi}_t),\ \forall k\not= b_t$. Next sample $x_{t+1}^k\sim p(x_{t+1}^k|x_t^{a_t^k}),\ \forall k\not=b_t$. Then for all $k$ set $w_{t+1}^k=p(y_{t+1}|x_{t+1}^k)$ and normalise with $\pi_{t+1}^k=w_{t+1}^k/\sum_{i=1}^M w_{t+1}^i$. Propagate this weighted set of particles to the next time point. At time $T$ select a single trajectory by sampling $k'\sim \mathcal{F}(k'|\boldsymbol{\pi}_T)$.

This defines a block Gibbs sampler which updates $2(M-1)T+1$ of the $2MT+1$ random variables in the augmented state space at each iteration. Since the block of variables to be updated is random, this defines an ergodic sampler for $M\geq2$ particles, and we have explained why the marginal distribution of the selected trajectory is the exact conditional distribution.

Before going on to consider the introduction of unknown parameters, it is worth considering the limitations of this method. One of the main motivations for considering a Gibbs-style update was concern about the “stickiness” of a Metropolis independence sampler. However, it is clear that conditional SMC updates also have the potential to stick. For a large number of time points, particle filter genealogies coalesce, or degenerate, to a single path. Since here we are conditioning on the current path, if there is coalescence, it is guaranteed to be to the previous path. So although the conditional SMC updates are always accepted, it is likely that much of the new path will be identical to the previous path, which is just another kind of “sticking” of the sampler. This problem with conditional SMC and particle Gibbs more generally is well recognised, and quite a bit of recent research activity in this area is directed at alleviating this sticking problem. The most obvious strategy to use is “backward sampling” (Godsill et al, 2004), which has been used in this context by Lindsten and Schon (2012), Whiteley et al (2010), and Chopin and Singh (2013), among others. Another related idea is “ancestor sampling” (Lindsten et al, 2014), which can be done in a single forward pass. Both of these techniques work well, but both rely on the tractability of the transition kernel of the state space model, which can be problematic in certain applications.

### Particle Gibbs sampling

As we are working in the context of Gibbs-style updates, the introduction of static parameters, $\theta$, into the problem is relatively straightforward. It turns out to be correct to do the obvious thing, which is to alternate between sampling $\theta$ given $y$ and the currently sampled path, $x$, and sampling a new path using a conditional SMC update, conditional on the previous path in addition to $\theta$ and $y$. Although this is the obvious thing to do, understanding exactly why it works is a little delicate, due to the augmented state space and conditional SMC update. However, it is reasonably clear that this strategy defines a “collapsed Gibbs sampler” (Lui, 1994), and so actually everything is fine. This particular collapsed Gibbs sampler is relatively easy to understand as a marginal sampler which integrates out the augmented variables, but then nevertheless samples the augmented variables at each iteration conditional on everything else.

Note that the Gibbs update of $\theta$ may be problematic in the context of a state space model with intractable transition kernel.

In a subsequent post I’ll show how to code up the particle Gibbs and other pMCMC algorithms in a reasonably efficient way.

## Parallel particle filtering and pMCMC using R and multicore

### Introduction

In a previous post I showed how to construct a PMMH pMCMC algorithm for parameter estimation with partially observed Markov processes. The inner loop of a pMCMC algorithm consists of running a particle filter to construct an unbiased estimate of marginal likelihood. This inner loop is the place where the code spends almost all of its time, and so speeding up the particle filter will result in dramatic speedup of the pMCMC algorithm. This is fortunate, since as previously discussed, MCMC algorithms are difficult to parallelise other than on a per iteration basis. Here, each iteration can be speeded up if we can effectively parallelise a particle filter. Particle filters are much easier to parallelise than MCMC algorithms, and so it is tempting to try and exploit this within R. In fact, although it is the case that it is possible to effectively parallelise particle filters in efficient languages using low-level parallelisation tools (say, using C with MPI, or Java concurrency tools), it is not so easy to speed up R-based particle filters using R’s high-level parallelisation constructs, as we shall see.

### Particle filters

In the previous post we looked at the function pfMLLik within the CRAN package smfsb. As a reminder, the source code is

pfMLLik <- function (n, simx0, t0, stepFun, dataLik, data)
{
times = c(t0, as.numeric(rownames(data)))
deltas = diff(times)
return(function(...) {
xmat = simx0(n, t0, ...)
ll = 0
for (i in 1:length(deltas)) {
xmat = t(apply(xmat, 1, stepFun, t0 = times[i], deltat = deltas[i], ...))
w = apply(xmat, 1, dataLik, t = times[i + 1], y = data[i,], log = FALSE, ...)
if (max(w) < 1e-20) {
warning("Particle filter bombed")
return(-1e+99)
}
ll = ll + log(mean(w))
rows = sample(1:n, n, replace = TRUE, prob = w)
xmat = xmat[rows, ]
}
ll
})
}


The function itself doesn’t actually run a particle filter, but instead returns a function closure which does (see the previous post for a discussion of lexical scope and function closures in R). There are obviously several different steps within the particle filter, and several of these are amenable to parallelisation. However, for complex models, forward simulation from the model will be the rate-limiting step, where the vast majority of CPU cycles will be spent. Line 9 in the above code is where forward simulation takes place, and in particular, the key function call is the apply call:

apply(xmat, 1, stepFun, t0 = times[i], deltat = deltas[i], ...)


This call applies the forward simulation algorithm stepFun to each row of the matrix xmat independently. Since there are no dependencies between the function calls, this is in principle very straightforward to parallelise on multicore hardware.

### Multicore support in R

I’m writing this post on a laptop with an Intel i7 quad core chip, running the 64 bit version of Ubuntu 11.10. R has support for multicore processing on this platform – it is just a simple matter of installing the relevant packages. However, things are changing rapidly regarding multicore support in R right now, so YMMV. Ubuntu 11.10 has R 2.13 by default, but the multicore support is slightly different in the recently released R 2.14. I’m still using R 2.13. I may update this post (or comment) when I move to R 2.14. The main difference is that the package multicore has been replaced by the package parallel. There are a few other minor changes, but it should be easy to adapt what is presented here to 2.14.

There is a new O’Reilly book called Parallel R. I’ve got a copy of it. It does cover the new parallel package in R 2.14, as well as other parallel R topics, but the book is a bit light weight, to say the least, and I reviewed it on this blog. Please read my review for further details before you buy it.

If you haven’t used multicore in R previously, then

install.packages(c("multicore","doMC"))


should get you started (again, I’m assuming that your R version is strictly < 2.14). You can test it has worked with:

library(multicore)
multicore:::detectCores()


When I do this, I get the answer 8 (I have 4 cores, each of which is hyper-threaded). To begin with, I want to tell R to use just 4 process threads, and I can do this with

library(doMC)
registerDoMC(4)


Replacing the second line with registerDoMC() will set things up to use all detected cores (in my case, 8). There are a couple of different strategies we could use to parallelise this. One strategy for parallelising the apply call discussed above is to be to replace it with a foreach / %dopar% loop. This is best illustrated by example. Start with line 9 from the function pfMLLik:

xmat = t(apply(xmat, 1, stepFun, t0 = times[i], deltat = deltas[i], ...))


We can produce a parallelised version by replacing this line with the following block of code:

res=foreach(j=1:dim(xmat)[1]) %dopar% {
stepFun(xmat[j,], t0 = times[i], deltat = deltas[i], ...)
}
xmat=t(sapply(res,cbind))


Each iteration of the foreach loop is executed independently (possibly using multiple cores), and the result of each iteration is returned as a list, and captured in res. This list of return vectors is then coerced back into a matrix with the final line.

In fact, we can improve on this by using the .combine argument to foreach, which describes how to combine the results from each iteration. Here we can just use rbind to combine the results into a matrix, using:

xmat=foreach(j=1:dim(xmat)[1], .combine="rbind") %dopar% {
stepFun(xmat[j,], t0 = times[i], deltat = deltas[i], ...)
}


This code is much neater, and in principle ought to be a bit faster, though I haven’t noticed much difference in practice.

In fact, it is not necessary to use the foreach construct at all. The multicore package provides the mclapply function, which is a multicore version of lapply. To use mclapply (or, indeed, lapply) here, we first need to split our matrix into a list of rows, which we can do using the split command. So in fact, our apply call can be replaced with the single line:

xmat=t(sapply(mclapply(split(xmat,row(xmat)), stepFun, t0=times[i], deltat=deltas[i], ...),cbind))


This is actually a much cleaner solution than the method using foreach, but it does require grokking a bit more R. Note that mclapply uses a different method to specify the number of threads to use than foreach/doMC. Here you can either use the named argument to mclapply, mc.cores, or use options(), eg. options(cores=4).

As well as being much cleaner, I find that the mclapply approach is much faster than the foreach/dopar approach for this problem. I’m guessing that this is because foreach doesn’t pre-schedule tasks by default, whereas mclapply does, but I haven’t had a chance to dig into this in detail yet.

### A parallelised particle filter

We can now splice the parallelised forward simulation step (using mclapply) back into our particle filter function to get:

require(multicore)
pfMLLik <- function (n, simx0, t0, stepFun, dataLik, data)
{
times = c(t0, as.numeric(rownames(data)))
deltas = diff(times)
return(function(...) {
xmat = simx0(n, t0, ...)
ll = 0
for (i in 1:length(deltas)) {
xmat=t(sapply(mclapply(split(xmat,row(xmat)), stepFun, t0=times[i], deltat=deltas[i], ...),cbind))
w = apply(xmat, 1, dataLik, t = times[i + 1], y = data[i,], log = FALSE, ...)
if (max(w) < 1e-20) {
warning("Particle filter bombed")
return(-1e+99)
}
ll = ll + log(mean(w))
rows = sample(1:n, n, replace = TRUE, prob = w)
xmat = xmat[rows, ]
}
ll
})
}


This can be used in place of the version supplied with the smfsb package for slow simulation algorithms running on modern multicore machines.

There is an issue regarding Monte Carlo simulations such as this and the multicore package (whether you use mclapply or foreach/dopar) in that it adopts a “different seeds” approach to parallel random number generation, rather than a true parallel random number generator. This probably isn’t worth worrying too much about now, since it is fixed in the new parallel package in R 2.14, but is something to be aware of. I discuss parallel random number generation issues in Wilkinson (2005).

### Granularity

The above code is now a parallel particle filter, and can now be used in place of the serial version that is part of the smfsb package. However, if you try it out on a simple example, you will most likely be disappointed. In particular, if you use it for the pMCMC example discussed in the previous post, you will see that the parallel version of the example actually runs much slower than the serial version (at least, it does for me). However, that is because the forward simulator stepFun, used in that example, was actually a very fast simulation algorithm, stepLVc, written in C. In this case, the overhead of setting up and closing down the threads, and distributing the tasks, and collating the results from the worker threads back in the master thread, etc., outweighs the advantage of running the individual tasks in parallel. This is why parallel programming is difficult. What is needed here is for the individual tasks to be sufficiently computationally intensive that the overheads associated with parallelisation are easily outweighed by the ability to run the tasks in parallel. In the context of particle filtering, this is particularly problematic, as if the forward simulator is very slow, running a reasonable particle filter is going to be very, very slow, and then you probably don’t want to be working in R anyway… Parallelising a particle filter written in C using MPI is much more likely to be successful, as it offers much more fine grained control of exactly how the tasks and processes are managed, but at the cost of increased development time. In a previous post I gave an introduction to parallel Monte Carlo with C and MPI, and I’ve written more extensively about parallel MCMC in Wilkinson (2005). It also looks as though the new parallel package in R 2.14 offers more control of parallelisation, so that also might help. However, if you are using a particle filter as part of a pMCMC algorithm, there is another strategy you can use at a higher level of granularity which might be useful even within R in some situations.

### Multiple particle filters and pMCMC

Let’s look again at the main loop of the pMCMC algorithm discussed in the previous post:

for (i in 1:iters) {
message(paste(i,""),appendLF=FALSE)
for (j in 1:thin) {
thprop=th*exp(rnorm(p,0,tune))
llprop=mLLik(thprop)
if (log(runif(1)) < llprop - ll) {
th=thprop
ll=llprop
}
}
thmat[i,]=th
}


It is clear that the main computational bottleneck of this code is the call to mLLik on line 5, as this is the call which runs the particle filter. The purpose of making the call is to obtain an unbiased estimate of marginal likelihood. However, there are plenty of other ways that we can obtain such estimates than by running a single particle filter. In particular, we could run multiple particle filters and average the results. So, let’s look at how to do this in the multicore setting. Let’s start by thinking about running 4 particle filters. We could just replace the line

llprop=mLLik(thprop)


with the code

llprop=0.25*foreach(i=1:4, .combine="+") %dopar% {
mLLik(thprop)
}


Now, there are at least 2 issues with this. The first is that we are now just running 4 particle filters rather than 1, and so even with perfect parallelisation, it will run no quicker than the code we started with. However, the idea is that by running 4 particle filters we ought to be able to get away with each particle filter using fewer particles, though it isn’t trivial to figure out exactly how many. For example, averaging the results from 4 particle filters, each of which uses 25 particles is not as good as running a single particle filter with 100 particles. In practice, some trial and error is likely to be required. The second problem is that we have computed the mean of the log of the likelihoods, and not the likelihoods themselves. This will almost certainly work fine in practice, as the resulting estimate will in most cases be very close to unbiased, but it will not be exactly unbiased, as so will not lead to an “exact” approximate algorithm. In principle, this can be fixed by instead using

res=foreach(i=1:4) %dopar% {
mLLik(thprop)
}
llprop=log(mean(sapply(res,exp)))


but in practice this is likely to be subject to numerical underflow problems, as it involves manipulating raw likelihood values, which is generally a bad idea. It is possible to compute the log of the mean of the likelihoods in a more numerically stable way, but that is left as an exercise for the reader, as this post is way too long already… However, one additional tip worth mentioning is that the foreach package includes a convenience function called times for situations like the above, where the argument is not varying over calls. So the above code can be replaced with

res=times(4) %dopar% mLLik(thprop)
llprop=log(mean(sapply(res,exp)))


which is a bit cleaner and more readable.

Using this approach to parallelisation, there is now a much better chance of getting some speedup on multicore architectures, as the granularity of the tasks being parallelised is now much larger. Consider the example from the previous post, where at each iteration we ran a particle filter with 100 particles. If we now re-run that example, but instead use 4 particle filters each using 25 particles, we do get a slight speedup. However, on my laptop, the speedup is only around a factor of 1.6 using 4 cores, and as already discussed, 4 filters each with 25 particles isn’t actually quite as good as a single filter with 100 particles anyway. So, the benefits are rather modest here, but will be much better with less trivial examples (slower simulators). For completeness, a complete runnable demo script is included after the references. Also, it is probably worth emphasising that if your pMCMC algorithm has a short burn-in period, you may well get much better overall speed-ups by just running parallel MCMC chains. Depressing, perhaps, but true.

### References

• McCallum, E., Weston, S. (2011) Parallel R, O’Reilly.
• Wilkinson, D. J. (2005) Parallel Bayesian Computation, Chapter 16 in E. J. Kontoghiorghes (ed.) Handbook of Parallel Computing and Statistics, Marcel Dekker/CRC Press, 481-512.
• Wilkinson, D. J. (2011) Stochastic Modelling for Systems Biology, second edition, Boca Raton, Florida: Chapman & Hall/CRC Press.
• ### Demo script

require(smfsb)
data(LVdata)

require(multicore)
require(doMC)
registerDoMC(4)

# set up data likelihood
noiseSD=10
dataLik <- function(x,t,y,log=TRUE,...)
{
ll=sum(dnorm(y,x,noiseSD,log=TRUE))
if (log)
return(ll)
else
return(exp(ll))
}
# now define a sampler for the prior on the initial state
simx0 <- function(N,t0,...)
{
mat=cbind(rpois(N,50),rpois(N,100))
colnames(mat)=c("x1","x2")
mat
}
# convert the time series to a timed data matrix
LVdata=as.timedData(LVnoise10)
# create marginal log-likelihood functions, based on a particle filter

# use 25 particles instead of 100
mLLik=pfMLLik(25,simx0,0,stepLVc,dataLik,LVdata)

iters=1000
tune=0.01
thin=10
th=c(th1 = 1, th2 = 0.005, th3 = 0.6)
p=length(th)
ll=-1e99
thmat=matrix(0,nrow=iters,ncol=p)
colnames(thmat)=names(th)
# Main pMCMC loop
for (i in 1:iters) {
message(paste(i,""),appendLF=FALSE)
for (j in 1:thin) {
thprop=th*exp(rnorm(p,0,tune))
res=times(4) %dopar% mLLik(thprop)
llprop=log(mean(sapply(res,exp)))
if (log(runif(1)) < llprop - ll) {
th=thprop
ll=llprop
}
}
thmat[i,]=th
}
message("Done!")
# Compute and plot some basic summaries
mcmcSummary(thmat)


## Particle filtering and pMCMC using R

In the previous post I gave a quick introduction to the CRAN R package smfsb, and how it can be used for simulation of Markov processes determined by stochastic kinetic networks. In this post I’ll show how to use data and particle MCMC techniques in order to carry out Bayesian inference for the parameters of partially observed Markov processes.

### The simulation model and the data

For this post we will assume that the smfsb package is installed and loaded (see previous post for details). The package includes the function stepLVc which simulates from the Markov transition kernel of a Lotka-Volterra process by calling out to some native C code for speed. So, for example,

stepLVc(c(x1=50,x2=100),0,1)


will simulate the state of the process at time 1 given an initial condition of 50 prey and 100 predators at time 0, using the default rate parameters of the function, th = c(1, 0.005, 0.6). The package also includes some data simulated from this model using these parameters, with and without added noise. The datasets can be loaded with

data(LVdata)


For simplicity, we will just make use of the dataset LVnoise10 in this post. This dataset is a multivariate time series consisting of 16 equally spaced observations on both prey and predators subject to Gaussian measurement error with a standard deviation of 10. We can plot the data with

plot(LVnoise10,plot.type="single",col=c(2,4))


giving:

The Bayesian inference problem is to see how much we are able to learn about the parameters which generated the data using only the data and our knowledge of the structure of the problem. There are many approaches one can take to this problem, but most are computationally intensive, due to the analytical intractability of the transition kernel of the LV process. Here we will follow Wilkinson (2011) and take a particle MCMC (pMCMC) approach, and specifically, use a pseudo-marginal “exact approximate” MCMC algorithm based on the particle marginal Metropolis-Hastings (PMMH) algorithm. I have discussed the pseudo-marginal approach, using particle filters for marginal likelihood estimation, and the PMMH algorithm in previous posts, so if you have been following my posts for a while, this should all make perfect sense…

### Particle filter

One of the key ingredients required to implement the pseudo-marginal MCMC scheme is a (bootstrap) particle filter which generates an unbiased estimate of the marginal likelihood of the data given the parameters (integrated over the unobserved state trajectory). The algorithm was discussed in this post, and R code to implement this is included in the smfsb R package as pfMLLik. For reasons of numerical stability, the function computes and returns the log of the marginal likelihood, but it is important to understand that it is the actually likelihood estimate that is unbiased for the true likelihood, and not the corresponding statement for the logs. The actual code of the function is relatively short, and for completeness is given below:

pfMLLik <- function (n, simx0, t0, stepFun, dataLik, data)
{
times = c(t0, as.numeric(rownames(data)))
deltas = diff(times)
return(function(...) {
xmat = simx0(n, t0, ...)
ll = 0
for (i in 1:length(deltas)) {
xmat = t(apply(xmat, 1, stepFun, t0 = times[i], deltat = deltas[i], ...))
w = apply(xmat, 1, dataLik, t = times[i + 1], y = data[i,], log = FALSE, ...)
if (max(w) < 1e-20) {
warning("Particle filter bombed")
return(-1e+99)
}
ll = ll + log(mean(w))
rows = sample(1:n, n, replace = TRUE, prob = w)
xmat = xmat[rows, ]
}
ll
})
}


We need to set up the prior and the data likelihood correctly before we can use this function, but first note that the function does not actually run a particle filter at all, but instead stores everything it needs to know to run the particle filter in the local environment, and then returns a function closure for evaluating the marginal likelihood at a given set of parameters. The resulting function (closure) can then be used to run a particle filter for a given set of parameters, simply by passing the required parameters into the function. This functional programming style is consistent with that used throughout the smfsb R package, and leads to quite simple, modular code. To use pfMLLik, we first need to define a function which evaluates the log-likelihood of an observation conditional on the true state, and another which samples from the prior distribution of the initial state of the system. Here, we can do that as follows.

# set up data likelihood
noiseSD=10
dataLik <- function(x,t,y,log=TRUE,...)
{
ll=sum(dnorm(y,x,noiseSD,log=TRUE))
if (log)
return(ll)
else
return(exp(ll))
}
# now define a sampler for the prior on the initial state
simx0 <- function(N,t0,...)
{
mat=cbind(rpois(N,50),rpois(N,100))
colnames(mat)=c("x1","x2")
mat
}
# convert the time series to a timed data matrix
LVdata=as.timedData(LVnoise10)
# create marginal log-likelihood functions, based on a particle filter
mLLik=pfMLLik(100,simx0,0,stepLVc,dataLik,LVdata)


Now the function (closure) mLLik will, for a given parameter vector, run a particle filter (using 100 particles) and return the log of the particle filter’s unbiased estimate of the marginal likelihood of the data. It is then very easy to use this function to create a simple PMMH algorithm for parameter inference.

### PMMH algorithm

Below is an algorithm based on flat priors and a simple Metropolis-Hastings update for the parameters using the function closure mLLik, defined above.

iters=1000
tune=0.01
thin=10
th=c(th1 = 1, th2 = 0.005, th3 = 0.6)
p=length(th)
ll=-1e99
thmat=matrix(0,nrow=iters,ncol=p)
colnames(thmat)=names(th)
# Main pMCMC loop
for (i in 1:iters) {
message(paste(i,""),appendLF=FALSE)
for (j in 1:thin) {
thprop=th*exp(rnorm(p,0,tune))
llprop=mLLik(thprop)
if (log(runif(1)) < llprop - ll) {
th=thprop
ll=llprop
}
}
thmat[i,]=th
}
message("Done!")
# Compute and plot some basic summaries
mcmcSummary(thmat)


This will take a little while to run, but in the end should give a plot something like the following (click for full size):

So, although we should really run the chain for a bit longer, we see that we can learn a great deal about the parameters of the process from very little data. For completeness, a full runnable demo script is included below the references. Of course there are many obvious extensions of this basic problem, such as partial observation (eg. only observing the prey) and unknown measurement error. These are discussed in Wilkinson (2011), and code for these cases is included within the demo(PMCMC), which should be inspected for further details.

### Discussion

At this point it is probably worth emphasising that there are other “likelihood free” approaches which can be taken to parameter inference for partially observed Markov process (POMP) models. Many of these are implemented in the pomp R package, also available from CRAN, by King et al (2008). The pomp package is well documented, and has a couple of good tutorial vignettes which should be sufficient to get people started. The API of the package is rather cumbersome, but the algorithms appear to be quite robust. Approximate Bayesian computation (ABC) approaches are also quite natural for POMP models (see, for example, Toni et al (2009)). This is because “exact” likelihood free procedures break down in the case of low/no measurement error or high-dimensional observations. There are some R packages for ABC, but I am not sufficiently familiar with them to be able to give recommendations.

If one is able to move away from the “likelihood free” paradigm, it is possible to develop “exact” pMCMC algorithms which do not break down in challenging observation scenarios. The problem here is the intractability of the Markov transition kernel. In the case of nonlinear Markov jump processes, finding very generic solutions seems quite difficult, but for diffusion (approximation) processes based on stochastic differential equations, it seems to be possible to develop irreducible pMCMC algorithms which have very broad applicability – see Golightly and Wilkinson (2011) for further details of how such techniques can be used in the context of stochastic kinetic models similar to those considered in this post.

### References

• Golightly, A., Wilkinson, D. J. (2011) Bayesian parameter inference for stochastic biochemical network models using particle MCMC, Interface Focus, 1(6):807-820.
• King, A.A., Ionides, E.L., & Breto, C.M. (2008) pomp: Statistical inference for partially observed Markov processes, CRAN.
• Toni, T., Welch, D., Strelkowa, N., Ipsen, A. & Stumpf, M. (2009) Approximate Bayesian computation scheme for parameter inference and model selection in dynamical systems, J. R. Soc. Interface 6(31): 187-202.
• Wilkinson, D. J. (2011) Stochastic Modelling for Systems Biology, second edition, Boca Raton, Florida: Chapman & Hall/CRC Press.
• ### Demo script

require(smfsb)
data(LVdata)

# set up data likelihood
noiseSD=10
dataLik <- function(x,t,y,log=TRUE,...)
{
ll=sum(dnorm(y,x,noiseSD,log=TRUE))
if (log)
return(ll)
else
return(exp(ll))
}
# now define a sampler for the prior on the initial state
simx0 <- function(N,t0,...)
{
mat=cbind(rpois(N,50),rpois(N,100))
colnames(mat)=c("x1","x2")
mat
}
# convert the time series to a timed data matrix
LVdata=as.timedData(LVnoise10)
# create marginal log-likelihood functions, based on a particle filter
mLLik=pfMLLik(100,simx0,0,stepLVc,dataLik,LVdata)

iters=1000
tune=0.01
thin=10
th=c(th1 = 1, th2 = 0.005, th3 = 0.6)
p=length(th)
ll=-1e99
thmat=matrix(0,nrow=iters,ncol=p)
colnames(thmat)=names(th)
# Main pMCMC loop
for (i in 1:iters) {
message(paste(i,""),appendLF=FALSE)
for (j in 1:thin) {
thprop=th*exp(rnorm(p,0,tune))
llprop=mLLik(thprop)
if (log(runif(1)) < llprop - ll) {
th=thprop
ll=llprop
}
}
thmat[i,]=th
}
message("Done!")
# Compute and plot some basic summaries
mcmcSummary(thmat)


## Introduction

In the previous post I explained how one can use an unbiased estimate of marginal likelihood derived from a particle filter within a Metropolis-Hastings MCMC algorithm in order to construct an exact pseudo-marginal MCMC scheme for the posterior distribution of the model parameters given some time course data. This idea is closely related to that of the particle marginal Metropolis-Hastings (PMMH) algorithm of Andreiu et al (2010), but not really exactly the same. This is because for a Bayesian model with parameters $\theta$, latent variables $x$ and data $y$, of the form

$\displaystyle p(\theta,x,y) = p(\theta)p(x|\theta)p(y|x,\theta),$

the pseudo-marginal algorithm which exploits the fact that the particle filter’s estimate of likelihood is unbiased is an MCMC algorithm which directly targets the marginal posterior distribution $p(\theta|y)$. On the other hand, the PMMH algorithm is an MCMC algorithm which targets the full joint posterior distribution $p(\theta,x|y)$. Now, the PMMH scheme does reduce to the pseudo-marginal scheme if samples of $x$ are not generated and stored in the state of the Markov chain, and it certainly is the case that the pseudo-marginal algorithm gives some insight into why the PMMH algorithm works. However, the PMMH algorithm is much more powerful, as it solves the “smoothing” and parameter estimation problem simultaneously and exactly, including the “initial value” problem (computing the posterior distribution of the initial state, $x_0$). Below I will describe the algorithm and explain why it works, but first it is necessary to understand the relationship between marginal, joint and “likelihood-free” MCMC updating schemes for such latent variable models.

### MCMC for latent variable models

#### Marginal approach

If we want to target $p(\theta|y)$ directly, we can use a Metropolis-Hastings scheme with a fairly arbitrary proposal distribution for exploring $\theta$, where a new $\theta^\star$ is proposed from $f(\theta^\star|\theta)$ and accepted with probability $\min\{1,A\}$, where

$\displaystyle A = \frac{p(\theta^\star)}{p(\theta)} \times \frac{f(\theta|\theta^\star)}{f(\theta^\star|\theta)} \times \frac{p({y}|\theta^\star)}{p({y}|\theta)}.$

As previously discussed, the problem with this scheme is that the marginal likelihood $p(y|\theta)$ required in the acceptance ratio is often difficult to compute.

#### Likelihood-free MCMC

A simple “likelihood-free” scheme targets the full joint posterior distribution $p(\theta,x|y)$. It works by exploiting the fact that we can often simulate from the model for the latent variables $p(x|\theta)$ even when we can’t evaluate it, or marginalise $x$ out of the problem. Here the Metropolis-Hastings proposal is constructed in two stages. First, a proposed new $\theta^\star$ is sampled from $f(\theta^\star|\theta)$ and then a corresponding $x^\star$ is simulated from the model $p(x^\star|\theta^\star)$. The pair $(\theta^\star,x^\star)$ is then jointly accepted with ratio

$\displaystyle A = \frac{p(\theta^\star)}{p(\theta)} \times \frac{f(\theta|\theta^\star)}{f(\theta^\star|\theta)} \times \frac{p(y|{x}^\star,\theta^\star)}{p(y|{x},\theta)}.$

The proposal mechanism ensures that the proposed $x^\star$ is consistent with the proposed $\theta^\star$, and so the procedure can work provided that the dimension of the data $y$ is low. However, in order to work well more generally, we would want the proposed latent variables to be consistent with the data as well as the model parameters.

#### Ideal joint update

Motivated by the likelihood-free scheme, we would really like to target the joint posterior $p(\theta,x|y)$ by first proposing $\theta^\star$ from $f(\theta^\star|\theta)$ and then a corresponding $x^\star$ from the conditional distribution $p(x^\star|\theta^\star,y)$. The pair $(\theta^\star,x^\star)$ is then jointly accepted with ratio

$\displaystyle A = \frac{p(\theta^\star)}{p(\theta)} \frac{p({x}^\star|\theta^\star)}{p({x}|\theta)} \frac{f(\theta|\theta^\star)}{f(\theta^\star|\theta)} \frac{p(y|{x}^\star,\theta^\star)}{p(y|{x},\theta)} \frac{p({x}|y,\theta)}{p({x}^\star|y,\theta^\star)}\\ \qquad = \frac{p(\theta^\star)}{p(\theta)} \frac{p(y|\theta^\star)}{p(y|\theta)} \frac{f(\theta|\theta^\star)}{f(\theta^\star|\theta)}.$

Notice how the acceptance ratio simplifies, using the basic marginal likelihood identity (BMI) of Chib (1995), and $x$ drops out of the ratio completely in order to give exactly the ratio used for the marginal updating scheme. Thus, the “ideal” joint updating scheme reduces to the marginal updating scheme if $x$ is not sampled and stored as a component of the Markov chain.

Understanding the relationship between these schemes is useful for understanding the PMMH algorithm. Indeed, we will see that the “ideal” joint updating scheme (and the marginal scheme) corresponds to PMMH using infinitely many particles in the particle filter, and that the likelihood-free scheme corresponds to PMMH using exactly one particle in the particle filter. For an intermediate number of particles, the PMMH scheme is a compromise between the “ideal” scheme and the “blind” likelihood-free scheme, but is always likelihood-free (when used with a bootstrap particle filter) and always has an acceptance ratio leaving the exact posterior invariant.

### The PMMH algorithm

#### The algorithm

The PMMH algorithm is an MCMC algorithm for state space models jointly updating $\theta$ and $x_{0:T}$, as the algorithms above. First, a proposed new $\theta^\star$ is generated from a proposal $f(\theta^\star|\theta)$, and then a corresponding $x_{0:T}^\star$ is generated by running a bootstrap particle filter (as described in the previous post, and below) using the proposed new model parameters, $\theta^\star$, and selecting a single trajectory by sampling once from the final set of particles using the final set of weights. This proposed pair $(\theta^\star,x_{0:T}^\star)$ is accepted using the Metropolis-Hastings ratio

$\displaystyle A = \frac{\hat{p}_{\theta^\star}(y_{1:T})p(\theta^\star)q(\theta|\theta^\star)}{\hat{p}_{\theta}(y_{1:T})p(\theta)q(\theta^\star|\theta)},$

where $\hat{p}_{\theta^\star}(y_{1:T})$ is the particle filter’s (unbiased) estimate of marginal likelihood, described in the previous post, and below. Note that this approach tends to the perfect joint/marginal updating scheme as the number of particles used in the filter tends to infinity. Note also that for a single particle, the particle filter just blindly forward simulates from $p_\theta(x^\star_{0:T})$ and that the filter’s estimate of marginal likelihood is just the observed data likelihood $p_\theta(y_{1:T}|x^\star_{0:T})$ leading precisely to the simple likelihood-free scheme. To understand for an arbitrary finite number of particles, $M$, one needs to think carefully about the structure of the particle filter.

#### Why it works

To understand why PMMH works, it is necessary to think about the joint distribution of all random variables used in the bootstrap particle filter. To this end, it is helpful to re-visit the particle filter, thinking carefully about the resampling and propagation steps.

First introduce notation for the “particle cloud”: $\mathbf{x}_t=\{x_t^k|k=1,\ldots,M\}$, $\boldsymbol{\pi}_t=\{\pi_t^k|k=1,\ldots,M\}$, $\tilde{\mathbf{x}}_t=\{(x_t^k,\pi_t^k)|k=1,\ldots,M\}$. Initialise the particle filter with $\tilde{\mathbf{x}}_0$, where $x_0^k\sim p(x_0)$ and $\pi_0^k=1/M$ (note that $w_0^k$ is undefined). Now suppose at time $t$ we have a sample from $p(x_t|y_{1:t})$: $\tilde{\mathbf{x}}_t$. First resample by sampling $a_t^k \sim \mathcal{F}(a_t^k|\boldsymbol{\pi}_t)$, $k=1,\ldots,M$. Here we use $\mathcal{F}(\cdot|\boldsymbol{\pi})$ for the discrete distribution on $1:M$ with probability mass function $\boldsymbol{\pi}$. Next sample $x_{t+1}^k\sim p(x_{t+1}^k|x_t^{a_t^k})$. Set $w_{t+1}^k=p(y_{t+1}|x_{t+1}^k)$ and $\pi_{t+1}^k=w_{t+1}^k/\sum_{i=1}^M w_{t+1}^i$. Finally, propagate $\tilde{\mathbf{x}}_{t+1}$ to the next step… We define the filter’s estimate of likelihood as $\hat{p}(y_t|y_{1:t-1})=\frac{1}{M}\sum_{i=1}^M w_t^i$ and $\hat{p}(y_{1:T})=\prod_{i=1}^T \hat{p}(y_t|y_{1:t-1})$. See Doucet et al (2001) for further theoretical background on particle filters and SMC more generally.

Describing the filter carefully as above allows us to write down the joint density of all random variables in the filter as

$\displaystyle \tilde{q}(\mathbf{x}_0,\ldots,\mathbf{x}_T,\mathbf{a}_0,\ldots,\mathbf{a}_{T-1}) = \left[\prod_{k=1}^M p(x_0^k)\right] \left[\prod_{t=0}^{T-1} \prod_{k=1}^M \pi_t^{a_t^k} p(x_{t+1}^k|x_t^{a_t^k}) \right]$

For PMMH we also sample a final index $k'$ from $\mathcal{F}(k'|\boldsymbol{\pi}_T)$ giving the joint density

$\displaystyle \tilde{q}(\mathbf{x}_0,\ldots,\mathbf{x}_T,\mathbf{a}_0,\ldots,\mathbf{a}_{T-1})\pi_T^{k'}$

We write the final selected trajectory as

$\displaystyle x_{0:T}^{k'}=(x_0^{b_0^{k'}},\ldots,x_T^{b_T^{k'}}),$

where $b_t^{k'}=a_t^{b_{t+1}^{k'}}$, and $b_T^{k'}=k'$. If we now think about the structure of the PMMH algorithm, our proposal on the space of all random variables in the problem is in fact

$\displaystyle f(\theta^\star|\theta)\tilde{q}_{\theta^\star}(\mathbf{x}_0^\star,\ldots,\mathbf{x}_T^\star,\mathbf{a}_0^\star,\ldots,\mathbf{a}_{T-1}^\star)\pi_T^{{k'}^\star}$

and by considering the proposal and the acceptance ratio, it is clear that detailed balance for the chain is satisfied by the target with density proportional to

$\displaystyle p(\theta)\hat{p}_\theta(y_{1:T}) \tilde{q}_\theta(\mathbf{x}_0,\ldots,\mathbf{x}_T,\mathbf{a}_0,\ldots,\mathbf{a}_{T-1}) \pi_T^{k'}$

We want to show that this target marginalises down to the correct posterior $p(\theta,x_{0:T}|y_{1:T})$ when we consider just the parameters and the selected trajectory. But if we consider the terms in the joint distribution of the proposal corresponding to the trajectory selected by $k'$, this is given by

$\displaystyle p_\theta(x_0^{b_0^{k'}})\left[\prod_{t=0}^{T-1} \pi_t^{b_t^{k'}} p_\theta(x_{t+1}^{b_{t+1}^{k'}}|x_t^{b_t^{k'}})\right]\pi_T^{k'} = p_\theta(x_{0:T}^{k'})\prod_{t=0}^T \pi_t^{b_t^{k'}}$

which, by expanding the $\pi_t^{b_t^{k'}}$ in terms of the unnormalised weights, simplifies to

$\displaystyle \frac{p_\theta(x_{0:T}^{k'})p_\theta(y_{1:T}|x_{0:T}^{k'})}{M^{T+1}\hat{p}_\theta(y_{1:T})}$

It is worth dwelling on this result, as this is the key insight required to understand why the PMMH algorithm works. The whole point is that the terms in the joint density of the proposal corresponding to the selected trajectory exactly represent the required joint distribution modulo a couple of normalising constants, one of which is the particle filter’s estimate of marginal likelihood. Thus, by including $\hat{p}_\theta(y_{1:T})$ in the acceptance ratio, we knock out the normalising constant, allowing all of the other terms in the proposal to be marginalised away. In other words, the target of the chain can be written as proportional to

$\displaystyle \frac{p(\theta)p_\theta(x_{0:T}^{k'},y_{1:T})}{M^{T+1}} \times \text{(Other terms...)}$

The other terms are all probabilities of random variables which do not occur elsewhere in the target, and hence can all be marginalised away to leave the correct posterior

$\displaystyle p(\theta,x_{0:T}|y_{1:T})$

Thus the PMMH algorithm targets the correct posterior for any number of particles, $M$. Also note the implied uniform distribution on the selected indices in the target.

I will give some code examples in a future post.