Catalogue of my first 25 blog posts

30/12/2011

This is my 25th blog post, so this seems like a good time to provide an index to those first 25 posts for ease of reference. I’ve covered a range of topics over my first two years of blogging, and managed to average almost one post per month, as suggested in my first post. Due to the rather occasional nature of my posting, most regular readers subscribe to my RSS feed using some kind of RSS feed reader. I use Google Reader for following blogs and other RSS feeds, which I find very convenient as it is web based and therefore synced across all the machines and devices I use, but there are plenty of other options. Alternatively, you can follow me on twitter, where I am @darrenjw, or my Google+ feed, as I always announce new posts on those platforms.

Blog posts

  • 1. About this blog…: quick introduction to the new blog and what to expect.
  • 2. Hypercubes in R (getting started with programming in R): Constructing, rotating and plotting (2d projections of) hypercubes in order to illustrate some elementary R programming concepts.
  • 3. Systems biology, mathematical modelling and mountains: My review of an excellent workshop I participated in at BIRS on Multi-scale Stochastic Modeling of Cell Dynamics.
  • 4. (Yet another) introduction to R and Bioconductor: A very quick tutorial on basic R concepts and getting started with Bioconductor – very first steps.
  • 5. MCMC programming in R, Python, Java and C: this post showed how to implement a very simple bivariate Gibbs sampler in four different programming languages, and compared the speeds. The post has now been superseded by post number 18.
  • 6. The last Valencia meeting on Bayesian Statistics and the future of Bayesian computation: My impressions of Bayes 9, together with some thoughts on Bayesian computing in the context of multicore CPUs, GPUs, clusters and “Big Data”.
  • 7. Metropolis-Hastings MCMC algorithms: A quick introduction to the Metropolis algorithm, with example code in R, discussing implementation issues.
  • 8. The pseudo-marginal approach to “exact approximate” MCMC algorithms: a simple explanation of the “pseudo-marginal” idea, which has many potential applications in Bayesian inference.
  • 9. Introduction to the processing of short read next generation sequencing data: a quick introduction to high-throughput sequencing data, the FASTQ file format, and the use of Unix and command-line tools for initial processing and analysis of FASTQ files.
  • 10. A quick introduction to the Bioconductor ShortRead package for the analysis of NGS data: a follow-on from post 9. where I show how to get started with the analysis of FASTQ sequencing data using R and Bioconductor.
  • 11. Getting started with parallel MCMC: an introduction to parallel Monte Carlo algorithms and their implementation using C, the GSL, and MPI.
  • 12. Calling C code from R: how to call a Gibbs sampler written in C from R.
  • 13. Calling Java code from R: how to call a Gibbs sampler written in Java from R.
  • 14. Parallel Monte Carlo with an Intel i7 Quad Core: a quick look at the potential speed-ups possible exploiting parallelisation on a laptop with a nice Quad-core CPU.
  • 15. MCMC, Monte Carlo likelihood estimation, and the bootstrap particle filter: how the pseudo-marginal idea discussed in post number 8. can be exploited for state-space models by using a simple particle filter to construct an unbiased estimate of marginal likelihood.
  • 16. The particle marginal Metropolis-Hastings (PMMH) particle MCMC algorithm: following on from the previous post, an explanation of the full PMMH pMCMC algorithm for simultaneous estimation of parameters and state for state-space models.
  • 17. Java math libraries and Monte Carlo simulation codes: a post bemoaning the lack of anything quite like the GSL C library in Java, but highlighting some reasonable alternatives (COLT, Parallel COLT and Apache Commons Math).
  • 18. Gibbs sampler in various languages (revisited): an updated version of post number 5, including detailed timings. I also take the opportunity to include new languages PyPy, Groovy and Scala.
  • 19. Faster Gibbs sampling MCMC from within R: How to call MCMC code written in C, C++ and Java from R, with timing details.
  • 20. Stochastic Modelling for Systems Biology, second edition: A quick introduction to the 2nd edition of my book, together with a tutorial introduction to the associated CRAN R package, smfsb, for simulation and inference of stochastic kinetic network models and other Markov processes.
  • 21. Particle filtering and pMCMC using R: code for particle filtering and particle MCMC for Markov processes, in R.
  • 22. Review of “Parallel R” by McCallum and Weston: my somewhat critical review of this book.
  • 23. Lexical scope and function closures in R: an introduction to notions of variable scope and closure in the context of R.
  • 24. Parallel particle filtering and pMCMC using R and multicore: a discussion of the parallelisation of the particle filtering code from post 21. using R’s high-level parallelisation constructs.
  • 25. Catalogue of my first 25 blog posts: this post!
  • Should I ever manage another 25 posts, I’ll do a similar review of the next 25 at post number 50…

    Parallel particle filtering and pMCMC using R and multicore

    29/12/2011

    Introduction

    In a previous post I showed how to construct a PMMH pMCMC algorithm for parameter estimation with partially observed Markov processes. The inner loop of a pMCMC algorithm consists of running a particle filter to construct an unbiased estimate of marginal likelihood. This inner loop is the place where the code spends almost all of its time, and so speeding up the particle filter will result in dramatic speedup of the pMCMC algorithm. This is fortunate, since as previously discussed, MCMC algorithms are difficult to parallelise other than on a per iteration basis. Here, each iteration can be speeded up if we can effectively parallelise a particle filter. Particle filters are much easier to parallelise than MCMC algorithms, and so it is tempting to try and exploit this within R. In fact, although it is the case that it is possible to effectively parallelise particle filters in efficient languages using low-level parallelisation tools (say, using C with MPI, or Java concurrency tools), it is not so easy to speed up R-based particle filters using R’s high-level parallelisation constructs, as we shall see.

    Particle filters

    In the previous post we looked at the function pfMLLik within the CRAN package smfsb. As a reminder, the source code is

    pfMLLik <- function (n, simx0, t0, stepFun, dataLik, data) 
    {
        times = c(t0, as.numeric(rownames(data)))
        deltas = diff(times)
        return(function(...) {
            xmat = simx0(n, t0, ...)
            ll = 0
            for (i in 1:length(deltas)) {
                xmat = t(apply(xmat, 1, stepFun, t0 = times[i], deltat = deltas[i], ...))
                w = apply(xmat, 1, dataLik, t = times[i + 1], y = data[i,], log = FALSE, ...)
                if (max(w) < 1e-20) {
                    warning("Particle filter bombed")
                    return(-1e+99)
                }
                ll = ll + log(mean(w))
                rows = sample(1:n, n, replace = TRUE, prob = w)
                xmat = xmat[rows, ]
            }
            ll
        })
    }
    

    The function itself doesn’t actually run a particle filter, but instead returns a function closure which does (see the previous post for a discussion of lexical scope and function closures in R). There are obviously several different steps within the particle filter, and several of these are amenable to parallelisation. However, for complex models, forward simulation from the model will be the rate-limiting step, where the vast majority of CPU cycles will be spent. Line 9 in the above code is where forward simulation takes place, and in particular, the key function call is the apply call:
    apply(xmat, 1, stepFun, t0 = times[i], deltat = deltas[i], ...)
    

    This call applies the forward simulation algorithm stepFun to each row of the matrix xmat independently. Since there are no dependencies between the function calls, this is in principle very straightforward to parallelise on multicore hardware.

    Multicore support in R

    I’m writing this post on a laptop with an Intel i7 quad core chip, running the 64 bit version of Ubuntu 11.10. R has support for multicore processing on this platform – it is just a simple matter of installing the relevant packages. However, things are changing rapidly regarding multicore support in R right now, so YMMV. Ubuntu 11.10 has R 2.13 by default, but the multicore support is slightly different in the recently released R 2.14. I’m still using R 2.13. I may update this post (or comment) when I move to R 2.14. The main difference is that the package multicore has been replaced by the package parallel. There are a few other minor changes, but it should be easy to adapt what is presented here to 2.14.

    There is a new O’Reilly book called Parallel R. I’ve got a copy of it. It does cover the new parallel package in R 2.14, as well as other parallel R topics, but the book is a bit light weight, to say the least, and I reviewed it on this blog. Please read my review for further details before you buy it.

    If you haven’t used multicore in R previously, then

    install.packages(c("multicore","doMC"))
    

    should get you started (again, I’m assuming that your R version is strictly < 2.14). You can test it has worked with:
    library(multicore)
    multicore:::detectCores()
    

    When I do this, I get the answer 8 (I have 4 cores, each of which is hyper-threaded). To begin with, I want to tell R to use just 4 process threads, and I can do this with
    library(doMC)
    registerDoMC(4)
    

    Replacing the second line with registerDoMC() will set things up to use all detected cores (in my case, 8). There are a couple of different strategies we could use to parallelise this. One strategy for parallelising the apply call discussed above is to be to replace it with a foreach / %dopar% loop. This is best illustrated by example. Start with line 9 from the function pfMLLik:
    xmat = t(apply(xmat, 1, stepFun, t0 = times[i], deltat = deltas[i], ...))
    

    We can produce a parallelised version by replacing this line with the following block of code:
    res=foreach(j=1:dim(xmat)[1]) %dopar% {
      stepFun(xmat[j,], t0 = times[i], deltat = deltas[i], ...)
    }
    xmat=t(sapply(res,cbind))
    

    Each iteration of the foreach loop is executed independently (possibly using multiple cores), and the result of each iteration is returned as a list, and captured in res. This list of return vectors is then coerced back into a matrix with the final line.

    In fact, we can improve on this by using the .combine argument to foreach, which describes how to combine the results from each iteration. Here we can just use rbind to combine the results into a matrix, using:

    xmat=foreach(j=1:dim(xmat)[1], .combine="rbind") %dopar% {
      stepFun(xmat[j,], t0 = times[i], deltat = deltas[i], ...)
    }
    

    This code is much neater, and in principle ought to be a bit faster, though I haven’t noticed much difference in practice.

    In fact, it is not necessary to use the foreach construct at all. The multicore package provides the mclapply function, which is a multicore version of lapply. To use mclapply (or, indeed, lapply) here, we first need to split our matrix into a list of rows, which we can do using the split command. So in fact, our apply call can be replaced with the single line:

    xmat=t(sapply(mclapply(split(xmat,row(xmat)), stepFun, t0=times[i], deltat=deltas[i], ...),cbind))
    

    This is actually a much cleaner solution than the method using foreach, but it does require grokking a bit more R. Note that mclapply uses a different method to specify the number of threads to use than foreach/doMC. Here you can either use the named argument to mclapply, mc.cores, or use options(), eg. options(cores=4).

    As well as being much cleaner, I find that the mclapply approach is much faster than the foreach/dopar approach for this problem. I’m guessing that this is because foreach doesn’t pre-schedule tasks by default, whereas mclapply does, but I haven’t had a chance to dig into this in detail yet.

    A parallelised particle filter

    We can now splice the parallelised forward simulation step (using mclapply) back into our particle filter function to get:

    require(multicore)
    pfMLLik <- function (n, simx0, t0, stepFun, dataLik, data) 
    {
        times = c(t0, as.numeric(rownames(data)))
        deltas = diff(times)
        return(function(...) {
            xmat = simx0(n, t0, ...)
            ll = 0
            for (i in 1:length(deltas)) {
            xmat=t(sapply(mclapply(split(xmat,row(xmat)), stepFun, t0=times[i], deltat=deltas[i], ...),cbind))
                w = apply(xmat, 1, dataLik, t = times[i + 1], y = data[i,], log = FALSE, ...)
                if (max(w) < 1e-20) {
                    warning("Particle filter bombed")
                    return(-1e+99)
                }
                ll = ll + log(mean(w))
                rows = sample(1:n, n, replace = TRUE, prob = w)
                xmat = xmat[rows, ]
            }
            ll
        })
    }
    

    This can be used in place of the version supplied with the smfsb package for slow simulation algorithms running on modern multicore machines.

    There is an issue regarding Monte Carlo simulations such as this and the multicore package (whether you use mclapply or foreach/dopar) in that it adopts a “different seeds” approach to parallel random number generation, rather than a true parallel random number generator. This probably isn’t worth worrying too much about now, since it is fixed in the new parallel package in R 2.14, but is something to be aware of. I discuss parallel random number generation issues in Wilkinson (2005).

    Granularity

    The above code is now a parallel particle filter, and can now be used in place of the serial version that is part of the smfsb package. However, if you try it out on a simple example, you will most likely be disappointed. In particular, if you use it for the pMCMC example discussed in the previous post, you will see that the parallel version of the example actually runs much slower than the serial version (at least, it does for me). However, that is because the forward simulator stepFun, used in that example, was actually a very fast simulation algorithm, stepLVc, written in C. In this case, the overhead of setting up and closing down the threads, and distributing the tasks, and collating the results from the worker threads back in the master thread, etc., outweighs the advantage of running the individual tasks in parallel. This is why parallel programming is difficult. What is needed here is for the individual tasks to be sufficiently computationally intensive that the overheads associated with parallelisation are easily outweighed by the ability to run the tasks in parallel. In the context of particle filtering, this is particularly problematic, as if the forward simulator is very slow, running a reasonable particle filter is going to be very, very slow, and then you probably don’t want to be working in R anyway… Parallelising a particle filter written in C using MPI is much more likely to be successful, as it offers much more fine grained control of exactly how the tasks and processes are managed, but at the cost of increased development time. In a previous post I gave an introduction to parallel Monte Carlo with C and MPI, and I’ve written more extensively about parallel MCMC in Wilkinson (2005). It also looks as though the new parallel package in R 2.14 offers more control of parallelisation, so that also might help. However, if you are using a particle filter as part of a pMCMC algorithm, there is another strategy you can use at a higher level of granularity which might be useful even within R in some situations.

    Multiple particle filters and pMCMC

    Let’s look again at the main loop of the pMCMC algorithm discussed in the previous post:

    for (i in 1:iters) {
    	message(paste(i,""),appendLF=FALSE)
    	for (j in 1:thin) {
    		thprop=th*exp(rnorm(p,0,tune))
    		llprop=mLLik(thprop)
    		if (log(runif(1)) < llprop - ll) {
    			th=thprop
    			ll=llprop
    			}
    		}
    	thmat[i,]=th
    	}
    

    It is clear that the main computational bottleneck of this code is the call to mLLik on line 5, as this is the call which runs the particle filter. The purpose of making the call is to obtain an unbiased estimate of marginal likelihood. However, there are plenty of other ways that we can obtain such estimates than by running a single particle filter. In particular, we could run multiple particle filters and average the results. So, let’s look at how to do this in the multicore setting. Let’s start by thinking about running 4 particle filters. We could just replace the line
    llprop=mLLik(thprop)
    

    with the code
    llprop=0.25*foreach(i=1:4, .combine="+") %dopar% {
        mLLik(thprop)
        }
    

    Now, there are at least 2 issues with this. The first is that we are now just running 4 particle filters rather than 1, and so even with perfect parallelisation, it will run no quicker than the code we started with. However, the idea is that by running 4 particle filters we ought to be able to get away with each particle filter using fewer particles, though it isn’t trivial to figure out exactly how many. For example, averaging the results from 4 particle filters, each of which uses 25 particles is not as good as running a single particle filter with 100 particles. In practice, some trial and error is likely to be required. The second problem is that we have computed the mean of the log of the likelihoods, and not the likelihoods themselves. This will almost certainly work fine in practice, as the resulting estimate will in most cases be very close to unbiased, but it will not be exactly unbiased, as so will not lead to an “exact” approximate algorithm. In principle, this can be fixed by instead using
    res=foreach(i=1:4) %dopar% {
        mLLik(thprop)
        }
    llprop=log(mean(sapply(res,exp)))
    

    but in practice this is likely to be subject to numerical underflow problems, as it involves manipulating raw likelihood values, which is generally a bad idea. It is possible to compute the log of the mean of the likelihoods in a more numerically stable way, but that is left as an exercise for the reader, as this post is way too long already… However, one additional tip worth mentioning is that the foreach package includes a convenience function called times for situations like the above, where the argument is not varying over calls. So the above code can be replaced with
    res=times(4) %dopar% mLLik(thprop)
    llprop=log(mean(sapply(res,exp)))
    

    which is a bit cleaner and more readable.

    Using this approach to parallelisation, there is now a much better chance of getting some speedup on multicore architectures, as the granularity of the tasks being parallelised is now much larger. Consider the example from the previous post, where at each iteration we ran a particle filter with 100 particles. If we now re-run that example, but instead use 4 particle filters each using 25 particles, we do get a slight speedup. However, on my laptop, the speedup is only around a factor of 1.6 using 4 cores, and as already discussed, 4 filters each with 25 particles isn’t actually quite as good as a single filter with 100 particles anyway. So, the benefits are rather modest here, but will be much better with less trivial examples (slower simulators). For completeness, a complete runnable demo script is included after the references. Also, it is probably worth emphasising that if your pMCMC algorithm has a short burn-in period, you may well get much better overall speed-ups by just running parallel MCMC chains. Depressing, perhaps, but true.

    References

  • McCallum, E., Weston, S. (2011) Parallel R, O’Reilly.
  • Wilkinson, D. J. (2005) Parallel Bayesian Computation, Chapter 16 in E. J. Kontoghiorghes (ed.) Handbook of Parallel Computing and Statistics, Marcel Dekker/CRC Press, 481-512.
  • Wilkinson, D. J. (2011) Stochastic Modelling for Systems Biology, second edition, Boca Raton, Florida: Chapman & Hall/CRC Press.
  • Demo script

    require(smfsb)
    data(LVdata)
    
    require(multicore)
    require(doMC)
    registerDoMC(4)
    
    # set up data likelihood
    noiseSD=10
    dataLik <- function(x,t,y,log=TRUE,...)
    {
    	ll=sum(dnorm(y,x,noiseSD,log=TRUE))
    	if (log)
    		return(ll)
    	else
    		return(exp(ll))
    }
    # now define a sampler for the prior on the initial state
    simx0 <- function(N,t0,...)
    {
    	mat=cbind(rpois(N,50),rpois(N,100))
    	colnames(mat)=c("x1","x2")
    	mat
    }
    # convert the time series to a timed data matrix
    LVdata=as.timedData(LVnoise10)
    # create marginal log-likelihood functions, based on a particle filter
    
    # use 25 particles instead of 100
    mLLik=pfMLLik(25,simx0,0,stepLVc,dataLik,LVdata)
    
    iters=1000
    tune=0.01
    thin=10
    th=c(th1 = 1, th2 = 0.005, th3 = 0.6)
    p=length(th)
    ll=-1e99
    thmat=matrix(0,nrow=iters,ncol=p)
    colnames(thmat)=names(th)
    # Main pMCMC loop
    for (i in 1:iters) {
    	message(paste(i,""),appendLF=FALSE)
    	for (j in 1:thin) {
    		thprop=th*exp(rnorm(p,0,tune))
    		res=times(4) %dopar% mLLik(thprop)
    		llprop=log(mean(sapply(res,exp)))
    		if (log(runif(1)) < llprop - ll) {
    			th=thprop
    			ll=llprop
    			}
    		}
    	thmat[i,]=th
    	}
    message("Done!")
    # Compute and plot some basic summaries
    mcmcSummary(thmat)
    

    Lexical scope and function closures in R

    23/11/2011

    Introduction

    R is different to many “easy to use” statistical software packages – it expects to be given commands at the R command prompt. This can be intimidating for new users, but is at the heart of its power. Most powerful software tools have an underlying scripting language. This is because scriptable tools are typically more flexible, and easier to automate, script, program, etc. In fact, even software packages like Excel or Minitab have a macro programming language behind the scenes available for “power users” to exploit.

    Programming from the ground up

    It is natural to want to automate (repetitive) tasks on a computer, to automate a “work flow”. This is especially natural for computational tasks, as all software tools are built from programming language components, anyway. In R, you do stuff by executing a sequence of commands. By putting a bunch of commands one after another into a text file, we can source the file, and script R. Scripting is the simplest form of programming – automating a sequence of tasks. Indeed, in Unix (including Linux and MacOS), we can put a bunch of Unix shell commands together in a shell script. In Windows, you can put a bunch of terminal commands together in a batch file.

    Next, one can add in simple control structures, to support looping, branching and conditional execution. Looping allows repetition of very similar tasks. Branching and conditional execution allow decisions to be made depending on what has already happened. Most scripting languages support simple control structures – this allows carrying out of tasks which we could do in principle, but perhaps not in practice, due to the laborious and repetitive nature of some work-flows. We can go a long way with this, but…

    Although scripting is a simple form of programming, it isn’t “real” programming, or software engineering. Software engineering is about developing flexible, modular, robust, re-usable, generic program components, and using them to build large, complex software systems – modularity is absolutely key here. Functions and procedures are a first step towards introducing modularity, allowing the development of “real” software. Proper support for these tends to distinguish “real” programming languages from scripting languages (though many modern “scripting” languages have at least some limited support, and the distinction between scripting languages and “real” languages is now very blurred).

    Functions and procedures

    Procedures (or subroutines) are re-usable pieces of code which can be called from other pieces of code when needed. They may be provided with inputs, but do not have to be. They are usually called for their “side-effects”, such as doing plots, changing global variables, or reading/writing data to/from disk.

    Functions are also re-usable pieces of code, but are mainly used to obtain a return-value that is computed on the basis of the given inputs. “Pure” functions do not have any side-effects. Functions and procedures may be combined in a hierarchical way to build large, complex algorithms from much simpler modular components. Note that many languages (including R), do not make a distinction between functions and procedures in the syntax of the language, but conceptually the distinction is really quite important.

    Variable scope

    Almost all programming languages allow the definition of variables which are labels or tags representing or pointing at some value that may be defined and re-defined at run-time. In most modern programming languages, functions can define local variables which can be used in addition to any inputs (formal parameters) of the function – these are very important for the development of modular, re-usable code components. In particular, they help to avoid unanticipated name clashes in the global name-space. If a function refers to a variable which is neither a formal parameter nor a local variable, then a rule is needed to find which (if any) variable with that label is in scope for the function, so that the program can know what value to use.

    Dynamic scope

    Under dynamic scope, if an “unknown” variable is referred to in a function, the idea is to use the version of the variable that is in scope at the time that the function was called (and apply this rule recursively) – this is the scoping rule used by the S-PLUS implementation of the S language. Dynamic scope was common among early dynamic programming languages – including early implementations of LISP (and is still used in Emacs LISP), as it was quite intuitive and natural to implement using a stack-based approach similar to the stack-based approach to passing variables in and out of subroutines commonly used by machine code and assembly programmers.

    Despite being intuitively appealing, at least initially, there are a number of problems with dynamic scope in practice. In particular, we can’t really know by code inspection whether or not a given section of code will run in all situations without actually running the code, as we can’t know whether all variable bindings will resolve correctly. This is an issue even for dynamic languages, but is particularly problematic for strongly typed compiled languages, as it becomes difficult for the compiler to figure out the types of all variables correctly and therefore generate the appropriate byte-code. It is also very difficult for a function to have associated state – to do this, you must somehow get state variables into global name-space where they then become vulnerable to masking and name clashes. See the Wikipedia page on scope for further details.

    Lexical scope

    Under lexical scoping rules, if an “unknown” variable is referred to in a function, the idea is to use the version that is “in scope” in the enclosing piece of code (and apply this rule recursively) — this is the scoping rule used by R (as R is built on top of a Scheme interpreter, a LISP derivative which emphasises lexical scope). Variable bindings can be all resolved, checked and verified at compile-time – this is safer, and in many other ways better. Most modern languages adopt lexical scoping, including most functional languages, such as LISPs (including LISP-STAT) and derivatives. In fact, I first read about lexical scope, function closures and their use in statistical computing in Luke Tierney’s LISP-STAT book (Tierney, 1990) in the early 1990s. That book was published over 20 years ago, so it just goes to show that there is nothing new about these functional programming approaches. In fact, although Tierney’s book describes a now obsolete system, I would nevertheless recommend reading it if you can find a copy, as I think it is still one of the best books on statistical computing ever written. It really puts the recent glut of horrible R-themed books to shame!

    Given that R has been lexically scoped and has supported function closures since day one, it is reasonable to wonder why this programming style is not used more widely in R code. I think it is the difference in scoping rules between S-PLUS and R that has led to a fear of developing any R code which relies on non-local scoping rules. Certainly, in the early days of R, I would use S-PLUS at work and R at home, and I would want my code to work in exactly the same in both places! This is a shame, as lexical scoping is very powerful, and exploited widely in functional programming styles. The use of lexical scope and function closures in R is described quite nicely in Gentleman (2008), along with many other things.

    To make sure that the concepts are clear, inspect the following piece of code and figure out what the result of the final function call will be. The answer is given below the code, so try not to peek before reading on…

    a=1
    b=2
    f<-function(x)
    {
      a*x + b
    }
    g<-function(x)
    {
      a=2
      b=1
      f(x)
    }
    g(2)
    

    No, really, try and figure it out before reading on for the answer! Understanding this example is key to understanding the difference between lexical and dynamic scope. Clearly the obvious answers are 4 and 5. If you didn’t get one of those, go back and try again! ;-) So, one of those is the result you get in a dynamically scoped language like S-PLUS, and the other is the result that you get in a lexically scoped language like R. But which is which? Many people when asked what this code does give the answer 5. This is the result for a dynamically scoped language. It is not the answer you get in R. In R, you get the answer 4. This is because f() was defined in the global environment, so it is the global bindings of a and b which count. Although the function g() defines its own local bindings for a and b, these have no impact on the global bindings, and are simply not relevant to the evaluation of f().

    Function closures

    Some languages (including LISPs and derivatives such as Scheme, Python, and R) have functions as “first class objects”, which means that a function is able to return as its value another function. If the function (fChild) returned by the function (fParent) refers to variables not local to fChild, then scoping rules must apply to the resolution of the variable binding. If the language is lexically scoped, then the binding is determined by the variables in scope within the function fParent. The function fChild therefore has an associated environment, which provides bindings for non-local variable references – this allows maintaining of state. A function together with its environment is referred to as a function closure, and is a very powerful programming tool. Below is some more code to help illustrate what is going on. Again, try to figure out the result of the final function call before reading on for the answer and explanation…

    a=1
    b=2
    f<-function(a,b)
    {
      return( function(x) {
        a*x + b
      })
    }
    g=f(2,1)
    g(2)
    

    Here, the function g(), together with its associated environment, is referred to as a function closure. See the Wikipedia page for closure for further details. So, what is the result of calling g(2) in this case? Again, some people get this wrong, and give the answer 4. This isn’t what you get in R – in R you get 5, again due to lexical scope. The point is that the function g() is created inside f(), and so it is the variable bindings in scope within f() at the time g() was created which matter. Since f() has a and b as formal arguments, these mask the global variables of the same name, so it is the 2 and 1 that are passed into f() to create g() which matter in the evaluation of g(). This is why function closures are so powerful. They are not simply functions, they are functions together with an associated environment, and the associated environment allows function closures to have associated state. Here the state corresponds to the values of a and b that were used in the creation of g(), but in principle the state can be essentially any data structure.

    Function closures for scientific computing

    Function closures have numerous important applications in a variety of problems in scientific computing that involve dealing in some way with the “function environment problem”. There is quite a nice discussion of this issue in Oliveira and Stewart (2006), in the context of several strongly typed compiled languages. Consider, for example, a function that will numerically integrate a univariate function using (say) the trapezium rule. This integration function might expect that you pass in the function to be integrated, together with the limits of integration, and possibly a step size. Most likely this integration function will expect that the function passed in is univariate. However, in practice many functions have additional parameters (eg. the straight line example, above, which was a function of x, but depending on additional parameters a and b). This problem is solved by passing in a univariate function closure that contains the necessary environment to evaluate this univariate function correctly. Similar considerations apply for functions that carry out optimisation, solve ODEs by passing in the RHS, etc.

    The smfsb R package

    The second edition of my textbook, Stochastic Modelling for Systems Biology, has recently been published (Wilkinson, 2011). The second edition has an associated R package, smfsb, available from CRAN – I gave a tutorial introduction in a previous post. The code makes extensive use of lexical scope and function closures, precisely to solve the function environment problem…

    References

  • Oliveira, S, Stewart, D.E. (2006) Writing scientific software, CUP.
  • Gentleman, R. (2008) R Programming for Bioinformatics, Chapman & Hall/CRC Press.
  • Tierney, L. (1990) LISP-STAT, Wiley.
  • Wilkinson (2011), Stochastic Modelling for Systems Biology, second edition, Chapman & Hall/CRC Press.
  • Review of “Parallel R” by McCallum and Weston

    16/11/2011

    Introduction

    This is the first book review I’ve done on this blog, and I don’t intend to make it a regular feature, but I ordered a copy of “Parallel R” a few days ago. It arrived today, and I’m quite disappointed with it, so I wanted to write a quick review to provide some additional information for people thinking of buying a copy. Just to be clear, the book is:

  • McCallum, E., Weston, S. (2011) Parallel R, O’Reilly.
  • I generally like O’Reilly books, and own a number of them. I use R a lot, I am very interested in parallel computing (traditionally using C and MPI), and I’ve dabbled a little with some parallel stuff in R, but don’t consider myself to be an expert. In other words, people like me are probably the target audience for this book, and sure enough I have handed over some of my hard-earned cash to buy a copy.

    The main problem with the book is that it just doesn’t feel finished. It seems as though the authors have rushed the text as quickly as possible and published it without any kind of critical review or reflection. It is very short – just 108 pages of the main text, and most annoyingly, doesn’t have an index. This wouldn’t be so much of a problem for an electronic version, but selling a technical computing book in dead tree format without an index is really unforgivable. All of my other O’Reilly books contain a decent index, so I’m just baffled as to why this one doesn’t. It really feels like this is the first draft of a manuscript that you would circulate to a few friends and colleagues for comments and suggested improvements. There is the kernel of a decent book here, but most of the current material will be obsolete before a second edition could be put together and published, so the second edition will have to be a complete re-write.

    Chapter by chapter

    Chapter 1 – Getting started – 5 pages

    A brief introduction to the rest of the book, and has a pointer to the companion website, parallelrbook.com (at the time of writing, it is empty…).

    Chapter 2 – snow – 30 pages

    This chapter is the most substantial, and arguably the best, chapter of the book. It provides a very reasonable introduction to the snow package for a simple network of workstations, for running embarrassingly parallel jobs on a cluster.

    Chapter 3 – multicore – 13 pages

    This chapter provides a very brief and superficial introduction to the multicore package, for exploiting modern multicore hardware. It provides a very brief introduction to the high level API (mclapply, pvec, parallel, collect). Discussion of the low-level API is almost non-existent (an example function is given which uses some low-level calls). Also, there is no discussion here, or anywhere else, of the foreach package/function, or the doMC back-end. Unfortunately I couldn’t verify this by checking the index (see above), but as there are only 100 or so pages, it didn’t take that long to flick through them all to double-check… Now I can understand that the book will not cover all obscure parallel packages for R, but foreach/doMC?! Missing from a book called “Parallel R”? Seriously? It is all the more weird as one of the authors (Weston) is an author of foreach/doMC. Go figure…

    Chapter 4 – parallel – 8 pages

    This chapter provides an even more brief introduction to the new parallel package for R 2.14. It should be noted that the book went to press before 2.14 was frozen for release, but the content that was there looked OK on the basis of a very quick skim. But at 8 pages, don’t expect too much.

    Chapter 5 – A primer on MapReduce and Hadoop – 8 pages

    A very brief introduction to the ideas behind MapReduce and Hadoop. Not actually anything to do with R, but necessary for the next chapter.

    Chapter 6 – R+Hadoop – 18 pages

    I’ve not worked through this chapter in detail, but it looks like a reasonable “getting started guide” for using Hadoop with R.

    Chapter 7 – RHIPE – 16 pages

    Again, I’ve not studied this chapter carefully, but it seems like a reasonable introduction to the RHIPE package (it is a package to make it simpler to use R with Hadoop, by hiding Hadoop stuff from the R user).

    Chapter 8 – Segue – 6 pages

    A very brief introduction to the segue package, which enables running jobs on Amazon’s Elastic MapReduce service.

    Chapter 9 – New and upcoming – 2 pages

    A very brief mention of some of the things that weren’t covered in the book… foreach is mentioned here, but no example is given.

    Conclusion

    For people who have absolutely no idea about parallel computing in R, or about what different options are available, then this book does provide a useful overview, together with some simple examples to illustrate the ideas, and try out for themselves. It is generally very brief and superficial, there are some gaping holes, and much of the material will become obsolete very quickly. It is a shame that there is not more discussion of low level functions, or of parallel computing in anything other than a simple embarrassingly parallel context. Admittedly, if your job isn’t embarrassingly parallel, you probably don’t want to use R anyway, but some discussion would still have been nice. And did I mention that there is no index?! I did toy briefly with the idea of sending it back, but I’m not going to. To be fair, there is quite a bit of useful information in the book, and I’d like to work through the Hadoop chapters at some point. So in summary, it’s OK, but don’t expect to love it.

    References

  • McCallum, E., Weston, S. (2011) Parallel R, O’Reilly.
  • Particle filtering and pMCMC using R

    12/11/2011

    In the previous post I gave a quick introduction to the CRAN R package smfsb, and how it can be used for simulation of Markov processes determined by stochastic kinetic networks. In this post I’ll show how to use data and particle MCMC techniques in order to carry out Bayesian inference for the parameters of partially observed Markov processes.

    The simulation model and the data

    For this post we will assume that the smfsb package is installed and loaded (see previous post for details). The package includes the function stepLVc which simulates from the Markov transition kernel of a Lotka-Volterra process by calling out to some native C code for speed. So, for example,

    stepLVc(c(x1=50,x2=100),0,1)
    

    will simulate the state of the process at time 1 given an initial condition of 50 prey and 100 predators at time 0, using the default rate parameters of the function, th = c(1, 0.005, 0.6). The package also includes some data simulated from this model using these parameters, with and without added noise. The datasets can be loaded with
    data(LVdata)
    

    For simplicity, we will just make use of the dataset LVnoise10 in this post. This dataset is a multivariate time series consisting of 16 equally spaced observations on both prey and predators subject to Gaussian measurement error with a standard deviation of 10. We can plot the data with
    plot(LVnoise10,plot.type="single",col=c(2,4))
    

    giving:

    The Bayesian inference problem is to see how much we are able to learn about the parameters which generated the data using only the data and our knowledge of the structure of the problem. There are many approaches one can take to this problem, but most are computationally intensive, due to the analytical intractability of the transition kernel of the LV process. Here we will follow Wilkinson (2011) and take a particle MCMC (pMCMC) approach, and specifically, use a pseudo-marginal “exact approximate” MCMC algorithm based on the particle marginal Metropolis-Hastings (PMMH) algorithm. I have discussed the pseudo-marginal approach, using particle filters for marginal likelihood estimation, and the PMMH algorithm in previous posts, so if you have been following my posts for a while, this should all make perfect sense…

    Particle filter

    One of the key ingredients required to implement the pseudo-marginal MCMC scheme is a (bootstrap) particle filter which generates an unbiased estimate of the marginal likelihood of the data given the parameters (integrated over the unobserved state trajectory). The algorithm was discussed in this post, and R code to implement this is included in the smfsb R package as pfMLLik. For reasons of numerical stability, the function computes and returns the log of the marginal likelihood, but it is important to understand that it is the actually likelihood estimate that is unbiased for the true likelihood, and not the corresponding statement for the logs. The actual code of the function is relatively short, and for completeness is given below:

    pfMLLik <- function (n, simx0, t0, stepFun, dataLik, data) 
    {
        times = c(t0, as.numeric(rownames(data)))
        deltas = diff(times)
        return(function(...) {
            xmat = simx0(n, t0, ...)
            ll = 0
            for (i in 1:length(deltas)) {
                xmat = t(apply(xmat, 1, stepFun, t0 = times[i], deltat = deltas[i], ...))
                w = apply(xmat, 1, dataLik, t = times[i + 1], y = data[i,], log = FALSE, ...)
                if (max(w) < 1e-20) {
                    warning("Particle filter bombed")
                    return(-1e+99)
                }
                ll = ll + log(mean(w))
                rows = sample(1:n, n, replace = TRUE, prob = w)
                xmat = xmat[rows, ]
            }
            ll
        })
    }
    

    We need to set up the prior and the data likelihood correctly before we can use this function, but first note that the function does not actually run a particle filter at all, but instead stores everything it needs to know to run the particle filter in the local environment, and then returns a function closure for evaluating the marginal likelihood at a given set of parameters. The resulting function (closure) can then be used to run a particle filter for a given set of parameters, simply by passing the required parameters into the function. This functional programming style is consistent with that used throughout the smfsb R package, and leads to quite simple, modular code. To use pfMLLik, we first need to define a function which evaluates the log-likelihood of an observation conditional on the true state, and another which samples from the prior distribution of the initial state of the system. Here, we can do that as follows.
    # set up data likelihood
    noiseSD=10
    dataLik <- function(x,t,y,log=TRUE,...)
    {
    	ll=sum(dnorm(y,x,noiseSD,log=TRUE))
    	if (log)
    		return(ll)
    	else
    		return(exp(ll))
    }
    # now define a sampler for the prior on the initial state
    simx0 <- function(N,t0,...)
    {
    	mat=cbind(rpois(N,50),rpois(N,100))
    	colnames(mat)=c("x1","x2")
    	mat
    }
    # convert the time series to a timed data matrix
    LVdata=as.timedData(LVnoise10)
    # create marginal log-likelihood functions, based on a particle filter
    mLLik=pfMLLik(100,simx0,0,stepLVc,dataLik,LVdata)
    

    Now the function (closure) mLLik will, for a given parameter vector, run a particle filter (using 100 particles) and return the log of the particle filter’s unbiased estimate of the marginal likelihood of the data. It is then very easy to use this function to create a simple PMMH algorithm for parameter inference.

    PMMH algorithm

    Below is an algorithm based on flat priors and a simple Metropolis-Hastings update for the parameters using the function closure mLLik, defined above.

    iters=1000
    tune=0.01
    thin=10
    th=c(th1 = 1, th2 = 0.005, th3 = 0.6)
    p=length(th)
    ll=-1e99
    thmat=matrix(0,nrow=iters,ncol=p)
    colnames(thmat)=names(th)
    # Main pMCMC loop
    for (i in 1:iters) {
    	message(paste(i,""),appendLF=FALSE)
    	for (j in 1:thin) {
    		thprop=th*exp(rnorm(p,0,tune))
    		llprop=mLLik(thprop)
    		if (log(runif(1)) < llprop - ll) {
    			th=thprop
    			ll=llprop
    			}
    		}
    	thmat[i,]=th
    	}
    message("Done!")
    # Compute and plot some basic summaries
    mcmcSummary(thmat)
    

    This will take a little while to run, but in the end should give a plot something like the following (click for full size):

    So, although we should really run the chain for a bit longer, we see that we can learn a great deal about the parameters of the process from very little data. For completeness, a full runnable demo script is included below the references. Of course there are many obvious extensions of this basic problem, such as partial observation (eg. only observing the prey) and unknown measurement error. These are discussed in Wilkinson (2011), and code for these cases is included within the demo(PMCMC), which should be inspected for further details.

    Discussion

    At this point it is probably worth emphasising that there are other “likelihood free” approaches which can be taken to parameter inference for partially observed Markov process (POMP) models. Many of these are implemented in the pomp R package, also available from CRAN, by King et al (2008). The pomp package is well documented, and has a couple of good tutorial vignettes which should be sufficient to get people started. The API of the package is rather cumbersome, but the algorithms appear to be quite robust. Approximate Bayesian computation (ABC) approaches are also quite natural for POMP models (see, for example, Toni et al (2009)). This is because “exact” likelihood free procedures break down in the case of low/no measurement error or high-dimensional observations. There are some R packages for ABC, but I am not sufficiently familiar with them to be able to give recommendations.

    If one is able to move away from the “likelihood free” paradigm, it is possible to develop “exact” pMCMC algorithms which do not break down in challenging observation scenarios. The problem here is the intractability of the Markov transition kernel. In the case of nonlinear Markov jump processes, finding very generic solutions seems quite difficult, but for diffusion (approximation) processes based on stochastic differential equations, it seems to be possible to develop irreducible pMCMC algorithms which have very broad applicability – see Golightly and Wilkinson (2011) for further details of how such techniques can be used in the context of stochastic kinetic models similar to those considered in this post.

    References

  • Golightly, A., Wilkinson, D. J. (2011) Bayesian parameter inference for stochastic biochemical network models using particle MCMC, Interface Focus, 1(6):807-820.
  • King, A.A., Ionides, E.L., & Breto, C.M. (2008) pomp: Statistical inference for partially observed Markov processes, CRAN.
  • Toni, T., Welch, D., Strelkowa, N., Ipsen, A. & Stumpf, M. (2009) Approximate Bayesian computation scheme for parameter inference and model selection in dynamical systems, J. R. Soc. Interface 6(31): 187-202.
  • Wilkinson, D. J. (2011) Stochastic Modelling for Systems Biology, second edition, Boca Raton, Florida: Chapman & Hall/CRC Press.
  • Demo script

    require(smfsb)
    data(LVdata)
    
    # set up data likelihood
    noiseSD=10
    dataLik <- function(x,t,y,log=TRUE,...)
    {
    	ll=sum(dnorm(y,x,noiseSD,log=TRUE))
    	if (log)
    		return(ll)
    	else
    		return(exp(ll))
    }
    # now define a sampler for the prior on the initial state
    simx0 <- function(N,t0,...)
    {
    	mat=cbind(rpois(N,50),rpois(N,100))
    	colnames(mat)=c("x1","x2")
    	mat
    }
    # convert the time series to a timed data matrix
    LVdata=as.timedData(LVnoise10)
    # create marginal log-likelihood functions, based on a particle filter
    mLLik=pfMLLik(100,simx0,0,stepLVc,dataLik,LVdata)
    
    iters=1000
    tune=0.01
    thin=10
    th=c(th1 = 1, th2 = 0.005, th3 = 0.6)
    p=length(th)
    ll=-1e99
    thmat=matrix(0,nrow=iters,ncol=p)
    colnames(thmat)=names(th)
    # Main pMCMC loop
    for (i in 1:iters) {
    	message(paste(i,""),appendLF=FALSE)
    	for (j in 1:thin) {
    		thprop=th*exp(rnorm(p,0,tune))
    		llprop=mLLik(thprop)
    		if (log(runif(1)) < llprop - ll) {
    			th=thprop
    			ll=llprop
    			}
    		}
    	thmat[i,]=th
    	}
    message("Done!")
    # Compute and plot some basic summaries
    mcmcSummary(thmat)
    

    Stochastic Modelling for Systems Biology, second edition

    09/11/2011

    The second edition of my textbook, Stochastic Modelling for Systems Biology was published on 7th November, 2011. One of the new features introduced into the new edition is an R package called smfsb which contains all of the code examples discussed in the text, which allow modelling, simulation and inference for stochastic kinetic models. The smfsb R package is the main topic of this post, but it seems appropriate to start off the post with a quick introduction to the book, and the main new features of the second edition.

    The first edition was published in April 2006. It provided an introduction to mathematical modelling for systems biology from a stochastic viewpoint. It began with an introduction to biochemical network modelling, then moved on to probability theory, stochastic simulation and Markov processes. After providing all of the necessary background material, the book then introduced the theory of stochastic kinetic modelling and the Gillespie algorithm for exact discrete stochastic event simulation of stochastic kinetic biochemical network models. This was followed by examples and case studies, advanced simulation algorithms, and then a brief introduction to Bayesian inference and its application to inference for stochastic kinetic models.

    The first edition proved to be very popular, as it was the first self-contained introduction to the field, and was aimed at an audience without a strong quantitative background. The decision to target an applied audience meant that it contained only the bare essentials necessary to get started with stochastic modelling in systems biology. The second edition was therefore an opportunity not only to revise and update the existing material, but also to add in additional material, especially new material which could provide a more solid foundation for advanced study by students with a more mathematical focus. New material introduced into the second edition includes a greatly expanded chapter on Markov processes, with particular emphasis on diffusion processes and stochastic differential equations, as well as Kolmogorov equations, the Fokker-Planck equation (FPE), Kurtz’s random time change representation of a stochastic kinetic model, an additional derivation of the chemical Langevin equation (CLE), and a derivation of the linear noise approximation (LNA). There is now also discussion of the modelling of “extrinsic” in addition to “intrinsic” noise. The final chapters on inference have also been greatly expanded, including discussion of importance resampling, particle filters, pseudo-marginal “exact approximate” MCMC, likelihood-free techniques and particle MCMC for rate parameter inference. I have tried as far as possible to maintain the informal and accessible style of the first edition, and a couple of the more technical new sections have been flagged as “skippable” by less mathematically trained students. In terms of computing, all of the SBML models have been updated to the new Level 3 specification, and all of the R code has been re-written, extended, documented and packaged as an open source R package. The rest of this post is an introduction to the R package. Although the R package is aimed mainly at owners of the second edition, it is well documented, and should therefore be usable by anyone with a reasonable background knowledge of the area. In particular, the R package should be very easy to use for anyone familiar with the first edition of the book. The introduction given here is closely based on the introductory vignette included with the package.

    smfsb: an R package for simulation and inference in stochastic kinetic models

    Overview

    The smfsb package provides all of the R code associated with the book, Wilkinson (2011). Almost all of the code is pure R code, intended to be inspected from the R command line. In order to keep the code short, clean and easily understood, there is almost no argument checking or other boilerplate code.

    Installation

    The package is available from CRAN, and it should therefore be possible to install from the R command prompt using

    install.packages("smfsb")
    

    from any machine with an internet connection.

    The package is being maintained on R-Forge, and so it should always be possible to install the very latest nightly build from the R command prompt with

    install.packages("smfsb",repos="http://r-forge.r-project.org")
    

    but you should only do this if you have a good reason to, in order not to overload the R-Forge servers (not that I imagine downloads of this package are likely to overload the servers…).

    Once installed, the package can be loaded ready for use with

    library(smfsb)
    

    Accessing documentation

    I have tried to ensure that the package and all associated functions and datasets are properly documented with runnable examples. So,

    help(package="smfsb")
    

    will give a brief overview of the package and a complete list of all functions. The list of vignettes associated with the package can be obtained with
    vignette(package="smfsb")
    

    At the time of writing, the introductory vignette is the only one available, and can be accessed from the R command line with
    vignette("smfsb",package="smfsb")
    

    Help on functions can be obtained using the usual R mechanisms. For example, help on the function StepGillespie can be obtained with
    ?StepGillespie
    

    and the associated example can be run with
    example(StepGillespie)
    

    The sourcecode for the function can be obtained by typing StepGillespie on a line by itself. In this case, it returns the following R code:
    function (N) 
    {
        S = t(N$Post - N$Pre)
        v = ncol(S)
        return(function(x0, t0, deltat, ...) {
            t = t0
            x = x0
            termt = t0 + deltat
            repeat {
                h = N$h(x, t, ...)
                h0 = sum(h)
                if (h0 < 1e-10)
                    t = 1e+99 
                else if (h0 > 1e+06) {
                    t = 1e+99
                    warning("Hazard too big - terminating simulation!")
                } 
                else 
                    t = t + rexp(1, h0)
                if (t >= termt) 
                    return(x)
                j = sample(v, 1, prob = h)
                x = x + S[, j]
            }
        })
    }
    

    A list of demos associated with the package can be obtained with
    demo(package="smfsb")
    

    A list of data sets associated with the package can be obtained with
    data(package="smfsb")
    

    For example, the small table, mytable from the introduction to R in Chapter 4 can by loaded with
    data(mytable)
    

    After running this command, the data frame mytable will be accessible, and can be examined by typing
    mytable
    

    at the R command prompt.

    Simulation of stochastic kinetic models

    The main purpose of this package is to provide a collection of tools for building and simulating stochastic kinetic models. This can be illustrated using a simple Lotka-Volterra predator-prey system. First, consider the prey, X_1 and the predator X_2 as a stochastic network, viz

    R_1:\quad X_1 \longrightarrow 2 X_1
    R_2:\quad X_1 + X_2\longrightarrow 2X_2
    R_3:\quad X_2 \longrightarrow \emptyset.

    The first “reaction” represents predator reproduction, the second predator-prey interaction and the third predator death. We can write the stoichiometries of the reactions, together with the rate (or hazard) of each reaction, in tabular form as

    Reaction Pre Post Hazard
    X_1 X_2 X_1 X_2 h()
    R_1 1 0 2 0 \theta_1 x_1
    R_2 1 1 0 2 \theta_2 x_1 x_2
    R_3 0 1 0 0 \theta_3 x_2

    This can be encoded in R as a stochastic Petri net (SPN) using

    # SPN for the Lotka-Volterra system
    LV=list()
    LV$Pre=matrix(c(1,0,1,1,0,1),ncol=2,byrow=TRUE)
    LV$Post=matrix(c(2,0,0,2,0,0),ncol=2,byrow=TRUE)
    LV$h=function(x,t,th=c(th1=1,th2=0.005,th3=0.6))
    {
     with(as.list(c(x,th)),{
             return(c(th1*x1, th2*x1*x2, th3*x2 ))
            })
    }
    

    This object could be created directly by executing
    data(spnModels)
    

    since the LV model is one of the standard demo models included with the package. Functions for simulating from the transition kernel of the Markov process defined by the SPN can be created easily by passing the SPN object into the appropriate constructor. For example, if simulation using the Gillespie algorithm is required, a simulation function can be created with
    stepLV=StepGillespie(LV)
    

    This resulting function (closure) can then be used to advance the state of the process. For example, to simulate the state of the process at time 1, given an initial condition of X_1=50, X_2=100 at time 0, use
    stepLV(c(x1=50,x2=100),0,1)
    

    Alternatively, to simulate a realisation of the process on a regular time grid over the interval [0,100] in steps of 0.1 time units, use
    out = simTs(c(x1=50,x2=100),0,100,0.1,stepLV)
    plot(out,plot.type="single",col=c(2,4))
    

    which gives the resulting plot

    See the help and runnable example for the function StepGillespie for further details, including some available alternative simulation algorithms, such as StepCLE.

    Inference for stochastic kinetic models from time course data

    Estimating the parameters of stochastic kinetic models using noisy time course measurements on some aspect of the system state is a very important problem. Wilkinson (2011) takes a Bayesian approach to the problem, using particle MCMC methodology. For this, a key aspect is the use of a particle filter to compute an unbiased estimate of marginal likelihood. This is accomplished using the function pfMLLik. Once a method is available for generating unbiased estimates for the marginal likelihood, this may be embedded into a fairly standard marginal Metropolis-Hastings algorithm for parameter estimation. See the help and runnable example for pfMLLik for further details, along with the particle MCMC demo, which can by run using demo(PMCMC). I’ll discuss more about particle MCMC and rate parameter inference in the next post.

    References

  • Wilkinson, D. J. (2006) Stochastic Modelling for Systems Biology, Boca Raton, Florida: Chapman & Hall/CRC Press.
  • Wilkinson, D. J. (2011) Stochastic Modelling for Systems Biology, second edition, Boca Raton, Florida: Chapman & Hall/CRC Press.
  • Faster Gibbs sampling MCMC from within R

    31/07/2011

    Introduction

    This post follows on from the previous post on Gibbs sampling in various languages. In that post a simple Gibbs sampler was implemented in various languages, and speeds were compared. It was seen that R is very slow for iterative simulation algorithms characteristic of MCMC methods such as the Gibbs sampler. Statically typed languages such as C/C++ and Java were seen to be fastest for this type of algorithm. Since many statisticians like to use R for most of their work, there is natural interest in the possibility of extending R by calling simulation algorithms written in other languages. It turns out to be straightforward to call C, C++ and Java from within R, so this post will look at how this can be done, and exactly how fast the different options turn out to be. The post draws heavily on my previous posts on calling C from R and calling Java from R, as well as Dirk Eddelbuettel’s post on calling C++ from R, and it may be helpful to consult these posts for further details.

    Languages

    R

    We will start with the simple pure R version of the Gibbs sampler, and use this as our point of reference for understanding the benefits of re-coding in other languages. The background to the problem was given in the previous post and so won’t be repeated here. The code can be given as follows:

    gibbs<-function(N=50000,thin=1000)
    {
    	mat=matrix(0,ncol=2,nrow=N)
    	x=0
    	y=0
    	for (i in 1:N) {
    		for (j in 1:thin) {
    			x=rgamma(1,3,y*y+4)
    			y=rnorm(1,1/(x+1),1/sqrt(2*x+2))
    		}
    		mat[i,]=c(x,y)
    	}
    	names(mat)=c("x","y")
    	mat
    }
    

    This code works perfectly, but is very slow. It takes 458.9 seconds on my very fast laptop (details given in previous post).

    C

    Let us now see how we can introduce a new function, gibbsC into R, which works in exactly the same way as gibbs, but actually calls on compiled C code to do all of the work. First we need the C code in a file called gibbs.c:

    #include <stdio.h>
    #include <math.h>
    #include <stdlib.h>
    #include <R.h>
    #include <Rmath.h>
    
    void gibbs(int *Np,int *thinp,double *xvec,double *yvec)
    {
      int i,j;
      int N=*Np,thin=*thinp;
      GetRNGstate();
      double x=0;
      double y=0;
      for (i=0;i<N;i++) {
        for (j=0;j<thin;j++) {
          x=rgamma(3.0,1.0/(y*y+4));
          y=rnorm(1.0/(x+1),1.0/sqrt(2*x+2));
        }
        xvec[i]=x; yvec[i]=y;
      }
      PutRNGstate();
    }
    

    This can be compiled with R CMD SHLIB gibbs.c. We can load it into R and wrap it up so that it is easy to use with the following code:

    dyn.load(file.path(".",paste("gibbs",.Platform$dynlib.ext,sep="")))
    gibbsC<-function(n=50000,thin=1000)
    {
      tmp=.C("gibbs",as.integer(n),as.integer(thin),
                    x=as.double(1:n),y=as.double(1:n))
      mat=cbind(tmp$x,tmp$y)
      colnames(mat)=c("x","y")
      mat
    }
    

    The new function gibbsC works just like gibbs, but takes just 12.1 seconds to run. This is roughly 40 times faster than the pure R version, which is a big deal.

    Note that using the R inline package, it is possible to directly inline the C code into the R source code. We can do this with the following R code:

    require(inline)
    code='
      int i,j;
      int N=*Np,thin=*thinp;
      GetRNGstate();
      double x=0;
      double y=0;
      for (i=0;i<N;i++) {
        for (j=0;j<thin;j++) {
          x=rgamma(3.0,1.0/(y*y+4));
          y=rnorm(1.0/(x+1),1.0/sqrt(2*x+2));
        }
        xvec[i]=x; yvec[i]=y;
      }
      PutRNGstate();'
    gibbsCin<-cfunction(sig=signature(Np="integer",thinp="integer",xvec="numeric",yvec="numeric"),body=code,includes="#include <Rmath.h>",language="C",convention=".C")
    gibbsCinline<-function(n=50000,thin=1000)
    {
      tmp=gibbsCin(n,thin,rep(0,n),rep(0,n))
      mat=cbind(tmp$x,tmp$y)
      colnames(mat)=c("x","y")
      mat
    }
    

    This runs at the same speed as the code compiled separately, and is arguably a bit cleaner in this case. Personally I’m not a big fan of inlining code unless it is something really very simple. If there is one thing that we have learned from the murky world of web development, it is that little good comes from mixing up different languages in the same source code file!

    C++

    We can also inline C++ code into R using the inline and Rcpp packages. The code below originates from Sanjog Misra, and was discussed in the post by Dirk Eddelbuettel mentioned at the start of this post.

    require(Rcpp)
    require(inline)
    
    gibbscode = '
    int N = as<int>(n);
    int thn = as<int>(thin);
    int i,j;
    RNGScope scope;
    NumericVector xs(N),ys(N);
    double x=0;
    double y=0;
    for (i=0;i<N;i++) {
      for (j=0;j<thn;j++) {
        x = ::Rf_rgamma(3.0,1.0/(y*y+4));
        y= ::Rf_rnorm(1.0/(x+1),1.0/sqrt(2*x+2));
      }
      xs(i) = x;
      ys(i) = y;
    }
    return Rcpp::DataFrame::create( Named("x")= xs, Named("y") = ys);
    '
    
    RcppGibbsFn <- cxxfunction( signature(n="int", thin = "int"),
                                  gibbscode, plugin="Rcpp")
    
    RcppGibbs <- function(N=50000,thin=1000)
    {
    	RcppGibbsFn(N,thin)
    }
    

    This version of the sampler runs in 12.4 seconds, just a little bit slower than the C version.

    Java

    It is also quite straightforward to call Java code from within R using the rJava package. The following code

    import java.util.*;
    import cern.jet.random.tdouble.*;
    import cern.jet.random.tdouble.engine.*;
    
    class GibbsR
    {
    
        public static double[][] gibbs(int N,int thin,int seed)
        {
    	DoubleRandomEngine rngEngine=new DoubleMersenneTwister(seed);
    	Normal rngN=new Normal(0.0,1.0,rngEngine);
    	Gamma rngG=new Gamma(1.0,1.0,rngEngine);
    	double x=0,y=0;
    	double[][] mat=new double[2][N];
    	for (int i=0;i<N;i++) {
    	    for (int j=0;j<thin;j++) {
    		x=rngG.nextDouble(3.0,y*y+4);
    		y=rngN.nextDouble(1.0/(x+1),1.0/Math.sqrt(2*x+2));
    	    }
    	    mat[0][i]=x; mat[1][i]=y;
    	}
    	return mat;
        }
    
    }
    

    can be compiled with javac GibbsR.java (assuming that Parallel COLT is in the classpath), and wrapped up from within an R session with

    library(rJava)
    .jinit()
    obj=.jnew("GibbsR")
    
    gibbsJ<-function(N=50000,thin=1000,seed=trunc(runif(1)*1e6))
    {
        result=.jcall(obj,"[[D","gibbs",as.integer(N),as.integer(thin),as.integer(seed))
        mat=sapply(result,.jevalArray)
        colnames(mat)=c("x","y")
        mat
    }
    

    This code runs in 10.7 seconds. Yes, that’s correct. Yes, the Java code is faster than both the C and C++ code! This really goes to show that Java is now an excellent option for numerically intensive work such as this. However, before any C/C++ enthusiasts go apoplectic, I should explain why Java turns out to be faster here, as the comparison is not quite fair… In the C and C++ code, use was made of the internal R random number generation routines, which are relatively slow compared to many modern numerical library implementations. In the Java code, I used Parallel COLT for random number generation, as it isn’t straightforward to call the R generators from Java code. It turns out that the COLT generators are faster than the R generators, and that is why Java turns out to be faster here…

    C+GSL

    Of course we do not have to use the R random number generators within our C code. For example, we could instead call on the GSL generators, using the following code:

    #include <stdio.h>
    #include <math.h>
    #include <stdlib.h>
    #include <gsl/gsl_rng.h>
    #include <gsl/gsl_randist.h>
    #include <R.h>
    
    void gibbsGSL(int *Np,int *thinp,int *seedp,double *xvec,double *yvec)
    {
      int i,j;
      int N=*Np,thin=*thinp,seed=*seedp;
      gsl_rng *r = gsl_rng_alloc(gsl_rng_mt19937);
      gsl_rng_set(r,seed);
      double x=0;
      double y=0;
      for (i=0;i<N;i++) {
        for (j=0;j<thin;j++) {
          x=gsl_ran_gamma(r,3.0,1.0/(y*y+4));
          y=1.0/(x+1)+gsl_ran_gaussian(r,1.0/sqrt(2*x+2));
        }
        xvec[i]=x; yvec[i]=y;
      }
    }
    

    It can be compiled with R CMD SHLIB -lgsl -lgslcblas gibbsGSL.c, and then called as for the regular C version. This runs in 8.0 seconds, which is noticeably faster than the Java code, but probably not “enough” faster to make it an important factor to consider in language choice.

    Summary

    In this post I’ve shown that it is relatively straightforward to call code written in C, C++ or Java from within R, and that this can give very significant performance gains relative to pure R code. All of the options give fairly similar performance gains. I showed that in the case of this particular example, the “obvious” Java code is actually slightly faster than the “obvious” C or C++ code, and explained why, and how to make the C version slightly faster by using the GSL. The post by Dirk shows how to call the GSL generators from the C++ version, which I haven’t replicated here.

    Gibbs sampler in various languages (revisited)

    16/07/2011

    Introduction

    Regular readers of this blog will know that in April 2010 I published a short post showing how a trivial bivariate Gibbs sampler could be implemented in the four languages that I use most often these days (R, python, C, Java), and I discussed relative timings, and how one might start to think about trading off development time against execution time for more complex MCMC algorithms. I actually wrote the post very quickly one night while I was stuck in a hotel room in Seattle – I didn’t give much thought to it, and the main purpose was to provide simple illustrative examples of simple Monte Carlo codes using non-uniform random number generators in the different languages, as a starting point for someone thinking of switching languages (say, from R to Java or C, for efficiency reasons). It wasn’t meant to be very deep or provocative, or to start any language wars. Suffice to say that this post has had many more hits than all of my other posts combined, is still my most popular post, and still attracts comments and spawns other posts to this day. Several people have requested that I re-do the post more carefully, to include actual timings, and to include a few additional optimisations. Hence this post. For reference, the original post is here. A post about it from the python community is here, and a recent post about using Rcpp and inlined C++ code to speed up the R version is here.

    The sampler

    So, the basic idea was to construct a Gibbs sampler for the bivariate distribution

    f(x,y) = kx^2\exp\{-xy^2-y^2+2y-4x\},\qquad x>0,y\in\Bbb{R}

    with unknown normalising constant k>0 ensuring that the density integrates to one. Unfortunately, in the original post I dropped a factor of 2 constructing one of the full conditionals, which meant that none of the samplers actually had exactly the right target distribution (thanks to Sanjog Misra for bringing this to my attention). So actually, the correct full conditionals are

    \displaystyle x|y \sim Ga(3,y^2+4)

    \displaystyle y|x \sim N\left(\frac{1}{1+x},\frac{1}{2(1+x)}\right)

    Note the factor of two in the variance of the full conditional for y. Given the full conditionals, it is simple to alternately sample from them to construct a Gibbs sampler for the target distribution. We will run a Gibbs sampler with a thin of 1000 and obtain a final sample of 50000.

    Implementations

    R

    Let’s start with R again. The slightly modified version of the code from the old post is given below

    gibbs=function(N,thin)
    {
    	mat=matrix(0,ncol=3,nrow=N)
    	mat[,1]=1:N
    	x=0
    	y=0
    	for (i in 1:N) {
    		for (j in 1:thin) {
    			x=rgamma(1,3,y*y+4)
    			y=rnorm(1,1/(x+1),1/sqrt(2*x+2))
    		}
    		mat[i,2:3]=c(x,y)
    	}
    	mat=data.frame(mat)
    	names(mat)=c("Iter","x","y")
    	mat
    }
    
    writegibbs=function(N=50000,thin=1000)
    {
    	mat=gibbs(N,thin)
    	write.table(mat,"data.tab",row.names=FALSE)
    }
    
    writegibbs()
    

    I’ve just corrected the full conditional, and I’ve increased the sample size and thinning to 50k and 1k, respectively, to allow for more accurate timings (of the faster languages). This code can be run from the (Linux) command line with something like:

    time Rscript gibbs.R
    

    I discuss timings in detail towards the end of the post, but this code is slow, taking over 7 minutes on my (very fast) laptop. Now, the above code is typical of the way code is often structured in R – doing as much as possible in memory, and writing to disk only if necessary. However, this can be a bad idea with large MCMC codes, and is less natural in other languages, anyway, so below is an alternative version of the code, written in more of a scripting language style.

    gibbs=function(N,thin)
    {
    	x=0
    	y=0
            cat(paste("Iter","x","y","\n"))
    	for (i in 1:N) {
    		for (j in 1:thin) {
    			x=rgamma(1,3,y*y+4)
    			y=rnorm(1,1/(x+1),1/sqrt(2*x+2))
    		}
    		cat(paste(i,x,y,"\n"))
    	}
    }
    
    gibbs(50000,1000)
    

    This can be run with a command like

    time Rscript gibbs-script.R > data.tab
    

    This code actually turns out to be a slightly slower than the in-memory version for this simple example, but for larger problems I would not expect that to be the case. I always analyse MCMC output using R, whatever language I use for running the algorithm, so for completeness, here is a bit of code to load up the data file, do some plots and compute summary statistics.

    fun=function(x,y)
    {
    	x*x*exp(-x*y*y-y*y+2*y-4*x)
    }
    
    compare<-function(file="data.tab")
    {
    	mat=read.table(file,header=TRUE)
    	op=par(mfrow=c(2,1))
    	x=seq(0,3,0.1)
    	y=seq(-1,3,0.1)
    	z=outer(x,y,fun)
    	contour(x,y,z,main="Contours of actual (unnormalised) distribution")
    	require(KernSmooth)
    	fit=bkde2D(as.matrix(mat[,2:3]),c(0.1,0.1))
    	contour(fit$x1,fit$x2,fit$fhat,main="Contours of empirical distribution")
    	par(op)
    	print(summary(mat[,2:3]))
    }
    compare()
    

    Python

    Another language I use a lot is Python. I don’t want to start any language wars, but I personally find python to be a better designed language than R, and generally much nicer for the development of large programs. A python script for this problem is given below

    import random,math
    
    def gibbs(N=50000,thin=1000):
        x=0
        y=0
        print "Iter  x  y"
        for i in range(N):
            for j in range(thin):
                x=random.gammavariate(3,1.0/(y*y+4))
                y=random.gauss(1.0/(x+1),1.0/math.sqrt(2*x+2))
            print i,x,y
    
    gibbs()
    

    It can be run with a command like

    time python gibbs.py > data.tab
    

    This code turns out to be noticeably faster than the R versions, taking around 4 minutes on my laptop (again, detailed timing information below). However, there is a project for python known as the PyPy project, which is concerned with compiling regular python code to very fast byte-code, giving significant speed-ups on certain problems. For this post, I downloaded and install version 1.5 of the 64-bit linux version of PyPy. Once installed, I can run the above code with the command

    time pypy gibbs.py > data.tab
    

    To my astonishment, this “just worked”, and gave very impressive speed-up over regular python, running in around 30 seconds. This actually makes python a much more realistic prospect for the development of MCMC codes than I imagined. However, I need to understand the limitations of PyPy better – for example, why doesn’t everyone always use PyPy for everything?! It certainly seems to make python look like a very good option for prototyping MCMC codes.

    C

    Traditionally, I have mainly written MCMC codes in C, using the GSL. C is a fast, efficient, statically typed language, which compiles to native code. In many ways it represents the “gold standard” for speed. So, here is the C code for this problem.

    #include <stdio.h>
    #include <math.h>
    #include <stdlib.h>
    #include <gsl/gsl_rng.h>
    #include <gsl/gsl_randist.h>
    
    void main()
    {
      int N=50000;
      int thin=1000;
      int i,j;
      gsl_rng *r = gsl_rng_alloc(gsl_rng_mt19937);
      double x=0;
      double y=0;
      printf("Iter x y\n");
      for (i=0;i<N;i++) {
        for (j=0;j<thin;j++) {
          x=gsl_ran_gamma(r,3.0,1.0/(y*y+4));
          y=1.0/(x+1)+gsl_ran_gaussian(r,1.0/sqrt(2*x+2));
        }
        printf("%d %f %f\n",i,x,y);
      }
    }
    

    It can be compiled and run with command like

    gcc -O4 -lgsl -lgslcblas gibbs.c -o gibbs
    time ./gibbs > datac.tab
    

    This runs faster than anything else I consider in this post, taking around 8 seconds.

    Java

    I’ve recently been experimenting with Java for MCMC codes, in conjunction with Parallel COLT. Java is a statically typed object-oriented (O-O) language, but is usually compiled to byte-code to run on a virtual machine (known as the JVM). Java compilers and virtual machines are very fast these days, giving “close to C” performance, but with a nicer programming language, and advantages associated with virtual machines. Portability is a huge advantage of Java. For example, I can easily get my Java code to run on almost any University Condor pool, on both Windows and Linux clusters – they all have a recent JVM installed, and I can easily bundle any required libraries with my code. Suffice to say that getting GSL/C code to run on generic Condor pools is typically much less straightforward. Here is the Java code:

    import java.util.*;
    import cern.jet.random.tdouble.*;
    import cern.jet.random.tdouble.engine.*;
    
    class Gibbs
    {
    
        public static void main(String[] arg)
        {
    	int N=50000;
    	int thin=1000;
    	DoubleRandomEngine rngEngine=new DoubleMersenneTwister(new Date());
    	Normal rngN=new Normal(0.0,1.0,rngEngine);
    	Gamma rngG=new Gamma(1.0,1.0,rngEngine);
    	double x=0;
    	double y=0;
    	System.out.println("Iter x y");
    	for (int i=0;i<N;i++) {
    	    for (int j=0;j<thin;j++) {
    		x=rngG.nextDouble(3.0,y*y+4);
    		y=rngN.nextDouble(1.0/(x+1),1.0/Math.sqrt(2*x+2));
    	    }
    	    System.out.println(i+" "+x+" "+y);
    	}
        }
    
    }
    

    It can be compiled and run with

    javac Gibbs.java
    time java Gibbs > data.tab
    

    This takes around 11.6s seconds on my laptop. This is well within a factor of 2 of the C version, and around 3 times faster than even the PyPy python version. It is around 40 times faster than R. Java looks like a good choice for implementing MCMC codes that would be messy to implement in C, or that need to run places where it would be fiddly to get native codes to run.

    Scala

    Another language I’ve been taking some interest in recently is Scala. Scala is a statically typed O-O/functional language which compiles to byte-code that runs on the JVM. Since it uses Java technology, it can seamlessly integrate with Java libraries, and can run anywhere that Java code can run. It is a much nicer language to program in than Java, and feels more like a dynamic language such as python. In fact, it is almost as nice to program in as python (and in some ways nicer), and will run in a lot more places than PyPy python code. Here is the scala code (which calls Parallel COLT for random number generation):

    object GibbsSc {
    
    	import cern.jet.random.tdouble.engine.DoubleMersenneTwister
    	import cern.jet.random.tdouble.Normal
    	import cern.jet.random.tdouble.Gamma
    	import Math.sqrt
     	import java.util.Date
    
    	def main(args: Array[String]) {
    		val N=50000
    		val thin=1000
    		val rngEngine=new DoubleMersenneTwister(new Date)
    		val rngN=new Normal(0.0,1.0,rngEngine)
    		val rngG=new Gamma(1.0,1.0,rngEngine)
    		var x=0.0
    		var y=0.0
    		println("Iter x y")
    		for (i <- 0 until N) {
    			for (j <- 0 until thin) {
    				x=rngG.nextDouble(3.0,y*y+4)
    				y=rngN.nextDouble(1.0/(x+1),1.0/sqrt(2*x+2))
    			}
    			println(i+" "+x+" "+y)
    		}
    	}
    
    }
    

    It can be compiled and run with

    scalac GibbsSc.scala
    time scala GibbsSc > data.tab
    

    This code takes around 11.8s on my laptop – almost as fast as the Java code! So, on the basis of this very simple and superficial example, it looks like scala may offer the best of all worlds – a nice, elegant, terse programming language, functional and O-O programming styles, the safety of static typing, the ability to call on Java libraries, great speed and efficiency, and the portability of Java! Very interesting.

    Groovy

    James Durbin has kindly sent me a Groovy version of the code, which he has also discussed in his own blog post. Groovy is a dynamic O-O language for the JVM, which, like Scala, can integrate nicely with Java applications. It isn’t a language I have examined closely, but it seems quite nice. The code is given below:

    import cern.jet.random.tdouble.engine.*
    import cern.jet.random.tdouble.*
    N=50000
    thin=1000
    rngEngine= new DoubleMersenneTwister(new Date())
    rngN=new Normal(0.0,1.0,rngEngine)
    rngG=new Gamma(1.0,1.0,rngEngine)
    x=0.0
    y=0.0
    println("Iter x y")
    for(i in 1..N){
    	for(j in 1..thin){
    		x=rngG.nextDouble(3.0,y*y+4)
    		y=rngN.nextDouble(1.0/(x+1),1.0/Math.sqrt(2*x+2))
    	}
    	println("$i $x $y")
    }
    

    It can be run with a command like:

    time groovy Gibbs.gv > data.tab
    

    Again, rather amazingly, this code runs in around 35 seconds – very similar to the speed of PyPy. This makes Groovy also seem like a potential very attractive environment for prototyping MCMC codes, especially if I’m thinking about ultimately porting to Java.

    Timings

    The laptop I’m running everything on is a Dell Precision M4500 with an Intel i7 Quad core (x940@2.13Ghz) CPU, running the 64-bit version of Ubuntu 11.04. I’m running stuff from the Ubuntu (Unity) desktop, and running several terminals and applications, but the machine is not loaded at the time each job runs. I’m running each job 3 times and taking the arithmetic mean real elapsed time. All timings are in seconds.

    R 2.12.1 (in memory) 435.0
    R 2.12.1 (script) 450.2
    Python 2.7.1+ 233.5
    PyPy 1.5 32.2
    Groovy 1.7.4 35.4
    Java 1.6.0 11.6
    Scala 2.7.7 11.8
    C (gcc 4.5.2) 8.1

    If we look at speed-up relative to the R code (in-memory version), we get:

    R (in memory) 1.00
    R (script) 0.97
    Python 1.86
    PyPy 13.51
    Groovy 12.3
    Java 37.50
    Scala 36.86
    C 53.70

    Alternatively, we can look at slow-down relative to the C version, to get:

    R (in memory) 53.7
    R (script) 55.6
    Python 28.8
    PyPy 4.0
    Groovy 4.4
    Java 1.4
    Scala 1.5
    C 1.0

    Discussion

    The findings here are generally consistent with those of the old post, but consideration of PyPy, Groovy and Scala does throw up some new issues. I was pretty stunned by PyPy. First, I didn’t expect that it would “just work” – I thought I would either have to spend time messing around with my configuration settings, or possibly even have to modify my code slightly. Nope. Running python code with pypy appears to be more than 10 times faster than R, and only 4 times slower than C. I find it quite amazing that it is possible to get python code to run just 4 times slower than C, and if that is indicative of more substantial examples, it really does open up the possibility of using python for “real” problems, although library coverage is currently a problem. It certainly solves my “prototyping problem”. I often like to prototype algorithms in very high level dynamic languages like R and python before porting to a more efficient language. However, I have found that this doesn’t always work well with complex MCMC codes, as they just run too slowly in the dynamic languages to develop, test and debug conveniently. But it looks now as though PyPy should be fast enough at least for prototyping purposes, and may even be fast enough for production code in some circumstances. But then again, exactly the same goes for Groovy, which runs on the JVM, and can access any existing Java library… I haven’t yet looked into Groovy in detail, but it appears that it could be a very nice language for prototyping algorithms that I intend to port to Java.

    The results also confirm my previous findings that Java is now “fast enough” that one shouldn’t worry too much about the difference in speed between it and native code written in C (or C++). The Java language is much nicer than C or C++, and the JVM platform is very attractive in many situations. However, the Scala results were also very surprising for me. Scala is a really elegant language (certainly on a par with python), comes with all of the advantages of Java, and appears to be almost as fast as Java. I’m really struggling to come up with reasons not to use Scala for everything!

    Speeding up R

    MCMC codes are used by a range of different scientists for a range of different problems. However, they are very (most?) often used by Bayesian statisticians who use the algorithms to target a Bayesian posterior distribution. For various (good) reasons, many statisticians are heavily invested in R, like to use R as much as possible, and do as much as possible from within the R environment. These results show why R is not a good language in which to implement MCMC algorithms, so what is an R-dependent statistician supposed to do? One possibility would be to byte-code compile R code in an analogous way to python and pypy. The very latest versions of R support such functionality, but the post by Dirk Eddelbuettel suggests that the current version of cmpfun will only give a 40% speedup on this problem, which is still slower than regular python code. Short of a dramatic improvement in this technology, the only way forward seems to be to extend R using code from another language. It is easy to extend R using C, C++ and Java. I have shown in previous posts how to do this using Java and using C, and the recent post by Dirk shows how to extend using C++. Although interesting, this doesn’t really have much bearing on the current discussion. If you extend using Java you get Java-like speedups, and if you extend using C you get C-like speedups. However, in case people are interested, I intend to gather up these examples into one post and include detailed timing information in a subsequent post.

    Java math libraries and Monte Carlo simulation codes

    04/06/2011

    Java libraries for (non-uniform) random number simulation

    Anyone writing serious Monte Carlo (and MCMC) codes relies on having a very good and fast (uniform) random number generator and associated functions for generation of non-uniform random quantities, such as Gaussian, Poisson, Gamma, etc. In a previous post I showed how to write a simple Gibbs sampler in four different languages. In C (and C++) random number generation is easy for most scientists, as the (excellent) GNU Scientific Library (GSL) provides exactly what most people need. But it wasn’t always that way… I remember the days before the GSL, when it was necessary to hunt around on the net for bits of C code to implement different algorithms. Worse, it was often necessary to hunt around for a bit of free FORTRAN code, and compile that with an F77 compiler and figure out how to call it from C. Even in the early Alpha days of the GSL, coverage was patchy, and the API changed often. Bad old days… But those days are long gone, and C programmers no longer have to worry about the problem of random variate generation – they can safely concentrate on developing their interesting new algorithm, and leave the rest to the GSL. Unfortunately for Java programmers, there isn’t yet anything quite comparable to the GSL in Java world.

    I pretty much ignored Java until Java 5. Before then, the language was too limited, and the compilers and JVMs were too primitive to really take seriously for numerical work. But since the launch of Java 5 I’ve been starting to pay more interest. The language is now a perfectly reasonable O-O language, and the compilers and JVMs are pretty good. On a lot of benchmarks, Java is really quite comparable to C/C++, and Java is nicer to code, and has a lot of impressive associated technology. So if there was a math library comparable to the GSL, I’d be quite tempted to jump ship to the Java world and start writing all of my Monte Carlo codes in Java. But there isn’t. At least not yet.

    When I first started to take Java seriously, the only good math library with good support for non-uniform random number generation was COLT. COLT was, and still is, pretty good. The code is generally well-written, and fast, and the documentation for it is reasonable. However, the structure of the library is very idiosyncratic, the coverage is a bit patchy, and there doesn’t ever seem to have been a proper development community behind it. It seems very much to have been a one-man project, which has long since stagnated. Unsurprisingly then, COLT has been forked. There is now a Parallel COLT project. This project is continuing the development of COLT, adding new features that were missing from COLT, and, as the name suggests, adding concurrency support. Parallel COLT is also good, and is the main library I currently use for random number generation in Java. However, it has obviously inherited all of the idiosyncrasies that COLT had, and still doesn’t seem to have a large and active development community associated with it. There is no doubt that it is an incredibly useful software library, but it still doesn’t really compare to the GSL.

    I have watched the emergence of the Apache Commons Math project with great interest (not to be confused with Uncommons Math – another one-man project). I think this project probably has the greatest potential for providing the Java community with their own GSL equivalent. The Commons project has a lot of momentum, the Commons Math project seems to have an active development community, and the structure of the library is more intuitive than that of (Parallel) COLT. However, it is early days, and the library still has patchy coverage and is a bit rough around the edges. It reminds me a lot of the GSL back in its Alpha days. I’d not bothered to even download it until recently, as the random number generation component didn’t include the generation of gamma random quantities – an absolutely essential requirement for me. However, I noticed recently that the latest release (2.2) did include gamma generation, so I decided to download it and try it out. It works, but the generation of gamma random quantities is very slow (around 50 times slower than Parallel COLT). This isn’t a fundamental design flaw of the whole library – generating Gaussian random quantities is quite comparable with other libraries. It’s just that an inversion method has been used for gamma generation. All efficient gamma generators use a neat rejection scheme. In case anyone would like to investigate for themselves, here is a complete program for gamma generation designed to be linked against Parallel COLT:

    import java.util.*;
    import cern.jet.random.tdouble.*;
    import cern.jet.random.tdouble.engine.*;
    
    class GammaPC
    {
    
        public static void main(String[] arg)
        {
    	DoubleRandomEngine rngEngine=new DoubleMersenneTwister();
    	Gamma rngG=new Gamma(1.0,1.0,rngEngine);
    	long N=10000;
    	double x=0.0;
    	for (int i=0;i<N;i++) {
    	    for (int j=0;j<1000;j++) {
    		x=rngG.nextDouble(3.0,25.0);
    	    }
    	    System.out.println(x);
    	}
        }
        
    }
    

    and here is a complete program designed to be linked against Commons Math:

    import java.util.*;
    import org.apache.commons.math.*;
    import org.apache.commons.math.random.*;
    
    class GammaACM
    {
    
        public static void main(String[] arg) throws MathException
        {
    	RandomDataImpl rng=new RandomDataImpl();
    	long N=10000;
    	double x=0.0;
    	for (int i=0;i<N;i++) {
    	    for (int j=0;j<1000;j++) {
    		x=rng.nextGamma(3.0,1.0/25.0);
    	    }
    	    System.out.println(x);
    	}
        }
        
    }
    

    The two codes do the same thing (note that they parameterise the gamma distribution differently). Both programs work (they generate variates from the same, correct, distribution), and the Commons Math interface is slightly nicer, but the code is much slower to execute. I’m still optimistic that Commons Math will one day be Java’s GSL, but I’m not giving up on Parallel COLT (or C, for that matter!) just yet…

    The particle marginal Metropolis-Hastings (PMMH) particle MCMC algorithm

    17/05/2011

    Introduction

    In the previous post I explained how one can use an unbiased estimate of marginal likelihood derived from a particle filter within a Metropolis-Hastings MCMC algorithm in order to construct an exact pseudo-marginal MCMC scheme for the posterior distribution of the model parameters given some time course data. This idea is closely related to that of the particle marginal Metropolis-Hastings (PMMH) algorithm of Andreiu et al (2010), but not really exactly the same. This is because for a Bayesian model with parameters \theta, latent variables x and data y, of the form

    \displaystyle  p(\theta,x,y) = p(\theta)p(x|\theta)p(y|x,\theta),

    the pseudo-marginal algorithm which exploits the fact that the particle filter’s estimate of likelihood is unbiased is an MCMC algorithm which directly targets the marginal posterior distribution p(\theta|y). On the other hand, the PMMH algorithm is an MCMC algorithm which targets the full joint posterior distribution p(\theta,x|y). Now, the PMMH scheme does reduce to the pseudo-marginal scheme if samples of x are not generated and stored in the state of the Markov chain, and it certainly is the case that the pseudo-marginal algorithm gives some insight into why the PMMH algorithm works. However, the PMMH algorithm is much more powerful, as it solves the “smoothing” and parameter estimation problem simultaneously and exactly, including the “initial value” problem (computing the posterior distribution of the initial state, x_0). Below I will describe the algorithm and explain why it works, but first it is necessary to understand the relationship between marginal, joint and “likelihood-free” MCMC updating schemes for such latent variable models.

    MCMC for latent variable models

    Marginal approach

    If we want to target p(\theta|y) directly, we can use a Metropolis-Hastings scheme with a fairly arbitrary proposal distribution for exploring \theta, where a new \theta^\star is proposed from f(\theta^\star|\theta) and accepted with probability \min\{1,A\}, where

    \displaystyle  A = \frac{p(\theta^\star)}{p(\theta)} \times  \frac{f(\theta|\theta^\star)}{f(\theta^\star|\theta)} \times \frac{p({y}|\theta^\star)}{p({y}|\theta)}.

    As previously discussed, the problem with this scheme is that the marginal likelihood p(y|\theta) required in the acceptance ratio is often difficult to compute.

    Likelihood-free MCMC

    A simple “likelihood-free” scheme targets the full joint posterior distribution p(\theta,x|y). It works by exploiting the fact that we can often simulate from the model for the latent variables p(x|\theta) even when we can’t evaluate it, or marginalise x out of the problem. Here the Metropolis-Hastings proposal is constructed in two stages. First, a proposed new \theta^\star is sampled from f(\theta^\star|\theta) and then a corresponding x^\star is simulated from the model p(x^\star|\theta^\star). The pair (\theta^\star,x^\star) is then jointly accepted with ratio

    \displaystyle  A = \frac{p(\theta^\star)}{p(\theta)} \times  \frac{f(\theta|\theta^\star)}{f(\theta^\star|\theta)} \times \frac{p(y|{x}^\star,\theta^\star)}{p(y|{x},\theta)}.

    The proposal mechanism ensures that the proposed x^\star is consistent with the proposed \theta^\star, and so the procedure can work provided that the dimension of the data y is low. However, in order to work well more generally, we would want the proposed latent variables to be consistent with the data as well as the model parameters.

    Ideal joint update

    Motivated by the likelihood-free scheme, we would really like to target the joint posterior p(\theta,x|y) by first proposing \theta^\star from f(\theta^\star|\theta) and then a corresponding x^\star from the conditional distribution p(x^\star|\theta^\star,y). The pair (\theta^\star,x^\star) is then jointly accepted with ratio

    \displaystyle  A = \frac{p(\theta^\star)}{p(\theta)}   \frac{p({x}^\star|\theta^\star)}{p({x}|\theta)}   \frac{f(\theta|\theta^\star)}{f(\theta^\star|\theta)}   \frac{p(y|{x}^\star,\theta^\star)}{p(y|{x},\theta)}  \frac{p({x}|y,\theta)}{p({x}^\star|y,\theta^\star)}\\  \qquad = \frac{p(\theta^\star)}{p(\theta)}  \frac{p(y|\theta^\star)}{p(y|\theta)} \frac{f(\theta|\theta^\star)}{f(\theta^\star|\theta)}.

    Notice how the acceptance ratio simplifies, using the basic marginal likelihood identity (BMI) of Chib (1995), and x drops out of the ratio completely in order to give exactly the ratio used for the marginal updating scheme. Thus, the “ideal” joint updating scheme reduces to the marginal updating scheme if x is not sampled and stored as a component of the Markov chain.

    Understanding the relationship between these schemes is useful for understanding the PMMH algorithm. Indeed, we will see that the “ideal” joint updating scheme (and the marginal scheme) corresponds to PMMH using infinitely many particles in the particle filter, and that the likelihood-free scheme corresponds to PMMH using exactly one particle in the particle filter. For an intermediate number of particles, the PMMH scheme is a compromise between the “ideal” scheme and the “blind” likelihood-free scheme, but is always likelihood-free (when used with a bootstrap particle filter) and always has an acceptance ratio leaving the exact posterior invariant.

    The PMMH algorithm

    The algorithm

    The PMMH algorithm is an MCMC algorithm for state space models jointly updating \theta and x_{0:T}, as the algorithms above. First, a proposed new \theta^\star is generated from a proposal f(\theta^\star|\theta), and then a corresponding x_{0:T}^\star is generated by running a bootstrap particle filter (as described in the previous post, and below) using the proposed new model parameters, \theta^\star, and selecting a single trajectory by sampling once from the final set of particles using the final set of weights. This proposed pair (\theta^\star,x_{0:T}^\star) is accepted using the Metropolis-Hastings ratio

    \displaystyle  A = \frac{\hat{p}_{\theta^\star}(y_{1:T})p(\theta^\star)q(\theta|\theta^\star)}{\hat{p}_{\theta}(y_{1:T})p(\theta)q(\theta^\star|\theta)},

    where \hat{p}_{\theta^\star}(y_{1:T}) is the particle filter’s (unbiased) estimate of marginal likelihood, described in the previous post, and below. Note that this approach tends to the perfect joint/marginal updating scheme as the number of particles used in the filter tends to infinity. Note also that for a single particle, the particle filter just blindly forward simulates from p_\theta(x^\star_{0:T}) and that the filter’s estimate of marginal likelihood is just the observed data likelihood p_\theta(y_{1:T}|x^\star_{0:T}) leading precisely to the simple likelihood-free scheme. To understand for an arbitrary finite number of particles, M, one needs to think carefully about the structure of the particle filter.

    Why it works

    To understand why PMMH works, it is necessary to think about the joint distribution of all random variables used in the bootstrap particle filter. To this end, it is helpful to re-visit the particle filter, thinking carefully about the resampling and propagation steps.

    First introduce notation for the “particle cloud”: \mathbf{x}_t=\{x_t^k|k=1,\ldots,M\}, \boldsymbol{\pi}_t=\{\pi_t^k|k=1,\ldots,M\}, \tilde{\mathbf{x}}_t=\{(x_t^k,\pi_t^k)|k=1,\ldots,M\}. Initialise the particle filter with \tilde{\mathbf{x}}_0, where x_0^k\sim p(x_0) and \pi_0^k=1/M (note that w_0^k is undefined). Now suppose at time t we have a sample from p(x_t|y_{1:t}): \tilde{\mathbf{x}}_t. First resample by sampling a_t^k \sim \mathcal{F}(a_t^k|\boldsymbol{\pi}_t), k=1,\ldots,M. Here we use \mathcal{F}(\cdot|\boldsymbol{\pi}) for the discrete distribution on 1:M with probability mass function \boldsymbol{\pi}. Next sample x_{t+1}^k\sim p(x_{t+1}^k|x_t^{a_t^k}). Set w_{t+1}^k=p(y_{t+1}|x_{t+1}^k) and \pi_{t+1}^k=w_{t+1}^k/\sum_{i=1}^M w_{t+1}^i. Finally, propagate \tilde{\mathbf{x}}_{t+1} to the next step… We define the filter’s estimate of likelihood as \hat{p}(y_t|y_{1:t-1})=\frac{1}{M}\sum_{i=1}^M w_t^i and \hat{p}(y_{1:T})=\prod_{i=1}^T \hat{p}(y_t|y_{1:t-1}). See Doucet et al (2001) for further theoretical background on particle filters and SMC more generally.

    Describing the filter carefully as above allows us to write down the joint density of all random variables in the filter as

    \displaystyle  \tilde{q}(\mathbf{x}_0,\ldots,\mathbf{x}_T,\mathbf{a}_0,\ldots,\mathbf{a}_{T-1})  = \left[\prod_{k=1}^M p(x_0^k)\right] \left[\prod_{t=0}^{T-1}    \prod_{k=1}^M \pi_t^{a_t^k} p(x_{t+1}^k|x_t^{a_t^k}) \right]

    For PMMH we also sample a final index k' from \mathcal{F}(k'|\boldsymbol{\pi}_T) giving the joint density

    \displaystyle  \tilde{q}(\mathbf{x}_0,\ldots,\mathbf{x}_T,\mathbf{a}_0,\ldots,\mathbf{a}_{T-1})\pi_T^{k'}

    We write the final selected trajectory as

    \displaystyle  x_{0:T}^{k'}=(x_0^{b_0^k},\ldots,x_T^{b_T^{k'}}),

    where b_t^{k'}=a_t^{b_{t+1}^{k'}}, and b_T^{k'}=k'. If we now think about the structure of the PMMH algorithm, our proposal on the space of all random variables in the problem is in fact

    \displaystyle  f(\theta^\star|\theta)\tilde{q}_{\theta^\star}(\mathbf{x}_0^\star,\ldots,\mathbf{x}_T^\star,\mathbf{a}_0^\star,\ldots,\mathbf{a}_{T-1}^\star)\pi_T^{{k'}^\star}

    and by considering the proposal and the acceptance ratio, it is clear that detailed balance for the chain is satisfied by the target with density proportional to

    \displaystyle  p(\theta)\hat{p}_\theta(y_{1:T})  \tilde{q}_\theta(\mathbf{x}_0,\ldots,\mathbf{x}_T,\mathbf{a}_0,\ldots,\mathbf{a}_{T-1})  \pi_T^{k'}

    We want to show that this target marginalises down to the correct posterior p(\theta,x_{0:T}|y_{1:T}) when we consider just the parameters and the selected trajectory. But if we consider the terms in the joint distribution of the proposal corresponding to the trajectory selected by k', this is given by

    \displaystyle  p_\theta(x_0^{b_0^{k'}})\left[\prod_{t=0}^{T-1} \pi_t^{b_t^{k'}}    p_\theta(x_{t+1}^{b_{t+1}^{k'}}|x_t^{b_t^{k'}})\right]\pi_T^{k'}  =  p_\theta(x_{0:T}^{k'})\prod_{t=0}^T \pi_t^{b_t^{k'}}

    which, by expanding the \pi_t^{b_t^{k'}} in terms of the unnormalised weights, simplifies to

    \displaystyle  \frac{p_\theta(x_{0:T}^{k'})p_\theta(y_{1:T}|x_{0:T}^{k'})}{M^{T+1}\hat{p}_\theta(y_{1:T})}

    It is worth dwelling on this result, as this is the key insight required to understand why the PMMH algorithm works. The whole point is that the terms in the joint density of the proposal corresponding to the selected trajectory exactly represent the required joint distribution modulo a couple of normalising constants, one of which is the particle filter’s estimate of marginal likelihood. Thus, by including \hat{p}_\theta(y_{1:T}) in the acceptance ratio, we knock out the normalising constant, allowing all of the other terms in the proposal to be marginalised away. In other words, the target of the chain can be written as proportional to

    \displaystyle  \frac{p(\theta)p_\theta(x_{0:T}^{k'},y_{1:T})}{M^{T+1}} \times  \text{(Other terms...)}

    The other terms are all probabilities of random variables which do not occur elsewhere in the target, and hence can all be marginalised away to leave the correct posterior

    \displaystyle  p(\theta,x_{0:T}|y_{1:T})

    Thus the PMMH algorithm targets the correct posterior for any number of particles, M. Also note the implied uniform distribution on the selected indices in the target.

    I will give some code examples in a future post.


    Follow

    Get every new post delivered to your Inbox.