## Introduction

In this post I’ll look at how to call compiled C code from an R session. The focus here is on calling C code from R, rather than on extending R using C. Although the two are technically very similar problems, the emphasis is somewhat different. A lot of existing documentation focuses on the latter problem, and this is one of the motivations for writing this post. Fortunately, the problem of calling existing C code from R is a bit simpler than the more general problem of extending R in C.

In a previous post I looked at how to implement a trivial bivariate Gibbs sampler in various languages. It was seen there that the C version ran approximately 60 times faster than the R version. It is therefore often desirable to code up MCMC algorithms in C. However, it is usually very convenient to be able to call such algorithms from inside an R session. There are various ways to do this, ranging from the trivial to very complex. In this post I will look at some of the simpler methods and discuss the pros and cons.

## Standalone C code

We will restrict attention to the Gibbs sampler discussed in a previous post. We will focus on the C version of the code. Below is a slightly modified version of the code which includes some command-line arguments that enable some flexibility in how the code is run post-compilation.

#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#include <gsl/gsl_rng.h>
#include <gsl/gsl_randist.h>

int main(int argc, char *argv[])
{
if (argc!=4) {
fprintf(stderr,"Usage: %s <Iters> <Thin> <Seed>\n",argv[0]);
exit(EXIT_FAILURE);
}
long N=(long) atoi(argv[1]);
long thin=(long) atoi(argv[2]);
long seed=(long) atoi(argv[3]);
long i,j;
gsl_rng *r = gsl_rng_alloc(gsl_rng_mt19937);
gsl_rng_set(r,seed);
double x=0;
double y=0;
printf("Iter x y\n");
for (i=0;i<N;i++) {
for (j=0;j<thin;j++) {
x=gsl_ran_gamma(r,3.0,1.0/(y*y+4));
y=1.0/(x+1)+gsl_ran_gaussian(r,1.0/sqrt(x+1));
}
printf("%ld %f %f\n",i,x,y);
}
exit(EXIT_SUCCESS);
}


Assuming a Unix/Linux environment (including a GSL implementation), the above code can be compiled from the Unix shell with a command like:

gcc -O2 -lgsl -lgslcblas standalone.c -o standalone


and run with a command like:

./standalone 10000 500 1 > data.tab


The first command-line argument is the number of iterations required, and the second is the “thin” to be applied to the output. The third argument is the “seed” to be applied to the GSL random number generator (RNG). This allows different (not quite independent – see my post on parallel MCMC for details) runs to be obtained by selecting different seed values. The simplest way to call this code from within an R session is to call this unmodified executable using the R system() command. A small “wrapper” function to do this is given below.

standalone<-function(N=10000,thin=500,
seed=trunc(runif(1)*1e6),
exec=file.path(".","standalone"),
tmpfile=tempfile())
{
command=paste(exec,N,thin,seed,">",tmpfile)
system(command)
}


Note the use of the file.path() and tempfile() R functions in a (probably vain!) attempt to make the code somewhat portable. Just running standalone() from an R session should then return a data frame containing the MCMC output. I gave some commands for analysing this output in a previous post. This approach to calling external code is very simple and crude, and quite generic (it is not specific to C code at all). However, it is very quick and easy to implement, and in many cases quite efficient. There is a considerable computational overhead in executing the system command and parsing output files from disk. However, if the code being called is very computationally intensive and relatively slow (as is typically the case), then this overhead can often be negligible, rendering this approach quite practical.

## Building and linking to a shared library

If one is really keen to avoid the overhead of executing an R system command, then it is necessary to compile the required C code into a shared library (or DLL), and link this code into R where it can be called directly via R’s foreign language interface. Below is a version of the previous C code modified to make it appropriate for calling from R.

#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#include <gsl/gsl_rng.h>
#include <gsl/gsl_randist.h>
#include <R.h>

void gibbs(int *Np,int *thinp,int *seedp,double *xvec,double *yvec)
{
int i,j;
int N=*Np,thin=*thinp,seed=*seedp;
gsl_rng *r = gsl_rng_alloc(gsl_rng_mt19937);
gsl_rng_set(r,seed);
double x=0;
double y=0;
for (i=0;i<N;i++) {
for (j=0;j<thin;j++) {
x=gsl_ran_gamma(r,3.0,1.0/(y*y+4));
y=1.0/(x+1)+gsl_ran_gaussian(r,1.0/sqrt(x+1));
}
xvec[i]=x; yvec[i]=y;
}
}


Note that it is only possible to pass pointers from simple R/C data types, and so all function arguments must be pointers. Also note that there is no return value to the function, and that values are retrieved in R by modifying some of the values pointed to by the pointer arguments. This is the mode of operation imposed by the basic method that R provides for calling C code from R (the .C() function). Note that there are other methods for extending R in C, using the .Call() and .External() functions, but these are beyond the scope of this post. Again assuming a Unix/Linux environment, this code can be compiled into a shared library with a command like:

R CMD SHLIB -lgsl -lgslcblas dynamic.c


It can then be loaded into a running R session with a command like dyn.load("dynamic.so"). Again, if we are attempting to write portable code, we might use a command like:

dyn.load(file.path(".",paste("dynamic",.Platform$dynlib.ext,sep="")))  You can check what dynamic libraries are loaded into the current R session with getLoadedDLLs(). Once the DLL (Dynamic Link Library) is loaded, it can be called using the .C() function. A small wrapper function appropriate in this instance is given below: dynamic<-function(n=10000,thin=500,seed=trunc(runif(1)*1e6)) { tmp=.C("gibbs",as.integer(n),as.integer(thin), as.integer(seed),x=as.double(1:n), y=as.double(1:n)) mat=cbind(1:n,tmp$x,tmp$y) colnames(mat)=c("Iter","x","y") mat }  Note how a random seed is generated in R to be passed to the C code to be used to seed the GSL random generator used within the C code. The code can then be run with a simple call to dynamic() and everything should work OK provided that all of the required libraries are found. This is the simplest way to link C code into R in a way that avoids the overhead associated with a system() call. However, this approach is also not without issues. In particular, the C code relies on the GSL, and more specifically on the random number streams provided by the GSL. These are completely separate from the random number streams used within the R system. In some situations it would make sense to use the same random number streams used within the R session, and to remove the dependence of the C code on the GSL. ## Using the R API The C code discussed in the previous section relies on the GSL only for the generation of (non-uniform) random numbers. Obviously R has its own very sophisticated system for handling random numbers and it is possible to use this system from within externally called C code using the R API. In particular, C versions of functions such as rnorm() and rgamma() can be called in C by including Rmath.h. Below is a version of the C code previously given modified to use the R random number generation routines and to remove all dependence on the GSL. #include <stdio.h> #include <math.h> #include <stdlib.h> #include <R.h> #include <Rmath.h> void gibbsR(int *Np,int *thinp,double *xvec,double *yvec) { int i,j; int N=*Np,thin=*thinp; GetRNGstate(); double x=0; double y=0; for (i=0;i<N;i++) { for (j=0;j<thin;j++) { x=rgamma(3.0,1.0/(y*y+4)); y=rnorm(1.0/(x+1),1.0/sqrt(x+1)); } xvec[i]=x; yvec[i]=y; } PutRNGstate(); }  Note that a call to GetRNGstate() must be made before calling any random number functions and that a call to PutRNGstate() must be called before the function returns control back to R. This code can be compiled with a command like R CMD SHLIB dynamicR.c  and linked into R with a command like dyn.load(file.path(".",paste("dynamicR",.Platform$dynlib.ext,sep="")))


An appropriate wrapper for this code is given below:

dynamicR<-function(n=10000,thin=500)
{
tmp=.C("gibbsR",as.integer(n),as.integer(thin),
x=as.double(1:n),y=as.double(1:n))
mat=cbind(1:n,tmp$x,tmp$y)
colnames(mat)=c("Iter","x","y")
mat
}


This code is now slightly simpler, and the lack of dependence on external libraries such as the GSL makes it much easier to integrate into R packages, should this be desired.

Foreign language interfaces are a notoriously complex subject and this post has obviously just scratched the surface of the problem. For a few more examples, first see my old computer practicals on Stochastic simulation in R and C. The examples are a bit out of date, but easy to fix. Also see a howto by the Flemish Supercomputing Centre on a similar topic to this one. For more detailed information, see the manual on Writing R extensions, especially the sections on Foreign language interfaces and the R API. I also find Chapter 6 of R Programming for Bioinformatics to be a useful introduction to more complex aspects.

I have also somewhat belatedly re-discovered Charlie Geyer‘s notes on Calling C and Fortran from R, which covers very similar ground to this post. They were probably the unconscious inspiration for this post…

## Getting started with parallel MCMC

### Introduction to parallel MCMC for Bayesian inference, using C, MPI, the GSL and SPRNG

#### Introduction

This post is aimed at people who already know how to code up Markov Chain Monte Carlo (MCMC) algorithms in C, but are interested in how to parallelise their code to run on multi-core machines and HPC clusters. I discussed different languages for coding MCMC algorithms in a previous post. The advantage of C is that it is fast, standard and has excellent scientific library support. Ultimately, people pursuing this route will be interested in running their code on large clusters of fast servers, but for the purposes of development and testing, this really isn’t necessary. A single dual-core laptop, or similar, is absolutely fine. I develop and test on a dual-core laptop running Ubuntu linux, so that is what I will assume for the rest of this post.

There are several possible environments for parallel computing, but I will focus on the Message-Passing Interface (MPI). This is a well-established standard for parallel computing, there are many implementations, and it is by far the most commonly used high performance computing (HPC) framework today. Even if you are ultimately interested in writing code for novel architectures such as GPUs, learning the basics of parallel computation using MPI will be time well spent.

#### MPI

The whole point of MPI is that it is a standard, so code written for one implementation should run fine with any other. There are many implementations. On Linux platforms, the most popular are OpenMPI, LAM, and MPICH. There are various pros and cons associated with each implementation, and if installing on a powerful HPC cluster, serious consideration should be given to which will be the most beneficial. For basic development and testing, however, it really doesn’t matter which is used. I use OpenMPI on my Ubuntu laptop, which can be installed with a simple:

sudo apt-get install openmpi-bin libopenmpi-dev


That’s it! You’re ready to go… You can test your installation with a simple “Hello world” program such as:

#include <stdio.h>
#include <mpi.h>

int main (int argc,char **argv)
{
int rank, size;
MPI_Init (&argc, &argv);
MPI_Comm_rank (MPI_COMM_WORLD, &rank);
MPI_Comm_size (MPI_COMM_WORLD, &size);
printf( "Hello world from process %d of %d\n", rank, size );
MPI_Finalize();
return 0;
}


You should be able to compile this with

mpicc -o helloworld helloworld.c


and run (on 2 processors) with

mpirun -np 2 helloworld


#### GSL

If you are writing non-trivial MCMC codes, you are almost certainly going to need to use a sophisticated math library and associated random number generation (RNG) routines. I typically use the GSL. On Ubuntu, the GSL can be installed with:

sudo apt-get install gsl-bin libgsl0-dev


A simple script to generate some non-uniform random numbers is given below.

#include <gsl/gsl_rng.h>
#include <gsl/gsl_randist.h>

int main(void)
{
int i; double z; gsl_rng *r;
r = gsl_rng_alloc(gsl_rng_mt19937);
gsl_rng_set(r,0);
for (i=0;i<10;i++) {
z = gsl_ran_gaussian(r,1.0);
printf("z(%d) = %f\n",i,z);
}
exit(EXIT_SUCCESS);
}


This can be compiled with a command like:

gcc gsl_ran_demo.c -o gsl_ran_demo -lgsl -lgslcblas


and run with

./gsl_ran_demo


#### SPRNG

When writing parallel Monte Carlo codes, it is important to be able to use independent streams of random numbers on each processor. Although it is tempting to “fudge” things by using a random number generator with a different seed on each processor, this does not guarantee independence of the streams, and an unfortunate choice of seeds could potentially lead to bad behaviour of your algorithm. The solution to this problem is to use a parallel random number generator (PRNG), designed specifically to give independent streams on different processors. Unfortunately the GSL does not have native support for such parallel random number generators, so an external library should be used. SPRNG 2.0 is a popular choice. SPRNG is designed so that it can be used with MPI, but also that it does not have to be. This is an issue, as the standard binary packages distributed with Ubuntu (libsprng2, libsprng2-dev) are compiled without MPI support. If you are going to be using SPRNG with MPI, things are simpler with MPI support, so it makes sense to download sprng2.0b.tar.gz from the SPRNG web site, and build it from source, carefully following the instructions for including MPI support. If you are not familiar with building libraries from source, you may need help from someone who is.

Once you have compiled SPRNG with MPI support, you can test it with the following code:

#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
#define SIMPLE_SPRNG
#define USE_MPI
#include "sprng.h"

int main(int argc,char *argv[])
{
double rn; int i,k;
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD,&k);
init_sprng(DEFAULT_RNG_TYPE,0,SPRNG_DEFAULT);
for (i=0;i<10;i++)
{
rn = sprng();
printf("Process %d, random number %d: %f\n", k, i+1, rn);
}
MPI_Finalize();
exit(EXIT_SUCCESS);
}


which can be compiled with a command like:

mpicc -I/usr/local/src/sprng2.0/include -L/usr/local/src/sprng2.0/lib -o sprng_demo sprng_demo.c -lsprng -lgmp


You will need to edit paths here to match your installation. If if builds, it can be run on 2 processors with a command like:

mpirun -np 2 sprng_demo


If it doesn’t build, there are many possible reasons. Check the error messages carefully. However, if the compilation fails at the linking stage with obscure messages about not being able to find certain SPRNG MPI functions, one possibility is that the SPRNG library has not been compiled with MPI support.

The problem with SPRNG is that it only provides a uniform random number generator. Of course we would really like to be able to use the SPRNG generator in conjunction with all of the sophisticated GSL routines for non-uniform random number generation. Many years ago I wrote a small piece of code to accomplish this, gsl-sprng.h. Download this and put it in your include path for the following example:

#include <mpi.h>
#include <gsl/gsl_rng.h>
#include "gsl-sprng.h"
#include <gsl/gsl_randist.h>

int main(int argc,char *argv[])
{
int i,k,po; gsl_rng *r;
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD,&k);
r=gsl_rng_alloc(gsl_rng_sprng20);
for (i=0;i<10;i++)
{
po = gsl_ran_poisson(r,2.0);
printf("Process %d, random number %d: %d\n", k, i+1, po);
}
MPI_Finalize();
exit(EXIT_SUCCESS);
}


A new GSL RNG, gsl_rng_sprng20 is created, by including gsl-sprng.h immediately after gsl_rng.h. If you want to set a seed, do so in the usual GSL way, but make sure to set it to be the same on each processor. I have had several emails recently from people who claim that gsl-sprng.h “doesn’t work”. All I can say is that it still works for me! I suspect the problem is that people are attempting to use it with a version of SPRNG without MPI support. That won’t work… Check that the previous SPRNG example works, first.

I can compile and run the above code with

mpicc -I/usr/local/src/sprng2.0/include -L/usr/local/src/sprng2.0/lib -o gsl-sprng_demo gsl-sprng_demo.c -lsprng -lgmp -lgsl -lgslcblas
mpirun -np 2 gsl-sprng_demo


#### Parallel Monte Carlo

Now we have parallel random number streams, we can think about how to do parallel Monte Carlo simulations. Here is a simple example:

#include <math.h>
#include <mpi.h>
#include <gsl/gsl_rng.h>
#include "gsl-sprng.h"

int main(int argc,char *argv[])
{
int i,k,N; double u,ksum,Nsum; gsl_rng *r;
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&N);
MPI_Comm_rank(MPI_COMM_WORLD,&k);
r=gsl_rng_alloc(gsl_rng_sprng20);
for (i=0;i<10000;i++) {
u = gsl_rng_uniform(r);
ksum += exp(-u*u);
}
MPI_Reduce(&ksum,&Nsum,1,MPI_DOUBLE,MPI_SUM,0,MPI_COMM_WORLD);
if (k == 0) {
printf("Monte carlo estimate is %f\n", (Nsum/10000)/N );
}
MPI_Finalize();
exit(EXIT_SUCCESS);
}


which calculates a Monte Carlo estimate of the integral

$\displaystyle I=\int_0^1 \exp(-u^2)du$

using 10k variates on each available processor. The MPI command MPI_Reduce is used to summarise the values obtained independently in each process. I compile and run with

mpicc -I/usr/local/src/sprng2.0/include -L/usr/local/src/sprng2.0/lib -o monte-carlo monte-carlo.c -lsprng -lgmp -lgsl -lgslcblas
mpirun -np 2 monte-carlo


#### Parallel chains MCMC

Once parallel Monte Carlo has been mastered, it is time to move on to parallel MCMC. First it makes sense to understand how to run parallel MCMC chains in an MPI environment. I will illustrate this with a simple Metropolis-Hastings algorithm to sample a standard normal using uniform proposals, as discussed in a previous post. Here an independent chain is run on each processor, and the results are written into separate files.

#include <gsl/gsl_rng.h>
#include "gsl-sprng.h"
#include <gsl/gsl_randist.h>
#include <mpi.h>

int main(int argc,char *argv[])
{
int k,i,iters; double x,can,a,alpha; gsl_rng *r;
FILE *s; char filename[15];
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD,&k);
if ((argc != 3)) {
if (k == 0)
fprintf(stderr,"Usage: %s <iters> <alpha>\n",argv[0]);
MPI_Finalize(); return(EXIT_FAILURE);
}
iters=atoi(argv[1]); alpha=atof(argv[2]);
r=gsl_rng_alloc(gsl_rng_sprng20);
sprintf(filename,"chain-%03d.tab",k);
s=fopen(filename,"w");
if (s==NULL) {
perror("Failed open");
MPI_Finalize(); return(EXIT_FAILURE);
}
x = gsl_ran_flat(r,-20,20);
fprintf(s,"Iter X\n");
for (i=0;i<iters;i++) {
can = x + gsl_ran_flat(r,-alpha,alpha);
a = gsl_ran_ugaussian_pdf(can) / gsl_ran_ugaussian_pdf(x);
if (gsl_rng_uniform(r) < a)
x = can;
fprintf(s,"%d %f\n",i,x);
}
fclose(s);
MPI_Finalize(); return(EXIT_SUCCESS);
}


I can compile and run this with the following commands

mpicc -I/usr/local/src/sprng2.0/include -L/usr/local/src/sprng2.0/lib -o mcmc mcmc.c -lsprng -lgmp -lgsl -lgslcblas
mpirun -np 2 mcmc 100000 1


#### Parallelising a single MCMC chain

The parallel chains approach turns out to be surprisingly effective in practice. Obviously the disadvantage of that approach is that “burn in” has to be repeated on every processor, which limits how much efficiency gain can be acheived by running across many processors. Consequently it is often desirable to try and parallelise a single MCMC chain. As MCMC algorithms are inherently sequential, parallelisation is not completely trivial, and most (but not all) approaches to parallelising a single MCMC chain focus on the parallelisation of each iteration. In order for this to be worthwhile, it is necessary that the problem being considered is non-trivial, having a large state space. The strategy is then to divide the state space into “chunks” which can be updated in parallel. I don’t have time to go through a real example in detail in this blog post, but fortunately I wrote a book chapter on this topic almost 10 years ago which is still valid and relevant today. The citation details are:

Wilkinson, D. J. (2005) Parallel Bayesian Computation, Chapter 16 in E. J. Kontoghiorghes (ed.) Handbook of Parallel Computing and Statistics, Marcel Dekker/CRC Press, 481-512.

The book was eventually published in 2005 after a long delay. The publisher which originally commisioned the handbook (Marcel Dekker) was taken over by CRC Press before publication, and the project lay dormant for a couple of years until the new publisher picked it up again and decided to proceed with publication. I have a draft of my original submission in PDF which I recommend reading for further information. The code examples used are also available for download, including several of the examples used in this post, as well as an extended case study on parallelisation of a single chain for Bayesian inference in a stochastic volatility model. Although the chapter is nearly 10 years old, the issues discussed are all still remarkably up-to-date, and the code examples all still work. I think that is a testament to the stability of the technology adopted (C, MPI, GSL). Some of the other handbook chapters have not stood the test of time so well.

For basic information on getting started with MPI and key MPI commands for implementing parallel MCMC algorithms, the above mentioned book chapter is a reasonable place to start. Read it all through carefully, run the examples, and carefully study the code for the parallel stochastic volatility example. Once that is understood, you should find it possible to start writing your own parallel MCMC algorithms. For further information about more sophisticated MPI usage and additional commands, I find the annotated specification: MPI – The complete reference to be as good a source as any.

## Introduction and background

In the previous post I gave an introduction to next generation sequencing (NGS) data, and in particular, working with FASTQ format files in a Unix environment. In another post I gave an introduction to R and Bioconductor for the statistical analysis of biological data. In this post I will give a quick introduction to the use of the Bioconductor package ShortRead for reading and processing FASTQ data using R/Bioconductor. This post assumes a working R/Bioconductor installation, and a basic familiarity with FASTQ files. See the previous posts for further details.

## Getting started

Assuming that Bioconductor is installed, entering the following at the R command prompt will install the ShortRead package and any missing dependencies:

source("http://bioconductor.org/biocLite.R")


The library can be loaded using

library(ShortRead)


Basic information about the package can be obtained with help(package="ShortRead"), and it is also worth getting help for the Biostrings package on which the ShortRead package depends. There is also a vignette for the ShortRead package, and several vignettes associated with the Biostrings package, all of which can be accessed using openVignette().

Assuming that the file Test100k.fastq.gz discussed in the previous post is available and in the current directory, it can be loaded with:

reads=readFastq("Test100k.fastq.gz")


Note that the FASTQ file does not need to be unzipped. Typing

reads


gives some basic information, such as the number of reads, and tells you that the reads object is of type ShortReadQ. Information about this class can be obtained via ?ShortReadQ. If the main interest is in the actual sequences, these can be extracted using

sequences=sread(reads)


Entering sequences will show the first few and last few sequences, and inform you that the object is of class DNAStringSet. Again, ?DNAStringSet may be helpful. As with most large R objects (and large Unix text files), it can be helpful to just look at the first few rows. In R this can be done with head(sequences), analogously to the Unix head command.

## Working with short read data

Individual reads can be accessed directly using usual R indexing, so that sequences[1] gives the sequence of the first read, etc. R string manipulation commands can also be used on a DNAStringSet, so that

substr(sequences,1,5)


will return a vector consisting of the first 5 bases from each read. Consequently,

table(substr(sequences,1,5))


will tabulate the occurrences of the inital pentamer associated with each read. This may or may not be interesting, depending on the data…

## Filtering data

Filtering reads to obtain a subset of direct interest is a key task. Fortunately there is good support for doing this in the ShortRead package. Suppose now that we are only interested in sequences beginning with the pentamer TAGCT (perhaps because those reads correspond to a sample amplified with a particular PCR primer). We can trivially filter our set of reads using standard R subsetting techniques:

myreads=reads[substr(sequences,1,5)=="TAGCT"]


However, filtering is such an important concept that the ShortRead library includes its own special set of functions relating to filtering, which allow filtering on sequence quality as well as sequence content, and allow filters to be combined together to allow the building of new filters from old, etc. Here I will just explain how to do the above filtering operation using a ShortRead srFilter. The idea is to create filters of class SRFilter and then apply these to sets of reads (of class ShortReadQ) in order to filter. So, if you want to build a custom filter, this is done using the function srFilter. An example follows:

myfilter=srFilter(function(x){
},name="My Filter")


The first argument to the srFilter function is a function taking a single argument, x, which is assumed to be a ShortReadQ object. The function should return a boolean vector, giving TRUE for all reads which are to pass through the filter. Once defined, typing myfilter will inform you that myfilter is an object of class SRFilter, and typing srFilter(myfilter) will give the definition of the filter.

Typing myfilter(reads) will apply the filter to the reads, giving a boolean vector. Therefore to obtain the filtered reads, use

myreads=reads[myfilter(reads)]


See ?srFilter and the ShortRead vignette for further details of how to build and combine filters.

Of course there is a bit of a problem with all of this post-hoc filtering when working with very large FASTQ files. The initial readFastq command reads the entire FASTQ file into RAM. If you don’t have enough RAM, the read will fail, and you won’t ever get a chance to filter things down to a smaller subset of reads to work with. So you really want to apply a filter as the data is read. Recent versions of the ShortRead package support this functionality, by allowing an optional filter argument to be passed in to the readFastq function.

myreads=readFastq("Test100k.fastq.gz",filter=myfilter)


This applies the filter at read time, so that as long as the filtered data will fit into the available RAM, the command will succeed. Note that it is desirable to run ShortRead on a 64bit machine with lots of RAM in order to be able to work with a reasonable number of reads at once. Even so, filtering at read time is still likely to be necessary. Attempting to read in a full lane of around 40M reads is likely to fail even on a machine with 16GB of RAM.

## Matching and counting

Suppose that we are interested in looking for occurrences of the sequence “AGACGATCACCATTCGATGC” in the reads, somewhere between positions 22 and 44 (this is a real example). We could start by creating a DNAStringSet containing just the read sequences between positions 22 and 44 as

dict=DNAStringSet(substr(sequences,22,44))


We could then filter the set with a regular expression requiring an exact match. However, NGS methods are not perfect. They occasionally call a base wrongly, and occasionally mess up a read by missing a base, or by inserting an extra base into the read. If we exactly match, we are going to miss all of the reads which aren’t quite right, and as we are matching to quite a long sequence (20 bases), that may well correspond to a lot of reads. What we really want to do is allow an imperfect match, allowing, say 1 base to be wrong, or one indel. We can do that with the following:

hits=vcountPattern("AGACGATCACCATTCGATGC",dict,
max.mismatch=1,with.indels=TRUE)
sum(hits)


hits is a vector corresponding to the number of hits in each read (mainly zeros, with a few ones). By summing them we get the total number of reads which match, and the actual reads can be pulled out for further analysis with a command like

sread(reads[hits])


See ?vcountPattern for futher details, including details of related functions.

## Further analysis

The basic commands and techniques covered in this brief post are enough to get started writing R functions for the analysis of short read data. It is easy to use the R programming language to write functions which count particular matching occurrences and events, and record them in vectors or matrices which can then be further analysed using other Bioconductor or R packages for statistical analysis, modelling or evaluation. As ever, further information can be found in the Bioconductor documentation and other on-line sources. I’ve also found Gentleman’s book R Programming for Bioinformatics to be a very useful source of information for working with sequence data in R.

## Overview

Recent developments in DNA (and RNA) sequencing technology is transforming how modern biological research is done. High-throughput sequencing of DNA and RNA using next generation sequencing (NGS) machines is now a standard approach in most good biological research labs. However, these technologies generate huge amounts of data which is non-trivial to manipulate and analyse. In this post I’ll give a quick introduction to working with short read DNA sequence data stored in the FASTQ format, using basic Unix (Linux) command-line tools. In the next post, I will give an introduction to analysing FASTQ data with Bioconductor.

## Working with data files

The first thing to be aware of when working with NGS data is that the data files involved can be really huge. For example, the data file(s) associated with the single lane of an NGS machine can often total over 5GB, even when stored in a compressed format. This can mean that they are not trivial to move around, and so thought needs to be put into things like how and where to download the files, where to store them, if/how to compress them, etc. Typically, most people will be having the actual sequencing done by a third-party, and the third-party will make the sequence data available for downloading via a secure web site, or similar. As the files are large, it makes sense to download from a machine with a good internet connection. It is also best to download direct to a Unix machine if possible, as Unix copes better with large files, and the packing and compression formats typically used (eg. tar, gzip, bzip) are usually better supported by default on Unix platforms. Also note that the FAT, FAT32 and VFAT file systems often used on (older) Windows platforms can not store very large files (over 4GB). This can sometimes be an issue for Unix users too, as most USB flash drives and some external (USB) hard disks will be VFAT/FAT32 formatted by default. Note that on Unix systems, the commands scp and rsync are useful ways to copy files from one machine to another over a (hopefully fast!) network.

The downloaded files will typically be in gzip (.gz) or gzipped-tar (.tar.gz or .tgz) format. If they are bundled together in a gzipped-tar, it probably makes sense to unpack them into individual files, which can be compressed, and then moved around individually. Use tar xvfz bundle.tar.gz to unpack a bundle of files, but make sure you have plenty of free disk space first (with df -h .). Remember that on Unix you can get basic help on most command-line tools by typing man commandname (where commandname is the name of the command you want information on). Type man man for help on the man command… Remember that compressing and uncompressing large files can take a long time, so be patient, and think carefully before you hit “Return”. Individual files should all be stored compressed. For example, gzip myexp_lane1.fastq will result in the file myexp_lane1.fastq.gz, which will be much smaller than the original, and can be analysed without ever storing the uncompressed version on disk. Use ls -lh to see the files and their sizes in a particular directory. Note that there are a variety of data formats associated with NGS data. Most are text formats which can be inspected using Unix commands such as head, tail and less. Files ending .fas contain the actual reads, files ending .qual contain the “quality” scores associated with a set of reads, but files ending .fastq are in the FASTQ format, which contains both the reads and the quality scores together in a single file. Most tools for working with short read data will work with the FASTQ format, and so that is what we will concentrate on for the rest of this quick introduction.

## Working with FASTQ files

FASTQ files are a text-based file format, with 4 lines of text corresponding to each read. Assuming a gzipped file called myreads.fastq.gz, the first few lines can be inspected with zcat myreads.fastq.gz | head. Typical results will look roughly like the following:

@NG-5232_4_1_1022_17823#0/1
NACTCCGGTGTCGGTCTCGTAGGCCATTTTAGAAGCGAATAAATCGATGNATTCGANCNCNNNNNNNNATCGNNAGAGCTCGTANGCCGTCTTCTGCTTGANNNNNNN
+NG-5232_4_1_1022_17823#0/1
#'''')(++)AAAAAAAAAA########################################################################################
@NG-5232_4_1_1025_18503#0/1
NTCTACGGTGTCGGTCTCGTAGCCTATCGGGTAGCAGAGCTTATCGATGAATTCGAGCTCGGTTTCAGATTGGCAGAGCTCGTANTGCGGCCTTCGGCTGANNNNNNN
+NG-5232_4_1_1025_18503#0/1
############################################################################################################
@NG-5232_4_1_1026_21154#0/1
NGTTACGGTGTCGGTCTCGTAGTGAGTTGACCTCCGCCCAGTATCGATGAATTCGAGCTCGTTTTCAGATCGGAAGAGCTCGTCNGCCGTCTTCTGCTTGANNNNNNN


The first line is an identifier associated with the first read. The second line is the read itself (usually the thing of greatest interest!). The third line is an identifier associated with the quality score. The fourth line gives the quality score associated with each base in the read, using an ASCII code. The next four lines correspond to the second read, and so on. Further details can be obtained from the Wikipedia FASTQ Format page.

Fortunately, Unix was designed from the outset with processing of large text files in mind. This makes it possible to do quite a lot of processing and analysis with standard Unix command-line tools. The number of lines in the (uncompressed) file can be obtained with

zcat myreads.fastq.gz | wc -l


Obviously then, dividing the result by 4 will give the number of reads in the file. You can browse through the file using




Use space to page-down, “b” to go back a page, and “q” to quit. Use man zless for more options. For example “/CAGGTT” will find the next occurrence of “CAGGTT” in the file. Any regular expression can be given after the slash, so, “/^CAGTT” will find the next read which starts with CAGTT. Regular expressions can be generally very useful in the analysis of sequence data, and we will return to them later.

Due to the very large file sizes involved in NGS data analysis, it can often be desirable to create a file containing a (relatively) small number of reads for initial testing and debugging of analysis pipelines. Again, this is easy to do using a command like the following

zcat myreads.fastq.gz | head -400000 | gzip > Test100k.fastq.gz


which takes the first 100k reads from “myreads” and stores them in “Test100k”, without ever storing any uncompressed reads on disk. An example Test100k.fastq.gz is available for downloading, so that readers can experiment for themselves with this kind of data. Note that in Unix, these commands which stream data from one command to another do so without requiring large amounts of RAM. All of these simple Unix filtering tools can be run without any problems on, say, a laptop with a modest amount of RAM (2GB will be fine) running Ubuntu Linux (or similar), so long as you have plenty of free disk space.

### Splitting and joining FASTQ files

If you do not have access to a large RAM machine, it may be desirable to split up a very large set of reads into a collection of files, each of which contains a more manageable set of reads. Again, this can be accomplished with standard Unix commands. For example:

zcat myreads.fastq.gz | split -d -l 2000000 - Block


will create files Block01, Block02, etc., each containing 0.5M reads. However, these files will not be compressed (so make sure you have plenty of disk space before running this!), and do not have the correct file extension. Both of those issues can be fixed with some more standard Unix commands. The following assumes that you are using a Bash shell, but similar things work in other shells:

for name in Block??
do
mv ${name}${name}.fastq
gzip \${name}.fastq
done


This will result in the compressed FASTQ files Block01.fastq.gz, Block02.fastq.gz, etc. The idea now is that each of these files is processed one at a time (or in parallel, perhaps on a cluster) using some tool, and then the results combined in some sensible way later. The precise details will depend a great deal on the nature of the experiment which led to the data.

If necessary, the files can be recombined at a later date with a command like:

zcat Block*.fastq.gz | gzip > allreads.fastq.gz


without storing uncompressed files on disk.

### Filtering FASTQ files

It will very often be desirable to filter the data in a less arbitrary way – selecting reads matching particular criteria for further analysis. Again, this sort of text processing is the kind of thing that is very easy in Unix. Traditionally, tools such as grep, sed and awk were used for this purpose. However, in the early 1990s, many people (myself included) migrated from awk to perl, as it was a full-blown programming language, with many more features than the simple tools. Perl is still a very effective language for this kind of activity, and now the BioPerl project provides a large range of modules specifically targeting biological applications, with an emphasis on sequence analysis. That said, many people find the perl language to be rather ugly, leading to code that is difficult to read and maintain. So in the late 1990s I, like many others, switched from perl to python for text processing and related activities. Python is a great general purpose programming language. It is simple, elegant and easy to learn, and code is easy to read and maintain. Analogous to perl, there is an associated Biopython project, containing modules for sequence analysis and other biological applications. At some point in the future I may write a post on using Biopython for the analysis of FASTQ data, but in the meantime the on-line resources and books such as Python for Bioinformatics provide further information for the interested reader.

Fortunately the FASTQ format is sufficiently simple that many if not most basic filtering and analysis tasks can be accomplished with the default python installation found on the vast majority of Unix systems. Below is a complete python program to filter a FASTQ file, selecting just the reads matching a given regular expression.

#!/usr/bin/env python
# filter.py
# Filter a fastq file according to a given regex to match to read sequence
# Copyright (C) Darren J Wilkinson, November 2010, http://tinyurl.com/darrenjw

import sys,re

for i in range(4):

if __name__=='__main__':
if (len(sys.argv)!=2):
print "Usage: python filter.py <regex> < infile.fastq > outfile.fastq"
exit(1)
regex=re.compile(sys.argv[1])

# eof


So, for example,

zcat Test100k.fastq.gz | python filter.py TCGATTT | gzip > filtered.fastq.gz


will select out all reads containing the string TCGATTT, and

zcat Test100k.fastq.gz | python filter.py ^TCGAT | gzip > filtered.fastq.gz


zcat Test100k.fastq.gz | python filter.py ^TCGAT | wc -l


and dividing the result by 4 will give the number of reads starting with TCGAT. People familiar with Unix can think of this python script as a version of grep for FASTQ files.

Of course the above python code could be extended to do more sophisticated analysis of the read data, but in that case it will make sense to make use of the Biopython libraries. An alternative is to use R and Bioconductor for the analysis, making use of their huge array of statistical analysis and visualisation tools. I will give a quick introduction to the use of the Bioconductor “ShortRead” package for processing FASTQ data in the next post.

## Motivation and background

In this post I will try and explain an important idea behind some recent developments in MCMC theory. First, let me give some motivation. Suppose you are trying to implement a Metropolis-Hastings algorithm, as discussed in a previous post (required reading!), but a key likelihood term needed for the acceptance ratio is difficult to evaluate (most likely it is a marginal likelihood of some sort). If it is possible to obtain a Monte-Carlo estimate for that likelihood term (which it sometimes is, for example, using importance sampling), one could obviously just plug it in to the acceptance ratio and hope for the best. What is not at all obvious is that if your Monte-Carlo estimate satisfies some fairly weak property then the equilibrium distribution of the Markov chain will remain exactly as it would be if the exact likelihood had been available. Furthermore, it is exact even if the Monte-Carlo estimate is very noisy and imprecise (though the mixing of the chain will be poorer in this case). This is the “exact approximate” pseudo-marginal MCMC approach. To give credit where it is due, the idea was first introduced by Mark Beaumont in Beaumont (2003), where he was using an importance sampling based approximate likelihood in the context of a statistical genetics example. This was later picked up by Christophe Andrieu and Gareth Roberts, who studied the technical properties of the approach in Andrieu and Roberts (2009). The idea is turning out to be useful in several contexts, and in particular, underpins the interesting new Particle MCMC algorithms of Andrieu et al (2010), which I will discuss in a future post. I’ve heard Mark, Christophe, Gareth and probably others present this concept, but this post is most strongly inspired by a talk that Christophe gave at the IMS 2010 meeting in Gothenburg this summer.

## The pseudo-marginal Metropolis-Hastings algorithm

Let’s focus on the simplest version of the problem, where we want to sample from a target $p(x)$ using a proposal $q(x'|x)$. As explained previously, the required Metropolis-Hastings acceptance ratio will take the form

$\displaystyle A=\frac{p(x')q(x|x')}{p(x)q(x'|x)}.$

Here we are assuming that $p(x)$ is difficult to evaluate (usually because it is a marginalised version of some higher-dimensional distribution), but that a Monte-Carlo estimate of $p(x)$, which we shall denote $r(x)$, can be computed. We can obviously just substitute this estimate into the acceptance ratio to get

$\displaystyle A=\frac{r(x')q(x|x')}{r(x)q(x'|x)},$

but it is not immediately clear that in many cases this will lead to the Markov chain having an equilibrium distribution that is exactly $p(x)$. It turns out that it is sufficient that the likelihood estimate, $r(x)$ is non-negative, and unbiased, in the sense that $E(r(x))=p(x)$, where the expectation is with respect to the Monte-Carlo error for a given fixed value of $x$. In fact, as we shall see, this condition is actually a bit stronger than is really required.

Put $W=r(x)/p(x)$, representing the noise in the Monte-Carlo estimate of $p(x)$, and suppose that $W \sim p(w|x)$ (note that in an abuse of notation, the function $p(w|x)$ is unrelated to $p(x)$). The main condition we will assume is that $E(W|x)=c$, where $c>0$ is a constant independent of $x$. In the case of $c=1$, we have the (typical) special case of $E(r(x))=p(x)$. For now, we will also assume that $W\geq 0$, but we will consider relaxing this constraint later.

The key to understanding the pseudo-marginal approach is to realise that at each iteration of the MCMC algorithm a new value of $W$ is being proposed in addition to a new value for $x$. If we regard the proposal mechanism as a joint update of $x$ and $w$, it is clear that the proposal generates $(x',w')$ from the density $q(x'|x)p(w'|x')$, and we can re-write our “approximate” acceptance ratio as

$\displaystyle A=\frac{w'p(x')p(w'|x')q(x|x')p(w|x)}{wp(x)p(w|x)q(x'|x)p(w'|x')}.$

Inspection of this acceptance ratio reveals that the target of the chain must be (proportional to)

$p(x)wp(w|x).$

This is a joint density for $(x,w)$, but the marginal for $x$ can be obtained by integrating over the range of $W$ with respect to $w$. Using the fact that $E(W|x)=c$, this then clearly gives a density proportional to $p(x)$, and this is precisely the target that we require. Note that for this to work, we must keep the old value of $w$ from one iteration to the next – that is, we must keep and re-use our noisy $r(x)$ value to include in the acceptance ratio for our next MCMC iteration – we should not compute a new Monte-Carlo estimate for the likelihood of the old state of the chain.

## Examples

We will consider again the example from the previous post – simulation of a chain with a N(0,1) target using uniform innovations. Using R, the main MCMC loop takes the form

pmmcmc<-function(n=1000,alpha=0.5)
{
vec=vector("numeric", n)
x=0
oldlik=noisydnorm(x)
vec[1]=x
for (i in 2:n) {
innov=runif(1,-alpha,alpha)
can=x+innov
lik=noisydnorm(can)
aprob=lik/oldlik
u=runif(1)
if (u < aprob) {
x=can
oldlik=lik
}
vec[i]=x
}
vec
}


Here we are assuming that we are unable to compute dnorm exactly, but instead only a Monte-Carlo estimate called noisydnorm. We can start with the following implementation

noisydnorm<-function(z)
{
dnorm(z)*rexp(1,1)
}


Each time this function is called, it will return a non-negative random quantity whose expectation is dnorm(z). We can now run this code as follows.

plot.mcmc<-function(mcmc.out)
{
op=par(mfrow=c(2,2))
plot(ts(mcmc.out),col=2)
hist(mcmc.out,30,col=3)
qqnorm(mcmc.out,col=4)
abline(0,1,col=2)
acf(mcmc.out,col=2,lag.max=100)
par(op)
}

metrop.out<-pmmcmc(10000,1)
plot.mcmc(metrop.out)


So we see that we have exactly the right N(0,1) target, despite using the (very) noisy noisydnorm function in place of the dnorm function. This noisy likelihood function is clearly unbiased. However, as already discussed, a constant bias in the noise is also acceptable, as the following function shows.

noisydnorm<-function(z)
{
dnorm(z)*rexp(1,2)
}


Re-running the code with this function also leads to the correct equilibrium distribution for the chain. However, it really does matter that the bias is independent of the state of the chain, as the following function shows.

noisydnorm<-function(z)
{
dnorm(z)*rexp(1,0.1+10*z*z)
}


Running with this function leads to the wrong equilibrium. However, it is OK for the distribution of the noise to depend on the state, as long as its expectation does not. The following function illustrates this.

noisydnorm<-function(z)
{
dnorm(z)*rgamma(1,0.1+10*z*z,0.1+10*z*z)
}


This works just fine. So far we have been assuming that our noisy likelihood estimates are non-negative. This is clearly desirable, as otherwise we could wind up with negative Metropolis-Hasting ratios. However, as long as we are careful about exactly what we mean, even this non-negativity condition may be relaxed. The following function illustrates the point.

noisydnorm<-function(z)
{
dnorm(z)*rnorm(1,1)
}


Despite the fact that this function will often produce negative values, the equilibrium distribution of the chain still seems to be correct! An even more spectacular example follows.

noisydnorm<-function(z)
{
dnorm(z)*rnorm(1,0)
}


Astonishingly, this one works too, despite having an expected value of zero! However, the following doesn’t work, even though it too has a constant expectation of zero.

noisydnorm<-function(z)
{
dnorm(z)*rnorm(1,0,0.1+10*z*z)
}


I don’t have time to explain exactly what is going on here, so the precise details are left as an exercise for the reader. Suffice to say that the key requirement is that it is the distribution of W conditioned to be non-negative which must have the constant bias property – something clearly violated by the final noisydnorm example.

Before finishing this post, it is worth re-emphasising the issue of re-using old Monte-Carlo estimates. The following function will not work (exactly), though in the case of good Monte-Carlo estimates will often work tolerably well.

approxmcmc<-function(n=1000,alpha=0.5)
{
vec=vector("numeric", n)
x=0
vec[1]=x
for (i in 2:n) {
innov=runif(1,-alpha,alpha)
can=x+innov
lik=noisydnorm(can)
oldlik=noisydnorm(x)
aprob=lik/oldlik
u=runif(1)
if (u < aprob) {
x=can
}
vec[i]=x
}
vec
}


In a subsequent post I will show how these ideas can be put into practice in the context of a Bayesian inference example.

## Background

Over the next few months I’m intending to have a series of posts on recent developments in MCMC and algorithms of Metropolis-Hastings type. These posts will assume a basic familiarity with stochastic simulation and R. For reference, I have some old notes on stochastic simulation and MCMC from a course I used to teach. There were a set of computer practicals associated with the course which focused on understanding the algorithms, mainly using R. Indeed, this post is going to make use of the example from this old practical. If anything about the rest of this post doesn’t make sense, you might need to review some of this background material. My favourite introductory text book on this subject is Gamerman’s Markov Chain Monte Carlo. In this post I’m going to briefly recap the key idea behind the Metropolis-Hastings algorithm, and illustrate how these kinds of algorithms are typically implemented. In order to keep things as simple as possible, I’m going to use R for implementation, but as discussed in a previous post, that probably won’t be a good idea for non-trivial applications.

## Metropolis-Hastings

The motivation is a need to understand a complex probability distribution, p(x). In applications to Bayesian statistics, this distribution is usually a posterior from a Bayesian analysis. A good way to understand an intractable distribution is to simulate realisations from it and study those. However, this often isn’t easy, either. The idea behind MCMC is to simulate a Markov chain whose equilibrium distribution is p(x). Metropolis-Hastings (M-H) provides a way to correct a fairly arbitrary transition kernel q(x’|x) (which typically won’t have p(x) as its equilibrium) to give a chain which does have the desired target. In M-H, the transition kernel is used to generate a proposed new value, x’ for the chain, which is then accepted as the new state at random with a particular probability a(x’|x)=min(1,A), where A = p(x’)q(x|x’)/[p(x)q(x’|x)].

If the value is accepted, then the chain moves to the proposed new state, x’. If the value is not accepted, the chain still advances to the next step, but the new state is given by the previous state of the chain, x.

It is clear that this algorithm is also a Markov chain, and that for x’ != x, the transition kernel of this chain is q(x’|x)a(x’|x). A couple of lines of algebra will then confirm that this “corrected” Markov chain satisfies detailed balance for the target p(x). Therefore, under some regularity conditions, this chain will have p(x) as its equilibrium, and this gives a convenient way of simulating values from this target.

### Special case: the Metropolis method

It is clear from the form of the acceptance probability that a considerable simplification arises in the case where the proposal kernel is symmetric, having q(x’|x)=q(x|x’). This is useful, as a convenient updating strategy is often to update the current value of x by adding an innovation sampled from a distribution that is symmetric about zero. This clearly gives a symmetric kernel, which then drops out of the acceptance ratio, to give A=p(x’)/p(x).

## Example

The example we will use for illustration is a Metropolis method for the standard normal distribution using innovations from a U(-eps,eps) distribution. Below is a simple R implementation of the algorithm

metrop1=function(n=1000,eps=0.5)
{
vec=vector("numeric", n)
x=0
vec[1]=x
for (i in 2:n) {
innov=runif(1,-eps,eps)
can=x+innov
aprob=min(1,dnorm(can)/dnorm(x))
u=runif(1)
if (u < aprob)
x=can
vec[i]=x
}
vec
}


First a candidate value (can) is constructed by perturbing the current state of the chain with the uniform innovation. Then the acceptance probability is computed, and the chain is updated appropriately depending on whether the proposed new value is accepted. Note the standard trick of picking an event with probability p by checking to see if u<p, where u is a draw from a U(0,1).

We can execute this code and plot the results with the following peice of code:

plot.mcmc<-function(mcmc.out)
{
op=par(mfrow=c(2,2))
plot(ts(mcmc.out),col=2)
hist(mcmc.out,30,col=3)
qqnorm(mcmc.out,col=4)
abline(0,1,col=2)
acf(mcmc.out,col=2,lag.max=100)
par(op)
}

metrop.out<-metrop1(10000,1)
plot.mcmc(metrop.out)


From these plots we see that the algorithm seems to be working as expected. Before finishing this post, it is worth explaining how to improve this code to get something which looks a bit more like the kind of code that people would actually write in practice. The next version of the code (below) makes use of the fact that min(1,A) is redundant if you are just going to compare to a uniform (since values of the ratio larger than 1 will be accepted with probability 1, as they should), and also that it is unnecessary to recalculate the likelihood of the current value of the chain at every iteration – better to store and re-use the value. That obviously doesn’t make much difference here, but for real problems likelihood computations are often the main computational bottleneck.

metrop2=function(n=1000,eps=0.5)
{
vec=vector("numeric", n)
x=0
oldlik=dnorm(x)
vec[1]=x
for (i in 2:n) {
innov=runif(1,-eps,eps)
can=x+innov
lik=dnorm(can)
a=lik/oldlik
u=runif(1)
if (u < a) {
x=can
oldlik=lik
}
vec[i]=x
}
vec
}


However, this code still has a very serious flaw. It computes likelihoods! For this problem it isn’t a major issue, but in general likelihoods are the product of a very large number of small numbers, and numerical underflow is a constant hazard. For this reason (and others), no-one ever computes likelihoods in code if they can possibly avoid it, but instead log-likelihoods (which are the sum of a large number of reasonably-sized numbers, and therefore numerically very stable). We can use these log-likelihoods to calculate a log-acceptance ratio, which can then be compared to the log of a uniform for the accept-reject decision. We end up with the code below, which now doesn’t look too different to the kind of code one might actually write…

metrop3=function(n=1000,eps=0.5)
{
vec=vector("numeric", n)
x=0
oldll=dnorm(x,log=TRUE)
vec[1]=x
for (i in 2:n) {
can=x+runif(1,-eps,eps)
loglik=dnorm(can,log=TRUE)
loga=loglik-oldll
if (log(runif(1)) < loga) {
x=can
oldll=loglik
}
vec[i]=x
}
vec
}


It is easy to verify that all 3 of these implementations behave identically. Incidentally, most statisticians would consider all 3 of the above codes to represent exactly the same algorithm. They are just slightly different implementations of the same algorithm. I mention this because I’ve noticed that computing scientists often have a slightly lower-level view of what constitutes an algorithm, and would sometimes consider the above 3 algorithms to be different.

## The last Valencia meeting on Bayesian Statistics and the future of Bayesian computation

I’ve spent the last week in Benidorm, Spain, for the 9th and final Valencia meeting on Bayesian Statistics. Nine of us travelled from Newcastle University, making us one of the best represented groups at the meeting. This was my fifth Valencia meeting – the first I attended was Valencia 5 which took place in Alicante back in 1994 when I was a PhD student at Durham University working with Michael Goldstein. Our contributed paper to the proceedings of that meeting was my first publication, and I’ve been to every meeting since. Michael is one of the few people to have attended all 9 Valencia meetings (in addition to the somewhat mythical “Valencia 0”). It was therefore very fitting that Michael opened the Valencia 9 invited programme (with a great talk on Bayesian analysis of complex computer models). Like many others, I’ll be a little sad to think that the Valencia meetings have come to an end.

The meeting itself was scientifically very interesting. I wish that I had the energy to give a summary of the scientific programme, but unfortunately I don’t! However, anyone who does want to get something of the flavour of the programme should take a look at the “Valencia Snapshots” on Christian Robert’s blog. My own talk gets a mention in Snapshot 4. I presented a paper entitled Parameter inference for stochastic kinetic models of bacterial gene regulation: a Bayesian approach to systems biology. Unfortunately my discussant, Sam Kou, was unable to travel to the meeting due to passport problems, but very kindly produced a pre-recorded video discussion to be played to me and the audience at the end of my talk. After a brief problem with the audio (a recurring theme of the meeting!), this actually worked quite well, though it felt slightly strange replying to his discussion knowing that he could not hear what I was saying!

There were several talks discussing Bayesian approaches to challenging problems in bioinformatics and molecular biology, and these were especially interesting to me. I was also particularly interested in the talks on Bayesian computation. Several talks mentioned the possibility of speeding up Bayesian computation using GPUs, and Chris Holmes gave a nice overview of the current technology and its potential, together with a link to a website providing further information. Although there is no doubt that GPU technology can provide fairly impressive speedups for certain Bayesian computations, I’m actually a little bit of a GPU-sceptic, so let me explain why. There are many reasons. First, I’m always a bit suspicious of a technology that is fairly closed and proprietary being pushed by a large powerful company – I prefer my hardware to be open, and my software to be free and open. Next, there isn’t really anything that you can do on a GPU that you can’t do on a decent multicore server or cluster using standard well established technologies such as MPI and OpenMP. Also, GPUs are relatively difficult to program, and time taken for software development is a very expensive cost which in many cases will dwarf differences in hardware costs. Also, in the days when 64 bit chips and many GB of RAM are sitting on everyone’s desktops, do I really want to go back to 1 GB of RAM, single precision arithmetic and no math libraries?! That hardly seems like the future I’m envisaging! Next, there are other related products like the Intel Knights Corner on the horizon that are likely to offer similar performance gains while being much simpler to develop for. Next, it seems likely to me that machines in the future are going to feature massively multicore CPUs, rendering GPU computing obsolete. Finally, although GPUs offer one possible approach to tackling the problem of speedup, they do little for the far more general and important problem of scalability of Bayesian computing and software. From that perspective, I really enjoyed the talk by Andrew McCallum on Probabilistic programming with imperatively-defined factor graphs. Andrew was talking about a flexible machine learning library he is developing called factorie for the interesting new language Scala. Whilst that particular library is not exactly what I need, fundamentally, his talk was all about building frameworks for Bayesian computation which really scale. I think this is the real big issue facing Bayesian computation, and modern languages and software platforms, including so-called cloud computing approaches, and technologies like Hadoop and MapReduce probably represent some of the directions we should be looking in. There is an interesting project called CRdata which is a first step in that direction. Clearly these technologies are somewhat orthogonal to the GPU/speedup issue, but I don’t think they are completely unrelated.