Metropolis-Hastings MCMC algorithms

Background

Over the next few months I’m intending to have a series of posts on recent developments in MCMC and algorithms of Metropolis-Hastings type. These posts will assume a basic familiarity with stochastic simulation and R. For reference, I have some old notes on stochastic simulation and MCMC from a course I used to teach. There were a set of computer practicals associated with the course which focused on understanding the algorithms, mainly using R. Indeed, this post is going to make use of the example from this old practical. If anything about the rest of this post doesn’t make sense, you might need to review some of this background material. My favourite introductory text book on this subject is Gamerman’s Markov Chain Monte Carlo. In this post I’m going to briefly recap the key idea behind the Metropolis-Hastings algorithm, and illustrate how these kinds of algorithms are typically implemented. In order to keep things as simple as possible, I’m going to use R for implementation, but as discussed in a previous post, that probably won’t be a good idea for non-trivial applications.

Metropolis-Hastings

The motivation is a need to understand a complex probability distribution, p(x). In applications to Bayesian statistics, this distribution is usually a posterior from a Bayesian analysis. A good way to understand an intractable distribution is to simulate realisations from it and study those. However, this often isn’t easy, either. The idea behind MCMC is to simulate a Markov chain whose equilibrium distribution is p(x). Metropolis-Hastings (M-H) provides a way to correct a fairly arbitrary transition kernel q(x’|x) (which typically won’t have p(x) as its equilibrium) to give a chain which does have the desired target. In M-H, the transition kernel is used to generate a proposed new value, x’ for the chain, which is then accepted as the new state at random with a particular probability a(x’|x)=min(1,A), where A = p(x’)q(x|x’)/[p(x)q(x’|x)].

If the value is accepted, then the chain moves to the proposed new state, x’. If the value is not accepted, the chain still advances to the next step, but the new state is given by the previous state of the chain, x.

It is clear that this algorithm is also a Markov chain, and that for x’ != x, the transition kernel of this chain is q(x’|x)a(x’|x). A couple of lines of algebra will then confirm that this “corrected” Markov chain satisfies detailed balance for the target p(x). Therefore, under some regularity conditions, this chain will have p(x) as its equilibrium, and this gives a convenient way of simulating values from this target.

Special case: the Metropolis method

It is clear from the form of the acceptance probability that a considerable simplification arises in the case where the proposal kernel is symmetric, having q(x’|x)=q(x|x’). This is useful, as a convenient updating strategy is often to update the current value of x by adding an innovation sampled from a distribution that is symmetric about zero. This clearly gives a symmetric kernel, which then drops out of the acceptance ratio, to give A=p(x’)/p(x).

Example

The example we will use for illustration is a Metropolis method for the standard normal distribution using innovations from a U(-eps,eps) distribution. Below is a simple R implementation of the algorithm

metrop1=function(n=1000,eps=0.5) 
{
        vec=vector("numeric", n)
        x=0
        vec[1]=x
        for (i in 2:n) {
                innov=runif(1,-eps,eps)
                can=x+innov
                aprob=min(1,dnorm(can)/dnorm(x))
                u=runif(1)
                if (u < aprob) 
                        x=can
                vec[i]=x
        }
        vec
}

First a candidate value (can) is constructed by perturbing the current state of the chain with the uniform innovation. Then the acceptance probability is computed, and the chain is updated appropriately depending on whether the proposed new value is accepted. Note the standard trick of picking an event with probability p by checking to see if u<p, where u is a draw from a U(0,1).

We can execute this code and plot the results with the following peice of code:

plot.mcmc<-function(mcmc.out)
{
	op=par(mfrow=c(2,2))
	plot(ts(mcmc.out),col=2)
	hist(mcmc.out,30,col=3)
	qqnorm(mcmc.out,col=4)
	abline(0,1,col=2)
	acf(mcmc.out,col=2,lag.max=100)
	par(op)
}

metrop.out<-metrop1(10000,1)
plot.mcmc(metrop.out)
Plots showing convergence of the Markov chain to a N(0,1) equilibrium distribution

From these plots we see that the algorithm seems to be working as expected. Before finishing this post, it is worth explaining how to improve this code to get something which looks a bit more like the kind of code that people would actually write in practice. The next version of the code (below) makes use of the fact that min(1,A) is redundant if you are just going to compare to a uniform (since values of the ratio larger than 1 will be accepted with probability 1, as they should), and also that it is unnecessary to recalculate the likelihood of the current value of the chain at every iteration – better to store and re-use the value. That obviously doesn’t make much difference here, but for real problems likelihood computations are often the main computational bottleneck.

metrop2=function(n=1000,eps=0.5) 
{
        vec=vector("numeric", n)
        x=0
        oldlik=dnorm(x)
        vec[1]=x
        for (i in 2:n) {
                innov=runif(1,-eps,eps)
                can=x+innov
                lik=dnorm(can)
                a=lik/oldlik
                u=runif(1)
                if (u < a) { 
                        x=can
                        oldlik=lik
                        }
                vec[i]=x
        }
        vec
}

However, this code still has a very serious flaw. It computes likelihoods! For this problem it isn’t a major issue, but in general likelihoods are the product of a very large number of small numbers, and numerical underflow is a constant hazard. For this reason (and others), no-one ever computes likelihoods in code if they can possibly avoid it, but instead log-likelihoods (which are the sum of a large number of reasonably-sized numbers, and therefore numerically very stable). We can use these log-likelihoods to calculate a log-acceptance ratio, which can then be compared to the log of a uniform for the accept-reject decision. We end up with the code below, which now doesn’t look too different to the kind of code one might actually write…

metrop3=function(n=1000,eps=0.5) 
{
        vec=vector("numeric", n)
        x=0
        oldll=dnorm(x,log=TRUE)
        vec[1]=x
        for (i in 2:n) {
                can=x+runif(1,-eps,eps)
                loglik=dnorm(can,log=TRUE)
                loga=loglik-oldll
                if (log(runif(1)) < loga) { 
                        x=can
                        oldll=loglik
                        }
                vec[i]=x
        }
        vec
}

It is easy to verify that all 3 of these implementations behave identically. Incidentally, most statisticians would consider all 3 of the above codes to represent exactly the same algorithm. They are just slightly different implementations of the same algorithm. I mention this because I’ve noticed that computing scientists often have a slightly lower-level view of what constitutes an algorithm, and would sometimes consider the above 3 algorithms to be different.

The last Valencia meeting on Bayesian Statistics and the future of Bayesian computation

I’ve spent the last week in Benidorm, Spain, for the 9th and final Valencia meeting on Bayesian Statistics. Nine of us travelled from Newcastle University, making us one of the best represented groups at the meeting. This was my fifth Valencia meeting – the first I attended was Valencia 5 which took place in Alicante back in 1994 when I was a PhD student at Durham University working with Michael Goldstein. Our contributed paper to the proceedings of that meeting was my first publication, and I’ve been to every meeting since. Michael is one of the few people to have attended all 9 Valencia meetings (in addition to the somewhat mythical “Valencia 0”). It was therefore very fitting that Michael opened the Valencia 9 invited programme (with a great talk on Bayesian analysis of complex computer models). Like many others, I’ll be a little sad to think that the Valencia meetings have come to an end.

The meeting itself was scientifically very interesting. I wish that I had the energy to give a summary of the scientific programme, but unfortunately I don’t! However, anyone who does want to get something of the flavour of the programme should take a look at the “Valencia Snapshots” on Christian Robert’s blog. My own talk gets a mention in Snapshot 4. I presented a paper entitled Parameter inference for stochastic kinetic models of bacterial gene regulation: a Bayesian approach to systems biology. Unfortunately my discussant, Sam Kou, was unable to travel to the meeting due to passport problems, but very kindly produced a pre-recorded video discussion to be played to me and the audience at the end of my talk. After a brief problem with the audio (a recurring theme of the meeting!), this actually worked quite well, though it felt slightly strange replying to his discussion knowing that he could not hear what I was saying!

There were several talks discussing Bayesian approaches to challenging problems in bioinformatics and molecular biology, and these were especially interesting to me. I was also particularly interested in the talks on Bayesian computation. Several talks mentioned the possibility of speeding up Bayesian computation using GPUs, and Chris Holmes gave a nice overview of the current technology and its potential, together with a link to a website providing further information. Although there is no doubt that GPU technology can provide fairly impressive speedups for certain Bayesian computations, I’m actually a little bit of a GPU-sceptic, so let me explain why. There are many reasons. First, I’m always a bit suspicious of a technology that is fairly closed and proprietary being pushed by a large powerful company – I prefer my hardware to be open, and my software to be free and open. Next, there isn’t really anything that you can do on a GPU that you can’t do on a decent multicore server or cluster using standard well established technologies such as MPI and OpenMP. Also, GPUs are relatively difficult to program, and time taken for software development is a very expensive cost which in many cases will dwarf differences in hardware costs. Also, in the days when 64 bit chips and many GB of RAM are sitting on everyone’s desktops, do I really want to go back to 1 GB of RAM, single precision arithmetic and no math libraries?! That hardly seems like the future I’m envisaging! Next, there are other related products like the Intel Knights Corner on the horizon that are likely to offer similar performance gains while being much simpler to develop for. Next, it seems likely to me that machines in the future are going to feature massively multicore CPUs, rendering GPU computing obsolete. Finally, although GPUs offer one possible approach to tackling the problem of speedup, they do little for the far more general and important problem of scalability of Bayesian computing and software. From that perspective, I really enjoyed the talk by Andrew McCallum on Probabilistic programming with imperatively-defined factor graphs. Andrew was talking about a flexible machine learning library he is developing called factorie for the interesting new language Scala. Whilst that particular library is not exactly what I need, fundamentally, his talk was all about building frameworks for Bayesian computation which really scale. I think this is the real big issue facing Bayesian computation, and modern languages and software platforms, including so-called cloud computing approaches, and technologies like Hadoop and MapReduce probably represent some of the directions we should be looking in. There is an interesting project called CRdata which is a first step in that direction. Clearly these technologies are somewhat orthogonal to the GPU/speedup issue, but I don’t think they are completely unrelated.

MCMC programming in R, Python, Java and C

EDIT 16/7/11 – this post has now been largely superceded by a new post!

MCMC

Markov chain Monte Carlo (MCMC) is a powerful simulation technique for exploring high-dimensional probability distributions. It is particularly useful for exploring posterior probability distributions that arise in Bayesian statistics. Although there are some generic tools (such as WinBugs and JAGS) for doing MCMC, for non-standard problems it is often desirable to code up MCMC algorithms from scratch. It is then natural to wonder what programming languages might be good for this purpose. There are hundreds of programming languages one could in principle use for MCMC programming, but it is necessary to use a language with a good scientific library, including good random number generation routines. I have used a large number of programming languages over the years, but these days I mostly program in R, Python, Java or C. I find each of these languages interesting and useful, with different strengths and weaknesses, and I find that no one of these languages dominates any of the others in all situations.

Example

For the purposes of this post we will focus on Gibbs sampling for a simple bivariate distribution defined on the half-plane x>0.

f(x,y) = k x2 exp{-xy2-y2+2y-4x}

The statistical motivation is not important for this post, but this is the kind of distribution which arises naturally in Bayesian inference for the mean and variance of a normal random sample. Gibbs sampling is a simple MCMC scheme which samples in turn from the full-conditional distributions. In this case, the full-conditional for x is Gamma(3,y2+4) and the full-conditional for y is N(1/(x+1),1/(x+1)). The resulting Gibbs sampling algorithm is very simple and works very efficiently, but for illustrative purposes, we will consider generating 20,000 iterations with a “thin” of 500 iterations.

Implementations

R

Although R has many flaws, it is well suited to programming with data, and has a huge array of statistical libraries associated with it. Like many statisticians, I probably use R more than any other language in my day-to-day work. It is the language I always use for the analysis of MCMC output. It is therefore natural to think about using R for coding up MCMC algorithms. Below is a simple function to implement a Gibbs sampler for this problem.

gibbs=function(N,thin)
{
	mat=matrix(0,ncol=3,nrow=N)
	mat[,1]=1:N
	x=0
	y=0
	for (i in 1:N) {
		for (j in 1:thin) {
			x=rgamma(1,3,y*y+4)
			y=rnorm(1,1/(x+1),1/sqrt(x+1))
		}
		mat[i,2:3]=c(x,y)
	}
	mat=data.frame(mat)
	names(mat)=c("Iter","x","y")
	mat
}

This function stores the output in a data frame. It can be called and the data written out to a file as follows:

writegibbs=function(N=20000,thin=500)
{
	mat=gibbs(N,thin)
	write.table(mat,"data.tab",row.names=FALSE)
}
tim=system.time(writegibbs())
print(tim)

This also times how long the function takes to run. We will return to timings later. As already mentioned, I always analyse MCMC output in R. Below is some R code to compare the contours of the true distribution with the contours of the empirical distribution obtained by a 2d kernel smoother.

fun=function(x,y)
{
	x*x*exp(-x*y*y-y*y+2*y-4*x)
}

mat=read.table("data.tab",header=TRUE)
op=par(mfrow=c(2,1))
x=seq(0,4,0.1)
y=seq(-2,4,0.1)
z=outer(x,y,fun)
contour(x,y,z,main="Contours of actual distribution")
require(KernSmooth)
fit=bkde2D(as.matrix(mat[,2:3]),c(0.1,0.1))
contour(fit$x1,fit$x2,fit$fhat,main="Contours of empirical distribution")
par(op)

Python

Python is a great general purpose programming language. In an ideal world, I would probably use Python for all of my programming needs. These days it is also a reasonable platform for scientific computing, due to the scipy project. It is certainly possible to develop MCMC algorithms in Python. A simple Python script for this problem can be written as follows.

import random,math

def gibbs(N=20000,thin=500):
    x=0
    y=0
    print "Iter  x  y"
    for i in range(N):
        for j in range(thin):
            x=random.gammavariate(3,1.0/(y*y+4))
            y=random.gauss(1.0/(x+1),1.0/math.sqrt(x+1))
        print i,x,y
    
gibbs()

This can be run with a command like python gibbs.py > data.tab. Note that python uses a different parametrisation of the gamma distribution to R.

C

The language I have used most for the development of MCMC algorithms is C (usually strict ANSI C). C is fast and efficient, and gives the programmer lots of control over how computations are carried out. The GSL is an excellent scientific library for C, which includes good random number generation facilities. A program for this problem is given below.

#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#include <gsl/gsl_rng.h>
#include <gsl/gsl_randist.h>

void main()
{
  int N=20000;
  int thin=500;
  int i,j;
  gsl_rng *r = gsl_rng_alloc(gsl_rng_mt19937);
  double x=0;
  double y=0;
  printf("Iter x y\n");
  for (i=0;i<N;i++) {
    for (j=0;j<thin;j++) {
      x=gsl_ran_gamma(r,3.0,1.0/(y*y+4));
      y=1.0/(x+1)+gsl_ran_gaussian(r,1.0/sqrt(x+1));
    }
    printf("%d %f %f\n",i,x,y);
  }
}

This can be compiled and run (on Linux) with commands like:

gcc -O2 -lgsl -lgslcblas gibbs.c -o gibbs
time ./gibbs > data.tab 

Java

Java is slightly nicer to code in than C, and has certain advantages over C in web application and cloud computing environments. It also has a few reasonable scientific libraries. I mainly use Parallel COLT, which (despite the name) is a general purpose scientific library, and not just for parallel applications. Using this library, code for this problem is given below.

import java.util.*;

class Gibbs
{

    public static void main(String[] arg)
    {
	int N=20000;
	int thin=500;
	cern.jet.random.tdouble.engine.DoubleRandomEngine rngEngine=
	    new cern.jet.random.tdouble.engine.DoubleMersenneTwister(new Date());
	cern.jet.random.tdouble.Normal rngN=
	    new cern.jet.random.tdouble.Normal(0.0,1.0,rngEngine);
	cern.jet.random.tdouble.Gamma rngG=
	    new cern.jet.random.tdouble.Gamma(1.0,1.0,rngEngine);
	double x=0;
	double y=0;
	System.out.println("Iter x y");
	for (int i=0;i<N;i++) {
	    for (int j=0;j<thin;j++) {
		x=rngG.nextDouble(3.0,y*y+4);
		y=rngN.nextDouble(1.0/(x+1),1.0/Math.sqrt(x+1));
	    }
	    System.out.println(i+" "+x+" "+y);
	}
    }

}

Assuming that Parallel COLT is in your classpath, this can be compiled and run with commands like.

javac Gibbs.java
time java Gibbs > data.tab

Timings

Complex MCMC algorithms can be really slow to run (weeks or months), so execution speed really does matter. So, how do these languages compare for this particular simple problem? R is the slowest, so let’s think about the speed relative to R (on my laptop). The Python code ran 2.4 times faster than R. To me, that is not really worth worrying too much about. Differences in coding style and speed-up tricks can make much bigger differences than this. So I wouldn’t choose between these two on the basis of speed differential alone. Python is likely to be nicer to develop in, but the convenience of doing the computation in R, the standard platform for statistical analysis, could nevertheless tip things in favour of R. C was the fastest, and this is the reason that most of my MCMC code development has been traditionally done in C. It was around 60 times faster than R. This difference is worth worrying about, and can make the difference between an MCMC code being practical to run or not. So how about Java? Many people assume that Java must be much slower than C, as it is byte-code compiled, and runs in a virtual machine. In this case it is around 40 times faster than R. This is within a factor of 2 of C. This is quite typical in my experience, and I have seen codes which run faster in Java. Again, I wouldn’t choose between C and Java purely on the basis of speed, but rather on speed of development and ease of deployment. Java can have advantages over C on both counts, which is why I’ve been experimenting more with Java for MCMC codes recently.

Summary

R and Python are slow, on account of their dynamic type systems, but are quick and easy to develop in. Java and C are much faster. The speed advantages of Java and C can be important in the context of complex MCMC algorithms. One possibility that is often put forward is to prototype in a dynamic language like R or Python and then port (the slow parts) to a statically typed language later. This could be time efficient, as the debugging and re-factoring can take place in the dynamic language where it is easier, then just re-coded fairly directly into the statically typed language once the code is working well. Actually, I haven’t found this to be time efficient for the development of MCMC codes. That could be just me, but it could be a feature of MCMC algorithms in that the dynamic code runs too slowly to be amenable to quick and simple debugging.

(Yet another) introduction to R and Bioconductor

Introduction

Data management, analysis and visualisation are becoming increasingly important in bioscience research, but also increasingly difficult, as bioscience data increases in volume and complexity. The R programming language is a general and flexible computing environment widely used by statisticians for working with data. The Bioconductor project provides a powerful and flexible set of tools for working with bioscience data that is built on top of the R language. Bioconductor provides packages for working with sequence data, microarray data, flow cytometry data, next generation sequencing data, etc. Together, R and Bioconductor provide an excellent environment for bioscientists to analyse, summarise and visualise quantitative data. It is also ideal for carrying out sophisticated statistical analyses of data, and for producing publication quality figures, plots and graphs.

Installing R and Bioconductor

R can be downloaded from the Comprehensive R Archive Network, generally known as CRAN. It is available for Windows, Macs, and most flavours of Unix/Linux. On most Linux systems, R is available as a standard part of the distribution, and so it is usually simplest to install directly using the appropriate package manager for the particular distribution. eg. on Ubuntu, typing sudo apt-get install r-base r-recommended should do it. Once installed, additional packages can be installed by running R and typing the appropriate command at the R prompt. eg. typing:

  
install.packages("flexclust")  

will install a nice clustering package, and

  
install.packages(c("XML","twitteR"))  

will install packages for working with XML data, and for interfacing with twitter. This method can be used for installing any package available from CRAN. However, Bioconductor has its own repository that is separate from CRAN, and so Bioconductor and Bioconductor packages must be installed using a different method. Even once installed, packages need to be loaded in order to be used. So a command such as library(flexclust) is needed to load the flexclust library ready for use.

To install Bioconductor, enter the following commands at the R prompt:

  
source("http://bioconductor.org/biocLite.R")  
biocLite()  

This will install the Bioconductor core and some recommended packages. As Bioconductor is quite large, this will most likely take some time. To install additional packages, use the following:

  
source("http://bioconductor.org/biocLite.R")  
biocLite("flowViz")  

The above will install packages for working with and visualising flow cytometry data. Again, multiple packages can be installed with a command like biocLite(c("Heatplus","biomaRt")). Most Bioconductor packages have an associated vignette which gives a brief overview of the package and a tutorial on how it should be used. The openVignette function can be used to see these. eg.:

  
library(flowViz)  
openVignette()  

Then select the vignette you would like to view. Both CRAN and Bioconductor packages manage dependencies sensibly, so if you install a package which depends on others, those dependencies will automatically install, too.

Basic R concepts

Many R functions work on lists of numbers, stored in an object called a vector. eg. the command rep(1,10) will return a vector consisting of the number 1 repeated 10 times. To get help on a function, use the ? operator. So, to get help on the rep function, type ?rep. Type the following commands one at a time at the R prompt to get a feeling for how to work with vectors:

  
rep(1,10)  
?rep  
a=rep(1,10)  
a  
b=1:10  
b  
c(a,b)  
?c  
a+b  
a+2*b  
a/b  
c(1,2,3)  
d=c(3,5,3,6,8,5,4,6,7)  
d  
d[4]  
d[5]  
d[2:4]  
d<7  
d[d<7]  
sort(d)  
sort(sample(1:49,6))  

Note that as c is the combine function in R, it is best not to use it as a variable name.

Basic plotting

Group measurements

First pretend we have some measurements on something (perhaps cell volume), so let’s just simulate some (R is also very good at stochastic simulation). We’ll simulate 30 control observations and 30 treatment observations. The treatment observations will be a bit larger, but also more variable.

  
control=rnorm(30,5,2)^2  
treatment=rnorm(30,6.5,3)^2  
mydataframe=data.frame(control,treatment)  
mydataframe  
mydataframe$control
mydataframe[,2]

We can get some basic stats about our data with

  
summary(mydataframe)  

and a comparison of the distributions with

  
boxplot(mydataframe)  

It can also be useful to stack the data:

  
mydata=stack(mydataframe)  
mydata  
boxplot(values ~ ind,data=mydata)  

as it is often more convenient to use the stacked data for statistical analysis:

  
fit=lm(values ~ ind,data=mydata)  
anova(fit)  

Plots can be customised to produce very nice graphs. For example, the above boxplots can be easily tidied up a little by passing in some extra parameters.

  
boxplot(mydataframe,notch=TRUE,names=c("Control","Treatment"),col=c(2,3),
                                       main="Boxplots for the control and treatment")  

A histogram of a vector of values can be obtained with a command like hist(treatment).

Time course measurements

Suppose now that we have some time course measurements, again on a control and a treatment group. Again, lets just simulate:

  
times=cumsum(rexp(20,0.1))  
control=cumsum(rnorm(20,0.1,1))  
treatment=cumsum(rnorm(20,0.2,1.5))  

A reasonable plot of these can be produced with the following commands:

  
minval=min(c(control,treatment))
maxval=max(c(control,treatment))
plot(times,treatment,ylim=c(minval,maxval),pch=19,col=2,xlab="Time (mins)",
                       ylab="Measurement",main="Time course measurements")  
lines(times,treatment,col=2,lty=2,lwd=2)  
points(times,control,col=3,pch=19)  
lines(times,control,col=3,lty=2,lwd=2)  
legend(0,maxval,c("Control","Treatment"),pch=c(19,19),lty=c(2,2),lwd=c(2,2),col=c(3,2))  

Start by entering each command one at a time, and use the help function (?) to figure out exactly what each command is doing. Regression lines can be added with the following commands.

fitc=lm(control~times)
fitt=lm(treatment~times)
abline(fitc,col=3,lty=3)
abline(fitt,col=2,lty=3)

Try demo(graphics) to get a better feel for some of the things that are possible.

Data import

Obviously to use R for real, you need to be able to read in data from other sources. The simplest way to get data into R is to read it in from a text file containing data in a tabular row/column format. See help on read.table for further information. Also see read.csv, scan and source. The easiest way to get data in from a spreadsheet application like MS Excel is to save as a tab-delimited file and read into R with read.table. Note however, that there are other possibilities for reading data, including some direct support for Excel files, and direct connections to databases (such as mySQL and Postgres) and various internet database resources.

Bioconductor

Introduction

Bioconductor can be used for all manner of bioscience data management and analysis tasks. In particular, it can be used to access genome databases and work with biological sequence data, as the following session illustrates:

  
library(annotate)  
getSEQ("P39779")  
substr(getSEQ("U13634"),3225,4004)  
library(Biostrings)  
dna=substr(getSEQ("U13634"),3225,4004)  
nchar(dna)  
GENETIC_CODE  
paste(GENETIC_CODE[substring(dna,seq(1,778,3),seq(3,780,3))],collapse="")  
toString(reverseComplement(DNAString(dna)))  

Again, use the available help facilities to figure out what is going on.

Flow cytometry data

I’ll finish this post with an example of using R to analyse a large and complex data set from a flow cytometer. Flow cytometers such as the Partec CyFlow save data by default in a binary format with a file extension of .FCS. This data format is understood by the BioConductor flowCore package (on which the flowViz package depends). An example of reading and processing some data (on Bacillus subtilis cells containing a GFP reporter) follows. For this, you need to download the file 5-ac-t1-s1.FCS and place it in the current R working directory, or change the working directory to where you have put it, using setwd.

  
library(flowViz)  
yjbM=read.FCS("5-ac-t1-s1.FCS")  
summary(yjbM)  
plot(yjbM)  
plot(yjbM,c("FSC","SSC"))  
plot(transform(yjbM,`FSC`=log(`FSC`),`SSC`=log(`SSC`)),c("FSC","SSC"))  
plot(yjbM,c("FL1"),breaks=256,col=3)  
plot(transform(yjbM,`FL1`=log(`FL1`)),c("FL1"),breaks=256,col=3)  

As I work a lot with GFP labelled samples, I find it convenient to define a function to plot the information I’m most interested in:

  
gfpplot<-function(x)  
{  
  lx=transform(x,`FSC`=log(`FSC`),`SSC`=log(`SSC`),`FL1`=log(`FL1`))  
  op=par(mfrow=c(2,2))  
  plot(lx,"FSC",breaks=64,col=2,main=x@description$FILENAME)  
  plot(lx,c("FSC","SSC"))  
  plot(lx,"FL1",breaks=64,col=3,main="GFP")  
  plot(lx,c("FSC","FL1"))  
  par(op)  
}  
gfpplot(yjbM)  

And finally, here is a function that will produce such a set of plots for all .FCS files in a directory and write the output to a PostScript file. Just call it with gfpplots().

  
gfpplots<-function(glob="*.FCS")  
{  
  xset=read.flowSet(pattern=glob2rx(glob))  
  postscript()  
  for (i in 1:length(xset)) {  
    gfpplot(xset[[i]])  
  }  
  dev.off()  
}  

Summary

This has been a very brief introduction to some of the possibilities of using R and Bioconductor, but it has covered only a tiny fraction. Fortunately there is a huge amount of documentation for R and Bioconductor available on-line, especially from the CRAN and Bioconductor web sites. However, there is lots of other stuff, some of which is a bit more user friendly, which can be easily uncovered with a bit of googling. For example, here is a nice manual from UC Riverside. Finally, as was illustrated in the section on flow cytometry data, much of the power of R comes from the ability to define your own functions, utilising the fact that R is a full-blown programming language in its own right. In a previous post I gave a quick introduction to some basic R programming concepts. For those who prefer real books, I quite like Gentleman’s R Programming for Bioinformatics , and the Bioinformatics Solutions book also contains a few useful chapters, though it is starting become a little dated.

Systems biology, mathematical modelling and mountains

BIRS Workshop on Multi-scale Stochastic Modeling of Cell Dynamics

View from Tunnel Mountain

This week I’ve been at the Banff International Research Station (BIRS), a mathematical research institute located at the Banff Centre, on the outskirts of Banff, Alberta, Canada. I was participating in a 5-day workshop on Multi-scale Stochastic Modeling of Cell Dynamics.

There can be few more picturesque workshop venues than the Banff Centre, located as it is is at the foot of Tunnel Mountain in the middle of Banff National Park. This was my second trip to BIRS at Banff. My previous visit was in the middle of summer, so it was interesting to contrast the summer and winter landscapes. That said, the weather was relatively mild, and the region has had relatively little snow this year. The workshop schedule was arranged to provide a long break each afternoon for outdoor activities. Many went skiing, but I went for a couple of hikes instead. On Monday afternoon a few of us hiked up Tunnel Mountain, and on Thursday a few more hiked from Bow Falls around Tunnel Mountain to the local hoodoos along the Bow River. Both walks offered great views of the Canadian Rockies, including nearby Mount Rundle. Wednesday afternoon’s session was rescheduled to allow people to see the Olympic torch arrive in Banff that evening, on its way to Vancouver for the 2010 Winter Olympics. A few of my photos from the trip are on flickr.

Essentially, the workshop was concerned with interesting outstanding mathematical, theoretical and computational issues in systems biology modelling, focusing in particular on stochastic and multi-scale issues. It featured many excellent participants from a diverse range of backgrounds. The quality of presentations was very high, so in combination with the outstanding location, this was one of the best workshops that I’ve attended for several years. I don’t have the time or the energy to give extensive notes for each talk, but I’ll go quickly through each talk in turn trying to give something of the flavour of the presentation, together with a link to the web page of the presenter, which will hopefully provide links to further details.

Frozen Bow River

18/1/10

Des Higham

Discrete versus continuous

Des was looking at some simple first order reaction networks, and in particular at hitting times for the Markov jump process and the associated diffusion approximation (chemical Langevin equation). He showed a simple isomerisation network where the diffusion approximation is not guaranteed to stay non-negative. Be careful with diffusion approximations was the message.

Jin Wang

Potential and flux landscape for stability

Jin-Wang was looking at how methods from statistical physics can be used to better understand the global dynamics of stochastic models. He looked at a couple of examples, including a circadian clock and a cell-cycle model.

Ted Perkins

Trajectory inference

Ted was thinking about how methods from computing science can be used to construct the most likely trajectory of a stochastic kinetic model between a given start and end point. The talk was based on a recent NIPS paper.

Ruth Williams

Enzymatic degradation

Observed correlations between protein expression levels could potentially be due to indirect coupling that arises due to a common enzyme being used for degredation. Ruth showed how the model could be re-written as a multiclass queueing system so that queueing theory results can be used to get at interesting properties of the stationary distribution such as the stationary correlation.

Banff Centre restaurant

19/1/10

Sam Kou

Parameter inference for SDEs

Sam was looking at parameter estimation for univariate stochastic differential equation (SDE) models from a Bayesian perspective. He explained the problems with naive MCMC samplers for the problem, and proposed a new sampler constructed by coupling together a collection of such samplers with different time discretisations. He also showed how more accurate estimates of the parameters can be obtained by extrapolating from the distributions corresponding to different degrees of discretisation.

Matt Scott

Intrinsic noise in continuous (spatial) systems

Matt was thinking about how to do intrinsic network noise in space. He set up the model on a lattice, and then took a multi-scale limit (essentially a high concentration, linear noise approximation).

Di Liu

Stochastic kinetics with multiple time scales

Di was looking at multi-scale (hybrid) simulation algorithms for stochastic networks with multiple time scales. In particular, he was looking at Nested SSA algorithms, and contrasting with slow-scale SSA, multiscale SSA and other implementation of nested SSA. He also talked about adaptation (re-classification of fast/slow), and looked at Tyson’s yeast cell cycle model as an example.

Darren Wilkinson

Bayesian inference for stochastic network models

I gave an overview of statistical methods for parameter inference for stochastic kinetic models, with emphasis on Bayesian approaches and sequential likelihood free MCMC. I showed an example application to stochastic kinetic modelling of p53/Mdm2 oscillations.

Moises Santillan

Distribution evolution for gene regulation models

Moises was looking at analytic multiscale approximations in the context of some relatively simple stochastic kinetic models, including the lac operon model. He was separating time-scales and deriving analytic approximations to the steady-state distribution.

Paul Tupper

State-dependent diffusion

Paul was looking at spatial diffusion, and in particular, thinking about proteins diffusing at different rates inside and outside the nucleus. He gave a nice simple example of diffusion in a box with different diffusion coefficients in each half, and showed that the steady state distribution depends rather subtly on how exactly diffusion is interpreted (mathematically, on the precise formalism of stochastic calculus adopted). He showed by modelling the diffusion as a Lorenz gas that the apparent paradox can be understood.

Sayan Mukherjee

Multi-scale factor models

Sayan was using sparse multiscale factor models to understand large and complex genomic data sets. He exploited linearity and sparsity for dimension reduction. The talk included discussion of network inference and links with graphical models.

Hikers

20/1/10

Rachel Kuske

Coherence resonance

Rachel was discussing mixed mode oscillations, and in particular, noise stabalised transients, such as found with the FitzHugh Nagumo model with added noise – occasional kick from near equilibrium onto an unstable limit cycle. The classic biological example of this (mentioned only in passing in the talk), is the transition of B. Subtilis between competent and non competent states (the competent state is thought to be an excitable unstable state).

Dave Anderson

Simulation methods for population processes

Dave was concerned with understanding the asymptotic behaviour of approximate simulation algorithms (principally, tau-leaping and a mid-point version). He was using Kurtz’s random time change representation of stochastic kinetic models and the associated approximations for analysis. Classical asymptotics show the mid-point method to be no better than regular tau-leaping, but practical applications show it to be superior. By rescaling the problem appropriately, an appropriate asymptotic limit can be constructed which shows the mid-point version to be better.

Eldon Emberly

Polar localisation of proteins in bacteria

Eldon was thinking about protein localisation as a mechanism for cellular differentiation for bacteria such as Caulobacter Crescentus (which has stalks and swarmers). PopZ localises at a pole, and this protein can be expressed in E. Coli (which does not have a PopZ homologue) to better understand the mechanisms of localisation. A passive model was shown to be adequate to explain the observed behaviour.

John Fricks

Neck linker extension in kinesin molecular motors

John has been working for some time on detailed molecular models of kinesin molecular motors. Here he was focusing on the importance of the neck linker (the bit between the two heads and the tail), and developing a model which is predictive in the context of artificial extension of the neck linker. The model was a combination of diffusion and discrete chemical kinetics, and he simplified the model as a renewal reward process to obtain analytic approximations.

Hye-Won Kang

Space discretisation in reaction-diffusion models

Hye-Won is interested in stochastic models of pattern formation – especially pattering along an axis. Here reaction-limited kinetics is more relevant than diffusion limited kinetics – lots of well-mixed compartments. By exploiting upper and lower bounds on compartment sizes, an appropriate scaling limit can be constructed which leads to a system of ODEs for the first two moments which can be solved for the steady-state distribution. The talk was mainly 1-d, but apparently there are 3-d extensions.

Peter Pfaffelhuber

Spatial multiscale chemical reaction networks

Peter described some joint work with Lea Popovic on multiscale reaction-diffusion across multiple compartments when the diffusion is fast, but compartment-dependent. The basic method was extended to multiscale reaction networks within each compartment, but then we clearly have to make assumptions about the relative time scale of the diffusion and the fast reactions.

Mads Kaern

Stochastic simulation in evolving heterogeneous cell populations

Mads is interested in comparing stochastic models to experimental data on (heterogeneous) cell populations – especially flow cytometry data. For this one needs a framework for simulating exponentially growing cell populations, and Mads was using “constant number Monte Carlo”, which essentially randomly throws stuff out to maintain a fixed population size that is representative of the bigger population. He had some nice examples of how noise can give robustness to uncertain stress.

Downtown Banff

21/1/10

Hans Othmer

Multi-scale analysis of reacting systems

Hans is interested in deterministic and stochastic models of patterning in drosophila. Think Gillespie across multiple boxes with one box per nucleus. He was then doing multi-scale analysis of the system to obtain analytic approximations.

Greg Rempala

Statistical and algebraic methods for mass-action kinetics

Greg is interested in using algebraic methods for evidence synthesis in systems biology as an alternative to more conventional hierarchical statistical models. He was using the approach to take rate constant estimates from different models and data sets and to carry out network inference by deciding which rate constants are significantly different from zero.

David McMillen

Bacterial gene expression

Dave gave a nice talk on synthetic and systems biology, and on problems and issues associated with using, measuring and modelling fluorescent proteins. He focused particularly on issues of folding, maturation and inclusion body formation, and on the population dynamics of cells growing in batch culture. He also gave a nice example of developing a synthetic circuit to give E. Coli resistance to infection from Bacteriophage lambda.

Peter Swain

Quantifying and modelling stochastic biochemical networks

Peter discussed the study of intrinsic and extrinsic noise in bacteria. He showed how this could be studied experimentally via inclusion of two copies of a simple circuit with different coloured fluorescent reporters. Intrinsic noise can be modelled in the usual way, and extrinsic noise can be injected by allowing rate constants to vary according to (say) a Gaussian OU process. He argued that it will often make sense to correlated the extrinsic noise, and showed some nice examples of how feed-forward networks can attenuate noise fluctuations.

Konstantin Mischaikow

Databases for global dynamics of multi-parameter systems

Konstantin is interested in categorising the global dynamic properties of deterministic iterated maps (and ultimately extending to ODE models and potentially also stochastic models). He argued that conventional descriptions are too complex and of limited practical value, and provided more robust descriptions that can be stored in a reasonable amount of space in a database.

Hong Qian

Non-equilibrium phase transition

Hong is interested in understanding stochastic kinetic reaction network models from a statistical physics viewpoint. He showed how quasi-stationary analysis can give insight into the origins of irreversibility, and illustrated his ideas with some example phosphorylation/dephosphorylation networks. He also explained how the methods give insight into the fitness landscapes and the forces driving cellular evolution.

Lev Tsimring

Synthetic gene oscillators

Lev described work on the development of synthetic gene oscillators in vivo and the associated detailed stochastic modelling. The first oscillator developed was single cell based, so oscillations at the population level decayed away due to gradual loss of synchrony. A new oscillator was developed using a protein which diffuses in and out of the cell, allowing synchronisation of the cells via quorum sensing. It worked, but diffusion effect led to spatial effects and wave propagation. This too can be modelled nicely. The work is described in a (very!) recent Nature paper.

Mountain scenery

22/1/10

Tom Kurtz

Diffusion approximation for multiscale reaction networks

Tom is interested in the mathematical analysis of multiscale reaction networks. He gave a nice overview of diffusion approximations, and then argued that Martingale representations are a powerful method for gaining deeper insight into multiscale approximations, due to the Martingale central limit theorem. He showed how the technique could be used on the classic Michaelis-Menten system.

Katharina Best/Surovcik

Spatial scaling in quorum sensing

Katharinia is interested in modelling quorum sensing in the symbiotic bacterium Sinorhizobuim meliloti found in plant roots. She modelled a reaction network for each cell, and then the diffusion of certain species in and out of cells. She did a multi-scale approximation assuming that the environment in which the cells live is a much larger volume than that occupied by the bacteria.

Tomas Gedeon

Somitogenesis clock-wave initiation

Tomas is interested in complex spatio-temporal pattern formation in embryos – especially spine formation in vertebrates – zebrafish. He has been thinking about modelling, stochasticity, model selection and parameter estimation. He built deterministic models, and did a simple, intuitive “Bayes theorem via the rejection method” type algorithm, to infer parameters and select models. He concluded that a model of the delta-notch signalling pathway that included two binding sites and monomor-only decay was most compatible with the data.

David Cottrell

Putting diffusion into stochastic networks (analytic)

David described some joint work with Peter Swain and Paul Tupper on using the theory of branching processes (analysis of the dual process) to obtain some powerful approximations to models of RNA and protein synthesis that takes spatial diffusion into account (assuming RNA degredation much faster than protein degredation). The technique is very flexible, but relies heavily on linearity of the model (no 2nd order reactions).

Olympic torch reaches Banff

All in all, a useful and interesting week.

Hypercubes in R (getting started with programming in R)

On the train home from Manchester the other day (no wifi!), I was thinking about exploring high-dimensional constrained parameter spaces (possibly sad, but something that statisticians do!), and it occurred to me that I didn’t have very good intuition about the structure of hypercubes in dimensions d>4. Since I was sat with my netbook in front of me listening to music and too tired from 2 days of meetings to do any real work, I decided to write a bit of code to build, rotate and plot 2-D projections of d-D hypercubes for arbitrary integer d>1. In principle there are many languages I could have done this in, but I knew it would be very quick and easy using R.

R is the de-facto standard language for statistical computing, data analysis and plotting, and is rapidly becoming the standard language for statistical bioinformatics and systems biology. It is free, interactive, and has a mind-boggling array of packages implementing a huge variety of statistical methods, including most state-of-the-art techniques. Further, via the BioConductoR project, it now has an impressive range of packages specifically targeting bioinformatics applications. The language itself is relatively simple and easy-to-learn, and is well-suited to programming with data. That said, it is not the most elegant of programming languages, and (rather confusingly) it has two somewhat incompatible object models, but that is perhaps a topic for another day. There are many available introductions to R, so I’m not going to repeat that here, though I do have an introductory tutorial on an old web page that is still worth a look.

So I wrote the code (it’s only around 50 lines, and took around 30 minutes to write and debug), and running it produces plots like these:

Hypercubes in dimensions 2, 3, 4, 5, 6 and 7
Hypercubes

I’ve appended the code below, as it illustrates a lot of basic R programming concepts. Essentially, the idea is to use vectors and matrices as much as possible, but not to be really obsessed about avoiding for loops if performance isn’t critical. The code isn’t commented, but it’s highly structured and easy to understand. The function makevertices creates a matrix whose columns are the vertex locations of a hypercube. The function makeedges creates a matrix whose columns give pairs of vertex indices representing the edges of the hypercube (pairs of vertices a distance of 1 apart). The function rotate picks a pair of coordinate axes at random and rotates the hypercube in that plane. The rest are obvious, and the file finishes with some example ways to call the code. As well as illustrating the basic programming concepts such as functions, vectors, matrices and matrix operations and looping, the code also illustrates several other useful concepts such as integer division and modular arithmetic, conditional execution, optional arguments, dynamically growing a matrix, randomly sampling without replacement, plotting lines, plot options, anonymous function arguments, etc.

# hypercube.R
# code to rotate arbitrary d-dimensional hypercubes

# function definitions
makevertices=function(d)
{
	numv=2^d
	vertices=matrix(0,nrow=d,ncol=numv)
	for (j in 1:numv) {
		count=j-1
		for (i in 1:d) {
			vertices[i,j]=count%%2
			count=count%/%2
		}	
	}
	vertices
}

makeedges=function(vertices)
{
	d=dim(vertices)[1]
	numv=dim(vertices)[2]
	edges=NULL
	for (j in 1:numv) {
		for (i in j:numv) {
			if (sum(abs((vertices[,i]-vertices[,j])))==1) {
				edges=cbind(edges,c(i,j))
			}
		}
	}
	edges
}

rotate=function(vertices,angle=0.2)
{
	d=dim(vertices)[1]
	axes=sort(sample(1:d,2))
	rotmat=diag(rep(1,d))
	rotmat[axes[1],axes[1]]=cos(angle)
	rotmat[axes[2],axes[1]]=sin(angle)
	rotmat[axes[1],axes[2]]=-sin(angle)
	rotmat[axes[2],axes[2]]=cos(angle)
	vertices=rotmat %*% vertices
	vertices
}

plotcube=function(vertices,edges,col=2,lwd=2)
{
	d=dim(vertices)[1]
	plot(NULL,xlim=c(-2,2),ylim=c(-2,2),main=paste(d,"-d hypercube",sep=""),xlab="x(1)",ylab="x(2)")
	for (i in 1:dim(edges)[2]) {
		lines(c(vertices[1,edges[1,i]],vertices[1,edges[2,i]]),c(vertices[2,edges[1,i]],vertices[2,edges[2,i]]),col=col,lwd=lwd)
	}
}

rotationplot=function(vertices,edges,rotations=20,...)
{
	for (count in 1:rotations) {
		vertices=rotate(vertices,...)
		plotcube(vertices,edges)
	}
}

hypercube=function(d=4,...)
{
	vertices=makevertices(d)
	edges=makeedges(vertices)
	vertices=2*vertices
	vertices=vertices-1
	rotationplot(vertices,edges,...)
}

# examples

# hypercube()
# hypercube(3)
# hypercube(4,angle=0.1)
# hypercube(5,rotations=30)

# eof

About this blog…

I am Darren Wilkinson, Professor of Stochastic Modelling at Newcastle University, UK. This is my blog. It will be semi-serious, with the vast majority of posts relating in some way to my research interests. It will be low volume – I’ll aim for around one post per month. I have a separate posterous blog for material of a more personal or frivolous nature. This blog will cover my interests in stochastic modelling (surprise!), Bayesian statistics (including MCMC), statistical computing (including R and BioConductoR), e-Science and systems biology (including SBML). There will be a bias towards more computational and computing issues (probably slightly geeky, and possibly straying occasionally into issues relating to Web 2.0, open-source Linux and Ubuntu).

There will also be some coverage of my experiences working in molecular biology labs. I am fortunate enough to currently hold a BBSRC Research Development Fellowship, and this is allowing me (a statistician with no lab experience), to spend some time working in a couple of molecular biology and genetics labs. This has been a fascinating and rewarding experience for all kinds of reasons that I don’t have time to go into right now, but if I have time I will try and use this blog to make some observations about this experience that may be of interest to others working in the systems biology field.

Obviously, this blog is not intended to in any way replace or be an alternative to my regular scientific publications, but simply to augment them with some additional material not well-suited to peer-review journals.