Statistical computing languages at the RSS

On Friday the Royal Statistical Society hosted a meeting on Statistical computing languages, organised by my colleague Colin Gillespie. Four languages were presented at the meeting: Python, Scala, Matlab and Julia. I presented the talk on Scala. The slides I presented are available, in addition to the code examples and instructions on how to run them, in a public github repo I have created. The talk mainly covered material I have discussed in various previous posts on this blog. What was new was the emphasis on how easy it is to use and run Scala code, and the inclusion of complete examples and instructions on how to run them on any platform with a JVM installed. I also discussed some of the current limitations of Scala as an environment for interactive statistical data analysis and visualisation, and how these limitations could be overcome with a little directed effort. Colin suggested that all of the speakers covered a couple of examples (linear regression and a Monte Carlo integral) in “their” languages, and he provided an R solution. It is interesting to see the examples in the five different languages side by side for comparison purposes. Colin is collecting together all of the material relating to the meeting in a combined github repo, which should fill out over the next few days.

For other Scala posts on this blog, see all of my posts with a “scala” tag.

Gibbs sampler in various languages (revisited)


Regular readers of this blog will know that in April 2010 I published a short post showing how a trivial bivariate Gibbs sampler could be implemented in the four languages that I use most often these days (R, python, C, Java), and I discussed relative timings, and how one might start to think about trading off development time against execution time for more complex MCMC algorithms. I actually wrote the post very quickly one night while I was stuck in a hotel room in Seattle – I didn’t give much thought to it, and the main purpose was to provide simple illustrative examples of simple Monte Carlo codes using non-uniform random number generators in the different languages, as a starting point for someone thinking of switching languages (say, from R to Java or C, for efficiency reasons). It wasn’t meant to be very deep or provocative, or to start any language wars. Suffice to say that this post has had many more hits than all of my other posts combined, is still my most popular post, and still attracts comments and spawns other posts to this day. Several people have requested that I re-do the post more carefully, to include actual timings, and to include a few additional optimisations. Hence this post. For reference, the original post is here. A post about it from the python community is here, and a recent post about using Rcpp and inlined C++ code to speed up the R version is here.

The sampler

So, the basic idea was to construct a Gibbs sampler for the bivariate distribution

f(x,y) = kx^2\exp\{-xy^2-y^2+2y-4x\},\qquad x>0,y\in\Bbb{R}

with unknown normalising constant k>0 ensuring that the density integrates to one. Unfortunately, in the original post I dropped a factor of 2 constructing one of the full conditionals, which meant that none of the samplers actually had exactly the right target distribution (thanks to Sanjog Misra for bringing this to my attention). So actually, the correct full conditionals are

\displaystyle x|y \sim Ga(3,y^2+4)

\displaystyle y|x \sim N\left(\frac{1}{1+x},\frac{1}{2(1+x)}\right)

Note the factor of two in the variance of the full conditional for y. Given the full conditionals, it is simple to alternately sample from them to construct a Gibbs sampler for the target distribution. We will run a Gibbs sampler with a thin of 1000 and obtain a final sample of 50000.



Let’s start with R again. The slightly modified version of the code from the old post is given below

	for (i in 1:N) {
		for (j in 1:thin) {



I’ve just corrected the full conditional, and I’ve increased the sample size and thinning to 50k and 1k, respectively, to allow for more accurate timings (of the faster languages). This code can be run from the (Linux) command line with something like:

time Rscript gibbs.R

I discuss timings in detail towards the end of the post, but this code is slow, taking over 7 minutes on my (very fast) laptop. Now, the above code is typical of the way code is often structured in R – doing as much as possible in memory, and writing to disk only if necessary. However, this can be a bad idea with large MCMC codes, and is less natural in other languages, anyway, so below is an alternative version of the code, written in more of a scripting language style.

	for (i in 1:N) {
		for (j in 1:thin) {


This can be run with a command like

time Rscript gibbs-script.R >

This code actually turns out to be a slightly slower than the in-memory version for this simple example, but for larger problems I would not expect that to be the case. I always analyse MCMC output using R, whatever language I use for running the algorithm, so for completeness, here is a bit of code to load up the data file, do some plots and compute summary statistics.


	contour(x,y,z,main="Contours of actual (unnormalised) distribution")
	contour(fit$x1,fit$x2,fit$fhat,main="Contours of empirical distribution")


Another language I use a lot is Python. I don’t want to start any language wars, but I personally find python to be a better designed language than R, and generally much nicer for the development of large programs. A python script for this problem is given below

import random,math

def gibbs(N=50000,thin=1000):
    print "Iter  x  y"
    for i in range(N):
        for j in range(thin):
        print i,x,y


It can be run with a command like

time python >

This code turns out to be noticeably faster than the R versions, taking around 4 minutes on my laptop (again, detailed timing information below). However, there is a project for python known as the PyPy project, which is concerned with compiling regular python code to very fast byte-code, giving significant speed-ups on certain problems. For this post, I downloaded and install version 1.5 of the 64-bit linux version of PyPy. Once installed, I can run the above code with the command

time pypy >

To my astonishment, this “just worked”, and gave very impressive speed-up over regular python, running in around 30 seconds. This actually makes python a much more realistic prospect for the development of MCMC codes than I imagined. However, I need to understand the limitations of PyPy better – for example, why doesn’t everyone always use PyPy for everything?! It certainly seems to make python look like a very good option for prototyping MCMC codes.


Traditionally, I have mainly written MCMC codes in C, using the GSL. C is a fast, efficient, statically typed language, which compiles to native code. In many ways it represents the “gold standard” for speed. So, here is the C code for this problem.

#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#include <gsl/gsl_rng.h>
#include <gsl/gsl_randist.h>

void main()
  int N=50000;
  int thin=1000;
  int i,j;
  gsl_rng *r = gsl_rng_alloc(gsl_rng_mt19937);
  double x=0;
  double y=0;
  printf("Iter x y\n");
  for (i=0;i<N;i++) {
    for (j=0;j<thin;j++) {
    printf("%d %f %f\n",i,x,y);

It can be compiled and run with command like

gcc -O4 gibbs.c -lgsl -lgslcblas -lm -o gibbs
time ./gibbs >

This runs faster than anything else I consider in this post, taking around 8 seconds.


I’ve recently been experimenting with Java for MCMC codes, in conjunction with Parallel COLT. Java is a statically typed object-oriented (O-O) language, but is usually compiled to byte-code to run on a virtual machine (known as the JVM). Java compilers and virtual machines are very fast these days, giving “close to C” performance, but with a nicer programming language, and advantages associated with virtual machines. Portability is a huge advantage of Java. For example, I can easily get my Java code to run on almost any University Condor pool, on both Windows and Linux clusters – they all have a recent JVM installed, and I can easily bundle any required libraries with my code. Suffice to say that getting GSL/C code to run on generic Condor pools is typically much less straightforward. Here is the Java code:

import java.util.*;
import cern.jet.random.tdouble.*;
import cern.jet.random.tdouble.engine.*;

class Gibbs

    public static void main(String[] arg)
	int N=50000;
	int thin=1000;
	DoubleRandomEngine rngEngine=new DoubleMersenneTwister(new Date());
	Normal rngN=new Normal(0.0,1.0,rngEngine);
	Gamma rngG=new Gamma(1.0,1.0,rngEngine);
	double x=0;
	double y=0;
	System.out.println("Iter x y");
	for (int i=0;i<N;i++) {
	    for (int j=0;j<thin;j++) {
	    System.out.println(i+" "+x+" "+y);


It can be compiled and run with

time java Gibbs >

This takes around 11.6s seconds on my laptop. This is well within a factor of 2 of the C version, and around 3 times faster than even the PyPy python version. It is around 40 times faster than R. Java looks like a good choice for implementing MCMC codes that would be messy to implement in C, or that need to run places where it would be fiddly to get native codes to run.


Another language I’ve been taking some interest in recently is Scala. Scala is a statically typed O-O/functional language which compiles to byte-code that runs on the JVM. Since it uses Java technology, it can seamlessly integrate with Java libraries, and can run anywhere that Java code can run. It is a much nicer language to program in than Java, and feels more like a dynamic language such as python. In fact, it is almost as nice to program in as python (and in some ways nicer), and will run in a lot more places than PyPy python code. Here is the scala code (which calls Parallel COLT for random number generation):

object GibbsSc {

	import cern.jet.random.tdouble.engine.DoubleMersenneTwister
	import cern.jet.random.tdouble.Normal
	import cern.jet.random.tdouble.Gamma
	import Math.sqrt
 	import java.util.Date

	def main(args: Array[String]) {
		val N=50000
		val thin=1000
		val rngEngine=new DoubleMersenneTwister(new Date)
		val rngN=new Normal(0.0,1.0,rngEngine)
		val rngG=new Gamma(1.0,1.0,rngEngine)
		var x=0.0
		var y=0.0
		println("Iter x y")
		for (i <- 0 until N) {
			for (j <- 0 until thin) {
			println(i+" "+x+" "+y)


It can be compiled and run with

scalac GibbsSc.scala
time scala GibbsSc >

This code takes around 11.8s on my laptop – almost as fast as the Java code! So, on the basis of this very simple and superficial example, it looks like scala may offer the best of all worlds – a nice, elegant, terse programming language, functional and O-O programming styles, the safety of static typing, the ability to call on Java libraries, great speed and efficiency, and the portability of Java! Very interesting.


James Durbin has kindly sent me a Groovy version of the code, which he has also discussed in his own blog post. Groovy is a dynamic O-O language for the JVM, which, like Scala, can integrate nicely with Java applications. It isn’t a language I have examined closely, but it seems quite nice. The code is given below:

import cern.jet.random.tdouble.engine.*;
import cern.jet.random.tdouble.*;

rngEngine= new DoubleMersenneTwister(new Date());
rngN=new Normal(0.0,1.0,rngEngine);
rngG=new Gamma(1.0,1.0,rngEngine);
println("Iter x y");
for(i in 1..N){
	for(j in 1..thin){
	println("$i $x $y");

It can be run with a command like:

time groovy Gibbs.gv >

Again, rather amazingly, this code runs in around 35 seconds – very similar to the speed of PyPy. This makes Groovy also seem like a potential very attractive environment for prototyping MCMC codes, especially if I’m thinking about ultimately porting to Java.


The laptop I’m running everything on is a Dell Precision M4500 with an Intel i7 Quad core (x940@2.13Ghz) CPU, running the 64-bit version of Ubuntu 11.04. I’m running stuff from the Ubuntu (Unity) desktop, and running several terminals and applications, but the machine is not loaded at the time each job runs. I’m running each job 3 times and taking the arithmetic mean real elapsed time. All timings are in seconds.

R 2.12.1 (in memory) 435.0
R 2.12.1 (script) 450.2
Python 2.7.1+ 233.5
PyPy 1.5 32.2
Groovy 1.7.4 35.4
Java 1.6.0 11.6
Scala 2.7.7 11.8
C (gcc 4.5.2) 8.1

If we look at speed-up relative to the R code (in-memory version), we get:

R (in memory) 1.00
R (script) 0.97
Python 1.86
PyPy 13.51
Groovy 12.3
Java 37.50
Scala 36.86
C 53.70

Alternatively, we can look at slow-down relative to the C version, to get:

R (in memory) 53.7
R (script) 55.6
Python 28.8
PyPy 4.0
Groovy 4.4
Java 1.4
Scala 1.5
C 1.0


The findings here are generally consistent with those of the old post, but consideration of PyPy, Groovy and Scala does throw up some new issues. I was pretty stunned by PyPy. First, I didn’t expect that it would “just work” – I thought I would either have to spend time messing around with my configuration settings, or possibly even have to modify my code slightly. Nope. Running python code with pypy appears to be more than 10 times faster than R, and only 4 times slower than C. I find it quite amazing that it is possible to get python code to run just 4 times slower than C, and if that is indicative of more substantial examples, it really does open up the possibility of using python for “real” problems, although library coverage is currently a problem. It certainly solves my “prototyping problem”. I often like to prototype algorithms in very high level dynamic languages like R and python before porting to a more efficient language. However, I have found that this doesn’t always work well with complex MCMC codes, as they just run too slowly in the dynamic languages to develop, test and debug conveniently. But it looks now as though PyPy should be fast enough at least for prototyping purposes, and may even be fast enough for production code in some circumstances. But then again, exactly the same goes for Groovy, which runs on the JVM, and can access any existing Java library… I haven’t yet looked into Groovy in detail, but it appears that it could be a very nice language for prototyping algorithms that I intend to port to Java.

The results also confirm my previous findings that Java is now “fast enough” that one shouldn’t worry too much about the difference in speed between it and native code written in C (or C++). The Java language is much nicer than C or C++, and the JVM platform is very attractive in many situations. However, the Scala results were also very surprising for me. Scala is a really elegant language (certainly on a par with python), comes with all of the advantages of Java, and appears to be almost as fast as Java. I’m really struggling to come up with reasons not to use Scala for everything!

Speeding up R

MCMC codes are used by a range of different scientists for a range of different problems. However, they are very (most?) often used by Bayesian statisticians who use the algorithms to target a Bayesian posterior distribution. For various (good) reasons, many statisticians are heavily invested in R, like to use R as much as possible, and do as much as possible from within the R environment. These results show why R is not a good language in which to implement MCMC algorithms, so what is an R-dependent statistician supposed to do? One possibility would be to byte-code compile R code in an analogous way to python and pypy. The very latest versions of R support such functionality, but the post by Dirk Eddelbuettel suggests that the current version of cmpfun will only give a 40% speedup on this problem, which is still slower than regular python code. Short of a dramatic improvement in this technology, the only way forward seems to be to extend R using code from another language. It is easy to extend R using C, C++ and Java. I have shown in previous posts how to do this using Java and using C, and the recent post by Dirk shows how to extend using C++. Although interesting, this doesn’t really have much bearing on the current discussion. If you extend using Java you get Java-like speedups, and if you extend using C you get C-like speedups. However, in case people are interested, I intend to gather up these examples into one post and include detailed timing information in a subsequent post.

Introduction to the processing of short read next generation sequencing data


Recent developments in DNA (and RNA) sequencing technology is transforming how modern biological research is done. High-throughput sequencing of DNA and RNA using next generation sequencing (NGS) machines is now a standard approach in most good biological research labs. However, these technologies generate huge amounts of data which is non-trivial to manipulate and analyse. In this post I’ll give a quick introduction to working with short read DNA sequence data stored in the FASTQ format, using basic Unix (Linux) command-line tools. In the next post, I will give an introduction to analysing FASTQ data with Bioconductor.

Working with data files

The first thing to be aware of when working with NGS data is that the data files involved can be really huge. For example, the data file(s) associated with the single lane of an NGS machine can often total over 5GB, even when stored in a compressed format. This can mean that they are not trivial to move around, and so thought needs to be put into things like how and where to download the files, where to store them, if/how to compress them, etc. Typically, most people will be having the actual sequencing done by a third-party, and the third-party will make the sequence data available for downloading via a secure web site, or similar. As the files are large, it makes sense to download from a machine with a good internet connection. It is also best to download direct to a Unix machine if possible, as Unix copes better with large files, and the packing and compression formats typically used (eg. tar, gzip, bzip) are usually better supported by default on Unix platforms. Also note that the FAT, FAT32 and VFAT file systems often used on (older) Windows platforms can not store very large files (over 4GB). This can sometimes be an issue for Unix users too, as most USB flash drives and some external (USB) hard disks will be VFAT/FAT32 formatted by default. Note that on Unix systems, the commands scp and rsync are useful ways to copy files from one machine to another over a (hopefully fast!) network.

The downloaded files will typically be in gzip (.gz) or gzipped-tar (.tar.gz or .tgz) format. If they are bundled together in a gzipped-tar, it probably makes sense to unpack them into individual files, which can be compressed, and then moved around individually. Use tar xvfz bundle.tar.gz to unpack a bundle of files, but make sure you have plenty of free disk space first (with df -h .). Remember that on Unix you can get basic help on most command-line tools by typing man commandname (where commandname is the name of the command you want information on). Type man man for help on the man command… Remember that compressing and uncompressing large files can take a long time, so be patient, and think carefully before you hit “Return”. Individual files should all be stored compressed. For example, gzip myexp_lane1.fastq will result in the file myexp_lane1.fastq.gz, which will be much smaller than the original, and can be analysed without ever storing the uncompressed version on disk. Use ls -lh to see the files and their sizes in a particular directory. Note that there are a variety of data formats associated with NGS data. Most are text formats which can be inspected using Unix commands such as head, tail and less. Files ending .fas contain the actual reads, files ending .qual contain the “quality” scores associated with a set of reads, but files ending .fastq are in the FASTQ format, which contains both the reads and the quality scores together in a single file. Most tools for working with short read data will work with the FASTQ format, and so that is what we will concentrate on for the rest of this quick introduction.

Working with FASTQ files

FASTQ files are a text-based file format, with 4 lines of text corresponding to each read. Assuming a gzipped file called myreads.fastq.gz, the first few lines can be inspected with zcat myreads.fastq.gz | head. Typical results will look roughly like the following:


The first line is an identifier associated with the first read. The second line is the read itself (usually the thing of greatest interest!). The third line is an identifier associated with the quality score. The fourth line gives the quality score associated with each base in the read, using an ASCII code. The next four lines correspond to the second read, and so on. Further details can be obtained from the Wikipedia FASTQ Format page.

Fortunately, Unix was designed from the outset with processing of large text files in mind. This makes it possible to do quite a lot of processing and analysis with standard Unix command-line tools. The number of lines in the (uncompressed) file can be obtained with

zcat myreads.fastq.gz | wc -l 

Obviously then, dividing the result by 4 will give the number of reads in the file. You can browse through the file using

zless myreads.fastq.gz 

Use space to page-down, “b” to go back a page, and “q” to quit. Use man zless for more options. For example “/CAGGTT” will find the next occurrence of “CAGGTT” in the file. Any regular expression can be given after the slash, so, “/^CAGTT” will find the next read which starts with CAGTT. Regular expressions can be generally very useful in the analysis of sequence data, and we will return to them later.

Due to the very large file sizes involved in NGS data analysis, it can often be desirable to create a file containing a (relatively) small number of reads for initial testing and debugging of analysis pipelines. Again, this is easy to do using a command like the following

zcat myreads.fastq.gz | head -400000 | gzip > Test100k.fastq.gz 

which takes the first 100k reads from “myreads” and stores them in “Test100k”, without ever storing any uncompressed reads on disk. An example Test100k.fastq.gz is available for downloading, so that readers can experiment for themselves with this kind of data. Note that in Unix, these commands which stream data from one command to another do so without requiring large amounts of RAM. All of these simple Unix filtering tools can be run without any problems on, say, a laptop with a modest amount of RAM (2GB will be fine) running Ubuntu Linux (or similar), so long as you have plenty of free disk space.

Splitting and joining FASTQ files

If you do not have access to a large RAM machine, it may be desirable to split up a very large set of reads into a collection of files, each of which contains a more manageable set of reads. Again, this can be accomplished with standard Unix commands. For example:

zcat myreads.fastq.gz | split -d -l 2000000 - Block 

will create files Block01, Block02, etc., each containing 0.5M reads. However, these files will not be compressed (so make sure you have plenty of disk space before running this!), and do not have the correct file extension. Both of those issues can be fixed with some more standard Unix commands. The following assumes that you are using a Bash shell, but similar things work in other shells:

for name in Block??
 mv ${name} ${name}.fastq
 gzip ${name}.fastq

This will result in the compressed FASTQ files Block01.fastq.gz, Block02.fastq.gz, etc. The idea now is that each of these files is processed one at a time (or in parallel, perhaps on a cluster) using some tool, and then the results combined in some sensible way later. The precise details will depend a great deal on the nature of the experiment which led to the data.

If necessary, the files can be recombined at a later date with a command like:

zcat Block*.fastq.gz | gzip > allreads.fastq.gz 

without storing uncompressed files on disk.

Filtering FASTQ files

It will very often be desirable to filter the data in a less arbitrary way – selecting reads matching particular criteria for further analysis. Again, this sort of text processing is the kind of thing that is very easy in Unix. Traditionally, tools such as grep, sed and awk were used for this purpose. However, in the early 1990s, many people (myself included) migrated from awk to perl, as it was a full-blown programming language, with many more features than the simple tools. Perl is still a very effective language for this kind of activity, and now the BioPerl project provides a large range of modules specifically targeting biological applications, with an emphasis on sequence analysis. That said, many people find the perl language to be rather ugly, leading to code that is difficult to read and maintain. So in the late 1990s I, like many others, switched from perl to python for text processing and related activities. Python is a great general purpose programming language. It is simple, elegant and easy to learn, and code is easy to read and maintain. Analogous to perl, there is an associated Biopython project, containing modules for sequence analysis and other biological applications. At some point in the future I may write a post on using Biopython for the analysis of FASTQ data, but in the meantime the on-line resources and books such as Python for Bioinformatics provide further information for the interested reader.

Fortunately the FASTQ format is sufficiently simple that many if not most basic filtering and analysis tasks can be accomplished with the default python installation found on the vast majority of Unix systems. Below is a complete python program to filter a FASTQ file, selecting just the reads matching a given regular expression.

#!/usr/bin/env python
# Filter a fastq file according to a given regex to match to read sequence
# Copyright (C) Darren J Wilkinson, November 2010,

import sys,re

def readread(s):
	return [s.readline(),s.readline(),s.readline(),s.readline()]

def writeread(sread,s):
	for i in range(4):

def readfilter(query,s,o):
	while (sread[0]):
	    if ([1])):

if __name__=='__main__':
	if (len(sys.argv)!=2):
	    print "Usage: python <regex> < infile.fastq > outfile.fastq"
# eof

So, for example,

zcat Test100k.fastq.gz | python TCGATTT | gzip > filtered.fastq.gz 

will select out all reads containing the string TCGATTT, and

zcat Test100k.fastq.gz | python ^TCGAT | gzip > filtered.fastq.gz

will select out all reads which start with the string TCGAT. Similarly, doing

zcat Test100k.fastq.gz | python ^TCGAT | wc -l 

and dividing the result by 4 will give the number of reads starting with TCGAT. People familiar with Unix can think of this python script as a version of grep for FASTQ files.

Of course the above python code could be extended to do more sophisticated analysis of the read data, but in that case it will make sense to make use of the Biopython libraries. An alternative is to use R and Bioconductor for the analysis, making use of their huge array of statistical analysis and visualisation tools. I will give a quick introduction to the use of the Bioconductor “ShortRead” package for processing FASTQ data in the next post.