## Keeping R up to date on Ubuntu linux

R is included as part of the standard Ubuntu distribution, and can be installed with a command like

sudo apt-get install r-base


Obviously the software included as part of the standard distribution usually lags a little behind the latest version, and this is usually quite acceptable for most users most of the time. However, R is evolving quite quickly at the moment, and for various reasons I have decided to skip Ubuntu 12.10 (quantal) and stick with Ubuntu 12.4 (precise) for the time being. Since R 2.14 is included with Ubuntu 12.4, and I’d rather use R 2.15, I’d like to run with the latest R builds on my Ubuntu system.

Fortunately this is very easy, as there is a maintained repository for Ubuntu builds of R on CRAN. Full instructions are provided on CRAN, but here is the quick summary. First you need to know your nearest CRAN mirror – there is a list of mirrors on CRAN. I generally use the Bristol mirror, and so I will use it in the following.

sudo su
echo "deb http://www.stats.bris.ac.uk/R/bin/linux/ubuntu precise/" >> /etc/apt/sources.list
apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E084DAB9
apt-get update


That’s it. You are updated to the latest version of R, and your system will check for updates in the usual way. There are just two things you may need to edit in line 2 above. The first is the address of the CRAN mirror (here “www.stats.bris.ac.uk”). The second is the name of the Ubuntu distro you are running (here “precise”).

## Getting started with parallel MCMC

### Introduction to parallel MCMC for Bayesian inference, using C, MPI, the GSL and SPRNG

#### Introduction

This post is aimed at people who already know how to code up Markov Chain Monte Carlo (MCMC) algorithms in C, but are interested in how to parallelise their code to run on multi-core machines and HPC clusters. I discussed different languages for coding MCMC algorithms in a previous post. The advantage of C is that it is fast, standard and has excellent scientific library support. Ultimately, people pursuing this route will be interested in running their code on large clusters of fast servers, but for the purposes of development and testing, this really isn’t necessary. A single dual-core laptop, or similar, is absolutely fine. I develop and test on a dual-core laptop running Ubuntu linux, so that is what I will assume for the rest of this post.

There are several possible environments for parallel computing, but I will focus on the Message-Passing Interface (MPI). This is a well-established standard for parallel computing, there are many implementations, and it is by far the most commonly used high performance computing (HPC) framework today. Even if you are ultimately interested in writing code for novel architectures such as GPUs, learning the basics of parallel computation using MPI will be time well spent.

#### MPI

The whole point of MPI is that it is a standard, so code written for one implementation should run fine with any other. There are many implementations. On Linux platforms, the most popular are OpenMPI, LAM, and MPICH. There are various pros and cons associated with each implementation, and if installing on a powerful HPC cluster, serious consideration should be given to which will be the most beneficial. For basic development and testing, however, it really doesn’t matter which is used. I use OpenMPI on my Ubuntu laptop, which can be installed with a simple:

sudo apt-get install openmpi-bin libopenmpi-dev


That’s it! You’re ready to go… You can test your installation with a simple “Hello world” program such as:

#include <stdio.h>
#include <mpi.h>

int main (int argc,char **argv)
{
int rank, size;
MPI_Init (&argc, &argv);
MPI_Comm_rank (MPI_COMM_WORLD, &rank);
MPI_Comm_size (MPI_COMM_WORLD, &size);
printf( "Hello world from process %d of %d\n", rank, size );
MPI_Finalize();
return 0;
}


You should be able to compile this with

mpicc -o helloworld helloworld.c


and run (on 2 processors) with

mpirun -np 2 helloworld


#### GSL

If you are writing non-trivial MCMC codes, you are almost certainly going to need to use a sophisticated math library and associated random number generation (RNG) routines. I typically use the GSL. On Ubuntu, the GSL can be installed with:

sudo apt-get install gsl-bin libgsl0-dev


A simple script to generate some non-uniform random numbers is given below.

#include <gsl/gsl_rng.h>
#include <gsl/gsl_randist.h>

int main(void)
{
int i; double z; gsl_rng *r;
r = gsl_rng_alloc(gsl_rng_mt19937);
gsl_rng_set(r,0);
for (i=0;i<10;i++) {
z = gsl_ran_gaussian(r,1.0);
printf("z(%d) = %f\n",i,z);
}
exit(EXIT_SUCCESS);
}


This can be compiled with a command like:

gcc gsl_ran_demo.c -o gsl_ran_demo -lgsl -lgslcblas


and run with

./gsl_ran_demo


#### SPRNG

When writing parallel Monte Carlo codes, it is important to be able to use independent streams of random numbers on each processor. Although it is tempting to “fudge” things by using a random number generator with a different seed on each processor, this does not guarantee independence of the streams, and an unfortunate choice of seeds could potentially lead to bad behaviour of your algorithm. The solution to this problem is to use a parallel random number generator (PRNG), designed specifically to give independent streams on different processors. Unfortunately the GSL does not have native support for such parallel random number generators, so an external library should be used. SPRNG 2.0 is a popular choice. SPRNG is designed so that it can be used with MPI, but also that it does not have to be. This is an issue, as the standard binary packages distributed with Ubuntu (libsprng2, libsprng2-dev) are compiled without MPI support. If you are going to be using SPRNG with MPI, things are simpler with MPI support, so it makes sense to download sprng2.0b.tar.gz from the SPRNG web site, and build it from source, carefully following the instructions for including MPI support. If you are not familiar with building libraries from source, you may need help from someone who is.

Once you have compiled SPRNG with MPI support, you can test it with the following code:

#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
#define SIMPLE_SPRNG
#define USE_MPI
#include "sprng.h"

int main(int argc,char *argv[])
{
double rn; int i,k;
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD,&k);
init_sprng(DEFAULT_RNG_TYPE,0,SPRNG_DEFAULT);
for (i=0;i<10;i++)
{
rn = sprng();
printf("Process %d, random number %d: %f\n", k, i+1, rn);
}
MPI_Finalize();
exit(EXIT_SUCCESS);
}


which can be compiled with a command like:

mpicc -I/usr/local/src/sprng2.0/include -L/usr/local/src/sprng2.0/lib -o sprng_demo sprng_demo.c -lsprng -lgmp


You will need to edit paths here to match your installation. If if builds, it can be run on 2 processors with a command like:

mpirun -np 2 sprng_demo


If it doesn’t build, there are many possible reasons. Check the error messages carefully. However, if the compilation fails at the linking stage with obscure messages about not being able to find certain SPRNG MPI functions, one possibility is that the SPRNG library has not been compiled with MPI support.

The problem with SPRNG is that it only provides a uniform random number generator. Of course we would really like to be able to use the SPRNG generator in conjunction with all of the sophisticated GSL routines for non-uniform random number generation. Many years ago I wrote a small piece of code to accomplish this, gsl-sprng.h. Download this and put it in your include path for the following example:

#include <mpi.h>
#include <gsl/gsl_rng.h>
#include "gsl-sprng.h"
#include <gsl/gsl_randist.h>

int main(int argc,char *argv[])
{
int i,k,po; gsl_rng *r;
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD,&k);
r=gsl_rng_alloc(gsl_rng_sprng20);
for (i=0;i<10;i++)
{
po = gsl_ran_poisson(r,2.0);
printf("Process %d, random number %d: %d\n", k, i+1, po);
}
MPI_Finalize();
exit(EXIT_SUCCESS);
}


A new GSL RNG, gsl_rng_sprng20 is created, by including gsl-sprng.h immediately after gsl_rng.h. If you want to set a seed, do so in the usual GSL way, but make sure to set it to be the same on each processor. I have had several emails recently from people who claim that gsl-sprng.h “doesn’t work”. All I can say is that it still works for me! I suspect the problem is that people are attempting to use it with a version of SPRNG without MPI support. That won’t work… Check that the previous SPRNG example works, first.

I can compile and run the above code with

mpicc -I/usr/local/src/sprng2.0/include -L/usr/local/src/sprng2.0/lib -o gsl-sprng_demo gsl-sprng_demo.c -lsprng -lgmp -lgsl -lgslcblas
mpirun -np 2 gsl-sprng_demo


#### Parallel Monte Carlo

Now we have parallel random number streams, we can think about how to do parallel Monte Carlo simulations. Here is a simple example:

#include <math.h>
#include <mpi.h>
#include <gsl/gsl_rng.h>
#include "gsl-sprng.h"

int main(int argc,char *argv[])
{
int i,k,N; double u,ksum,Nsum; gsl_rng *r;
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&N);
MPI_Comm_rank(MPI_COMM_WORLD,&k);
r=gsl_rng_alloc(gsl_rng_sprng20);
for (i=0;i<10000;i++) {
u = gsl_rng_uniform(r);
ksum += exp(-u*u);
}
MPI_Reduce(&ksum,&Nsum,1,MPI_DOUBLE,MPI_SUM,0,MPI_COMM_WORLD);
if (k == 0) {
printf("Monte carlo estimate is %f\n", (Nsum/10000)/N );
}
MPI_Finalize();
exit(EXIT_SUCCESS);
}


which calculates a Monte Carlo estimate of the integral

$\displaystyle I=\int_0^1 \exp(-u^2)du$

using 10k variates on each available processor. The MPI command MPI_Reduce is used to summarise the values obtained independently in each process. I compile and run with

mpicc -I/usr/local/src/sprng2.0/include -L/usr/local/src/sprng2.0/lib -o monte-carlo monte-carlo.c -lsprng -lgmp -lgsl -lgslcblas
mpirun -np 2 monte-carlo


#### Parallel chains MCMC

Once parallel Monte Carlo has been mastered, it is time to move on to parallel MCMC. First it makes sense to understand how to run parallel MCMC chains in an MPI environment. I will illustrate this with a simple Metropolis-Hastings algorithm to sample a standard normal using uniform proposals, as discussed in a previous post. Here an independent chain is run on each processor, and the results are written into separate files.

#include <gsl/gsl_rng.h>
#include "gsl-sprng.h"
#include <gsl/gsl_randist.h>
#include <mpi.h>

int main(int argc,char *argv[])
{
int k,i,iters; double x,can,a,alpha; gsl_rng *r;
FILE *s; char filename[15];
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD,&k);
if ((argc != 3)) {
if (k == 0)
fprintf(stderr,"Usage: %s <iters> <alpha>\n",argv[0]);
MPI_Finalize(); return(EXIT_FAILURE);
}
iters=atoi(argv[1]); alpha=atof(argv[2]);
r=gsl_rng_alloc(gsl_rng_sprng20);
sprintf(filename,"chain-%03d.tab",k);
s=fopen(filename,"w");
if (s==NULL) {
perror("Failed open");
MPI_Finalize(); return(EXIT_FAILURE);
}
x = gsl_ran_flat(r,-20,20);
fprintf(s,"Iter X\n");
for (i=0;i<iters;i++) {
can = x + gsl_ran_flat(r,-alpha,alpha);
a = gsl_ran_ugaussian_pdf(can) / gsl_ran_ugaussian_pdf(x);
if (gsl_rng_uniform(r) < a)
x = can;
fprintf(s,"%d %f\n",i,x);
}
fclose(s);
MPI_Finalize(); return(EXIT_SUCCESS);
}


I can compile and run this with the following commands

mpicc -I/usr/local/src/sprng2.0/include -L/usr/local/src/sprng2.0/lib -o mcmc mcmc.c -lsprng -lgmp -lgsl -lgslcblas
mpirun -np 2 mcmc 100000 1


#### Parallelising a single MCMC chain

The parallel chains approach turns out to be surprisingly effective in practice. Obviously the disadvantage of that approach is that “burn in” has to be repeated on every processor, which limits how much efficiency gain can be acheived by running across many processors. Consequently it is often desirable to try and parallelise a single MCMC chain. As MCMC algorithms are inherently sequential, parallelisation is not completely trivial, and most (but not all) approaches to parallelising a single MCMC chain focus on the parallelisation of each iteration. In order for this to be worthwhile, it is necessary that the problem being considered is non-trivial, having a large state space. The strategy is then to divide the state space into “chunks” which can be updated in parallel. I don’t have time to go through a real example in detail in this blog post, but fortunately I wrote a book chapter on this topic almost 10 years ago which is still valid and relevant today. The citation details are:

Wilkinson, D. J. (2005) Parallel Bayesian Computation, Chapter 16 in E. J. Kontoghiorghes (ed.) Handbook of Parallel Computing and Statistics, Marcel Dekker/CRC Press, 481-512.

The book was eventually published in 2005 after a long delay. The publisher which originally commisioned the handbook (Marcel Dekker) was taken over by CRC Press before publication, and the project lay dormant for a couple of years until the new publisher picked it up again and decided to proceed with publication. I have a draft of my original submission in PDF which I recommend reading for further information. The code examples used are also available for download, including several of the examples used in this post, as well as an extended case study on parallelisation of a single chain for Bayesian inference in a stochastic volatility model. Although the chapter is nearly 10 years old, the issues discussed are all still remarkably up-to-date, and the code examples all still work. I think that is a testament to the stability of the technology adopted (C, MPI, GSL). Some of the other handbook chapters have not stood the test of time so well.

For basic information on getting started with MPI and key MPI commands for implementing parallel MCMC algorithms, the above mentioned book chapter is a reasonable place to start. Read it all through carefully, run the examples, and carefully study the code for the parallel stochastic volatility example. Once that is understood, you should find it possible to start writing your own parallel MCMC algorithms. For further information about more sophisticated MPI usage and additional commands, I find the annotated specification: MPI – The complete reference to be as good a source as any.

## Overview

Recent developments in DNA (and RNA) sequencing technology is transforming how modern biological research is done. High-throughput sequencing of DNA and RNA using next generation sequencing (NGS) machines is now a standard approach in most good biological research labs. However, these technologies generate huge amounts of data which is non-trivial to manipulate and analyse. In this post I’ll give a quick introduction to working with short read DNA sequence data stored in the FASTQ format, using basic Unix (Linux) command-line tools. In the next post, I will give an introduction to analysing FASTQ data with Bioconductor.

## Working with data files

The first thing to be aware of when working with NGS data is that the data files involved can be really huge. For example, the data file(s) associated with the single lane of an NGS machine can often total over 5GB, even when stored in a compressed format. This can mean that they are not trivial to move around, and so thought needs to be put into things like how and where to download the files, where to store them, if/how to compress them, etc. Typically, most people will be having the actual sequencing done by a third-party, and the third-party will make the sequence data available for downloading via a secure web site, or similar. As the files are large, it makes sense to download from a machine with a good internet connection. It is also best to download direct to a Unix machine if possible, as Unix copes better with large files, and the packing and compression formats typically used (eg. tar, gzip, bzip) are usually better supported by default on Unix platforms. Also note that the FAT, FAT32 and VFAT file systems often used on (older) Windows platforms can not store very large files (over 4GB). This can sometimes be an issue for Unix users too, as most USB flash drives and some external (USB) hard disks will be VFAT/FAT32 formatted by default. Note that on Unix systems, the commands scp and rsync are useful ways to copy files from one machine to another over a (hopefully fast!) network.

The downloaded files will typically be in gzip (.gz) or gzipped-tar (.tar.gz or .tgz) format. If they are bundled together in a gzipped-tar, it probably makes sense to unpack them into individual files, which can be compressed, and then moved around individually. Use tar xvfz bundle.tar.gz to unpack a bundle of files, but make sure you have plenty of free disk space first (with df -h .). Remember that on Unix you can get basic help on most command-line tools by typing man commandname (where commandname is the name of the command you want information on). Type man man for help on the man command… Remember that compressing and uncompressing large files can take a long time, so be patient, and think carefully before you hit “Return”. Individual files should all be stored compressed. For example, gzip myexp_lane1.fastq will result in the file myexp_lane1.fastq.gz, which will be much smaller than the original, and can be analysed without ever storing the uncompressed version on disk. Use ls -lh to see the files and their sizes in a particular directory. Note that there are a variety of data formats associated with NGS data. Most are text formats which can be inspected using Unix commands such as head, tail and less. Files ending .fas contain the actual reads, files ending .qual contain the “quality” scores associated with a set of reads, but files ending .fastq are in the FASTQ format, which contains both the reads and the quality scores together in a single file. Most tools for working with short read data will work with the FASTQ format, and so that is what we will concentrate on for the rest of this quick introduction.

## Working with FASTQ files

FASTQ files are a text-based file format, with 4 lines of text corresponding to each read. Assuming a gzipped file called myreads.fastq.gz, the first few lines can be inspected with zcat myreads.fastq.gz | head. Typical results will look roughly like the following:

@NG-5232_4_1_1022_17823#0/1
NACTCCGGTGTCGGTCTCGTAGGCCATTTTAGAAGCGAATAAATCGATGNATTCGANCNCNNNNNNNNATCGNNAGAGCTCGTANGCCGTCTTCTGCTTGANNNNNNN
+NG-5232_4_1_1022_17823#0/1
#'''')(++)AAAAAAAAAA########################################################################################
@NG-5232_4_1_1025_18503#0/1
NTCTACGGTGTCGGTCTCGTAGCCTATCGGGTAGCAGAGCTTATCGATGAATTCGAGCTCGGTTTCAGATTGGCAGAGCTCGTANTGCGGCCTTCGGCTGANNNNNNN
+NG-5232_4_1_1025_18503#0/1
############################################################################################################
@NG-5232_4_1_1026_21154#0/1
NGTTACGGTGTCGGTCTCGTAGTGAGTTGACCTCCGCCCAGTATCGATGAATTCGAGCTCGTTTTCAGATCGGAAGAGCTCGTCNGCCGTCTTCTGCTTGANNNNNNN


The first line is an identifier associated with the first read. The second line is the read itself (usually the thing of greatest interest!). The third line is an identifier associated with the quality score. The fourth line gives the quality score associated with each base in the read, using an ASCII code. The next four lines correspond to the second read, and so on. Further details can be obtained from the Wikipedia FASTQ Format page.

Fortunately, Unix was designed from the outset with processing of large text files in mind. This makes it possible to do quite a lot of processing and analysis with standard Unix command-line tools. The number of lines in the (uncompressed) file can be obtained with

zcat myreads.fastq.gz | wc -l


Obviously then, dividing the result by 4 will give the number of reads in the file. You can browse through the file using




Use space to page-down, “b” to go back a page, and “q” to quit. Use man zless for more options. For example “/CAGGTT” will find the next occurrence of “CAGGTT” in the file. Any regular expression can be given after the slash, so, “/^CAGTT” will find the next read which starts with CAGTT. Regular expressions can be generally very useful in the analysis of sequence data, and we will return to them later.

Due to the very large file sizes involved in NGS data analysis, it can often be desirable to create a file containing a (relatively) small number of reads for initial testing and debugging of analysis pipelines. Again, this is easy to do using a command like the following

zcat myreads.fastq.gz | head -400000 | gzip > Test100k.fastq.gz


which takes the first 100k reads from “myreads” and stores them in “Test100k”, without ever storing any uncompressed reads on disk. An example Test100k.fastq.gz is available for downloading, so that readers can experiment for themselves with this kind of data. Note that in Unix, these commands which stream data from one command to another do so without requiring large amounts of RAM. All of these simple Unix filtering tools can be run without any problems on, say, a laptop with a modest amount of RAM (2GB will be fine) running Ubuntu Linux (or similar), so long as you have plenty of free disk space.

### Splitting and joining FASTQ files

If you do not have access to a large RAM machine, it may be desirable to split up a very large set of reads into a collection of files, each of which contains a more manageable set of reads. Again, this can be accomplished with standard Unix commands. For example:

zcat myreads.fastq.gz | split -d -l 2000000 - Block


will create files Block01, Block02, etc., each containing 0.5M reads. However, these files will not be compressed (so make sure you have plenty of disk space before running this!), and do not have the correct file extension. Both of those issues can be fixed with some more standard Unix commands. The following assumes that you are using a Bash shell, but similar things work in other shells:

for name in Block??
do
mv ${name}${name}.fastq
gzip \${name}.fastq
done


This will result in the compressed FASTQ files Block01.fastq.gz, Block02.fastq.gz, etc. The idea now is that each of these files is processed one at a time (or in parallel, perhaps on a cluster) using some tool, and then the results combined in some sensible way later. The precise details will depend a great deal on the nature of the experiment which led to the data.

If necessary, the files can be recombined at a later date with a command like:

zcat Block*.fastq.gz | gzip > allreads.fastq.gz


without storing uncompressed files on disk.

### Filtering FASTQ files

It will very often be desirable to filter the data in a less arbitrary way – selecting reads matching particular criteria for further analysis. Again, this sort of text processing is the kind of thing that is very easy in Unix. Traditionally, tools such as grep, sed and awk were used for this purpose. However, in the early 1990s, many people (myself included) migrated from awk to perl, as it was a full-blown programming language, with many more features than the simple tools. Perl is still a very effective language for this kind of activity, and now the BioPerl project provides a large range of modules specifically targeting biological applications, with an emphasis on sequence analysis. That said, many people find the perl language to be rather ugly, leading to code that is difficult to read and maintain. So in the late 1990s I, like many others, switched from perl to python for text processing and related activities. Python is a great general purpose programming language. It is simple, elegant and easy to learn, and code is easy to read and maintain. Analogous to perl, there is an associated Biopython project, containing modules for sequence analysis and other biological applications. At some point in the future I may write a post on using Biopython for the analysis of FASTQ data, but in the meantime the on-line resources and books such as Python for Bioinformatics provide further information for the interested reader.

Fortunately the FASTQ format is sufficiently simple that many if not most basic filtering and analysis tasks can be accomplished with the default python installation found on the vast majority of Unix systems. Below is a complete python program to filter a FASTQ file, selecting just the reads matching a given regular expression.

#!/usr/bin/env python
# filter.py
# Filter a fastq file according to a given regex to match to read sequence
# Copyright (C) Darren J Wilkinson, November 2010, http://tinyurl.com/darrenjw

import sys,re

for i in range(4):

if __name__=='__main__':
if (len(sys.argv)!=2):
print "Usage: python filter.py <regex> < infile.fastq > outfile.fastq"
exit(1)
regex=re.compile(sys.argv[1])

# eof


So, for example,

zcat Test100k.fastq.gz | python filter.py TCGATTT | gzip > filtered.fastq.gz


will select out all reads containing the string TCGATTT, and

zcat Test100k.fastq.gz | python filter.py ^TCGAT | gzip > filtered.fastq.gz


zcat Test100k.fastq.gz | python filter.py ^TCGAT | wc -l