A quick introduction to the Bioconductor ShortRead package for the analysis of NGS data

Introduction and background

In the previous post I gave an introduction to next generation sequencing (NGS) data, and in particular, working with FASTQ format files in a Unix environment. In another post I gave an introduction to R and Bioconductor for the statistical analysis of biological data. In this post I will give a quick introduction to the use of the Bioconductor package ShortRead for reading and processing FASTQ data using R/Bioconductor. This post assumes a working R/Bioconductor installation, and a basic familiarity with FASTQ files. See the previous posts for further details.

Getting started

Assuming that Bioconductor is installed, entering the following at the R command prompt will install the ShortRead package and any missing dependencies:


The library can be loaded using


Basic information about the package can be obtained with help(package="ShortRead"), and it is also worth getting help for the Biostrings package on which the ShortRead package depends. There is also a vignette for the ShortRead package, and several vignettes associated with the Biostrings package, all of which can be accessed using openVignette().

Loading data

Assuming that the file Test100k.fastq.gz discussed in the previous post is available and in the current directory, it can be loaded with:


Note that the FASTQ file does not need to be unzipped. Typing


gives some basic information, such as the number of reads, and tells you that the reads object is of type ShortReadQ. Information about this class can be obtained via ?ShortReadQ. If the main interest is in the actual sequences, these can be extracted using


Entering sequences will show the first few and last few sequences, and inform you that the object is of class DNAStringSet. Again, ?DNAStringSet may be helpful. As with most large R objects (and large Unix text files), it can be helpful to just look at the first few rows. In R this can be done with head(sequences), analogously to the Unix head command.

Working with short read data

Individual reads can be accessed directly using usual R indexing, so that sequences[1] gives the sequence of the first read, etc. R string manipulation commands can also be used on a DNAStringSet, so that


will return a vector consisting of the first 5 bases from each read. Consequently,


will tabulate the occurrences of the inital pentamer associated with each read. This may or may not be interesting, depending on the data…

Filtering data

Filtering reads to obtain a subset of direct interest is a key task. Fortunately there is good support for doing this in the ShortRead package. Suppose now that we are only interested in sequences beginning with the pentamer TAGCT (perhaps because those reads correspond to a sample amplified with a particular PCR primer). We can trivially filter our set of reads using standard R subsetting techniques:


However, filtering is such an important concept that the ShortRead library includes its own special set of functions relating to filtering, which allow filtering on sequence quality as well as sequence content, and allow filters to be combined together to allow the building of new filters from old, etc. Here I will just explain how to do the above filtering operation using a ShortRead srFilter. The idea is to create filters of class SRFilter and then apply these to sets of reads (of class ShortReadQ) in order to filter. So, if you want to build a custom filter, this is done using the function srFilter. An example follows:

        },name="My Filter")

The first argument to the srFilter function is a function taking a single argument, x, which is assumed to be a ShortReadQ object. The function should return a boolean vector, giving TRUE for all reads which are to pass through the filter. Once defined, typing myfilter will inform you that myfilter is an object of class SRFilter, and typing srFilter(myfilter) will give the definition of the filter.

Typing myfilter(reads) will apply the filter to the reads, giving a boolean vector. Therefore to obtain the filtered reads, use


See ?srFilter and the ShortRead vignette for further details of how to build and combine filters.

Of course there is a bit of a problem with all of this post-hoc filtering when working with very large FASTQ files. The initial readFastq command reads the entire FASTQ file into RAM. If you don’t have enough RAM, the read will fail, and you won’t ever get a chance to filter things down to a smaller subset of reads to work with. So you really want to apply a filter as the data is read. Recent versions of the ShortRead package support this functionality, by allowing an optional filter argument to be passed in to the readFastq function.


This applies the filter at read time, so that as long as the filtered data will fit into the available RAM, the command will succeed. Note that it is desirable to run ShortRead on a 64bit machine with lots of RAM in order to be able to work with a reasonable number of reads at once. Even so, filtering at read time is still likely to be necessary. Attempting to read in a full lane of around 40M reads is likely to fail even on a machine with 16GB of RAM.

Matching and counting

Suppose that we are interested in looking for occurrences of the sequence “AGACGATCACCATTCGATGC” in the reads, somewhere between positions 22 and 44 (this is a real example). We could start by creating a DNAStringSet containing just the read sequences between positions 22 and 44 as


We could then filter the set with a regular expression requiring an exact match. However, NGS methods are not perfect. They occasionally call a base wrongly, and occasionally mess up a read by missing a base, or by inserting an extra base into the read. If we exactly match, we are going to miss all of the reads which aren’t quite right, and as we are matching to quite a long sequence (20 bases), that may well correspond to a lot of reads. What we really want to do is allow an imperfect match, allowing, say 1 base to be wrong, or one indel. We can do that with the following:


hits is a vector corresponding to the number of hits in each read (mainly zeros, with a few ones). By summing them we get the total number of reads which match, and the actual reads can be pulled out for further analysis with a command like


See ?vcountPattern for futher details, including details of related functions.

Further analysis

The basic commands and techniques covered in this brief post are enough to get started writing R functions for the analysis of short read data. It is easy to use the R programming language to write functions which count particular matching occurrences and events, and record them in vectors or matrices which can then be further analysed using other Bioconductor or R packages for statistical analysis, modelling or evaluation. As ever, further information can be found in the Bioconductor documentation and other on-line sources. I’ve also found Gentleman’s book R Programming for Bioinformatics to be a very useful source of information for working with sequence data in R.


Introduction to the processing of short read next generation sequencing data


Recent developments in DNA (and RNA) sequencing technology is transforming how modern biological research is done. High-throughput sequencing of DNA and RNA using next generation sequencing (NGS) machines is now a standard approach in most good biological research labs. However, these technologies generate huge amounts of data which is non-trivial to manipulate and analyse. In this post I’ll give a quick introduction to working with short read DNA sequence data stored in the FASTQ format, using basic Unix (Linux) command-line tools. In the next post, I will give an introduction to analysing FASTQ data with Bioconductor.

Working with data files

The first thing to be aware of when working with NGS data is that the data files involved can be really huge. For example, the data file(s) associated with the single lane of an NGS machine can often total over 5GB, even when stored in a compressed format. This can mean that they are not trivial to move around, and so thought needs to be put into things like how and where to download the files, where to store them, if/how to compress them, etc. Typically, most people will be having the actual sequencing done by a third-party, and the third-party will make the sequence data available for downloading via a secure web site, or similar. As the files are large, it makes sense to download from a machine with a good internet connection. It is also best to download direct to a Unix machine if possible, as Unix copes better with large files, and the packing and compression formats typically used (eg. tar, gzip, bzip) are usually better supported by default on Unix platforms. Also note that the FAT, FAT32 and VFAT file systems often used on (older) Windows platforms can not store very large files (over 4GB). This can sometimes be an issue for Unix users too, as most USB flash drives and some external (USB) hard disks will be VFAT/FAT32 formatted by default. Note that on Unix systems, the commands scp and rsync are useful ways to copy files from one machine to another over a (hopefully fast!) network.

The downloaded files will typically be in gzip (.gz) or gzipped-tar (.tar.gz or .tgz) format. If they are bundled together in a gzipped-tar, it probably makes sense to unpack them into individual files, which can be compressed, and then moved around individually. Use tar xvfz bundle.tar.gz to unpack a bundle of files, but make sure you have plenty of free disk space first (with df -h .). Remember that on Unix you can get basic help on most command-line tools by typing man commandname (where commandname is the name of the command you want information on). Type man man for help on the man command… Remember that compressing and uncompressing large files can take a long time, so be patient, and think carefully before you hit “Return”. Individual files should all be stored compressed. For example, gzip myexp_lane1.fastq will result in the file myexp_lane1.fastq.gz, which will be much smaller than the original, and can be analysed without ever storing the uncompressed version on disk. Use ls -lh to see the files and their sizes in a particular directory. Note that there are a variety of data formats associated with NGS data. Most are text formats which can be inspected using Unix commands such as head, tail and less. Files ending .fas contain the actual reads, files ending .qual contain the “quality” scores associated with a set of reads, but files ending .fastq are in the FASTQ format, which contains both the reads and the quality scores together in a single file. Most tools for working with short read data will work with the FASTQ format, and so that is what we will concentrate on for the rest of this quick introduction.

Working with FASTQ files

FASTQ files are a text-based file format, with 4 lines of text corresponding to each read. Assuming a gzipped file called myreads.fastq.gz, the first few lines can be inspected with zcat myreads.fastq.gz | head. Typical results will look roughly like the following:


The first line is an identifier associated with the first read. The second line is the read itself (usually the thing of greatest interest!). The third line is an identifier associated with the quality score. The fourth line gives the quality score associated with each base in the read, using an ASCII code. The next four lines correspond to the second read, and so on. Further details can be obtained from the Wikipedia FASTQ Format page.

Fortunately, Unix was designed from the outset with processing of large text files in mind. This makes it possible to do quite a lot of processing and analysis with standard Unix command-line tools. The number of lines in the (uncompressed) file can be obtained with

zcat myreads.fastq.gz | wc -l 

Obviously then, dividing the result by 4 will give the number of reads in the file. You can browse through the file using

zless myreads.fastq.gz 

Use space to page-down, “b” to go back a page, and “q” to quit. Use man zless for more options. For example “/CAGGTT” will find the next occurrence of “CAGGTT” in the file. Any regular expression can be given after the slash, so, “/^CAGTT” will find the next read which starts with CAGTT. Regular expressions can be generally very useful in the analysis of sequence data, and we will return to them later.

Due to the very large file sizes involved in NGS data analysis, it can often be desirable to create a file containing a (relatively) small number of reads for initial testing and debugging of analysis pipelines. Again, this is easy to do using a command like the following

zcat myreads.fastq.gz | head -400000 | gzip > Test100k.fastq.gz 

which takes the first 100k reads from “myreads” and stores them in “Test100k”, without ever storing any uncompressed reads on disk. An example Test100k.fastq.gz is available for downloading, so that readers can experiment for themselves with this kind of data. Note that in Unix, these commands which stream data from one command to another do so without requiring large amounts of RAM. All of these simple Unix filtering tools can be run without any problems on, say, a laptop with a modest amount of RAM (2GB will be fine) running Ubuntu Linux (or similar), so long as you have plenty of free disk space.

Splitting and joining FASTQ files

If you do not have access to a large RAM machine, it may be desirable to split up a very large set of reads into a collection of files, each of which contains a more manageable set of reads. Again, this can be accomplished with standard Unix commands. For example:

zcat myreads.fastq.gz | split -d -l 2000000 - Block 

will create files Block01, Block02, etc., each containing 0.5M reads. However, these files will not be compressed (so make sure you have plenty of disk space before running this!), and do not have the correct file extension. Both of those issues can be fixed with some more standard Unix commands. The following assumes that you are using a Bash shell, but similar things work in other shells:

for name in Block??
 mv ${name} ${name}.fastq
 gzip ${name}.fastq

This will result in the compressed FASTQ files Block01.fastq.gz, Block02.fastq.gz, etc. The idea now is that each of these files is processed one at a time (or in parallel, perhaps on a cluster) using some tool, and then the results combined in some sensible way later. The precise details will depend a great deal on the nature of the experiment which led to the data.

If necessary, the files can be recombined at a later date with a command like:

zcat Block*.fastq.gz | gzip > allreads.fastq.gz 

without storing uncompressed files on disk.

Filtering FASTQ files

It will very often be desirable to filter the data in a less arbitrary way – selecting reads matching particular criteria for further analysis. Again, this sort of text processing is the kind of thing that is very easy in Unix. Traditionally, tools such as grep, sed and awk were used for this purpose. However, in the early 1990s, many people (myself included) migrated from awk to perl, as it was a full-blown programming language, with many more features than the simple tools. Perl is still a very effective language for this kind of activity, and now the BioPerl project provides a large range of modules specifically targeting biological applications, with an emphasis on sequence analysis. That said, many people find the perl language to be rather ugly, leading to code that is difficult to read and maintain. So in the late 1990s I, like many others, switched from perl to python for text processing and related activities. Python is a great general purpose programming language. It is simple, elegant and easy to learn, and code is easy to read and maintain. Analogous to perl, there is an associated Biopython project, containing modules for sequence analysis and other biological applications. At some point in the future I may write a post on using Biopython for the analysis of FASTQ data, but in the meantime the on-line resources and books such as Python for Bioinformatics provide further information for the interested reader.

Fortunately the FASTQ format is sufficiently simple that many if not most basic filtering and analysis tasks can be accomplished with the default python installation found on the vast majority of Unix systems. Below is a complete python program to filter a FASTQ file, selecting just the reads matching a given regular expression.

#!/usr/bin/env python
# filter.py
# Filter a fastq file according to a given regex to match to read sequence
# Copyright (C) Darren J Wilkinson, November 2010, http://tinyurl.com/darrenjw

import sys,re

def readread(s):
	return [s.readline(),s.readline(),s.readline(),s.readline()]

def writeread(sread,s):
	for i in range(4):

def readfilter(query,s,o):
	while (sread[0]):
	    if (query.search(sread[1])):

if __name__=='__main__':
	if (len(sys.argv)!=2):
	    print "Usage: python filter.py <regex> < infile.fastq > outfile.fastq"
# eof

So, for example,

zcat Test100k.fastq.gz | python filter.py TCGATTT | gzip > filtered.fastq.gz 

will select out all reads containing the string TCGATTT, and

zcat Test100k.fastq.gz | python filter.py ^TCGAT | gzip > filtered.fastq.gz

will select out all reads which start with the string TCGAT. Similarly, doing

zcat Test100k.fastq.gz | python filter.py ^TCGAT | wc -l 

and dividing the result by 4 will give the number of reads starting with TCGAT. People familiar with Unix can think of this python script as a version of grep for FASTQ files.

Of course the above python code could be extended to do more sophisticated analysis of the read data, but in that case it will make sense to make use of the Biopython libraries. An alternative is to use R and Bioconductor for the analysis, making use of their huge array of statistical analysis and visualisation tools. I will give a quick introduction to the use of the Bioconductor “ShortRead” package for processing FASTQ data in the next post.

(Yet another) introduction to R and Bioconductor


Data management, analysis and visualisation are becoming increasingly important in bioscience research, but also increasingly difficult, as bioscience data increases in volume and complexity. The R programming language is a general and flexible computing environment widely used by statisticians for working with data. The Bioconductor project provides a powerful and flexible set of tools for working with bioscience data that is built on top of the R language. Bioconductor provides packages for working with sequence data, microarray data, flow cytometry data, next generation sequencing data, etc. Together, R and Bioconductor provide an excellent environment for bioscientists to analyse, summarise and visualise quantitative data. It is also ideal for carrying out sophisticated statistical analyses of data, and for producing publication quality figures, plots and graphs.

Installing R and Bioconductor

R can be downloaded from the Comprehensive R Archive Network, generally known as CRAN. It is available for Windows, Macs, and most flavours of Unix/Linux. On most Linux systems, R is available as a standard part of the distribution, and so it is usually simplest to install directly using the appropriate package manager for the particular distribution. eg. on Ubuntu, typing sudo apt-get install r-base r-recommended should do it. Once installed, additional packages can be installed by running R and typing the appropriate command at the R prompt. eg. typing:


will install a nice clustering package, and


will install packages for working with XML data, and for interfacing with twitter. This method can be used for installing any package available from CRAN. However, Bioconductor has its own repository that is separate from CRAN, and so Bioconductor and Bioconductor packages must be installed using a different method. Even once installed, packages need to be loaded in order to be used. So a command such as library(flexclust) is needed to load the flexclust library ready for use.

To install Bioconductor, enter the following commands at the R prompt:


This will install the Bioconductor core and some recommended packages. As Bioconductor is quite large, this will most likely take some time. To install additional packages, use the following:


The above will install packages for working with and visualising flow cytometry data. Again, multiple packages can be installed with a command like biocLite(c("Heatplus","biomaRt")). Most Bioconductor packages have an associated vignette which gives a brief overview of the package and a tutorial on how it should be used. The openVignette function can be used to see these. eg.:


Then select the vignette you would like to view. Both CRAN and Bioconductor packages manage dependencies sensibly, so if you install a package which depends on others, those dependencies will automatically install, too.

Basic R concepts

Many R functions work on lists of numbers, stored in an object called a vector. eg. the command rep(1,10) will return a vector consisting of the number 1 repeated 10 times. To get help on a function, use the ? operator. So, to get help on the rep function, type ?rep. Type the following commands one at a time at the R prompt to get a feeling for how to work with vectors:


Note that as c is the combine function in R, it is best not to use it as a variable name.

Basic plotting

Group measurements

First pretend we have some measurements on something (perhaps cell volume), so let’s just simulate some (R is also very good at stochastic simulation). We’ll simulate 30 control observations and 30 treatment observations. The treatment observations will be a bit larger, but also more variable.


We can get some basic stats about our data with


and a comparison of the distributions with


It can also be useful to stack the data:

boxplot(values ~ ind,data=mydata)  

as it is often more convenient to use the stacked data for statistical analysis:

fit=lm(values ~ ind,data=mydata)  

Plots can be customised to produce very nice graphs. For example, the above boxplots can be easily tidied up a little by passing in some extra parameters.

                                       main="Boxplots for the control and treatment")  

A histogram of a vector of values can be obtained with a command like hist(treatment).

Time course measurements

Suppose now that we have some time course measurements, again on a control and a treatment group. Again, lets just simulate:


A reasonable plot of these can be produced with the following commands:

plot(times,treatment,ylim=c(minval,maxval),pch=19,col=2,xlab="Time (mins)",
                       ylab="Measurement",main="Time course measurements")  

Start by entering each command one at a time, and use the help function (?) to figure out exactly what each command is doing. Regression lines can be added with the following commands.


Try demo(graphics) to get a better feel for some of the things that are possible.

Data import

Obviously to use R for real, you need to be able to read in data from other sources. The simplest way to get data into R is to read it in from a text file containing data in a tabular row/column format. See help on read.table for further information. Also see read.csv, scan and source. The easiest way to get data in from a spreadsheet application like MS Excel is to save as a tab-delimited file and read into R with read.table. Note however, that there are other possibilities for reading data, including some direct support for Excel files, and direct connections to databases (such as mySQL and Postgres) and various internet database resources.



Bioconductor can be used for all manner of bioscience data management and analysis tasks. In particular, it can be used to access genome databases and work with biological sequence data, as the following session illustrates:


Again, use the available help facilities to figure out what is going on.

Flow cytometry data

I’ll finish this post with an example of using R to analyse a large and complex data set from a flow cytometer. Flow cytometers such as the Partec CyFlow save data by default in a binary format with a file extension of .FCS. This data format is understood by the BioConductor flowCore package (on which the flowViz package depends). An example of reading and processing some data (on Bacillus subtilis cells containing a GFP reporter) follows. For this, you need to download the file 5-ac-t1-s1.FCS and place it in the current R working directory, or change the working directory to where you have put it, using setwd.


As I work a lot with GFP labelled samples, I find it convenient to define a function to plot the information I’m most interested in:


And finally, here is a function that will produce such a set of plots for all .FCS files in a directory and write the output to a PostScript file. Just call it with gfpplots().

  for (i in 1:length(xset)) {  


This has been a very brief introduction to some of the possibilities of using R and Bioconductor, but it has covered only a tiny fraction. Fortunately there is a huge amount of documentation for R and Bioconductor available on-line, especially from the CRAN and Bioconductor web sites. However, there is lots of other stuff, some of which is a bit more user friendly, which can be easily uncovered with a bit of googling. For example, here is a nice manual from UC Riverside. Finally, as was illustrated in the section on flow cytometry data, much of the power of R comes from the ability to define your own functions, utilising the fact that R is a full-blown programming language in its own right. In a previous post I gave a quick introduction to some basic R programming concepts. For those who prefer real books, I quite like Gentleman’s R Programming for Bioinformatics , and the Bioinformatics Solutions book also contains a few useful chapters, though it is starting become a little dated.