A quick introduction to the Bioconductor ShortRead package for the analysis of NGS data

Introduction and background

In the previous post I gave an introduction to next generation sequencing (NGS) data, and in particular, working with FASTQ format files in a Unix environment. In another post I gave an introduction to R and Bioconductor for the statistical analysis of biological data. In this post I will give a quick introduction to the use of the Bioconductor package ShortRead for reading and processing FASTQ data using R/Bioconductor. This post assumes a working R/Bioconductor installation, and a basic familiarity with FASTQ files. See the previous posts for further details.

Getting started

Assuming that Bioconductor is installed, entering the following at the R command prompt will install the ShortRead package and any missing dependencies:

source("http://bioconductor.org/biocLite.R")
biocLite("ShortRead")

The library can be loaded using

library(ShortRead)

Basic information about the package can be obtained with help(package="ShortRead"), and it is also worth getting help for the Biostrings package on which the ShortRead package depends. There is also a vignette for the ShortRead package, and several vignettes associated with the Biostrings package, all of which can be accessed using openVignette().

Loading data

Assuming that the file Test100k.fastq.gz discussed in the previous post is available and in the current directory, it can be loaded with:

reads=readFastq("Test100k.fastq.gz")

Note that the FASTQ file does not need to be unzipped. Typing

reads

gives some basic information, such as the number of reads, and tells you that the reads object is of type ShortReadQ. Information about this class can be obtained via ?ShortReadQ. If the main interest is in the actual sequences, these can be extracted using

sequences=sread(reads)

Entering sequences will show the first few and last few sequences, and inform you that the object is of class DNAStringSet. Again, ?DNAStringSet may be helpful. As with most large R objects (and large Unix text files), it can be helpful to just look at the first few rows. In R this can be done with head(sequences), analogously to the Unix head command.

Working with short read data

Individual reads can be accessed directly using usual R indexing, so that sequences[1] gives the sequence of the first read, etc. R string manipulation commands can also be used on a DNAStringSet, so that

substr(sequences,1,5)

will return a vector consisting of the first 5 bases from each read. Consequently,

table(substr(sequences,1,5))

will tabulate the occurrences of the inital pentamer associated with each read. This may or may not be interesting, depending on the data…

Filtering data

Filtering reads to obtain a subset of direct interest is a key task. Fortunately there is good support for doing this in the ShortRead package. Suppose now that we are only interested in sequences beginning with the pentamer TAGCT (perhaps because those reads correspond to a sample amplified with a particular PCR primer). We can trivially filter our set of reads using standard R subsetting techniques:

myreads=reads[substr(sequences,1,5)=="TAGCT"]
myreads

However, filtering is such an important concept that the ShortRead library includes its own special set of functions relating to filtering, which allow filtering on sequence quality as well as sequence content, and allow filters to be combined together to allow the building of new filters from old, etc. Here I will just explain how to do the above filtering operation using a ShortRead srFilter. The idea is to create filters of class SRFilter and then apply these to sets of reads (of class ShortReadQ) in order to filter. So, if you want to build a custom filter, this is done using the function srFilter. An example follows:

myfilter=srFilter(function(x){
        substr(sread(x),1,5)=="TAGCT"
        },name="My Filter")

The first argument to the srFilter function is a function taking a single argument, x, which is assumed to be a ShortReadQ object. The function should return a boolean vector, giving TRUE for all reads which are to pass through the filter. Once defined, typing myfilter will inform you that myfilter is an object of class SRFilter, and typing srFilter(myfilter) will give the definition of the filter.

Typing myfilter(reads) will apply the filter to the reads, giving a boolean vector. Therefore to obtain the filtered reads, use

myreads=reads[myfilter(reads)]

See ?srFilter and the ShortRead vignette for further details of how to build and combine filters.

Of course there is a bit of a problem with all of this post-hoc filtering when working with very large FASTQ files. The initial readFastq command reads the entire FASTQ file into RAM. If you don’t have enough RAM, the read will fail, and you won’t ever get a chance to filter things down to a smaller subset of reads to work with. So you really want to apply a filter as the data is read. Recent versions of the ShortRead package support this functionality, by allowing an optional filter argument to be passed in to the readFastq function.

myreads=readFastq("Test100k.fastq.gz",filter=myfilter)

This applies the filter at read time, so that as long as the filtered data will fit into the available RAM, the command will succeed. Note that it is desirable to run ShortRead on a 64bit machine with lots of RAM in order to be able to work with a reasonable number of reads at once. Even so, filtering at read time is still likely to be necessary. Attempting to read in a full lane of around 40M reads is likely to fail even on a machine with 16GB of RAM.

Matching and counting

Suppose that we are interested in looking for occurrences of the sequence “AGACGATCACCATTCGATGC” in the reads, somewhere between positions 22 and 44 (this is a real example). We could start by creating a DNAStringSet containing just the read sequences between positions 22 and 44 as

dict=DNAStringSet(substr(sequences,22,44))

We could then filter the set with a regular expression requiring an exact match. However, NGS methods are not perfect. They occasionally call a base wrongly, and occasionally mess up a read by missing a base, or by inserting an extra base into the read. If we exactly match, we are going to miss all of the reads which aren’t quite right, and as we are matching to quite a long sequence (20 bases), that may well correspond to a lot of reads. What we really want to do is allow an imperfect match, allowing, say 1 base to be wrong, or one indel. We can do that with the following:

hits=vcountPattern("AGACGATCACCATTCGATGC",dict,
                        max.mismatch=1,with.indels=TRUE)
sum(hits)

hits is a vector corresponding to the number of hits in each read (mainly zeros, with a few ones). By summing them we get the total number of reads which match, and the actual reads can be pulled out for further analysis with a command like

sread(reads[hits])

See ?vcountPattern for futher details, including details of related functions.

Further analysis

The basic commands and techniques covered in this brief post are enough to get started writing R functions for the analysis of short read data. It is easy to use the R programming language to write functions which count particular matching occurrences and events, and record them in vectors or matrices which can then be further analysed using other Bioconductor or R packages for statistical analysis, modelling or evaluation. As ever, further information can be found in the Bioconductor documentation and other on-line sources. I’ve also found Gentleman’s book R Programming for Bioinformatics to be a very useful source of information for working with sequence data in R.

Advertisement

Introduction to the processing of short read next generation sequencing data

Overview

Recent developments in DNA (and RNA) sequencing technology is transforming how modern biological research is done. High-throughput sequencing of DNA and RNA using next generation sequencing (NGS) machines is now a standard approach in most good biological research labs. However, these technologies generate huge amounts of data which is non-trivial to manipulate and analyse. In this post I’ll give a quick introduction to working with short read DNA sequence data stored in the FASTQ format, using basic Unix (Linux) command-line tools. In the next post, I will give an introduction to analysing FASTQ data with Bioconductor.

Working with data files

The first thing to be aware of when working with NGS data is that the data files involved can be really huge. For example, the data file(s) associated with the single lane of an NGS machine can often total over 5GB, even when stored in a compressed format. This can mean that they are not trivial to move around, and so thought needs to be put into things like how and where to download the files, where to store them, if/how to compress them, etc. Typically, most people will be having the actual sequencing done by a third-party, and the third-party will make the sequence data available for downloading via a secure web site, or similar. As the files are large, it makes sense to download from a machine with a good internet connection. It is also best to download direct to a Unix machine if possible, as Unix copes better with large files, and the packing and compression formats typically used (eg. tar, gzip, bzip) are usually better supported by default on Unix platforms. Also note that the FAT, FAT32 and VFAT file systems often used on (older) Windows platforms can not store very large files (over 4GB). This can sometimes be an issue for Unix users too, as most USB flash drives and some external (USB) hard disks will be VFAT/FAT32 formatted by default. Note that on Unix systems, the commands scp and rsync are useful ways to copy files from one machine to another over a (hopefully fast!) network.

The downloaded files will typically be in gzip (.gz) or gzipped-tar (.tar.gz or .tgz) format. If they are bundled together in a gzipped-tar, it probably makes sense to unpack them into individual files, which can be compressed, and then moved around individually. Use tar xvfz bundle.tar.gz to unpack a bundle of files, but make sure you have plenty of free disk space first (with df -h .). Remember that on Unix you can get basic help on most command-line tools by typing man commandname (where commandname is the name of the command you want information on). Type man man for help on the man command… Remember that compressing and uncompressing large files can take a long time, so be patient, and think carefully before you hit “Return”. Individual files should all be stored compressed. For example, gzip myexp_lane1.fastq will result in the file myexp_lane1.fastq.gz, which will be much smaller than the original, and can be analysed without ever storing the uncompressed version on disk. Use ls -lh to see the files and their sizes in a particular directory. Note that there are a variety of data formats associated with NGS data. Most are text formats which can be inspected using Unix commands such as head, tail and less. Files ending .fas contain the actual reads, files ending .qual contain the “quality” scores associated with a set of reads, but files ending .fastq are in the FASTQ format, which contains both the reads and the quality scores together in a single file. Most tools for working with short read data will work with the FASTQ format, and so that is what we will concentrate on for the rest of this quick introduction.

Working with FASTQ files

FASTQ files are a text-based file format, with 4 lines of text corresponding to each read. Assuming a gzipped file called myreads.fastq.gz, the first few lines can be inspected with zcat myreads.fastq.gz | head. Typical results will look roughly like the following:

@NG-5232_4_1_1022_17823#0/1
NACTCCGGTGTCGGTCTCGTAGGCCATTTTAGAAGCGAATAAATCGATGNATTCGANCNCNNNNNNNNATCGNNAGAGCTCGTANGCCGTCTTCTGCTTGANNNNNNN
+NG-5232_4_1_1022_17823#0/1
#'''')(++)AAAAAAAAAA########################################################################################
@NG-5232_4_1_1025_18503#0/1
NTCTACGGTGTCGGTCTCGTAGCCTATCGGGTAGCAGAGCTTATCGATGAATTCGAGCTCGGTTTCAGATTGGCAGAGCTCGTANTGCGGCCTTCGGCTGANNNNNNN
+NG-5232_4_1_1025_18503#0/1
############################################################################################################
@NG-5232_4_1_1026_21154#0/1
NGTTACGGTGTCGGTCTCGTAGTGAGTTGACCTCCGCCCAGTATCGATGAATTCGAGCTCGTTTTCAGATCGGAAGAGCTCGTCNGCCGTCTTCTGCTTGANNNNNNN

The first line is an identifier associated with the first read. The second line is the read itself (usually the thing of greatest interest!). The third line is an identifier associated with the quality score. The fourth line gives the quality score associated with each base in the read, using an ASCII code. The next four lines correspond to the second read, and so on. Further details can be obtained from the Wikipedia FASTQ Format page.

Fortunately, Unix was designed from the outset with processing of large text files in mind. This makes it possible to do quite a lot of processing and analysis with standard Unix command-line tools. The number of lines in the (uncompressed) file can be obtained with

zcat myreads.fastq.gz | wc -l 

Obviously then, dividing the result by 4 will give the number of reads in the file. You can browse through the file using

 
zless myreads.fastq.gz 

Use space to page-down, “b” to go back a page, and “q” to quit. Use man zless for more options. For example “/CAGGTT” will find the next occurrence of “CAGGTT” in the file. Any regular expression can be given after the slash, so, “/^CAGTT” will find the next read which starts with CAGTT. Regular expressions can be generally very useful in the analysis of sequence data, and we will return to them later.

Due to the very large file sizes involved in NGS data analysis, it can often be desirable to create a file containing a (relatively) small number of reads for initial testing and debugging of analysis pipelines. Again, this is easy to do using a command like the following

zcat myreads.fastq.gz | head -400000 | gzip > Test100k.fastq.gz 

which takes the first 100k reads from “myreads” and stores them in “Test100k”, without ever storing any uncompressed reads on disk. An example Test100k.fastq.gz is available for downloading, so that readers can experiment for themselves with this kind of data. Note that in Unix, these commands which stream data from one command to another do so without requiring large amounts of RAM. All of these simple Unix filtering tools can be run without any problems on, say, a laptop with a modest amount of RAM (2GB will be fine) running Ubuntu Linux (or similar), so long as you have plenty of free disk space.

Splitting and joining FASTQ files

If you do not have access to a large RAM machine, it may be desirable to split up a very large set of reads into a collection of files, each of which contains a more manageable set of reads. Again, this can be accomplished with standard Unix commands. For example:

zcat myreads.fastq.gz | split -d -l 2000000 - Block 

will create files Block01, Block02, etc., each containing 0.5M reads. However, these files will not be compressed (so make sure you have plenty of disk space before running this!), and do not have the correct file extension. Both of those issues can be fixed with some more standard Unix commands. The following assumes that you are using a Bash shell, but similar things work in other shells:

for name in Block??
do
 mv ${name} ${name}.fastq
 gzip ${name}.fastq
done

This will result in the compressed FASTQ files Block01.fastq.gz, Block02.fastq.gz, etc. The idea now is that each of these files is processed one at a time (or in parallel, perhaps on a cluster) using some tool, and then the results combined in some sensible way later. The precise details will depend a great deal on the nature of the experiment which led to the data.

If necessary, the files can be recombined at a later date with a command like:

zcat Block*.fastq.gz | gzip > allreads.fastq.gz 

without storing uncompressed files on disk.

Filtering FASTQ files

It will very often be desirable to filter the data in a less arbitrary way – selecting reads matching particular criteria for further analysis. Again, this sort of text processing is the kind of thing that is very easy in Unix. Traditionally, tools such as grep, sed and awk were used for this purpose. However, in the early 1990s, many people (myself included) migrated from awk to perl, as it was a full-blown programming language, with many more features than the simple tools. Perl is still a very effective language for this kind of activity, and now the BioPerl project provides a large range of modules specifically targeting biological applications, with an emphasis on sequence analysis. That said, many people find the perl language to be rather ugly, leading to code that is difficult to read and maintain. So in the late 1990s I, like many others, switched from perl to python for text processing and related activities. Python is a great general purpose programming language. It is simple, elegant and easy to learn, and code is easy to read and maintain. Analogous to perl, there is an associated Biopython project, containing modules for sequence analysis and other biological applications. At some point in the future I may write a post on using Biopython for the analysis of FASTQ data, but in the meantime the on-line resources and books such as Python for Bioinformatics provide further information for the interested reader.

Fortunately the FASTQ format is sufficiently simple that many if not most basic filtering and analysis tasks can be accomplished with the default python installation found on the vast majority of Unix systems. Below is a complete python program to filter a FASTQ file, selecting just the reads matching a given regular expression.

#!/usr/bin/env python
# filter.py
# Filter a fastq file according to a given regex to match to read sequence
# Copyright (C) Darren J Wilkinson, November 2010, http://tinyurl.com/darrenjw

import sys,re

def readread(s):
	return [s.readline(),s.readline(),s.readline(),s.readline()]

def writeread(sread,s):
	for i in range(4):
		s.write(sread[i])

def readfilter(query,s,o):
	sread=readread(s)
	while (sread[0]):
	    if (query.search(sread[1])):
	        writeread(sread,o)
	    sread=readread(s)

if __name__=='__main__':
	if (len(sys.argv)!=2):
	    print "Usage: python filter.py <regex> < infile.fastq > outfile.fastq"
	    exit(1)
	regex=re.compile(sys.argv[1])
	readfilter(regex,sys.stdin,sys.stdout)
	
# eof

So, for example,

zcat Test100k.fastq.gz | python filter.py TCGATTT | gzip > filtered.fastq.gz 

will select out all reads containing the string TCGATTT, and

zcat Test100k.fastq.gz | python filter.py ^TCGAT | gzip > filtered.fastq.gz

will select out all reads which start with the string TCGAT. Similarly, doing

zcat Test100k.fastq.gz | python filter.py ^TCGAT | wc -l 

and dividing the result by 4 will give the number of reads starting with TCGAT. People familiar with Unix can think of this python script as a version of grep for FASTQ files.

Of course the above python code could be extended to do more sophisticated analysis of the read data, but in that case it will make sense to make use of the Biopython libraries. An alternative is to use R and Bioconductor for the analysis, making use of their huge array of statistical analysis and visualisation tools. I will give a quick introduction to the use of the Bioconductor “ShortRead” package for processing FASTQ data in the next post.