A quick introduction to the Bioconductor ShortRead package for the analysis of NGS data

Introduction and background

In the previous post I gave an introduction to next generation sequencing (NGS) data, and in particular, working with FASTQ format files in a Unix environment. In another post I gave an introduction to R and Bioconductor for the statistical analysis of biological data. In this post I will give a quick introduction to the use of the Bioconductor package ShortRead for reading and processing FASTQ data using R/Bioconductor. This post assumes a working R/Bioconductor installation, and a basic familiarity with FASTQ files. See the previous posts for further details.

Getting started

Assuming that Bioconductor is installed, entering the following at the R command prompt will install the ShortRead package and any missing dependencies:

source("http://bioconductor.org/biocLite.R")
biocLite("ShortRead")

The library can be loaded using

library(ShortRead)

Basic information about the package can be obtained with help(package="ShortRead"), and it is also worth getting help for the Biostrings package on which the ShortRead package depends. There is also a vignette for the ShortRead package, and several vignettes associated with the Biostrings package, all of which can be accessed using openVignette().

Loading data

Assuming that the file Test100k.fastq.gz discussed in the previous post is available and in the current directory, it can be loaded with:

reads=readFastq("Test100k.fastq.gz")

Note that the FASTQ file does not need to be unzipped. Typing

reads

gives some basic information, such as the number of reads, and tells you that the reads object is of type ShortReadQ. Information about this class can be obtained via ?ShortReadQ. If the main interest is in the actual sequences, these can be extracted using

sequences=sread(reads)

Entering sequences will show the first few and last few sequences, and inform you that the object is of class DNAStringSet. Again, ?DNAStringSet may be helpful. As with most large R objects (and large Unix text files), it can be helpful to just look at the first few rows. In R this can be done with head(sequences), analogously to the Unix head command.

Working with short read data

Individual reads can be accessed directly using usual R indexing, so that sequences[1] gives the sequence of the first read, etc. R string manipulation commands can also be used on a DNAStringSet, so that

substr(sequences,1,5)

will return a vector consisting of the first 5 bases from each read. Consequently,

table(substr(sequences,1,5))

will tabulate the occurrences of the inital pentamer associated with each read. This may or may not be interesting, depending on the data…

Filtering data

Filtering reads to obtain a subset of direct interest is a key task. Fortunately there is good support for doing this in the ShortRead package. Suppose now that we are only interested in sequences beginning with the pentamer TAGCT (perhaps because those reads correspond to a sample amplified with a particular PCR primer). We can trivially filter our set of reads using standard R subsetting techniques:

myreads=reads[substr(sequences,1,5)=="TAGCT"]
myreads

However, filtering is such an important concept that the ShortRead library includes its own special set of functions relating to filtering, which allow filtering on sequence quality as well as sequence content, and allow filters to be combined together to allow the building of new filters from old, etc. Here I will just explain how to do the above filtering operation using a ShortRead srFilter. The idea is to create filters of class SRFilter and then apply these to sets of reads (of class ShortReadQ) in order to filter. So, if you want to build a custom filter, this is done using the function srFilter. An example follows:

myfilter=srFilter(function(x){
        substr(sread(x),1,5)=="TAGCT"
        },name="My Filter")

The first argument to the srFilter function is a function taking a single argument, x, which is assumed to be a ShortReadQ object. The function should return a boolean vector, giving TRUE for all reads which are to pass through the filter. Once defined, typing myfilter will inform you that myfilter is an object of class SRFilter, and typing srFilter(myfilter) will give the definition of the filter.

Typing myfilter(reads) will apply the filter to the reads, giving a boolean vector. Therefore to obtain the filtered reads, use

myreads=reads[myfilter(reads)]

See ?srFilter and the ShortRead vignette for further details of how to build and combine filters.

Of course there is a bit of a problem with all of this post-hoc filtering when working with very large FASTQ files. The initial readFastq command reads the entire FASTQ file into RAM. If you don’t have enough RAM, the read will fail, and you won’t ever get a chance to filter things down to a smaller subset of reads to work with. So you really want to apply a filter as the data is read. Recent versions of the ShortRead package support this functionality, by allowing an optional filter argument to be passed in to the readFastq function.

myreads=readFastq("Test100k.fastq.gz",filter=myfilter)

This applies the filter at read time, so that as long as the filtered data will fit into the available RAM, the command will succeed. Note that it is desirable to run ShortRead on a 64bit machine with lots of RAM in order to be able to work with a reasonable number of reads at once. Even so, filtering at read time is still likely to be necessary. Attempting to read in a full lane of around 40M reads is likely to fail even on a machine with 16GB of RAM.

Matching and counting

Suppose that we are interested in looking for occurrences of the sequence “AGACGATCACCATTCGATGC” in the reads, somewhere between positions 22 and 44 (this is a real example). We could start by creating a DNAStringSet containing just the read sequences between positions 22 and 44 as

dict=DNAStringSet(substr(sequences,22,44))

We could then filter the set with a regular expression requiring an exact match. However, NGS methods are not perfect. They occasionally call a base wrongly, and occasionally mess up a read by missing a base, or by inserting an extra base into the read. If we exactly match, we are going to miss all of the reads which aren’t quite right, and as we are matching to quite a long sequence (20 bases), that may well correspond to a lot of reads. What we really want to do is allow an imperfect match, allowing, say 1 base to be wrong, or one indel. We can do that with the following:

hits=vcountPattern("AGACGATCACCATTCGATGC",dict,
                        max.mismatch=1,with.indels=TRUE)
sum(hits)

hits is a vector corresponding to the number of hits in each read (mainly zeros, with a few ones). By summing them we get the total number of reads which match, and the actual reads can be pulled out for further analysis with a command like

sread(reads[hits])

See ?vcountPattern for futher details, including details of related functions.

Further analysis

The basic commands and techniques covered in this brief post are enough to get started writing R functions for the analysis of short read data. It is easy to use the R programming language to write functions which count particular matching occurrences and events, and record them in vectors or matrices which can then be further analysed using other Bioconductor or R packages for statistical analysis, modelling or evaluation. As ever, further information can be found in the Bioconductor documentation and other on-line sources. I’ve also found Gentleman’s book R Programming for Bioinformatics to be a very useful source of information for working with sequence data in R.

Published by

darrenjw

I am Professor of Statistics within the Department of Mathematical Sciences at Durham University, UK. I am an Bayesian statistician interested in computation and applications, especially to engineering and the life sciences.

3 thoughts on “A quick introduction to the Bioconductor ShortRead package for the analysis of NGS data”

  1. Thanks so much for these blogs, they’re so useful – this page is so much more useful than the Bioconductor documentation itself, it’s really got me going in my analysis.

    There is just one thing I can’t get to work, which is the matching scripts:

    hits=vcountPattern(“AGACGATCACCATTCGATGC”,dict,max.mismatch=1,with.indels=TRUE)

    I’m searching a number of fastq files of Illumina sequences from a multiplex PCR amplifying a variable area. Ideally I’d like to sort these either by primer from one end, or by matching to one of the possible variable sequences.

    If I try this code for either strategy, sum(hits) tells me that a likely number of sequences have been detected (say just under a quarter of reads if I’m searching for reads starting with one primer out of four).

    However, if I output this with sread(reads[hits]) or write it to a file, then all of the resultant files are exactly the same. I know that this can’t be true, as if I separate out say one of the primer sets through a different method (more exact but less inclusive, no accounting for mismatches) I find a wide variety of diverse samples.

    Is there a reason why this should be the case? Am I executing this wrong? Or is there a similar way to achieve a search for sequences matching a given sequence, but allowing discrepancies?

  2. Well, the example above is almost complete. The issue is the Integer vector “hits”. To make this work in R, I had to convert it to a Logical vector fist e.g., using
    reads[as.logical(hits)] extracts the the correct sequences.

    Hope this helps

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.