Introduction to the processing of short read next generation sequencing data

Overview

Recent developments in DNA (and RNA) sequencing technology is transforming how modern biological research is done. High-throughput sequencing of DNA and RNA using next generation sequencing (NGS) machines is now a standard approach in most good biological research labs. However, these technologies generate huge amounts of data which is non-trivial to manipulate and analyse. In this post I’ll give a quick introduction to working with short read DNA sequence data stored in the FASTQ format, using basic Unix (Linux) command-line tools. In the next post, I will give an introduction to analysing FASTQ data with Bioconductor.

Working with data files

The first thing to be aware of when working with NGS data is that the data files involved can be really huge. For example, the data file(s) associated with the single lane of an NGS machine can often total over 5GB, even when stored in a compressed format. This can mean that they are not trivial to move around, and so thought needs to be put into things like how and where to download the files, where to store them, if/how to compress them, etc. Typically, most people will be having the actual sequencing done by a third-party, and the third-party will make the sequence data available for downloading via a secure web site, or similar. As the files are large, it makes sense to download from a machine with a good internet connection. It is also best to download direct to a Unix machine if possible, as Unix copes better with large files, and the packing and compression formats typically used (eg. tar, gzip, bzip) are usually better supported by default on Unix platforms. Also note that the FAT, FAT32 and VFAT file systems often used on (older) Windows platforms can not store very large files (over 4GB). This can sometimes be an issue for Unix users too, as most USB flash drives and some external (USB) hard disks will be VFAT/FAT32 formatted by default. Note that on Unix systems, the commands scp and rsync are useful ways to copy files from one machine to another over a (hopefully fast!) network.

The downloaded files will typically be in gzip (.gz) or gzipped-tar (.tar.gz or .tgz) format. If they are bundled together in a gzipped-tar, it probably makes sense to unpack them into individual files, which can be compressed, and then moved around individually. Use tar xvfz bundle.tar.gz to unpack a bundle of files, but make sure you have plenty of free disk space first (with df -h .). Remember that on Unix you can get basic help on most command-line tools by typing man commandname (where commandname is the name of the command you want information on). Type man man for help on the man command… Remember that compressing and uncompressing large files can take a long time, so be patient, and think carefully before you hit “Return”. Individual files should all be stored compressed. For example, gzip myexp_lane1.fastq will result in the file myexp_lane1.fastq.gz, which will be much smaller than the original, and can be analysed without ever storing the uncompressed version on disk. Use ls -lh to see the files and their sizes in a particular directory. Note that there are a variety of data formats associated with NGS data. Most are text formats which can be inspected using Unix commands such as head, tail and less. Files ending .fas contain the actual reads, files ending .qual contain the “quality” scores associated with a set of reads, but files ending .fastq are in the FASTQ format, which contains both the reads and the quality scores together in a single file. Most tools for working with short read data will work with the FASTQ format, and so that is what we will concentrate on for the rest of this quick introduction.

Working with FASTQ files

FASTQ files are a text-based file format, with 4 lines of text corresponding to each read. Assuming a gzipped file called myreads.fastq.gz, the first few lines can be inspected with zcat myreads.fastq.gz | head. Typical results will look roughly like the following:

@NG-5232_4_1_1022_17823#0/1
NACTCCGGTGTCGGTCTCGTAGGCCATTTTAGAAGCGAATAAATCGATGNATTCGANCNCNNNNNNNNATCGNNAGAGCTCGTANGCCGTCTTCTGCTTGANNNNNNN
+NG-5232_4_1_1022_17823#0/1
#'''')(++)AAAAAAAAAA########################################################################################
@NG-5232_4_1_1025_18503#0/1
NTCTACGGTGTCGGTCTCGTAGCCTATCGGGTAGCAGAGCTTATCGATGAATTCGAGCTCGGTTTCAGATTGGCAGAGCTCGTANTGCGGCCTTCGGCTGANNNNNNN
+NG-5232_4_1_1025_18503#0/1
############################################################################################################
@NG-5232_4_1_1026_21154#0/1
NGTTACGGTGTCGGTCTCGTAGTGAGTTGACCTCCGCCCAGTATCGATGAATTCGAGCTCGTTTTCAGATCGGAAGAGCTCGTCNGCCGTCTTCTGCTTGANNNNNNN

The first line is an identifier associated with the first read. The second line is the read itself (usually the thing of greatest interest!). The third line is an identifier associated with the quality score. The fourth line gives the quality score associated with each base in the read, using an ASCII code. The next four lines correspond to the second read, and so on. Further details can be obtained from the Wikipedia FASTQ Format page.

Fortunately, Unix was designed from the outset with processing of large text files in mind. This makes it possible to do quite a lot of processing and analysis with standard Unix command-line tools. The number of lines in the (uncompressed) file can be obtained with

zcat myreads.fastq.gz | wc -l 

Obviously then, dividing the result by 4 will give the number of reads in the file. You can browse through the file using

 
zless myreads.fastq.gz 

Use space to page-down, “b” to go back a page, and “q” to quit. Use man zless for more options. For example “/CAGGTT” will find the next occurrence of “CAGGTT” in the file. Any regular expression can be given after the slash, so, “/^CAGTT” will find the next read which starts with CAGTT. Regular expressions can be generally very useful in the analysis of sequence data, and we will return to them later.

Due to the very large file sizes involved in NGS data analysis, it can often be desirable to create a file containing a (relatively) small number of reads for initial testing and debugging of analysis pipelines. Again, this is easy to do using a command like the following

zcat myreads.fastq.gz | head -400000 | gzip > Test100k.fastq.gz 

which takes the first 100k reads from “myreads” and stores them in “Test100k”, without ever storing any uncompressed reads on disk. An example Test100k.fastq.gz is available for downloading, so that readers can experiment for themselves with this kind of data. Note that in Unix, these commands which stream data from one command to another do so without requiring large amounts of RAM. All of these simple Unix filtering tools can be run without any problems on, say, a laptop with a modest amount of RAM (2GB will be fine) running Ubuntu Linux (or similar), so long as you have plenty of free disk space.

Splitting and joining FASTQ files

If you do not have access to a large RAM machine, it may be desirable to split up a very large set of reads into a collection of files, each of which contains a more manageable set of reads. Again, this can be accomplished with standard Unix commands. For example:

zcat myreads.fastq.gz | split -d -l 2000000 - Block 

will create files Block01, Block02, etc., each containing 0.5M reads. However, these files will not be compressed (so make sure you have plenty of disk space before running this!), and do not have the correct file extension. Both of those issues can be fixed with some more standard Unix commands. The following assumes that you are using a Bash shell, but similar things work in other shells:

for name in Block??
do
 mv ${name} ${name}.fastq
 gzip ${name}.fastq
done

This will result in the compressed FASTQ files Block01.fastq.gz, Block02.fastq.gz, etc. The idea now is that each of these files is processed one at a time (or in parallel, perhaps on a cluster) using some tool, and then the results combined in some sensible way later. The precise details will depend a great deal on the nature of the experiment which led to the data.

If necessary, the files can be recombined at a later date with a command like:

zcat Block*.fastq.gz | gzip > allreads.fastq.gz 

without storing uncompressed files on disk.

Filtering FASTQ files

It will very often be desirable to filter the data in a less arbitrary way – selecting reads matching particular criteria for further analysis. Again, this sort of text processing is the kind of thing that is very easy in Unix. Traditionally, tools such as grep, sed and awk were used for this purpose. However, in the early 1990s, many people (myself included) migrated from awk to perl, as it was a full-blown programming language, with many more features than the simple tools. Perl is still a very effective language for this kind of activity, and now the BioPerl project provides a large range of modules specifically targeting biological applications, with an emphasis on sequence analysis. That said, many people find the perl language to be rather ugly, leading to code that is difficult to read and maintain. So in the late 1990s I, like many others, switched from perl to python for text processing and related activities. Python is a great general purpose programming language. It is simple, elegant and easy to learn, and code is easy to read and maintain. Analogous to perl, there is an associated Biopython project, containing modules for sequence analysis and other biological applications. At some point in the future I may write a post on using Biopython for the analysis of FASTQ data, but in the meantime the on-line resources and books such as Python for Bioinformatics provide further information for the interested reader.

Fortunately the FASTQ format is sufficiently simple that many if not most basic filtering and analysis tasks can be accomplished with the default python installation found on the vast majority of Unix systems. Below is a complete python program to filter a FASTQ file, selecting just the reads matching a given regular expression.

#!/usr/bin/env python
# filter.py
# Filter a fastq file according to a given regex to match to read sequence
# Copyright (C) Darren J Wilkinson, November 2010, http://tinyurl.com/darrenjw

import sys,re

def readread(s):
	return [s.readline(),s.readline(),s.readline(),s.readline()]

def writeread(sread,s):
	for i in range(4):
		s.write(sread[i])

def readfilter(query,s,o):
	sread=readread(s)
	while (sread[0]):
	    if (query.search(sread[1])):
	        writeread(sread,o)
	    sread=readread(s)

if __name__=='__main__':
	if (len(sys.argv)!=2):
	    print "Usage: python filter.py <regex> < infile.fastq > outfile.fastq"
	    exit(1)
	regex=re.compile(sys.argv[1])
	readfilter(regex,sys.stdin,sys.stdout)
	
# eof

So, for example,

zcat Test100k.fastq.gz | python filter.py TCGATTT | gzip > filtered.fastq.gz 

will select out all reads containing the string TCGATTT, and

zcat Test100k.fastq.gz | python filter.py ^TCGAT | gzip > filtered.fastq.gz

will select out all reads which start with the string TCGAT. Similarly, doing

zcat Test100k.fastq.gz | python filter.py ^TCGAT | wc -l 

and dividing the result by 4 will give the number of reads starting with TCGAT. People familiar with Unix can think of this python script as a version of grep for FASTQ files.

Of course the above python code could be extended to do more sophisticated analysis of the read data, but in that case it will make sense to make use of the Biopython libraries. An alternative is to use R and Bioconductor for the analysis, making use of their huge array of statistical analysis and visualisation tools. I will give a quick introduction to the use of the Bioconductor “ShortRead” package for processing FASTQ data in the next post.

Published by

darrenjw

I am Professor of Statistics within the Department of Mathematical Sciences at Durham University, UK. I am an Bayesian statistician interested in computation and applications, especially to engineering and the life sciences.

8 thoughts on “Introduction to the processing of short read next generation sequencing data”

  1. Thanks for the great explanation. Your tutorial is easy to read and the explanations are very helpful to understanding what you are doing.

  2. Thank you so much for such a great tutorial. I would really appreciate a little more help. I am newbie to bioinformatics and i am a computer engineer. I am currently working on a bioinformatics project which is to implement – “ECHO – a reference free short- read error correction algorithm” on java and cloud.

    I am at initial stage and trying to understand base-calls and in-del errors in short-read. I tried to read a few research papers and understand it however found it too biology specific.

    I would really appreciate if you could point me to right direction about the resources I should refer to.

    Thanks and regards

    Ravi

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.