(Yet another) introduction to R and Bioconductor

Introduction

Data management, analysis and visualisation are becoming increasingly important in bioscience research, but also increasingly difficult, as bioscience data increases in volume and complexity. The R programming language is a general and flexible computing environment widely used by statisticians for working with data. The Bioconductor project provides a powerful and flexible set of tools for working with bioscience data that is built on top of the R language. Bioconductor provides packages for working with sequence data, microarray data, flow cytometry data, next generation sequencing data, etc. Together, R and Bioconductor provide an excellent environment for bioscientists to analyse, summarise and visualise quantitative data. It is also ideal for carrying out sophisticated statistical analyses of data, and for producing publication quality figures, plots and graphs.

Installing R and Bioconductor

R can be downloaded from the Comprehensive R Archive Network, generally known as CRAN. It is available for Windows, Macs, and most flavours of Unix/Linux. On most Linux systems, R is available as a standard part of the distribution, and so it is usually simplest to install directly using the appropriate package manager for the particular distribution. eg. on Ubuntu, typing sudo apt-get install r-base r-recommended should do it. Once installed, additional packages can be installed by running R and typing the appropriate command at the R prompt. eg. typing:

  
install.packages("flexclust")  

will install a nice clustering package, and

  
install.packages(c("XML","twitteR"))  

will install packages for working with XML data, and for interfacing with twitter. This method can be used for installing any package available from CRAN. However, Bioconductor has its own repository that is separate from CRAN, and so Bioconductor and Bioconductor packages must be installed using a different method. Even once installed, packages need to be loaded in order to be used. So a command such as library(flexclust) is needed to load the flexclust library ready for use.

To install Bioconductor, enter the following commands at the R prompt:

  
source("http://bioconductor.org/biocLite.R")  
biocLite()  

This will install the Bioconductor core and some recommended packages. As Bioconductor is quite large, this will most likely take some time. To install additional packages, use the following:

  
source("http://bioconductor.org/biocLite.R")  
biocLite("flowViz")  

The above will install packages for working with and visualising flow cytometry data. Again, multiple packages can be installed with a command like biocLite(c("Heatplus","biomaRt")). Most Bioconductor packages have an associated vignette which gives a brief overview of the package and a tutorial on how it should be used. The openVignette function can be used to see these. eg.:

  
library(flowViz)  
openVignette()  

Then select the vignette you would like to view. Both CRAN and Bioconductor packages manage dependencies sensibly, so if you install a package which depends on others, those dependencies will automatically install, too.

Basic R concepts

Many R functions work on lists of numbers, stored in an object called a vector. eg. the command rep(1,10) will return a vector consisting of the number 1 repeated 10 times. To get help on a function, use the ? operator. So, to get help on the rep function, type ?rep. Type the following commands one at a time at the R prompt to get a feeling for how to work with vectors:

  
rep(1,10)  
?rep  
a=rep(1,10)  
a  
b=1:10  
b  
c(a,b)  
?c  
a+b  
a+2*b  
a/b  
c(1,2,3)  
d=c(3,5,3,6,8,5,4,6,7)  
d  
d[4]  
d[5]  
d[2:4]  
d<7  
d[d<7]  
sort(d)  
sort(sample(1:49,6))  

Note that as c is the combine function in R, it is best not to use it as a variable name.

Basic plotting

Group measurements

First pretend we have some measurements on something (perhaps cell volume), so let’s just simulate some (R is also very good at stochastic simulation). We’ll simulate 30 control observations and 30 treatment observations. The treatment observations will be a bit larger, but also more variable.

  
control=rnorm(30,5,2)^2  
treatment=rnorm(30,6.5,3)^2  
mydataframe=data.frame(control,treatment)  
mydataframe  
mydataframe$control
mydataframe[,2]

We can get some basic stats about our data with

  
summary(mydataframe)  

and a comparison of the distributions with

  
boxplot(mydataframe)  

It can also be useful to stack the data:

  
mydata=stack(mydataframe)  
mydata  
boxplot(values ~ ind,data=mydata)  

as it is often more convenient to use the stacked data for statistical analysis:

  
fit=lm(values ~ ind,data=mydata)  
anova(fit)  

Plots can be customised to produce very nice graphs. For example, the above boxplots can be easily tidied up a little by passing in some extra parameters.

  
boxplot(mydataframe,notch=TRUE,names=c("Control","Treatment"),col=c(2,3),
                                       main="Boxplots for the control and treatment")  

A histogram of a vector of values can be obtained with a command like hist(treatment).

Time course measurements

Suppose now that we have some time course measurements, again on a control and a treatment group. Again, lets just simulate:

  
times=cumsum(rexp(20,0.1))  
control=cumsum(rnorm(20,0.1,1))  
treatment=cumsum(rnorm(20,0.2,1.5))  

A reasonable plot of these can be produced with the following commands:

  
minval=min(c(control,treatment))
maxval=max(c(control,treatment))
plot(times,treatment,ylim=c(minval,maxval),pch=19,col=2,xlab="Time (mins)",
                       ylab="Measurement",main="Time course measurements")  
lines(times,treatment,col=2,lty=2,lwd=2)  
points(times,control,col=3,pch=19)  
lines(times,control,col=3,lty=2,lwd=2)  
legend(0,maxval,c("Control","Treatment"),pch=c(19,19),lty=c(2,2),lwd=c(2,2),col=c(3,2))  

Start by entering each command one at a time, and use the help function (?) to figure out exactly what each command is doing. Regression lines can be added with the following commands.

fitc=lm(control~times)
fitt=lm(treatment~times)
abline(fitc,col=3,lty=3)
abline(fitt,col=2,lty=3)

Try demo(graphics) to get a better feel for some of the things that are possible.

Data import

Obviously to use R for real, you need to be able to read in data from other sources. The simplest way to get data into R is to read it in from a text file containing data in a tabular row/column format. See help on read.table for further information. Also see read.csv, scan and source. The easiest way to get data in from a spreadsheet application like MS Excel is to save as a tab-delimited file and read into R with read.table. Note however, that there are other possibilities for reading data, including some direct support for Excel files, and direct connections to databases (such as mySQL and Postgres) and various internet database resources.

Bioconductor

Introduction

Bioconductor can be used for all manner of bioscience data management and analysis tasks. In particular, it can be used to access genome databases and work with biological sequence data, as the following session illustrates:

  
library(annotate)  
getSEQ("P39779")  
substr(getSEQ("U13634"),3225,4004)  
library(Biostrings)  
dna=substr(getSEQ("U13634"),3225,4004)  
nchar(dna)  
GENETIC_CODE  
paste(GENETIC_CODE[substring(dna,seq(1,778,3),seq(3,780,3))],collapse="")  
toString(reverseComplement(DNAString(dna)))  

Again, use the available help facilities to figure out what is going on.

Flow cytometry data

I’ll finish this post with an example of using R to analyse a large and complex data set from a flow cytometer. Flow cytometers such as the Partec CyFlow save data by default in a binary format with a file extension of .FCS. This data format is understood by the BioConductor flowCore package (on which the flowViz package depends). An example of reading and processing some data (on Bacillus subtilis cells containing a GFP reporter) follows. For this, you need to download the file 5-ac-t1-s1.FCS and place it in the current R working directory, or change the working directory to where you have put it, using setwd.

  
library(flowViz)  
yjbM=read.FCS("5-ac-t1-s1.FCS")  
summary(yjbM)  
plot(yjbM)  
plot(yjbM,c("FSC","SSC"))  
plot(transform(yjbM,`FSC`=log(`FSC`),`SSC`=log(`SSC`)),c("FSC","SSC"))  
plot(yjbM,c("FL1"),breaks=256,col=3)  
plot(transform(yjbM,`FL1`=log(`FL1`)),c("FL1"),breaks=256,col=3)  

As I work a lot with GFP labelled samples, I find it convenient to define a function to plot the information I’m most interested in:

  
gfpplot<-function(x)  
{  
  lx=transform(x,`FSC`=log(`FSC`),`SSC`=log(`SSC`),`FL1`=log(`FL1`))  
  op=par(mfrow=c(2,2))  
  plot(lx,"FSC",breaks=64,col=2,main=x@description$FILENAME)  
  plot(lx,c("FSC","SSC"))  
  plot(lx,"FL1",breaks=64,col=3,main="GFP")  
  plot(lx,c("FSC","FL1"))  
  par(op)  
}  
gfpplot(yjbM)  

And finally, here is a function that will produce such a set of plots for all .FCS files in a directory and write the output to a PostScript file. Just call it with gfpplots().

  
gfpplots<-function(glob="*.FCS")  
{  
  xset=read.flowSet(pattern=glob2rx(glob))  
  postscript()  
  for (i in 1:length(xset)) {  
    gfpplot(xset[[i]])  
  }  
  dev.off()  
}  

Summary

This has been a very brief introduction to some of the possibilities of using R and Bioconductor, but it has covered only a tiny fraction. Fortunately there is a huge amount of documentation for R and Bioconductor available on-line, especially from the CRAN and Bioconductor web sites. However, there is lots of other stuff, some of which is a bit more user friendly, which can be easily uncovered with a bit of googling. For example, here is a nice manual from UC Riverside. Finally, as was illustrated in the section on flow cytometry data, much of the power of R comes from the ability to define your own functions, utilising the fact that R is a full-blown programming language in its own right. In a previous post I gave a quick introduction to some basic R programming concepts. For those who prefer real books, I quite like Gentleman’s R Programming for Bioinformatics , and the Bioinformatics Solutions book also contains a few useful chapters, though it is starting become a little dated.

Published by

darrenjw

I am Professor of Statistics within the Department of Mathematical Sciences at Durham University, UK. I am an Bayesian statistician interested in computation and applications, especially to engineering and the life sciences.