The scala-smfsb library

In the previous post I gave a very quick introduction to the smfsb R package. As mentioned in that post, although good for teaching and learning, R isn’t a great language for serious scientific computing or computational statistics. So for the publication of the third edition of my textbook, Stochastic modelling for systems biology, I have created a library in the Scala programming language replicating the functionality provided by the R package. Here I will give a very quick introduction to the scala-smfsb library. Some familiarity with both Scala and the smfsb R package will be helpful, but is not strictly necessary. Note that the library relies on the Scala Breeze library for linear algebra and probability distributions, so some familiarity with that library can also be helpful.

Setup

To follow the along you need to have Sbt installed, and this in turn requires a recent JDK. If you are new to Scala, you may find the setup page for my Scala course to be useful, but note that on many Linux systems it can be as simple as installing the packages openjdk-8-jdk and sbt.

Once you have Sbt installed, you should be able to run it by entering sbt at your OS command line. You now need to use Sbt to create a Scala REPL with a dependency on the scala-smfsb library. There are many ways to do this, but if you are new to Scala, the simplest way is probably to start up Sbt from an empty or temporary directory (which doesn’t contain any Scala code), and then paste the following into the Sbt prompt:

set libraryDependencies += "com.github.darrenjw" %% "scala-smfsb" % "0.6"
set libraryDependencies += "org.scalanlp" %% "breeze-viz" % "0.13.2"
set scalaVersion := "2.12.6"
set scalacOptions += "-Yrepl-class-based"
console

The first time you run this it will take a little while to download and cache various library dependencies. But everything is cached, so it should be much quicker in future. When it is finished, you should have a Scala REPL ready to enter Scala code.

An introduction to scala-smfsb

It should be possible to type or copy-and-paste the commands below one-at-a-time into the Scala REPL. We need to start with a few imports.

import smfsb._
import breeze.linalg.{Vector => BVec, _}
import breeze.numerics._
import breeze.plot._

Note that I’ve renamed Breeze’s Vector type to BVec to avoid clashing with that in the Scala standard library. We are now ready to go.

Simulating models

Let’s begin by instantiating a Lotka-Volterra model, simulating a single realisation of the process, and then plotting it.

// Simulate LV with Gillespie
val model = SpnModels.lv[IntState]()
val step = Step.gillespie(model)
val ts = Sim.ts(DenseVector(50, 100), 0.0, 20.0, 0.05, step)
Sim.plotTs(ts, "Gillespie simulation of LV model with default parameters")

The library comes with a few other models. There’s a Michaelis-Menten enzyme kinetics model:

// Simulate other models with Gillespie
val stepMM = Step.gillespie(SpnModels.mm[IntState]())
val tsMM = Sim.ts(DenseVector(301,120,0,0), 0.0, 100.0, 0.5, stepMM)
Sim.plotTs(tsMM, "Gillespie simulation of the MM model")

and an auto-regulatory genetic network model, for example.

val stepAR = Step.gillespie(SpnModels.ar[IntState]())
val tsAR = Sim.ts(DenseVector(10, 0, 0, 0, 0), 0.0, 500.0, 0.5, stepAR)
Sim.plotTs(tsAR, "Gillespie simulation of the AR model")

If you know the book and/or the R package, these models should all be familiar.
We are not restricted to exact stochastic simulation using the Gillespie algorithm. We can use an approximate Poisson time-stepping algorithm.

// Simulate LV with other algorithms
val stepPts = Step.pts(model)
val tsPts = Sim.ts(DenseVector(50, 100), 0.0, 20.0, 0.05, stepPts)
Sim.plotTs(tsPts, "Poisson time-step simulation of the LV model")

Alternatively, we can instantiate the example models using a continuous state rather than a discrete state, and then simulate using algorithms based on continous approximations, such as Euler-Maruyama simulation of a chemical Langevin equation (CLE) approximation.

val stepCle = Step.cle(SpnModels.lv[DoubleState]())
val tsCle = Sim.ts(DenseVector(50.0, 100.0), 0.0, 20.0, 0.05, stepCle)
Sim.plotTs(tsCle, "Euler-Maruyama/CLE simulation of the LV model")

If we want to ignore noise temporarily, there’s also a simple continuous deterministic Euler integrator built-in.

val stepE = Step.euler(SpnModels.lv[DoubleState]())
val tsE = Sim.ts(DenseVector(50.0, 100.0), 0.0, 20.0, 0.05, stepE)
Sim.plotTs(tsE, "Continuous-deterministic Euler simulation of the LV model")

Spatial stochastic reaction-diffusion simulation

We can do 1d reaction-diffusion simulation with something like:

val N = 50; val T = 40.0
val model = SpnModels.lv[IntState]()
val step = Spatial.gillespie1d(model,DenseVector(0.8, 0.8))
val x00 = DenseVector(0, 0)
val x0 = DenseVector(50, 100)
val xx00 = Vector.fill(N)(x00)
val xx0 = xx00.updated(N/2,x0)
val output = Sim.ts(xx0, 0.0, T, 0.2, step)
Spatial.plotTs1d(output)

For 2d simulation, we use PMatrix, a comonadic matrix/image type defined within the library, with parallelised map and coflatMap (cobind) operations. See my post on comonads for scientific computing for further details on the concepts underpinning this, though note that it isn’t necessary to understand comonads to use the library.

val r = 20; val c = 30
val model = SpnModels.lv[DoubleState]()
val step = Spatial.cle2d(model, DenseVector(0.6, 0.6), 0.05)
val x00 = DenseVector(0.0, 0.0)
val x0 = DenseVector(50.0, 100.0)
val xx00 = PMatrix(r, c, Vector.fill(r*c)(x00))
val xx0 = xx00.updated(c/2, r/2, x0)
val output = step(xx0, 0.0, 8.0)
val f = Figure("2d LV reaction-diffusion simulation")
val p0 = f.subplot(2, 1, 0)
p0 += image(PMatrix.toBDM(output map (_.data(0))))
val p1 = f.subplot(2, 1, 1)
p1 += image(PMatrix.toBDM(output map (_.data(1))))

Bayesian parameter inference

The library also includes functions for carrying out parameter inference for stochastic dynamical systems models, using particle MCMC, ABC and ABC-SMC. See the examples directory for further details.

Next steps

Having worked through this post, the next step is to work through the tutorial. There is some overlap of content with this blog post, but the tutorial goes into more detail regarding the basics. It also finishes with suggestions for how to proceed further.

Source

This post started out as a tut document (the Scala equivalent of an RMarkdown document). The source can be found here.

Advertisements

The smfsb R package

Introduction

In the previous post I gave a brief introduction to the third edition of my textbook, Stochastic modelling for systems biology. The algorithms described in the book are illustrated by implementations in R. These implementations are collected together in an R package on CRAN called smfsb. This post will provide a brief introduction to the package and its capabilities.

Installation

The package is on CRAN – see the CRAN package page for details. So the simplest way to install it is to enter

install.packages("smfsb")

at the R command prompt. This will install the latest version that is on CRAN. Once installed, the package can be loaded with

library(smfsb)

The package is well-documented, so further information can be obtained with the usual R mechanisms, such as

vignette(package="smfsb")
vignette("smfsb")
help(package="smfsb")
?StepGillespie
example(StepCLE1D)

The version of the package on CRAN is almost certainly what you want. However, the package is developed on R-Forge – see the R-Forge project page for details. So the very latest version of the package can always be installed with

install.packages("smfsb", repos="http://R-Forge.R-project.org")

if you have a reason for wanting it.

A brief tutorial

The vignette gives a quick introduction the the library, which I don’t need to repeat verbatim here. If you are new to the package, I recommend working through that before continuing. Here I’ll concentrate on some of the new features associated with the third edition.

Simulating stochastic kinetic models

Much of the book is concerned with the simulation of stochastic kinetic models using exact and approximate algorithms. Although the primary focus of the text is the application to modelling of intra-cellular processes, the methods are also appropriate for population modelling of ecological and epidemic processes. For example, we can start by simulating a simple susceptible-infectious-recovered (SIR) disease epidemic model.

set.seed(2)
data(spnModels)

stepSIR = StepGillespie(SIR)
plot(simTs(SIR$M, 0, 8, 0.05, stepSIR),
  main="Exact simulation of the SIR model")

Exact simulation of the SIR epidemic model
The focus of the text is stochastic simulation of discrete models, so that is the obvious place to start. But there is also support for continuous deterministic simulation.

plot(simTs(SIR$M, 0, 8, 0.05, StepEulerSPN(SIR)),
  main="Euler simulation of the SIR model")

Euler simulation of the SIR model
My favourite toy population dynamics model is the Lotka-Volterra (LV) model, so I tend to use this frequently as a running example throughout the book. We can simulate this (exactly) as follows.

stepLV = StepGillespie(LV)
plot(simTs(LV$M, 0, 30, 0.2, stepLV),
  main="Exact simulation of the LV model")

Exact simulation of the Lotka-Volterra model

Stochastic reaction-diffusion modelling

The first two editions of the book were almost exclusively concerned with well-mixed systems, where spatial effects are ignorable. One of the main new features of the third edition is the inclusion of a new chapter on spatially extended systems. The focus is on models related to the reaction diffusion master equation (RDME) formulation, rather than individual particle-based simulations. For these models, space is typically divided into a regular grid of voxels, with reactions taking place as normal within each voxel, and additional reaction events included, corresponding to the diffusion of particles to adjacent voxels. So to specify such models, we just need an initial condition, a reaction model, and diffusion coefficients (one for each reacting species). So, we can carry out exact simulation of an RDME model for a 1D spatial domain as follows.

N=20; T=30
x0=matrix(0, nrow=2, ncol=N)
rownames(x0) = c("x1", "x2")
x0[,round(N/2)] = LV$M
stepLV1D = StepGillespie1D(LV, c(0.6, 0.6))
xx = simTs1D(x0, 0, T, 0.2, stepLV1D, verb=TRUE)
image(xx[1,,], main="Prey", xlab="Space", ylab="Time")

Discrete 1D simulation of the LV model

image(xx[2,,], main="Predator", xlab="Space", ylab="Time")

Discrete 1D simulation of the LV model
Exact simulation of discrete stochastic reaction diffusion systems is very expensive (and the reference implementation provided in the package is very inefficient), so we will often use diffusion approximations based on the CLE.

stepLV1DC = StepCLE1D(LV, c(0.6, 0.6))
xx = simTs1D(x0, 0, T, 0.2, stepLV1D)
image(xx[1,,], main="Prey", xlab="Space", ylab="Time")

Spatial CLE simulation of the 1D LV model

image(xx[2,,], main="Predator", xlab="Space", ylab="Time")

Spatial CLE simulation of the 1D LV model
We can think of this algorithm as an explicit numerical integration of the obvious SPDE approximation to the exact model.

The package also includes support for simulation of 2D systems. Again, we can use the Spatial CLE to speed things up.

m=70; n=50; T=10
data(spnModels)
x0=array(0, c(2,m,n))
dimnames(x0)[[1]]=c("x1", "x2")
x0[,round(m/2),round(n/2)] = LV$M
stepLV2D = StepCLE2D(LV, c(0.6,0.6), dt=0.05)
xx = simTs2D(x0, 0, T, 0.5, stepLV2D)
N = dim(xx)[4]
image(xx[1,,,N],main="Prey",xlab="x",ylab="y")

Spatial CLE simulation of the 2D LV model

image(xx[2,,,N],main="Predator",xlab="x",ylab="y")

Spatial CLE simulation of the 2D LV model

Bayesian parameter inference

Although much of the book is concerned with the problem of forward simulation, the final chapters are concerned with the inverse problem of estimating model parameters, such as reaction rate constants, from data. A computational Bayesian approach is adopted, with the main emphasis being placed on “likelihood free” methods, which rely on forward simulation to avoid explicit computation of sample path likelihoods. The second edition included some rudimentary code for a likelihood free particle marginal Metropolis-Hastings (PMMH) particle Markov chain Monte Carlo (pMCMC) algorithm. The third edition includes a more complete and improved implementation, in addition to approximate inference algorithms based on approximate Bayesian computation (ABC).

The key function underpinning the PMMH approach is pfMLLik, which computes an estimate of marginal model log-likelihood using a (bootstrap) particle filter. There is a new implementation of this function with the third edition. There is also a generic implementation of the Metropolis-Hastings algorithm, metropolisHastings, which can be combined with pfMLLik to create a PMMH algorithm. PMMH algorithms are very slow, but a full demo of how to use these functions for parameter inference is included in the package and can be run with

demo(PMCMC)

Simple rejection-based ABC methods are facilitated by the (very simple) function abcRun, which just samples from a prior and then carries out independent simulations in parallel before computing summary statistics. A simple illustration of the use of the function is given below.

data(LVdata)
rprior <- function() { exp(c(runif(1, -3, 3),runif(1,-8,-2),runif(1,-4,2))) }
rmodel <- function(th) { simTs(c(50,100), 0, 30, 2, stepLVc, th) }
sumStats <- identity
ssd = sumStats(LVperfect)
distance <- function(s) {
    diff = s - ssd
    sqrt(sum(diff*diff))
}
rdist <- function(th) { distance(sumStats(rmodel(th))) }
out = abcRun(10000, rprior, rdist)
q=quantile(out$dist, c(0.01, 0.05, 0.1))
print(q)
##       1%       5%      10% 
## 772.5546 845.8879 881.0573
accepted = out$param[out$dist < q[1],]
print(summary(accepted))
##        V1                V2                  V3         
##  Min.   :0.06498   Min.   :0.0004467   Min.   :0.01887  
##  1st Qu.:0.16159   1st Qu.:0.0012598   1st Qu.:0.04122  
##  Median :0.35750   Median :0.0023488   Median :0.14664  
##  Mean   :0.68565   Mean   :0.0046887   Mean   :0.36726  
##  3rd Qu.:0.86708   3rd Qu.:0.0057264   3rd Qu.:0.36870  
##  Max.   :4.76773   Max.   :0.0309364   Max.   :3.79220
print(summary(log(accepted)))
##        V1                V2               V3         
##  Min.   :-2.7337   Min.   :-7.714   Min.   :-3.9702  
##  1st Qu.:-1.8228   1st Qu.:-6.677   1st Qu.:-3.1888  
##  Median :-1.0286   Median :-6.054   Median :-1.9198  
##  Mean   :-0.8906   Mean   :-5.877   Mean   :-1.9649  
##  3rd Qu.:-0.1430   3rd Qu.:-5.163   3rd Qu.:-0.9978  
##  Max.   : 1.5619   Max.   :-3.476   Max.   : 1.3329

Naive rejection-based ABC algorithms are notoriously inefficient, so the library also includes an implementation of a more efficient, sequential version of ABC, often known as ABC-SMC, in the function abcSmc. This function requires specification of a perturbation kernel to “noise up” the particles at each algorithm sweep. Again, the implementation is parallel, using the parallel package to run the required simulations in parallel on multiple cores. A simple illustration of use is given below.

rprior <- function() { c(runif(1, -3, 3), runif(1, -8, -2), runif(1, -4, 2)) }
dprior <- function(x, ...) { dunif(x[1], -3, 3, ...) + 
    dunif(x[2], -8, -2, ...) + dunif(x[3], -4, 2, ...) }
rmodel <- function(th) { simTs(c(50,100), 0, 30, 2, stepLVc, exp(th)) }
rperturb <- function(th){th + rnorm(3, 0, 0.5)}
dperturb <- function(thNew, thOld, ...){sum(dnorm(thNew, thOld, 0.5, ...))}
sumStats <- identity
ssd = sumStats(LVperfect)
distance <- function(s) {
    diff = s - ssd
    sqrt(sum(diff*diff))
}
rdist <- function(th) { distance(sumStats(rmodel(th))) }
out = abcSmc(5000, rprior, dprior, rdist, rperturb,
    dperturb, verb=TRUE, steps=6, factor=5)
## 6 5 4 3 2 1
print(summary(out))
##        V1                V2               V3        
##  Min.   :-2.9961   Min.   :-7.988   Min.   :-3.999  
##  1st Qu.:-1.9001   1st Qu.:-6.786   1st Qu.:-3.428  
##  Median :-1.2571   Median :-6.167   Median :-2.433  
##  Mean   :-1.0789   Mean   :-6.014   Mean   :-2.196  
##  3rd Qu.:-0.2682   3rd Qu.:-5.261   3rd Qu.:-1.161  
##  Max.   : 2.1128   Max.   :-2.925   Max.   : 1.706

We can then plot some results with

hist(out[,1],30,main="log(c1)")

ABC-SMC posterior for the LV model

hist(out[,2],30,main="log(c2)")

ABC-SMC posterior for the LV model

hist(out[,3],30,main="log(c3)")

ABC-SMC posterior for the LV model

Although the inference methods are illustrated in the book in the context of parameter inference for stochastic kinetic models, their implementation is generic, and can be used with any appropriate parameter inference problem.

The smfsbSBML package

smfsbSBML is another R package associated with the third edition of the book. This package is not on CRAN due to its dependency on a package not on CRAN, and hence is slightly less straightforward to install. Follow the available installation instructions to install the package. Once installed, you should be able to load the package with

library(smfsbSBML)

This package provides a function for reading in SBML files and parsing them into the simulatable stochastic Petri net (SPN) objects used by the main smfsb R package. Examples of suitable SBML models are included in the main smfsb GitHub repo. An appropriate SBML model can be read and parsed with a command like:

model = sbml2spn("mySbmlModel.xml")

The resulting value, model is an SPN object which can be passed in to simulation functions such as StepGillespie for constructing stochastic simulation algorithms.

Other software

In addition to the above R packages, I also have some Python scripts for converting between SBML and the SBML-shorthand notation I use in the book. See the SBML-shorthand page for further details.

Although R is a convenient language for teaching and learning about stochastic simulation, it isn’t ideal for serious research-level scientific computing or computational statistics. So for the third edition of the book I have also developed scala-smfsb, a library written in the Scala programming language, which re-implements all of the models and algorithms from the third edition of the book in Scala, a fast, efficient, strongly-typed, compiled, functional programming language. I’ll give an introduction to this library in a subsequent post, but in the meantime, it is already well documented, so see the scala-smfsb repo for further details, including information on installation, getting started, a tutorial, examples, API docs, etc.

Source

This blog post started out as an RMarkdown document, the source of which can be found here.

Stochastic Modelling for Systems Biology, third edition

The third edition of my textbook, Stochastic Modelling for Systems Biology has recently been published by Chapman & Hall/CRC Press. The book has ISBN-10 113854928-2 and ISBN-13 978-113854928-9. It can be ordered from CRC Press, Amazon.com, Amazon.co.uk and similar book sellers.

I was fairly happy with the way that the second edition, published in 2011, turned out, and so I haven’t substantially re-written any of the text for the third edition. Instead, I’ve concentrated on adding in new material and improving the associated on-line resources. Those on-line resources are all free and open source, and hence available to everyone, irrespective of whether you have a copy of the new edition. I’ll give an introduction to those resources below (and in subsequent posts). The new material can be briefly summarised as follows:

  • New chapter on spatially extended systems, covering the spatial Gillespie algorithm for reaction diffusion master equation (RDME) models in 1- and 2-d, the next subvolume method, spatial CLE, scaling issues, etc.
  • Significantly expanded chapter on inference for stochastic kinetic models from data, covering approximate methods of inference (ABC), including ABC-SMC. The material relating to particle MCMC has also been improved and extended.
  • Updated R package, including code relating to all of the new material
  • New R package for parsing SBML models into simulatable stochastic Petri net models
  • New software library, written in Scala, replicating most of the functionality of the R packages in a fast, compiled, strongly typed, functional language

New content

Although some minor edits and improvements have been made throughout the text, there are two substantial new additions to the text in this new edition. The first is an entirely new chapter on spatially extended systems. The first two editions of the text focused on the implications of discreteness and stochasticity in chemical reaction systems, but maintained the well-mixed assumption throughout. This is a reasonable first approach, since discreteness and stochasticity are most pronounced in very small volumes where diffusion should be rapid. In any case, even these non-spatial models have very interesting behaviour, and become computationally challenging very quickly for non-trivial reaction networks. However, we know that, in fact, the cell is a very crowded environment, and so even at small spatial scales, many interesting processes are diffusion limited. It therefore seems appropriate to dedicate one chapter (the new Chapter 9) to studying some of the implications of relaxing the well-mixed assumption. Entire books can be written on stochastic reaction-diffusion systems, so here only a brief introduction is provided, based mainly around models in the reaction-diffusion master equation (RDME) style. Exact stochastic simulation algorithms are discussed, and implementations provided in the 1- and 2-d cases, and an appropriate Langevin approximation is examined, the spatial CLE.

The second major addition is to the chapter on inference for stochastic kinetic models from data (now Chapter 11). The second edition of the book included a discussion of “likelihood free” Bayesian MCMC methods for inference, and provided a working implementation of likelihood free particle marginal Metropolis-Hastings (PMMH) for stochastic kinetic models. The third edition improves on that implementation, and discusses approximate Bayesian computation (ABC) as an alternative to MCMC for likelihood free inference. Implementation issues are discussed, and sequential ABC approaches are examined, concentrating in particular on the method known as ABC-SMC.

New software and on-line resources

Accompanying the text are new and improved on-line resources, all well-documented, free, and open source.

New website/GitHub repo

Information and materials relating to the previous editions were kept on my University website. All materials relating to this new edition are kept in a public GitHub repo: darrenjw/smfsb. This will be simpler to maintain, and will make it much easier for people to make copies of the material for use and studying off-line.

Updated R package(s)

Along with the second edition of the book I released an accompanying R package, “smfsb”, published on CRAN. This was a very popular feature, allowing anyone with R to trivially experiment with all of the models and algorithms discussed in the text. This R package has been updated, and a new version has been published to CRAN. The updates are all backwards-compatible with the version associated with the second edition of the text, so owners of that edition can still upgrade safely. I’ll give a proper introduction to the package, including the new features, in a subsequent post, but in the meantime, you can install/upgrade the package from a running R session with

install.packages("smfsb")

and then pop up a tutorial vignette with:

vignette("smfsb")

This should be enough to get you started.

In addition to the main R package, there is an additional R package for parsing SBML models into models that can be simulated within R. This package is not on CRAN, due to its dependency on a non-CRAN package. See the repo for further details.

There are also Python scripts available for converting SBML models to and from the shorthand SBML notation used in the text.

New Scala library

Another major new resource associated with the third edition of the text is a software library written in the Scala programming language. This library provides Scala implementations of all of the algorithms discussed in the book and implemented in the associated R packages. This then provides example implementations in a fast, efficient, compiled language, and is likely to be most useful for people wanting to use the methods in the book for research. Again, I’ll provide a tutorial introduction to this library in a subsequent post, but it is well-documented, with all necessary information needed to get started available at the scala-smfsb repo/website, including a step-by-step tutorial and some additional examples.

Lexical scope and function closures in R

Introduction

R is different to many “easy to use” statistical software packages – it expects to be given commands at the R command prompt. This can be intimidating for new users, but is at the heart of its power. Most powerful software tools have an underlying scripting language. This is because scriptable tools are typically more flexible, and easier to automate, script, program, etc. In fact, even software packages like Excel or Minitab have a macro programming language behind the scenes available for “power users” to exploit.

Programming from the ground up

It is natural to want to automate (repetitive) tasks on a computer, to automate a “work flow”. This is especially natural for computational tasks, as all software tools are built from programming language components, anyway. In R, you do stuff by executing a sequence of commands. By putting a bunch of commands one after another into a text file, we can source the file, and script R. Scripting is the simplest form of programming – automating a sequence of tasks. Indeed, in Unix (including Linux and MacOS), we can put a bunch of Unix shell commands together in a shell script. In Windows, you can put a bunch of terminal commands together in a batch file.

Next, one can add in simple control structures, to support looping, branching and conditional execution. Looping allows repetition of very similar tasks. Branching and conditional execution allow decisions to be made depending on what has already happened. Most scripting languages support simple control structures – this allows carrying out of tasks which we could do in principle, but perhaps not in practice, due to the laborious and repetitive nature of some work-flows. We can go a long way with this, but…

Although scripting is a simple form of programming, it isn’t “real” programming, or software engineering. Software engineering is about developing flexible, modular, robust, re-usable, generic program components, and using them to build large, complex software systems – modularity is absolutely key here. Functions and procedures are a first step towards introducing modularity, allowing the development of “real” software. Proper support for these tends to distinguish “real” programming languages from scripting languages (though many modern “scripting” languages have at least some limited support, and the distinction between scripting languages and “real” languages is now very blurred).

Functions and procedures

Procedures (or subroutines) are re-usable pieces of code which can be called from other pieces of code when needed. They may be provided with inputs, but do not have to be. They are usually called for their “side-effects”, such as doing plots, changing global variables, or reading/writing data to/from disk.

Functions are also re-usable pieces of code, but are mainly used to obtain a return-value that is computed on the basis of the given inputs. “Pure” functions do not have any side-effects. Functions and procedures may be combined in a hierarchical way to build large, complex algorithms from much simpler modular components. Note that many languages (including R), do not make a distinction between functions and procedures in the syntax of the language, but conceptually the distinction is really quite important.

Variable scope

Almost all programming languages allow the definition of variables which are labels or tags representing or pointing at some value that may be defined and re-defined at run-time. In most modern programming languages, functions can define local variables which can be used in addition to any inputs (formal parameters) of the function – these are very important for the development of modular, re-usable code components. In particular, they help to avoid unanticipated name clashes in the global name-space. If a function refers to a variable which is neither a formal parameter nor a local variable, then a rule is needed to find which (if any) variable with that label is in scope for the function, so that the program can know what value to use.

Dynamic scope

Under dynamic scope, if an “unknown” variable is referred to in a function, the idea is to use the version of the variable that is in scope at the time that the function was called (and apply this rule recursively) – this is the scoping rule used by the S-PLUS implementation of the S language. Dynamic scope was common among early dynamic programming languages – including early implementations of LISP (and is still used in Emacs LISP), as it was quite intuitive and natural to implement using a stack-based approach similar to the stack-based approach to passing variables in and out of subroutines commonly used by machine code and assembly programmers.

Despite being intuitively appealing, at least initially, there are a number of problems with dynamic scope in practice. In particular, we can’t really know by code inspection whether or not a given section of code will run in all situations without actually running the code, as we can’t know whether all variable bindings will resolve correctly. This is an issue even for dynamic languages, but is particularly problematic for strongly typed compiled languages, as it becomes difficult for the compiler to figure out the types of all variables correctly and therefore generate the appropriate byte-code. It is also very difficult for a function to have associated state – to do this, you must somehow get state variables into global name-space where they then become vulnerable to masking and name clashes. See the Wikipedia page on scope for further details.

Lexical scope

Under lexical scoping rules, if an “unknown” variable is referred to in a function, the idea is to use the version that is “in scope” in the enclosing piece of code (and apply this rule recursively) — this is the scoping rule used by R (as R is built on top of a Scheme interpreter, a LISP derivative which emphasises lexical scope). Variable bindings can be all resolved, checked and verified at compile-time – this is safer, and in many other ways better. Most modern languages adopt lexical scoping, including most functional languages, such as LISPs (including LISP-STAT) and derivatives. In fact, I first read about lexical scope, function closures and their use in statistical computing in Luke Tierney’s LISP-STAT book (Tierney, 1990) in the early 1990s. That book was published over 20 years ago, so it just goes to show that there is nothing new about these functional programming approaches. In fact, although Tierney’s book describes a now obsolete system, I would nevertheless recommend reading it if you can find a copy, as I think it is still one of the best books on statistical computing ever written. It really puts the recent glut of horrible R-themed books to shame!

Given that R has been lexically scoped and has supported function closures since day one, it is reasonable to wonder why this programming style is not used more widely in R code. I think it is the difference in scoping rules between S-PLUS and R that has led to a fear of developing any R code which relies on non-local scoping rules. Certainly, in the early days of R, I would use S-PLUS at work and R at home, and I would want my code to work in exactly the same in both places! This is a shame, as lexical scoping is very powerful, and exploited widely in functional programming styles. The use of lexical scope and function closures in R is described quite nicely in Gentleman (2008), along with many other things.

To make sure that the concepts are clear, inspect the following piece of code and figure out what the result of the final function call will be. The answer is given below the code, so try not to peek before reading on…

a=1
b=2
f<-function(x)
{
  a*x + b
}
g<-function(x)
{
  a=2
  b=1
  f(x)
}
g(2)

No, really, try and figure it out before reading on for the answer! Understanding this example is key to understanding the difference between lexical and dynamic scope. Clearly the obvious answers are 4 and 5. If you didn’t get one of those, go back and try again! 😉 So, one of those is the result you get in a dynamically scoped language like S-PLUS, and the other is the result that you get in a lexically scoped language like R. But which is which? Many people when asked what this code does give the answer 5. This is the result for a dynamically scoped language. It is not the answer you get in R. In R, you get the answer 4. This is because f() was defined in the global environment, so it is the global bindings of a and b which count. Although the function g() defines its own local bindings for a and b, these have no impact on the global bindings, and are simply not relevant to the evaluation of f().

Function closures

Some languages (including LISPs and derivatives such as Scheme, Python, and R) have functions as “first class objects”, which means that a function is able to return as its value another function. If the function (fChild) returned by the function (fParent) refers to variables not local to fChild, then scoping rules must apply to the resolution of the variable binding. If the language is lexically scoped, then the binding is determined by the variables in scope within the function fParent. The function fChild therefore has an associated environment, which provides bindings for non-local variable references – this allows maintaining of state. A function together with its environment is referred to as a function closure, and is a very powerful programming tool. Below is some more code to help illustrate what is going on. Again, try to figure out the result of the final function call before reading on for the answer and explanation…

a=1
b=2
f<-function(a,b)
{
  return( function(x) {
    a*x + b
  })
}
g=f(2,1)
g(2)

Here, the function g(), together with its associated environment, is referred to as a function closure. See the Wikipedia page for closure for further details. So, what is the result of calling g(2) in this case? Again, some people get this wrong, and give the answer 4. This isn’t what you get in R – in R you get 5, again due to lexical scope. The point is that the function g() is created inside f(), and so it is the variable bindings in scope within f() at the time g() was created which matter. Since f() has a and b as formal arguments, these mask the global variables of the same name, so it is the 2 and 1 that are passed into f() to create g() which matter in the evaluation of g(). This is why function closures are so powerful. They are not simply functions, they are functions together with an associated environment, and the associated environment allows function closures to have associated state. Here the state corresponds to the values of a and b that were used in the creation of g(), but in principle the state can be essentially any data structure.

Function closures for scientific computing

Function closures have numerous important applications in a variety of problems in scientific computing that involve dealing in some way with the “function environment problem”. There is quite a nice discussion of this issue in Oliveira and Stewart (2006), in the context of several strongly typed compiled languages. Consider, for example, a function that will numerically integrate a univariate function using (say) the trapezium rule. This integration function might expect that you pass in the function to be integrated, together with the limits of integration, and possibly a step size. Most likely this integration function will expect that the function passed in is univariate. However, in practice many functions have additional parameters (eg. the straight line example, above, which was a function of x, but depending on additional parameters a and b). This problem is solved by passing in a univariate function closure that contains the necessary environment to evaluate this univariate function correctly. Similar considerations apply for functions that carry out optimisation, solve ODEs by passing in the RHS, etc.

The smfsb R package

The second edition of my textbook, Stochastic Modelling for Systems Biology, has recently been published (Wilkinson, 2011). The second edition has an associated R package, smfsb, available from CRAN – I gave a tutorial introduction in a previous post. The code makes extensive use of lexical scope and function closures, precisely to solve the function environment problem…

References

  • Oliveira, S, Stewart, D.E. (2006) Writing scientific software, CUP.
  • Gentleman, R. (2008) R Programming for Bioinformatics, Chapman & Hall/CRC Press.
  • Tierney, L. (1990) LISP-STAT, Wiley.
  • Wilkinson (2011), Stochastic Modelling for Systems Biology, second edition, Chapman & Hall/CRC Press.
  • Stochastic Modelling for Systems Biology, second edition

    The second edition of my textbook, Stochastic Modelling for Systems Biology was published on 7th November, 2011. One of the new features introduced into the new edition is an R package called smfsb which contains all of the code examples discussed in the text, which allow modelling, simulation and inference for stochastic kinetic models. The smfsb R package is the main topic of this post, but it seems appropriate to start off the post with a quick introduction to the book, and the main new features of the second edition.

    The first edition was published in April 2006. It provided an introduction to mathematical modelling for systems biology from a stochastic viewpoint. It began with an introduction to biochemical network modelling, then moved on to probability theory, stochastic simulation and Markov processes. After providing all of the necessary background material, the book then introduced the theory of stochastic kinetic modelling and the Gillespie algorithm for exact discrete stochastic event simulation of stochastic kinetic biochemical network models. This was followed by examples and case studies, advanced simulation algorithms, and then a brief introduction to Bayesian inference and its application to inference for stochastic kinetic models.

    The first edition proved to be very popular, as it was the first self-contained introduction to the field, and was aimed at an audience without a strong quantitative background. The decision to target an applied audience meant that it contained only the bare essentials necessary to get started with stochastic modelling in systems biology. The second edition was therefore an opportunity not only to revise and update the existing material, but also to add in additional material, especially new material which could provide a more solid foundation for advanced study by students with a more mathematical focus. New material introduced into the second edition includes a greatly expanded chapter on Markov processes, with particular emphasis on diffusion processes and stochastic differential equations, as well as Kolmogorov equations, the Fokker-Planck equation (FPE), Kurtz’s random time change representation of a stochastic kinetic model, an additional derivation of the chemical Langevin equation (CLE), and a derivation of the linear noise approximation (LNA). There is now also discussion of the modelling of “extrinsic” in addition to “intrinsic” noise. The final chapters on inference have also been greatly expanded, including discussion of importance resampling, particle filters, pseudo-marginal “exact approximate” MCMC, likelihood-free techniques and particle MCMC for rate parameter inference. I have tried as far as possible to maintain the informal and accessible style of the first edition, and a couple of the more technical new sections have been flagged as “skippable” by less mathematically trained students. In terms of computing, all of the SBML models have been updated to the new Level 3 specification, and all of the R code has been re-written, extended, documented and packaged as an open source R package. The rest of this post is an introduction to the R package. Although the R package is aimed mainly at owners of the second edition, it is well documented, and should therefore be usable by anyone with a reasonable background knowledge of the area. In particular, the R package should be very easy to use for anyone familiar with the first edition of the book. The introduction given here is closely based on the introductory vignette included with the package.

    smfsb: an R package for simulation and inference in stochastic kinetic models

    Overview

    The smfsb package provides all of the R code associated with the book, Wilkinson (2011). Almost all of the code is pure R code, intended to be inspected from the R command line. In order to keep the code short, clean and easily understood, there is almost no argument checking or other boilerplate code.

    Installation

    The package is available from CRAN, and it should therefore be possible to install from the R command prompt using

    install.packages("smfsb")
    

    from any machine with an internet connection.

    The package is being maintained on R-Forge, and so it should always be possible to install the very latest nightly build from the R command prompt with

    install.packages("smfsb",repos="http://r-forge.r-project.org")
    

    but you should only do this if you have a good reason to, in order not to overload the R-Forge servers (not that I imagine downloads of this package are likely to overload the servers…).

    Once installed, the package can be loaded ready for use with

    library(smfsb)
    

    Accessing documentation

    I have tried to ensure that the package and all associated functions and datasets are properly documented with runnable examples. So,

    help(package="smfsb")
    

    will give a brief overview of the package and a complete list of all functions. The list of vignettes associated with the package can be obtained with

    vignette(package="smfsb")
    

    At the time of writing, the introductory vignette is the only one available, and can be accessed from the R command line with

    vignette("smfsb",package="smfsb")
    

    Help on functions can be obtained using the usual R mechanisms. For example, help on the function StepGillespie can be obtained with

    ?StepGillespie
    

    and the associated example can be run with

    example(StepGillespie)
    

    The sourcecode for the function can be obtained by typing StepGillespie on a line by itself. In this case, it returns the following R code:

    function (N) 
    {
        S = t(N$Post - N$Pre)
        v = ncol(S)
        return(function(x0, t0, deltat, ...) {
            t = t0
            x = x0
            termt = t0 + deltat
            repeat {
                h = N$h(x, t, ...)
                h0 = sum(h)
                if (h0 < 1e-10)
                    t = 1e+99 
                else if (h0 > 1e+06) {
                    t = 1e+99
                    warning("Hazard too big - terminating simulation!")
                } 
                else 
                    t = t + rexp(1, h0)
                if (t >= termt) 
                    return(x)
                j = sample(v, 1, prob = h)
                x = x + S[, j]
            }
        })
    }
    

    A list of demos associated with the package can be obtained with

    demo(package="smfsb")
    

    A list of data sets associated with the package can be obtained with

    data(package="smfsb")
    

    For example, the small table, mytable from the introduction to R in Chapter 4 can by loaded with

    data(mytable)
    

    After running this command, the data frame mytable will be accessible, and can be examined by typing

    mytable
    

    at the R command prompt.

    Simulation of stochastic kinetic models

    The main purpose of this package is to provide a collection of tools for building and simulating stochastic kinetic models. This can be illustrated using a simple Lotka-Volterra predator-prey system. First, consider the prey, X_1 and the predator X_2 as a stochastic network, viz

    R_1:\quad X_1 \longrightarrow 2 X_1
    R_2:\quad X_1 + X_2\longrightarrow 2X_2
    R_3:\quad X_2 \longrightarrow \emptyset.

    The first “reaction” represents predator reproduction, the second predator-prey interaction and the third predator death. We can write the stoichiometries of the reactions, together with the rate (or hazard) of each reaction, in tabular form as

    Reaction Pre Post Hazard
    X_1 X_2 X_1 X_2 h()
    R_1 1 0 2 0 \theta_1 x_1
    R_2 1 1 0 2 \theta_2 x_1 x_2
    R_3 0 1 0 0 \theta_3 x_2

    This can be encoded in R as a stochastic Petri net (SPN) using

    # SPN for the Lotka-Volterra system
    LV=list()
    LV$Pre=matrix(c(1,0,1,1,0,1),ncol=2,byrow=TRUE)
    LV$Post=matrix(c(2,0,0,2,0,0),ncol=2,byrow=TRUE)
    LV$h=function(x,t,th=c(th1=1,th2=0.005,th3=0.6))
    {
     with(as.list(c(x,th)),{
             return(c(th1*x1, th2*x1*x2, th3*x2 ))
            })
    }
    

    This object could be created directly by executing

    data(spnModels)
    

    since the LV model is one of the standard demo models included with the package. Functions for simulating from the transition kernel of the Markov process defined by the SPN can be created easily by passing the SPN object into the appropriate constructor. For example, if simulation using the Gillespie algorithm is required, a simulation function can be created with

    stepLV=StepGillespie(LV)
    

    This resulting function (closure) can then be used to advance the state of the process. For example, to simulate the state of the process at time 1, given an initial condition of X_1=50, X_2=100 at time 0, use

    stepLV(c(x1=50,x2=100),0,1)
    

    Alternatively, to simulate a realisation of the process on a regular time grid over the interval [0,100] in steps of 0.1 time units, use

    out = simTs(c(x1=50,x2=100),0,100,0.1,stepLV)
    plot(out,plot.type="single",col=c(2,4))
    

    which gives the resulting plot

    See the help and runnable example for the function StepGillespie for further details, including some available alternative simulation algorithms, such as StepCLE.

    Inference for stochastic kinetic models from time course data

    Estimating the parameters of stochastic kinetic models using noisy time course measurements on some aspect of the system state is a very important problem. Wilkinson (2011) takes a Bayesian approach to the problem, using particle MCMC methodology. For this, a key aspect is the use of a particle filter to compute an unbiased estimate of marginal likelihood. This is accomplished using the function pfMLLik. Once a method is available for generating unbiased estimates for the marginal likelihood, this may be embedded into a fairly standard marginal Metropolis-Hastings algorithm for parameter estimation. See the help and runnable example for pfMLLik for further details, along with the particle MCMC demo, which can by run using demo(PMCMC). I’ll discuss more about particle MCMC and rate parameter inference in the next post.

    References

  • Wilkinson, D. J. (2006) Stochastic Modelling for Systems Biology, Boca Raton, Florida: Chapman & Hall/CRC Press.
  • Wilkinson, D. J. (2011) Stochastic Modelling for Systems Biology, second edition, Boca Raton, Florida: Chapman & Hall/CRC Press.