## The scala-smfsb library

In the previous post I gave a very quick introduction to the smfsb R package. As mentioned in that post, although good for teaching and learning, R isn’t a great language for serious scientific computing or computational statistics. So for the publication of the third edition of my textbook, Stochastic modelling for systems biology, I have created a library in the Scala programming language replicating the functionality provided by the R package. Here I will give a very quick introduction to the scala-smfsb library. Some familiarity with both Scala and the smfsb R package will be helpful, but is not strictly necessary. Note that the library relies on the Scala Breeze library for linear algebra and probability distributions, so some familiarity with that library can also be helpful.

## Setup

To follow the along you need to have Sbt installed, and this in turn requires a recent JDK. If you are new to Scala, you may find the setup page for my Scala course to be useful, but note that on many Linux systems it can be as simple as installing the packages openjdk-8-jdk and sbt.

Once you have Sbt installed, you should be able to run it by entering sbt at your OS command line. You now need to use Sbt to create a Scala REPL with a dependency on the scala-smfsb library. There are many ways to do this, but if you are new to Scala, the simplest way is probably to start up Sbt from an empty or temporary directory (which doesn’t contain any Scala code), and then paste the following into the Sbt prompt:

set libraryDependencies += "com.github.darrenjw" %% "scala-smfsb" % "0.6"
set libraryDependencies += "org.scalanlp" %% "breeze-viz" % "0.13.2"
set scalaVersion := "2.12.6"
set scalacOptions += "-Yrepl-class-based"
console


The first time you run this it will take a little while to download and cache various library dependencies. But everything is cached, so it should be much quicker in future. When it is finished, you should have a Scala REPL ready to enter Scala code.

## An introduction to scala-smfsb

It should be possible to type or copy-and-paste the commands below one-at-a-time into the Scala REPL. We need to start with a few imports.

import smfsb._
import breeze.linalg.{Vector => BVec, _}
import breeze.numerics._
import breeze.plot._


Note that I’ve renamed Breeze’s Vector type to BVec to avoid clashing with that in the Scala standard library. We are now ready to go.

### Simulating models

Let’s begin by instantiating a Lotka-Volterra model, simulating a single realisation of the process, and then plotting it.

// Simulate LV with Gillespie
val model = SpnModels.lv[IntState]()
val step = Step.gillespie(model)
val ts = Sim.ts(DenseVector(50, 100), 0.0, 20.0, 0.05, step)
Sim.plotTs(ts, "Gillespie simulation of LV model with default parameters")


The library comes with a few other models. There’s a Michaelis-Menten enzyme kinetics model:

// Simulate other models with Gillespie
val stepMM = Step.gillespie(SpnModels.mm[IntState]())
val tsMM = Sim.ts(DenseVector(301,120,0,0), 0.0, 100.0, 0.5, stepMM)
Sim.plotTs(tsMM, "Gillespie simulation of the MM model")


and an auto-regulatory genetic network model, for example.

val stepAR = Step.gillespie(SpnModels.ar[IntState]())
val tsAR = Sim.ts(DenseVector(10, 0, 0, 0, 0), 0.0, 500.0, 0.5, stepAR)
Sim.plotTs(tsAR, "Gillespie simulation of the AR model")


If you know the book and/or the R package, these models should all be familiar.
We are not restricted to exact stochastic simulation using the Gillespie algorithm. We can use an approximate Poisson time-stepping algorithm.

// Simulate LV with other algorithms
val stepPts = Step.pts(model)
val tsPts = Sim.ts(DenseVector(50, 100), 0.0, 20.0, 0.05, stepPts)
Sim.plotTs(tsPts, "Poisson time-step simulation of the LV model")


Alternatively, we can instantiate the example models using a continuous state rather than a discrete state, and then simulate using algorithms based on continous approximations, such as Euler-Maruyama simulation of a chemical Langevin equation (CLE) approximation.

val stepCle = Step.cle(SpnModels.lv[DoubleState]())
val tsCle = Sim.ts(DenseVector(50.0, 100.0), 0.0, 20.0, 0.05, stepCle)
Sim.plotTs(tsCle, "Euler-Maruyama/CLE simulation of the LV model")


If we want to ignore noise temporarily, there’s also a simple continuous deterministic Euler integrator built-in.

val stepE = Step.euler(SpnModels.lv[DoubleState]())
val tsE = Sim.ts(DenseVector(50.0, 100.0), 0.0, 20.0, 0.05, stepE)
Sim.plotTs(tsE, "Continuous-deterministic Euler simulation of the LV model")


### Spatial stochastic reaction-diffusion simulation

We can do 1d reaction-diffusion simulation with something like:

val N = 50; val T = 40.0
val model = SpnModels.lv[IntState]()
val step = Spatial.gillespie1d(model,DenseVector(0.8, 0.8))
val x00 = DenseVector(0, 0)
val x0 = DenseVector(50, 100)
val xx00 = Vector.fill(N)(x00)
val xx0 = xx00.updated(N/2,x0)
val output = Sim.ts(xx0, 0.0, T, 0.2, step)
Spatial.plotTs1d(output)


For 2d simulation, we use PMatrix, a comonadic matrix/image type defined within the library, with parallelised map and coflatMap (cobind) operations. See my post on comonads for scientific computing for further details on the concepts underpinning this, though note that it isn’t necessary to understand comonads to use the library.

val r = 20; val c = 30
val model = SpnModels.lv[DoubleState]()
val step = Spatial.cle2d(model, DenseVector(0.6, 0.6), 0.05)
val x00 = DenseVector(0.0, 0.0)
val x0 = DenseVector(50.0, 100.0)
val xx00 = PMatrix(r, c, Vector.fill(r*c)(x00))
val xx0 = xx00.updated(c/2, r/2, x0)
val output = step(xx0, 0.0, 8.0)
val f = Figure("2d LV reaction-diffusion simulation")
val p0 = f.subplot(2, 1, 0)
p0 += image(PMatrix.toBDM(output map (_.data(0))))
val p1 = f.subplot(2, 1, 1)
p1 += image(PMatrix.toBDM(output map (_.data(1))))


### Bayesian parameter inference

The library also includes functions for carrying out parameter inference for stochastic dynamical systems models, using particle MCMC, ABC and ABC-SMC. See the examples directory for further details.

## Next steps

Having worked through this post, the next step is to work through the tutorial. There is some overlap of content with this blog post, but the tutorial goes into more detail regarding the basics. It also finishes with suggestions for how to proceed further.

## Source

This post started out as a tut document (the Scala equivalent of an RMarkdown document). The source can be found here.

## Stochastic Modelling for Systems Biology, third edition

The third edition of my textbook, Stochastic Modelling for Systems Biology has recently been published by Chapman & Hall/CRC Press. The book has ISBN-10 113854928-2 and ISBN-13 978-113854928-9. It can be ordered from CRC Press, Amazon.com, Amazon.co.uk and similar book sellers.

I was fairly happy with the way that the second edition, published in 2011, turned out, and so I haven’t substantially re-written any of the text for the third edition. Instead, I’ve concentrated on adding in new material and improving the associated on-line resources. Those on-line resources are all free and open source, and hence available to everyone, irrespective of whether you have a copy of the new edition. I’ll give an introduction to those resources below (and in subsequent posts). The new material can be briefly summarised as follows:

• New chapter on spatially extended systems, covering the spatial Gillespie algorithm for reaction diffusion master equation (RDME) models in 1- and 2-d, the next subvolume method, spatial CLE, scaling issues, etc.
• Significantly expanded chapter on inference for stochastic kinetic models from data, covering approximate methods of inference (ABC), including ABC-SMC. The material relating to particle MCMC has also been improved and extended.
• Updated R package, including code relating to all of the new material
• New R package for parsing SBML models into simulatable stochastic Petri net models
• New software library, written in Scala, replicating most of the functionality of the R packages in a fast, compiled, strongly typed, functional language

## New content

Although some minor edits and improvements have been made throughout the text, there are two substantial new additions to the text in this new edition. The first is an entirely new chapter on spatially extended systems. The first two editions of the text focused on the implications of discreteness and stochasticity in chemical reaction systems, but maintained the well-mixed assumption throughout. This is a reasonable first approach, since discreteness and stochasticity are most pronounced in very small volumes where diffusion should be rapid. In any case, even these non-spatial models have very interesting behaviour, and become computationally challenging very quickly for non-trivial reaction networks. However, we know that, in fact, the cell is a very crowded environment, and so even at small spatial scales, many interesting processes are diffusion limited. It therefore seems appropriate to dedicate one chapter (the new Chapter 9) to studying some of the implications of relaxing the well-mixed assumption. Entire books can be written on stochastic reaction-diffusion systems, so here only a brief introduction is provided, based mainly around models in the reaction-diffusion master equation (RDME) style. Exact stochastic simulation algorithms are discussed, and implementations provided in the 1- and 2-d cases, and an appropriate Langevin approximation is examined, the spatial CLE.

The second major addition is to the chapter on inference for stochastic kinetic models from data (now Chapter 11). The second edition of the book included a discussion of “likelihood free” Bayesian MCMC methods for inference, and provided a working implementation of likelihood free particle marginal Metropolis-Hastings (PMMH) for stochastic kinetic models. The third edition improves on that implementation, and discusses approximate Bayesian computation (ABC) as an alternative to MCMC for likelihood free inference. Implementation issues are discussed, and sequential ABC approaches are examined, concentrating in particular on the method known as ABC-SMC.

## New software and on-line resources

Accompanying the text are new and improved on-line resources, all well-documented, free, and open source.

### New website/GitHub repo

Information and materials relating to the previous editions were kept on my University website. All materials relating to this new edition are kept in a public GitHub repo: darrenjw/smfsb. This will be simpler to maintain, and will make it much easier for people to make copies of the material for use and studying off-line.

### Updated R package(s)

Along with the second edition of the book I released an accompanying R package, “smfsb”, published on CRAN. This was a very popular feature, allowing anyone with R to trivially experiment with all of the models and algorithms discussed in the text. This R package has been updated, and a new version has been published to CRAN. The updates are all backwards-compatible with the version associated with the second edition of the text, so owners of that edition can still upgrade safely. I’ll give a proper introduction to the package, including the new features, in a subsequent post, but in the meantime, you can install/upgrade the package from a running R session with

install.packages("smfsb")


and then pop up a tutorial vignette with:

vignette("smfsb")


This should be enough to get you started.

In addition to the main R package, there is an additional R package for parsing SBML models into models that can be simulated within R. This package is not on CRAN, due to its dependency on a non-CRAN package. See the repo for further details.

There are also Python scripts available for converting SBML models to and from the shorthand SBML notation used in the text.

### New Scala library

Another major new resource associated with the third edition of the text is a software library written in the Scala programming language. This library provides Scala implementations of all of the algorithms discussed in the book and implemented in the associated R packages. This then provides example implementations in a fast, efficient, compiled language, and is likely to be most useful for people wanting to use the methods in the book for research. Again, I’ll provide a tutorial introduction to this library in a subsequent post, but it is well-documented, with all necessary information needed to get started available at the scala-smfsb repo/website, including a step-by-step tutorial and some additional examples.

## Introduction

Rainier is an interesting new probabilistic programming library for Scala recently open-sourced by Stripe. Probabilistic programming languages provide a computational framework for building and fitting Bayesian models to data. There are many interesting probabilistic programming languages, and there is currently a lot of interesting innovation happening with probabilistic programming languages embedded in strongly typed functional programming languages such as Scala and Haskell. However, most such languages tend to be developed by people lacking expertise in statistics and numerics, leading to elegant, composable languages which work well for toy problems, but don’t scale well to the kinds of practical problems that applied statisticians are interested in. Conversely, there are a few well-known probabilistic programming languages developed by and for statisticians which have efficient inference engines, but are hampered by inflexible, inelegant languages and APIs. Rainier is interesting because it is an attempt to bridge the gap between these two worlds: it has a functional, composable, extensible, monadic API, yet is backed by a very efficient, high-performance scalable inference engine, using HMC and a static compute graph for reverse-mode AD. Clearly there will be some loss of generality associated with choosing an efficient inference algorithm (eg. for HMC, there needs to be a fixed number of parameters and they must all be continuous), but it still covers a large proportion of the class of hierarchical models commonly used in applied statistical modelling.

In this post I’ll give a quick introduction to Rainier using an interactive session requiring only that SBT is installed and the Rainier repo is downloaded or cloned.

## Interactive session

To follow along with this post just clone, or download and unpack, the Rainier repo, and run SBT from the top-level Rainier directory and paste commands. First start a Scala REPL.

project rainierPlot
console


Before we start building models, we need some data. For this post we will focus on a simple logistic regression model, and so we will begin by simulating some synthetic data consistent with such a model.

val r = new scala.util.Random(0)
val N = 1000
val beta0 = 0.1
val beta1 = 0.3
val x = (1 to N) map { i =>
3.0 * r.nextGaussian
}
val theta = x map { xi =>
beta0 + beta1 * xi
}
def expit(x: Double): Double = 1.0 / (1.0 + math.exp(-x))
val p = theta map expit
val y = p map (pi => (r.nextDouble < pi))


Now we have some synthetic data, we can fit the model and see if we are able to recover the “true” parameters used to generate the synthetic data. In Rainier, we build models by declaring probabilistic programs for the model and the data, and then run an inference engine to generate samples from the posterior distribution.

import com.stripe.rainier.compute._
import com.stripe.rainier.core._
import com.stripe.rainier.sampler._
import com.stripe.rainier.repl._


Now we want to build a model. We do so by describing the joint distribution of parameters and data. Rainier has a few built-in distributions, and these can be combined using standard functional monadic combinators such as map, zip, flatMap, etc., to create a probabilistic program representing a probability monad for the model. Due to the monadic nature of such probabilistic programs, it is often most natural to declare them using a for-expression.

val model = for {
beta0 <- Normal(0, 5).param
beta1 <- Normal(0, 5).param
_ <- Predictor.from{x: Double =>
{
val theta = beta0 + beta1 * x
val p = Real(1.0) / (Real(1.0) + (Real(0.0) - theta).exp)
Categorical.boolean(p)
}
}.fit(x zip y)
} yield Map("b0"->beta0, "b1"->beta1)


This kind of construction is very natural for anyone familiar with monadic programming in Scala, but will no doubt be a little mysterious otherwise. RandomVariable is the probability monad used for HMC sampling, and these can be constructed from Distributions using .param (for unobserved parameters) and .fit (for variables with associated observations). Predictor is just a convenience for observations corresponding to covariate information. model is therefore a RandomVariable over beta0 and beta1, the two unobserved parameters of interest. Note that I briefly discussed this kind of pure functional approach to describing probabilistic programs (using Rand from Breeze) in my post on MCMC as a stream.

Now we have our probabilistic program, we can sample from it using HMC as follows.

implicit val rng = ScalaRNG(3)
val its = 10000
val thin = 5
val out = model.sample(HMC(5), 10000, its*thin, thin)
println(out.take(10))


The argument to HMC() is the number of leapfrog steps to take per iteration.

Finally, we can use EvilPlot to look at the HMC output and check that we have managed to reasonably recover the true parameters associated with our synthetic data.

import com.cibo.evilplot.geometry.Extent
import com.stripe.rainier.plot.EvilTracePlot._

render(traces(out, truth = Map("b0" -> beta0, "b1" -> beta1)),
"traceplots.png", Extent(1200, 1000))
render(pairs(out, truth = Map("b0" -> beta0, "b1" -> beta1)), "pairs.png")


Everything looks good, and the sampling is very fast!

For further information, see the Rainier repo. In particular, start with the tour of Rainier’s core, which gives a more detailed introduction to how Rainier works than this post. Those interested in how the efficient AD works may want to read about the compute graph, and the implementation notes explain how it all fits together. There is some basic ScalaDoc for the core package, and also some examples (including this one), and there’s a gitter channel for asking questions. This is a very new project, so there are a few minor bugs and wrinkles in the initial release, but development is progressing rapidly, so I fully expect the library to get properly battle-hardened over the next few months.

For those unfamiliar with the monadic approach to probabilistic programming, then Ścibior et al (2015) is probably a good starting point.

## One-way ANOVA with fixed and random effects from a Bayesian perspective

This blog post is derived from a computer practical session that I ran as part of my new course on Statistics for Big Data, previously discussed. This course covered a lot of material very quickly. In particular, I deferred introducing notions of hierarchical modelling until the Bayesian part of the course, where I feel it is more natural and powerful. However, some of the terminology associated with hierarchical statistical modelling probably seems a bit mysterious to those without a strong background in classical statistical modelling, and so this practical session was intended to clear up some potential confusion. I will analyse a simple one-way Analysis of Variance (ANOVA) model from a Bayesian perspective, making sure to highlight the difference between fixed and random effects in a Bayesian context where everything is random, as well as emphasising the associated identifiability issues. R code is used to illustrate the ideas.

### Example scenario

We will consider the body mass index (BMI) of new male undergraduate students at a selection of UK Universities. Let us suppose that our data consist of measurements of (log) BMI for a random sample of 1,000 males at each of 8 Universities. We are interested to know if there are any differences between the Universities. Again, we want to model the process as we would simulate it, so thinking about how we would simulate such data is instructive. We start by assuming that the log BMI is a normal random quantity, and that the variance is common across the Universities in question (this is quite a big assumption, and it is easy to relax). We assume that the mean of this normal distribution is University-specific, but that we do not have strong prior opinions regarding the way in which the Universities differ. That said, we expect that the Universities would not be very different from one another.

### Simulating data

A simple simulation of the data with some plausible parameters can be carried out as follows.

set.seed(1)
Z=matrix(rnorm(1000*8,3.1,0.1),nrow=8)
RE=rnorm(8,0,0.01)
X=t(Z+RE)
colnames(X)=paste("Uni",1:8,sep="")
Data=stack(data.frame(X))
boxplot(exp(values)~ind,data=Data,notch=TRUE)


Make sure that you understand exactly what this code is doing before proceeding. The boxplot showing the simulated data is given below.

### Frequentist analysis

We will start with a frequentist analysis of the data. The model we would like to fit is

$y_{ij} = \mu + \theta_i + \varepsilon_{ij}$

where i is an indicator for the University and j for the individual within a particular University. The “effect”, $\theta_i$ represents how the ith University differs from the overall mean. We know that this model is not actually identifiable when the model parameters are all treated as “fixed effects”, but R will handle this for us.

> mod=lm(values~ind,data=Data)
> summary(mod)

Call:
lm(formula = values ~ ind, data = Data)

Residuals:
Min       1Q   Median       3Q      Max
-0.36846 -0.06778 -0.00069  0.06910  0.38219

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  3.101068   0.003223 962.244  < 2e-16 ***
indUni2     -0.006516   0.004558  -1.430 0.152826
indUni3     -0.017168   0.004558  -3.767 0.000166 ***
indUni4      0.017916   0.004558   3.931 8.53e-05 ***
indUni5     -0.022838   0.004558  -5.011 5.53e-07 ***
indUni6     -0.001651   0.004558  -0.362 0.717143
indUni7      0.007935   0.004558   1.741 0.081707 .
indUni8      0.003373   0.004558   0.740 0.459300
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1019 on 7992 degrees of freedom
Multiple R-squared:  0.01439,	Adjusted R-squared:  0.01353
F-statistic: 16.67 on 7 and 7992 DF,  p-value: < 2.2e-16


We see that R has handled the identifiability problem using “treatment contrasts”, dropping the fixed effect for the first university, so that the intercept actually represents the mean value for the first University, and the effects for the other Univeristies represent the differences from the first University. If we would prefer to impose a sum constraint, then we can switch to sum contrasts with

options(contrasts=rep("contr.sum",2))


and then re-fit the model.

> mods=lm(values~ind,data=Data)
> summary(mods)

Call:
lm(formula = values ~ ind, data = Data)

Residuals:
Min       1Q   Median       3Q      Max
-0.36846 -0.06778 -0.00069  0.06910  0.38219

Coefficients:
Estimate Std. Error  t value Pr(>|t|)
(Intercept)  3.0986991  0.0011394 2719.558  < 2e-16 ***
ind1         0.0023687  0.0030146    0.786 0.432048
ind2        -0.0041477  0.0030146   -1.376 0.168905
ind3        -0.0147997  0.0030146   -4.909 9.32e-07 ***
ind4         0.0202851  0.0030146    6.729 1.83e-11 ***
ind5        -0.0204693  0.0030146   -6.790 1.20e-11 ***
ind6         0.0007175  0.0030146    0.238 0.811889
ind7         0.0103039  0.0030146    3.418 0.000634 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1019 on 7992 degrees of freedom
Multiple R-squared:  0.01439,	Adjusted R-squared:  0.01353
F-statistic: 16.67 on 7 and 7992 DF,  p-value: < 2.2e-16


This has 7 degrees of freedom for the effects, as before, but ensures that the 8 effects sum to precisely zero. This is arguably more interpretable in this case.

### Bayesian analysis

We will now analyse the simulated data from a Bayesian perspective, using JAGS.

#### Fixed effects

All parameters in Bayesian models are uncertain, and therefore random, so there is much confusion regarding the difference between “fixed” and “random” effects in a Bayesian context. For “fixed” effects, our prior captures the idea that we sample the effects independently from a “fixed” (typically vague) prior distribution. We could simply code this up and fit it in JAGS as follows.

require(rjags)
n=dim(X)[1]
p=dim(X)[2]
data=list(X=X,n=n,p=p)
init=list(mu=2,tau=1)
modelstring="
model {
for (j in 1:p) {
theta[j]~dnorm(0,0.0001)
for (i in 1:n) {
X[i,j]~dnorm(mu+theta[j],tau)
}
}
mu~dnorm(0,0.0001)
tau~dgamma(1,0.0001)
}
"
model=jags.model(textConnection(modelstring),data=data,inits=init)
update(model,n.iter=1000)
output=coda.samples(model=model,variable.names=c("mu","tau","theta"),n.iter=100000,thin=10)
print(summary(output))
plot(output)
autocorr.plot(output)
pairs(as.matrix(output))
crosscorr.plot(output)


On running the code we can clearly see that this naive approach leads to high posterior correlation between the mean and the effects, due to the fundamental lack of identifiability of the model. This also leads to MCMC mixing problems, but it is important to understand that this computational issue is conceptually entirely separate from the fundamental statisticial identifiability issue. Even if we could avoid MCMC entirely, the identifiability issue would remain.

A quick fix for the identifiability issue is to use “treatment contrasts”, just as for the frequentist model. We can implement that as follows.

data=list(X=X,n=n,p=p)
init=list(mu=2,tau=1)
modelstring="
model {
for (j in 1:p) {
for (i in 1:n) {
X[i,j]~dnorm(mu+theta[j],tau)
}
}
theta[1]<-0
for (j in 2:p) {
theta[j]~dnorm(0,0.0001)
}
mu~dnorm(0,0.0001)
tau~dgamma(1,0.0001)
}
"
model=jags.model(textConnection(modelstring),data=data,inits=init)
update(model,n.iter=1000)
output=coda.samples(model=model,variable.names=c("mu","tau","theta"),n.iter=100000,thin=10)
print(summary(output))
plot(output)
autocorr.plot(output)
pairs(as.matrix(output))
crosscorr.plot(output)


Running this we see that the model now works perfectly well, mixes nicely, and gives sensible inferences for the treatment effects.

Another source of confusion for models of this type is data formating and indexing in JAGS models. For our balanced data there was not problem passing in data to JAGS as a matrix and specifying the model using nested loops. However, for unbalanced designs this is not necessarily so convenient, and so then it can be helpful to specify the model based on two-column data, as we would use for fitting using lm(). This is illustrated with the following model specification, which is exactly equivalent to the previous model, and should give identical (up to Monte Carlo error) results.

N=n*p
data=list(y=Data$values,g=Data$ind,N=N,p=p)
init=list(mu=2,tau=1)
modelstring="
model {
for (i in 1:N) {
y[i]~dnorm(mu+theta[g[i]],tau)
}
theta[1]<-0
for (j in 2:p) {
theta[j]~dnorm(0,0.0001)
}
mu~dnorm(0,0.0001)
tau~dgamma(1,0.0001)
}
"
model=jags.model(textConnection(modelstring),data=data,inits=init)
update(model,n.iter=1000)
output=coda.samples(model=model,variable.names=c("mu","tau","theta"),n.iter=100000,thin=10)
print(summary(output))
plot(output)


As suggested above, this indexing scheme is much more convenient for unbalanced data, and hence widely used. However, since our data is balanced here, we will revert to the matrix approach for the remainder of the post.

One final thing to consider before moving on to random effects is the sum-contrast model. We can implement this in various ways, but I’ve tried to encode it for maximum clarity below, imposing the sum-to-zero constraint via the final effect.

data=list(X=X,n=n,p=p)
init=list(mu=2,tau=1)
modelstring="
model {
for (j in 1:p) {
for (i in 1:n) {
X[i,j]~dnorm(mu+theta[j],tau)
}
}
for (j in 1:(p-1)) {
theta[j]~dnorm(0,0.0001)
}
theta[p] <- -sum(theta[1:(p-1)])
mu~dnorm(0,0.0001)
tau~dgamma(1,0.0001)
}
"
model=jags.model(textConnection(modelstring),data=data,inits=init)
update(model,n.iter=1000)
output=coda.samples(model=model,variable.names=c("mu","tau","theta"),n.iter=100000,thin=10)
print(summary(output))
plot(output)


Again, this works perfectly well and gives similar results to the frequentist analysis.

#### Random effects

The key difference between fixed and random effects in a Bayesian framework is that random effects are not independent, being drawn from a distribution with parameters which are not fixed. Essentially, there is another level of hierarchy involved in the specification of the random effects. This is best illustrated by example. A random effects model for this problem is given below.

data=list(X=X,n=n,p=p)
init=list(mu=2,tau=1)
modelstring="
model {
for (j in 1:p) {
theta[j]~dnorm(0,taut)
for (i in 1:n) {
X[i,j]~dnorm(mu+theta[j],tau)
}
}
mu~dnorm(0,0.0001)
tau~dgamma(1,0.0001)
taut~dgamma(1,0.0001)
}
"
model=jags.model(textConnection(modelstring),data=data,inits=init)
update(model,n.iter=1000)
output=coda.samples(model=model,variable.names=c("mu","tau","taut","theta"),n.iter=100000,thin=10)
print(summary(output))
plot(output)


The only difference between this and our first naive attempt at a Bayesian fixed effects model is that we have put a gamma prior on the precision of the effect. Note that this model now runs and fits perfectly well, with reasonable mixing, and gives sensible parameter inferences. Although the effects here are not constrained to sum-to-zero, like in the case of sum contrasts for a fixed effects model, the prior encourages shrinkage towards zero, and so the random effect distribution can be thought of as a kind of soft version of a hard sum-to-zero constraint. From a predictive perspective, this model is much more powerful. In particular, using a random effects model, we can make strong predictions for unobserved groups (eg. a ninth University), with sensible prediction intervals based on our inferred understanding of how similar different universities are. Using a fixed effects model this isn’t really possible. Even for a Bayesian version of a fixed effects model using proper (but vague) priors, prediction intervals for unobserved groups are not really sensible.

Since we have used simulated data here, we can compare the estimated random effects with the true effects generated during the simulation.

> apply(as.matrix(output),2,mean)
mu           tau          taut      theta[1]      theta[2]
3.098813e+00  9.627110e+01  7.015976e+03  2.086581e-03 -3.935511e-03
theta[3]      theta[4]      theta[5]      theta[6]      theta[7]
-1.389099e-02  1.881528e-02 -1.921854e-02  5.640306e-04  9.529532e-03
theta[8]
5.227518e-03
> RE
[1]  0.002637034 -0.008294518 -0.014616348  0.016839902 -0.015443243
[6] -0.001908871  0.010162117  0.005471262


We see that the Bayesian random effects model has done an excellent job of estimation. If we wished, we could relax the assumption of common variance across the groups by making tau a vector indexed by j, though there is not much point in persuing this here, since we know that the groups do all have the same variance.

#### Strong subjective priors

The above is the usual story regarding fixed and random effects in Bayesian inference. I hope this is reasonably clear, so really I should quit while I’m ahead… However, the issues are really a bit more subtle than I’ve suggested. The inferred precision of the random effects was around 7,000, so now lets re-run the original, naive, “fixed effects” model with a strong subjective Bayesian prior on the distribution of the effects.

data=list(X=X,n=n,p=p)
init=list(mu=2,tau=1)
modelstring="
model {
for (j in 1:p) {
theta[j]~dnorm(0,7000)
for (i in 1:n) {
X[i,j]~dnorm(mu+theta[j],tau)
}
}
mu~dnorm(0,0.0001)
tau~dgamma(1,0.0001)
}
"
model=jags.model(textConnection(modelstring),data=data,inits=init)
update(model,n.iter=1000)
output=coda.samples(model=model,variable.names=c("mu","tau","theta"),n.iter=100000,thin=10)
print(summary(output))
plot(output)


This model also runs perfectly well and gives sensible inferences, despite the fact that the effects are iid from a fixed distribution and there is no hard constraint on the effects. Similarly, we can make sensible predictions, together with appropriate prediction intervals, for an unobserved group. So it isn’t so much the fact that the effects are coupled via an extra level of hierarchy that makes things work. It’s really the fact that the effects are sensibly distributed and not just sampled directly from a vague prior. So for “real” subjective Bayesians the line between fixed and random effects is actually very blurred indeed…

## Introduction

In the previous post I explained how one can use an unbiased estimate of marginal likelihood derived from a particle filter within a Metropolis-Hastings MCMC algorithm in order to construct an exact pseudo-marginal MCMC scheme for the posterior distribution of the model parameters given some time course data. This idea is closely related to that of the particle marginal Metropolis-Hastings (PMMH) algorithm of Andreiu et al (2010), but not really exactly the same. This is because for a Bayesian model with parameters $\theta$, latent variables $x$ and data $y$, of the form

$\displaystyle p(\theta,x,y) = p(\theta)p(x|\theta)p(y|x,\theta),$

the pseudo-marginal algorithm which exploits the fact that the particle filter’s estimate of likelihood is unbiased is an MCMC algorithm which directly targets the marginal posterior distribution $p(\theta|y)$. On the other hand, the PMMH algorithm is an MCMC algorithm which targets the full joint posterior distribution $p(\theta,x|y)$. Now, the PMMH scheme does reduce to the pseudo-marginal scheme if samples of $x$ are not generated and stored in the state of the Markov chain, and it certainly is the case that the pseudo-marginal algorithm gives some insight into why the PMMH algorithm works. However, the PMMH algorithm is much more powerful, as it solves the “smoothing” and parameter estimation problem simultaneously and exactly, including the “initial value” problem (computing the posterior distribution of the initial state, $x_0$). Below I will describe the algorithm and explain why it works, but first it is necessary to understand the relationship between marginal, joint and “likelihood-free” MCMC updating schemes for such latent variable models.

### MCMC for latent variable models

#### Marginal approach

If we want to target $p(\theta|y)$ directly, we can use a Metropolis-Hastings scheme with a fairly arbitrary proposal distribution for exploring $\theta$, where a new $\theta^\star$ is proposed from $f(\theta^\star|\theta)$ and accepted with probability $\min\{1,A\}$, where

$\displaystyle A = \frac{p(\theta^\star)}{p(\theta)} \times \frac{f(\theta|\theta^\star)}{f(\theta^\star|\theta)} \times \frac{p({y}|\theta^\star)}{p({y}|\theta)}.$

As previously discussed, the problem with this scheme is that the marginal likelihood $p(y|\theta)$ required in the acceptance ratio is often difficult to compute.

#### Likelihood-free MCMC

A simple “likelihood-free” scheme targets the full joint posterior distribution $p(\theta,x|y)$. It works by exploiting the fact that we can often simulate from the model for the latent variables $p(x|\theta)$ even when we can’t evaluate it, or marginalise $x$ out of the problem. Here the Metropolis-Hastings proposal is constructed in two stages. First, a proposed new $\theta^\star$ is sampled from $f(\theta^\star|\theta)$ and then a corresponding $x^\star$ is simulated from the model $p(x^\star|\theta^\star)$. The pair $(\theta^\star,x^\star)$ is then jointly accepted with ratio

$\displaystyle A = \frac{p(\theta^\star)}{p(\theta)} \times \frac{f(\theta|\theta^\star)}{f(\theta^\star|\theta)} \times \frac{p(y|{x}^\star,\theta^\star)}{p(y|{x},\theta)}.$

The proposal mechanism ensures that the proposed $x^\star$ is consistent with the proposed $\theta^\star$, and so the procedure can work provided that the dimension of the data $y$ is low. However, in order to work well more generally, we would want the proposed latent variables to be consistent with the data as well as the model parameters.

#### Ideal joint update

Motivated by the likelihood-free scheme, we would really like to target the joint posterior $p(\theta,x|y)$ by first proposing $\theta^\star$ from $f(\theta^\star|\theta)$ and then a corresponding $x^\star$ from the conditional distribution $p(x^\star|\theta^\star,y)$. The pair $(\theta^\star,x^\star)$ is then jointly accepted with ratio

$\displaystyle A = \frac{p(\theta^\star)}{p(\theta)} \frac{p({x}^\star|\theta^\star)}{p({x}|\theta)} \frac{f(\theta|\theta^\star)}{f(\theta^\star|\theta)} \frac{p(y|{x}^\star,\theta^\star)}{p(y|{x},\theta)} \frac{p({x}|y,\theta)}{p({x}^\star|y,\theta^\star)}\\ \qquad = \frac{p(\theta^\star)}{p(\theta)} \frac{p(y|\theta^\star)}{p(y|\theta)} \frac{f(\theta|\theta^\star)}{f(\theta^\star|\theta)}.$

Notice how the acceptance ratio simplifies, using the basic marginal likelihood identity (BMI) of Chib (1995), and $x$ drops out of the ratio completely in order to give exactly the ratio used for the marginal updating scheme. Thus, the “ideal” joint updating scheme reduces to the marginal updating scheme if $x$ is not sampled and stored as a component of the Markov chain.

Understanding the relationship between these schemes is useful for understanding the PMMH algorithm. Indeed, we will see that the “ideal” joint updating scheme (and the marginal scheme) corresponds to PMMH using infinitely many particles in the particle filter, and that the likelihood-free scheme corresponds to PMMH using exactly one particle in the particle filter. For an intermediate number of particles, the PMMH scheme is a compromise between the “ideal” scheme and the “blind” likelihood-free scheme, but is always likelihood-free (when used with a bootstrap particle filter) and always has an acceptance ratio leaving the exact posterior invariant.

### The PMMH algorithm

#### The algorithm

The PMMH algorithm is an MCMC algorithm for state space models jointly updating $\theta$ and $x_{0:T}$, as the algorithms above. First, a proposed new $\theta^\star$ is generated from a proposal $f(\theta^\star|\theta)$, and then a corresponding $x_{0:T}^\star$ is generated by running a bootstrap particle filter (as described in the previous post, and below) using the proposed new model parameters, $\theta^\star$, and selecting a single trajectory by sampling once from the final set of particles using the final set of weights. This proposed pair $(\theta^\star,x_{0:T}^\star)$ is accepted using the Metropolis-Hastings ratio

$\displaystyle A = \frac{\hat{p}_{\theta^\star}(y_{1:T})p(\theta^\star)q(\theta|\theta^\star)}{\hat{p}_{\theta}(y_{1:T})p(\theta)q(\theta^\star|\theta)},$

where $\hat{p}_{\theta^\star}(y_{1:T})$ is the particle filter’s (unbiased) estimate of marginal likelihood, described in the previous post, and below. Note that this approach tends to the perfect joint/marginal updating scheme as the number of particles used in the filter tends to infinity. Note also that for a single particle, the particle filter just blindly forward simulates from $p_\theta(x^\star_{0:T})$ and that the filter’s estimate of marginal likelihood is just the observed data likelihood $p_\theta(y_{1:T}|x^\star_{0:T})$ leading precisely to the simple likelihood-free scheme. To understand for an arbitrary finite number of particles, $M$, one needs to think carefully about the structure of the particle filter.

#### Why it works

To understand why PMMH works, it is necessary to think about the joint distribution of all random variables used in the bootstrap particle filter. To this end, it is helpful to re-visit the particle filter, thinking carefully about the resampling and propagation steps.

First introduce notation for the “particle cloud”: $\mathbf{x}_t=\{x_t^k|k=1,\ldots,M\}$, $\boldsymbol{\pi}_t=\{\pi_t^k|k=1,\ldots,M\}$, $\tilde{\mathbf{x}}_t=\{(x_t^k,\pi_t^k)|k=1,\ldots,M\}$. Initialise the particle filter with $\tilde{\mathbf{x}}_0$, where $x_0^k\sim p(x_0)$ and $\pi_0^k=1/M$ (note that $w_0^k$ is undefined). Now suppose at time $t$ we have a sample from $p(x_t|y_{1:t})$: $\tilde{\mathbf{x}}_t$. First resample by sampling $a_t^k \sim \mathcal{F}(a_t^k|\boldsymbol{\pi}_t)$, $k=1,\ldots,M$. Here we use $\mathcal{F}(\cdot|\boldsymbol{\pi})$ for the discrete distribution on $1:M$ with probability mass function $\boldsymbol{\pi}$. Next sample $x_{t+1}^k\sim p(x_{t+1}^k|x_t^{a_t^k})$. Set $w_{t+1}^k=p(y_{t+1}|x_{t+1}^k)$ and $\pi_{t+1}^k=w_{t+1}^k/\sum_{i=1}^M w_{t+1}^i$. Finally, propagate $\tilde{\mathbf{x}}_{t+1}$ to the next step… We define the filter’s estimate of likelihood as $\hat{p}(y_t|y_{1:t-1})=\frac{1}{M}\sum_{i=1}^M w_t^i$ and $\hat{p}(y_{1:T})=\prod_{i=1}^T \hat{p}(y_t|y_{1:t-1})$. See Doucet et al (2001) for further theoretical background on particle filters and SMC more generally.

Describing the filter carefully as above allows us to write down the joint density of all random variables in the filter as

$\displaystyle \tilde{q}(\mathbf{x}_0,\ldots,\mathbf{x}_T,\mathbf{a}_0,\ldots,\mathbf{a}_{T-1}) = \left[\prod_{k=1}^M p(x_0^k)\right] \left[\prod_{t=0}^{T-1} \prod_{k=1}^M \pi_t^{a_t^k} p(x_{t+1}^k|x_t^{a_t^k}) \right]$

For PMMH we also sample a final index $k'$ from $\mathcal{F}(k'|\boldsymbol{\pi}_T)$ giving the joint density

$\displaystyle \tilde{q}(\mathbf{x}_0,\ldots,\mathbf{x}_T,\mathbf{a}_0,\ldots,\mathbf{a}_{T-1})\pi_T^{k'}$

We write the final selected trajectory as

$\displaystyle x_{0:T}^{k'}=(x_0^{b_0^{k'}},\ldots,x_T^{b_T^{k'}}),$

where $b_t^{k'}=a_t^{b_{t+1}^{k'}}$, and $b_T^{k'}=k'$. If we now think about the structure of the PMMH algorithm, our proposal on the space of all random variables in the problem is in fact

$\displaystyle f(\theta^\star|\theta)\tilde{q}_{\theta^\star}(\mathbf{x}_0^\star,\ldots,\mathbf{x}_T^\star,\mathbf{a}_0^\star,\ldots,\mathbf{a}_{T-1}^\star)\pi_T^{{k'}^\star}$

and by considering the proposal and the acceptance ratio, it is clear that detailed balance for the chain is satisfied by the target with density proportional to

$\displaystyle p(\theta)\hat{p}_\theta(y_{1:T}) \tilde{q}_\theta(\mathbf{x}_0,\ldots,\mathbf{x}_T,\mathbf{a}_0,\ldots,\mathbf{a}_{T-1}) \pi_T^{k'}$

We want to show that this target marginalises down to the correct posterior $p(\theta,x_{0:T}|y_{1:T})$ when we consider just the parameters and the selected trajectory. But if we consider the terms in the joint distribution of the proposal corresponding to the trajectory selected by $k'$, this is given by

$\displaystyle p_\theta(x_0^{b_0^{k'}})\left[\prod_{t=0}^{T-1} \pi_t^{b_t^{k'}} p_\theta(x_{t+1}^{b_{t+1}^{k'}}|x_t^{b_t^{k'}})\right]\pi_T^{k'} = p_\theta(x_{0:T}^{k'})\prod_{t=0}^T \pi_t^{b_t^{k'}}$

which, by expanding the $\pi_t^{b_t^{k'}}$ in terms of the unnormalised weights, simplifies to

$\displaystyle \frac{p_\theta(x_{0:T}^{k'})p_\theta(y_{1:T}|x_{0:T}^{k'})}{M^{T+1}\hat{p}_\theta(y_{1:T})}$

It is worth dwelling on this result, as this is the key insight required to understand why the PMMH algorithm works. The whole point is that the terms in the joint density of the proposal corresponding to the selected trajectory exactly represent the required joint distribution modulo a couple of normalising constants, one of which is the particle filter’s estimate of marginal likelihood. Thus, by including $\hat{p}_\theta(y_{1:T})$ in the acceptance ratio, we knock out the normalising constant, allowing all of the other terms in the proposal to be marginalised away. In other words, the target of the chain can be written as proportional to

$\displaystyle \frac{p(\theta)p_\theta(x_{0:T}^{k'},y_{1:T})}{M^{T+1}} \times \text{(Other terms...)}$

The other terms are all probabilities of random variables which do not occur elsewhere in the target, and hence can all be marginalised away to leave the correct posterior

$\displaystyle p(\theta,x_{0:T}|y_{1:T})$

Thus the PMMH algorithm targets the correct posterior for any number of particles, $M$. Also note the implied uniform distribution on the selected indices in the target.

I will give some code examples in a future post.