## The scala-smfsb library

In the previous post I gave a very quick introduction to the smfsb R package. As mentioned in that post, although good for teaching and learning, R isn’t a great language for serious scientific computing or computational statistics. So for the publication of the third edition of my textbook, Stochastic modelling for systems biology, I have created a library in the Scala programming language replicating the functionality provided by the R package. Here I will give a very quick introduction to the scala-smfsb library. Some familiarity with both Scala and the smfsb R package will be helpful, but is not strictly necessary. Note that the library relies on the Scala Breeze library for linear algebra and probability distributions, so some familiarity with that library can also be helpful.

## Setup

To follow the along you need to have Sbt installed, and this in turn requires a recent JDK. If you are new to Scala, you may find the setup page for my Scala course to be useful, but note that on many Linux systems it can be as simple as installing the packages openjdk-8-jdk and sbt.

Once you have Sbt installed, you should be able to run it by entering sbt at your OS command line. You now need to use Sbt to create a Scala REPL with a dependency on the scala-smfsb library. There are many ways to do this, but if you are new to Scala, the simplest way is probably to start up Sbt from an empty or temporary directory (which doesn’t contain any Scala code), and then paste the following into the Sbt prompt:

set libraryDependencies += "com.github.darrenjw" %% "scala-smfsb" % "0.6"
set libraryDependencies += "org.scalanlp" %% "breeze-viz" % "0.13.2"
set scalaVersion := "2.12.6"
set scalacOptions += "-Yrepl-class-based"
console

The first time you run this it will take a little while to download and cache various library dependencies. But everything is cached, so it should be much quicker in future. When it is finished, you should have a Scala REPL ready to enter Scala code.

## An introduction to scala-smfsb

It should be possible to type or copy-and-paste the commands below one-at-a-time into the Scala REPL. We need to start with a few imports.

import smfsb._
import breeze.linalg.{Vector => BVec, _}
import breeze.numerics._
import breeze.plot._

Note that I’ve renamed Breeze’s Vector type to BVec to avoid clashing with that in the Scala standard library. We are now ready to go.

### Simulating models

Let’s begin by instantiating a Lotka-Volterra model, simulating a single realisation of the process, and then plotting it.

// Simulate LV with Gillespie
val model = SpnModels.lv[IntState]()
val step = Step.gillespie(model)
val ts = Sim.ts(DenseVector(50, 100), 0.0, 20.0, 0.05, step)
Sim.plotTs(ts, "Gillespie simulation of LV model with default parameters")

The library comes with a few other models. There’s a Michaelis-Menten enzyme kinetics model:

// Simulate other models with Gillespie
val stepMM = Step.gillespie(SpnModels.mm[IntState]())
val tsMM = Sim.ts(DenseVector(301,120,0,0), 0.0, 100.0, 0.5, stepMM)
Sim.plotTs(tsMM, "Gillespie simulation of the MM model")

and an auto-regulatory genetic network model, for example.

val stepAR = Step.gillespie(SpnModels.ar[IntState]())
val tsAR = Sim.ts(DenseVector(10, 0, 0, 0, 0), 0.0, 500.0, 0.5, stepAR)
Sim.plotTs(tsAR, "Gillespie simulation of the AR model")

If you know the book and/or the R package, these models should all be familiar.
We are not restricted to exact stochastic simulation using the Gillespie algorithm. We can use an approximate Poisson time-stepping algorithm.

// Simulate LV with other algorithms
val stepPts = Step.pts(model)
val tsPts = Sim.ts(DenseVector(50, 100), 0.0, 20.0, 0.05, stepPts)
Sim.plotTs(tsPts, "Poisson time-step simulation of the LV model")

Alternatively, we can instantiate the example models using a continuous state rather than a discrete state, and then simulate using algorithms based on continous approximations, such as Euler-Maruyama simulation of a chemical Langevin equation (CLE) approximation.

val stepCle = Step.cle(SpnModels.lv[DoubleState]())
val tsCle = Sim.ts(DenseVector(50.0, 100.0), 0.0, 20.0, 0.05, stepCle)
Sim.plotTs(tsCle, "Euler-Maruyama/CLE simulation of the LV model")

If we want to ignore noise temporarily, there’s also a simple continuous deterministic Euler integrator built-in.

val stepE = Step.euler(SpnModels.lv[DoubleState]())
val tsE = Sim.ts(DenseVector(50.0, 100.0), 0.0, 20.0, 0.05, stepE)
Sim.plotTs(tsE, "Continuous-deterministic Euler simulation of the LV model")

### Spatial stochastic reaction-diffusion simulation

We can do 1d reaction-diffusion simulation with something like:

val N = 50; val T = 40.0
val model = SpnModels.lv[IntState]()
val step = Spatial.gillespie1d(model,DenseVector(0.8, 0.8))
val x00 = DenseVector(0, 0)
val x0 = DenseVector(50, 100)
val xx00 = Vector.fill(N)(x00)
val xx0 = xx00.updated(N/2,x0)
val output = Sim.ts(xx0, 0.0, T, 0.2, step)
Spatial.plotTs1d(output)

For 2d simulation, we use PMatrix, a comonadic matrix/image type defined within the library, with parallelised map and coflatMap (cobind) operations. See my post on comonads for scientific computing for further details on the concepts underpinning this, though note that it isn’t necessary to understand comonads to use the library.

val r = 20; val c = 30
val model = SpnModels.lv[DoubleState]()
val step = Spatial.cle2d(model, DenseVector(0.6, 0.6), 0.05)
val x00 = DenseVector(0.0, 0.0)
val x0 = DenseVector(50.0, 100.0)
val xx00 = PMatrix(r, c, Vector.fill(r*c)(x00))
val xx0 = xx00.updated(c/2, r/2, x0)
val output = step(xx0, 0.0, 8.0)
val f = Figure("2d LV reaction-diffusion simulation")
val p0 = f.subplot(2, 1, 0)
p0 += image(PMatrix.toBDM(output map (_.data(0))))
val p1 = f.subplot(2, 1, 1)
p1 += image(PMatrix.toBDM(output map (_.data(1))))

### Bayesian parameter inference

The library also includes functions for carrying out parameter inference for stochastic dynamical systems models, using particle MCMC, ABC and ABC-SMC. See the examples directory for further details.

## Next steps

Having worked through this post, the next step is to work through the tutorial. There is some overlap of content with this blog post, but the tutorial goes into more detail regarding the basics. It also finishes with suggestions for how to proceed further.

## Source

This post started out as a tut document (the Scala equivalent of an RMarkdown document). The source can be found here.

## Introduction

Rainier is an interesting new probabilistic programming library for Scala recently open-sourced by Stripe. Probabilistic programming languages provide a computational framework for building and fitting Bayesian models to data. There are many interesting probabilistic programming languages, and there is currently a lot of interesting innovation happening with probabilistic programming languages embedded in strongly typed functional programming languages such as Scala and Haskell. However, most such languages tend to be developed by people lacking expertise in statistics and numerics, leading to elegant, composable languages which work well for toy problems, but don’t scale well to the kinds of practical problems that applied statisticians are interested in. Conversely, there are a few well-known probabilistic programming languages developed by and for statisticians which have efficient inference engines, but are hampered by inflexible, inelegant languages and APIs. Rainier is interesting because it is an attempt to bridge the gap between these two worlds: it has a functional, composable, extensible, monadic API, yet is backed by a very efficient, high-performance scalable inference engine, using HMC and a static compute graph for reverse-mode AD. Clearly there will be some loss of generality associated with choosing an efficient inference algorithm (eg. for HMC, there needs to be a fixed number of parameters and they must all be continuous), but it still covers a large proportion of the class of hierarchical models commonly used in applied statistical modelling.

In this post I’ll give a quick introduction to Rainier using an interactive session requiring only that SBT is installed and the Rainier repo is downloaded or cloned.

## Interactive session

To follow along with this post just clone, or download and unpack, the Rainier repo, and run SBT from the top-level Rainier directory and paste commands. First start a Scala REPL.

project rainierPlot
console

Before we start building models, we need some data. For this post we will focus on a simple logistic regression model, and so we will begin by simulating some synthetic data consistent with such a model.

val r = new scala.util.Random(0)
val N = 1000
val beta0 = 0.1
val beta1 = 0.3
val x = (1 to N) map { i =>
3.0 * r.nextGaussian
}
val theta = x map { xi =>
beta0 + beta1 * xi
}
def expit(x: Double): Double = 1.0 / (1.0 + math.exp(-x))
val p = theta map expit
val y = p map (pi => (r.nextDouble < pi))

Now we have some synthetic data, we can fit the model and see if we are able to recover the “true” parameters used to generate the synthetic data. In Rainier, we build models by declaring probabilistic programs for the model and the data, and then run an inference engine to generate samples from the posterior distribution.

import com.stripe.rainier.compute._
import com.stripe.rainier.core._
import com.stripe.rainier.sampler._
import com.stripe.rainier.repl._

Now we want to build a model. We do so by describing the joint distribution of parameters and data. Rainier has a few built-in distributions, and these can be combined using standard functional monadic combinators such as map, zip, flatMap, etc., to create a probabilistic program representing a probability monad for the model. Due to the monadic nature of such probabilistic programs, it is often most natural to declare them using a for-expression.

val model = for {
beta0 <- Normal(0, 5).param
beta1 <- Normal(0, 5).param
_ <- Predictor.from{x: Double =>
{
val theta = beta0 + beta1 * x
val p = Real(1.0) / (Real(1.0) + (Real(0.0) - theta).exp)
Categorical.boolean(p)
}
}.fit(x zip y)
} yield Map("b0"->beta0, "b1"->beta1)

This kind of construction is very natural for anyone familiar with monadic programming in Scala, but will no doubt be a little mysterious otherwise. RandomVariable is the probability monad used for HMC sampling, and these can be constructed from Distributions using .param (for unobserved parameters) and .fit (for variables with associated observations). Predictor is just a convenience for observations corresponding to covariate information. model is therefore a RandomVariable over beta0 and beta1, the two unobserved parameters of interest. Note that I briefly discussed this kind of pure functional approach to describing probabilistic programs (using Rand from Breeze) in my post on MCMC as a stream.

Now we have our probabilistic program, we can sample from it using HMC as follows.

implicit val rng = ScalaRNG(3)
val its = 10000
val thin = 5
val out = model.sample(HMC(5), 10000, its*thin, thin)
println(out.take(10))

The argument to HMC() is the number of leapfrog steps to take per iteration.

Finally, we can use EvilPlot to look at the HMC output and check that we have managed to reasonably recover the true parameters associated with our synthetic data.

import com.cibo.evilplot.geometry.Extent
import com.stripe.rainier.plot.EvilTracePlot._

render(traces(out, truth = Map("b0" -> beta0, "b1" -> beta1)),
"traceplots.png", Extent(1200, 1000))
render(pairs(out, truth = Map("b0" -> beta0, "b1" -> beta1)), "pairs.png")

Everything looks good, and the sampling is very fast!

For further information, see the Rainier repo. In particular, start with the tour of Rainier’s core, which gives a more detailed introduction to how Rainier works than this post. Those interested in how the efficient AD works may want to read about the compute graph, and the implementation notes explain how it all fits together. There is some basic ScalaDoc for the core package, and also some examples (including this one), and there’s a gitter channel for asking questions. This is a very new project, so there are a few minor bugs and wrinkles in the initial release, but development is progressing rapidly, so I fully expect the library to get properly battle-hardened over the next few months.

For those unfamiliar with the monadic approach to probabilistic programming, then Ścibior et al (2015) is probably a good starting point.

## One-way ANOVA with fixed and random effects from a Bayesian perspective

This blog post is derived from a computer practical session that I ran as part of my new course on Statistics for Big Data, previously discussed. This course covered a lot of material very quickly. In particular, I deferred introducing notions of hierarchical modelling until the Bayesian part of the course, where I feel it is more natural and powerful. However, some of the terminology associated with hierarchical statistical modelling probably seems a bit mysterious to those without a strong background in classical statistical modelling, and so this practical session was intended to clear up some potential confusion. I will analyse a simple one-way Analysis of Variance (ANOVA) model from a Bayesian perspective, making sure to highlight the difference between fixed and random effects in a Bayesian context where everything is random, as well as emphasising the associated identifiability issues. R code is used to illustrate the ideas.

### Example scenario

We will consider the body mass index (BMI) of new male undergraduate students at a selection of UK Universities. Let us suppose that our data consist of measurements of (log) BMI for a random sample of 1,000 males at each of 8 Universities. We are interested to know if there are any differences between the Universities. Again, we want to model the process as we would simulate it, so thinking about how we would simulate such data is instructive. We start by assuming that the log BMI is a normal random quantity, and that the variance is common across the Universities in question (this is quite a big assumption, and it is easy to relax). We assume that the mean of this normal distribution is University-specific, but that we do not have strong prior opinions regarding the way in which the Universities differ. That said, we expect that the Universities would not be very different from one another.

### Simulating data

A simple simulation of the data with some plausible parameters can be carried out as follows.

set.seed(1)
Z=matrix(rnorm(1000*8,3.1,0.1),nrow=8)
RE=rnorm(8,0,0.01)
X=t(Z+RE)
colnames(X)=paste("Uni",1:8,sep="")
Data=stack(data.frame(X))
boxplot(exp(values)~ind,data=Data,notch=TRUE)

Make sure that you understand exactly what this code is doing before proceeding. The boxplot showing the simulated data is given below.

### Frequentist analysis

We will start with a frequentist analysis of the data. The model we would like to fit is

$y_{ij} = \mu + \theta_i + \varepsilon_{ij}$

where i is an indicator for the University and j for the individual within a particular University. The “effect”, $\theta_i$ represents how the ith University differs from the overall mean. We know that this model is not actually identifiable when the model parameters are all treated as “fixed effects”, but R will handle this for us.

> mod=lm(values~ind,data=Data)
> summary(mod)

Call:
lm(formula = values ~ ind, data = Data)

Residuals:
Min       1Q   Median       3Q      Max
-0.36846 -0.06778 -0.00069  0.06910  0.38219

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  3.101068   0.003223 962.244  < 2e-16 ***
indUni2     -0.006516   0.004558  -1.430 0.152826
indUni3     -0.017168   0.004558  -3.767 0.000166 ***
indUni4      0.017916   0.004558   3.931 8.53e-05 ***
indUni5     -0.022838   0.004558  -5.011 5.53e-07 ***
indUni6     -0.001651   0.004558  -0.362 0.717143
indUni7      0.007935   0.004558   1.741 0.081707 .
indUni8      0.003373   0.004558   0.740 0.459300
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1019 on 7992 degrees of freedom
Multiple R-squared:  0.01439,	Adjusted R-squared:  0.01353
F-statistic: 16.67 on 7 and 7992 DF,  p-value: < 2.2e-16

We see that R has handled the identifiability problem using “treatment contrasts”, dropping the fixed effect for the first university, so that the intercept actually represents the mean value for the first University, and the effects for the other Univeristies represent the differences from the first University. If we would prefer to impose a sum constraint, then we can switch to sum contrasts with

options(contrasts=rep("contr.sum",2))

and then re-fit the model.

> mods=lm(values~ind,data=Data)
> summary(mods)

Call:
lm(formula = values ~ ind, data = Data)

Residuals:
Min       1Q   Median       3Q      Max
-0.36846 -0.06778 -0.00069  0.06910  0.38219

Coefficients:
Estimate Std. Error  t value Pr(>|t|)
(Intercept)  3.0986991  0.0011394 2719.558  < 2e-16 ***
ind1         0.0023687  0.0030146    0.786 0.432048
ind2        -0.0041477  0.0030146   -1.376 0.168905
ind3        -0.0147997  0.0030146   -4.909 9.32e-07 ***
ind4         0.0202851  0.0030146    6.729 1.83e-11 ***
ind5        -0.0204693  0.0030146   -6.790 1.20e-11 ***
ind6         0.0007175  0.0030146    0.238 0.811889
ind7         0.0103039  0.0030146    3.418 0.000634 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1019 on 7992 degrees of freedom
Multiple R-squared:  0.01439,	Adjusted R-squared:  0.01353
F-statistic: 16.67 on 7 and 7992 DF,  p-value: < 2.2e-16

This has 7 degrees of freedom for the effects, as before, but ensures that the 8 effects sum to precisely zero. This is arguably more interpretable in this case.

### Bayesian analysis

We will now analyse the simulated data from a Bayesian perspective, using JAGS.

#### Fixed effects

All parameters in Bayesian models are uncertain, and therefore random, so there is much confusion regarding the difference between “fixed” and “random” effects in a Bayesian context. For “fixed” effects, our prior captures the idea that we sample the effects independently from a “fixed” (typically vague) prior distribution. We could simply code this up and fit it in JAGS as follows.

require(rjags)
n=dim(X)[1]
p=dim(X)[2]
data=list(X=X,n=n,p=p)
init=list(mu=2,tau=1)
modelstring="
model {
for (j in 1:p) {
theta[j]~dnorm(0,0.0001)
for (i in 1:n) {
X[i,j]~dnorm(mu+theta[j],tau)
}
}
mu~dnorm(0,0.0001)
tau~dgamma(1,0.0001)
}
"
model=jags.model(textConnection(modelstring),data=data,inits=init)
update(model,n.iter=1000)
output=coda.samples(model=model,variable.names=c("mu","tau","theta"),n.iter=100000,thin=10)
print(summary(output))
plot(output)
autocorr.plot(output)
pairs(as.matrix(output))
crosscorr.plot(output)

On running the code we can clearly see that this naive approach leads to high posterior correlation between the mean and the effects, due to the fundamental lack of identifiability of the model. This also leads to MCMC mixing problems, but it is important to understand that this computational issue is conceptually entirely separate from the fundamental statisticial identifiability issue. Even if we could avoid MCMC entirely, the identifiability issue would remain.

A quick fix for the identifiability issue is to use “treatment contrasts”, just as for the frequentist model. We can implement that as follows.

data=list(X=X,n=n,p=p)
init=list(mu=2,tau=1)
modelstring="
model {
for (j in 1:p) {
for (i in 1:n) {
X[i,j]~dnorm(mu+theta[j],tau)
}
}
theta[1]<-0
for (j in 2:p) {
theta[j]~dnorm(0,0.0001)
}
mu~dnorm(0,0.0001)
tau~dgamma(1,0.0001)
}
"
model=jags.model(textConnection(modelstring),data=data,inits=init)
update(model,n.iter=1000)
output=coda.samples(model=model,variable.names=c("mu","tau","theta"),n.iter=100000,thin=10)
print(summary(output))
plot(output)
autocorr.plot(output)
pairs(as.matrix(output))
crosscorr.plot(output)

Running this we see that the model now works perfectly well, mixes nicely, and gives sensible inferences for the treatment effects.

Another source of confusion for models of this type is data formating and indexing in JAGS models. For our balanced data there was not problem passing in data to JAGS as a matrix and specifying the model using nested loops. However, for unbalanced designs this is not necessarily so convenient, and so then it can be helpful to specify the model based on two-column data, as we would use for fitting using lm(). This is illustrated with the following model specification, which is exactly equivalent to the previous model, and should give identical (up to Monte Carlo error) results.

N=n*p
data=list(y=Data$values,g=Data$ind,N=N,p=p)
init=list(mu=2,tau=1)
modelstring="
model {
for (i in 1:N) {
y[i]~dnorm(mu+theta[g[i]],tau)
}
theta[1]<-0
for (j in 2:p) {
theta[j]~dnorm(0,0.0001)
}
mu~dnorm(0,0.0001)
tau~dgamma(1,0.0001)
}
"
model=jags.model(textConnection(modelstring),data=data,inits=init)
update(model,n.iter=1000)
output=coda.samples(model=model,variable.names=c("mu","tau","theta"),n.iter=100000,thin=10)
print(summary(output))
plot(output)

As suggested above, this indexing scheme is much more convenient for unbalanced data, and hence widely used. However, since our data is balanced here, we will revert to the matrix approach for the remainder of the post.

One final thing to consider before moving on to random effects is the sum-contrast model. We can implement this in various ways, but I’ve tried to encode it for maximum clarity below, imposing the sum-to-zero constraint via the final effect.

data=list(X=X,n=n,p=p)
init=list(mu=2,tau=1)
modelstring="
model {
for (j in 1:p) {
for (i in 1:n) {
X[i,j]~dnorm(mu+theta[j],tau)
}
}
for (j in 1:(p-1)) {
theta[j]~dnorm(0,0.0001)
}
theta[p] <- -sum(theta[1:(p-1)])
mu~dnorm(0,0.0001)
tau~dgamma(1,0.0001)
}
"
model=jags.model(textConnection(modelstring),data=data,inits=init)
update(model,n.iter=1000)
output=coda.samples(model=model,variable.names=c("mu","tau","theta"),n.iter=100000,thin=10)
print(summary(output))
plot(output)

Again, this works perfectly well and gives similar results to the frequentist analysis.

#### Random effects

The key difference between fixed and random effects in a Bayesian framework is that random effects are not independent, being drawn from a distribution with parameters which are not fixed. Essentially, there is another level of hierarchy involved in the specification of the random effects. This is best illustrated by example. A random effects model for this problem is given below.

data=list(X=X,n=n,p=p)
init=list(mu=2,tau=1)
modelstring="
model {
for (j in 1:p) {
theta[j]~dnorm(0,taut)
for (i in 1:n) {
X[i,j]~dnorm(mu+theta[j],tau)
}
}
mu~dnorm(0,0.0001)
tau~dgamma(1,0.0001)
taut~dgamma(1,0.0001)
}
"
model=jags.model(textConnection(modelstring),data=data,inits=init)
update(model,n.iter=1000)
output=coda.samples(model=model,variable.names=c("mu","tau","taut","theta"),n.iter=100000,thin=10)
print(summary(output))
plot(output)

The only difference between this and our first naive attempt at a Bayesian fixed effects model is that we have put a gamma prior on the precision of the effect. Note that this model now runs and fits perfectly well, with reasonable mixing, and gives sensible parameter inferences. Although the effects here are not constrained to sum-to-zero, like in the case of sum contrasts for a fixed effects model, the prior encourages shrinkage towards zero, and so the random effect distribution can be thought of as a kind of soft version of a hard sum-to-zero constraint. From a predictive perspective, this model is much more powerful. In particular, using a random effects model, we can make strong predictions for unobserved groups (eg. a ninth University), with sensible prediction intervals based on our inferred understanding of how similar different universities are. Using a fixed effects model this isn’t really possible. Even for a Bayesian version of a fixed effects model using proper (but vague) priors, prediction intervals for unobserved groups are not really sensible.

Since we have used simulated data here, we can compare the estimated random effects with the true effects generated during the simulation.

> apply(as.matrix(output),2,mean)
mu           tau          taut      theta[1]      theta[2]
3.098813e+00  9.627110e+01  7.015976e+03  2.086581e-03 -3.935511e-03
theta[3]      theta[4]      theta[5]      theta[6]      theta[7]
-1.389099e-02  1.881528e-02 -1.921854e-02  5.640306e-04  9.529532e-03
theta[8]
5.227518e-03
> RE
[1]  0.002637034 -0.008294518 -0.014616348  0.016839902 -0.015443243
[6] -0.001908871  0.010162117  0.005471262

We see that the Bayesian random effects model has done an excellent job of estimation. If we wished, we could relax the assumption of common variance across the groups by making tau a vector indexed by j, though there is not much point in persuing this here, since we know that the groups do all have the same variance.

#### Strong subjective priors

The above is the usual story regarding fixed and random effects in Bayesian inference. I hope this is reasonably clear, so really I should quit while I’m ahead… However, the issues are really a bit more subtle than I’ve suggested. The inferred precision of the random effects was around 7,000, so now lets re-run the original, naive, “fixed effects” model with a strong subjective Bayesian prior on the distribution of the effects.

data=list(X=X,n=n,p=p)
init=list(mu=2,tau=1)
modelstring="
model {
for (j in 1:p) {
theta[j]~dnorm(0,7000)
for (i in 1:n) {
X[i,j]~dnorm(mu+theta[j],tau)
}
}
mu~dnorm(0,0.0001)
tau~dgamma(1,0.0001)
}
"
model=jags.model(textConnection(modelstring),data=data,inits=init)
update(model,n.iter=1000)
output=coda.samples(model=model,variable.names=c("mu","tau","theta"),n.iter=100000,thin=10)
print(summary(output))
plot(output)

This model also runs perfectly well and gives sensible inferences, despite the fact that the effects are iid from a fixed distribution and there is no hard constraint on the effects. Similarly, we can make sensible predictions, together with appropriate prediction intervals, for an unobserved group. So it isn’t so much the fact that the effects are coupled via an extra level of hierarchy that makes things work. It’s really the fact that the effects are sensibly distributed and not just sampled directly from a vague prior. So for “real” subjective Bayesians the line between fixed and random effects is actually very blurred indeed…

## Introduction

In the previous post I showed that it is possible to couple parallel tempered MCMC chains in order to improve mixing. Such methods can be used when the target of interest is a Bayesian posterior distribution that is difficult to sample. There are (at least) a couple of obvious ways that one can temper a Bayesian posterior distribution. Perhaps the most obvious way is a simple flattening, so that if

$\pi(\theta|y) \propto \pi(\theta)\pi(y|\theta)$

is the posterior distribution, then for $t\in [0,1]$ we define

$\pi_t(\theta|y) \propto \pi(\theta|y)^t \propto [ \pi(\theta)\pi(y|\theta) ]^t.$

This corresponds with the tempering that is often used in statistical physics applications. We recover the posterior of interest for $t=1$ and tend to a flat distribution as $t\longrightarrow 0$. However, for Bayesian posterior distributions, there is a different way of tempering that is often more natural and useful, and that is to temper using the power posterior, defined by

$\pi_t(\theta|y) \propto \pi(\theta)\pi(y|\theta)^t.$

Here we again recover the posterior for $t=1$, but get the prior for $t=0$. Thus, the family of distributions forms a natural bridge or path from the prior to the posterior distributions. The power posterior is a special case of the more general concept of a geometric path from distribution $f(\theta)$ (at $t=0$) to $g(\theta)$ (at $t=1$) defined by

$h_t(\theta) \propto f(\theta)^{1-t}g(\theta)^t,$

where, in our case, $f(\cdot)$ is the prior and $g(\cdot)$ is the posterior.

So, given a posterior distribution that is difficult to sample, choose a temperature schedule

$0=t_0

and run a parallel tempering scheme as outlined in the previous post. The idea is that for small values of $t$ mixing will be good, as prior-like distributions are usually well-behaved, and the mixing of these "high temperature" chains can help to improve the mixing of the "low temperature" chains that are more like the posterior (note that $t$ is really an inverse temperature parameter the way I’ve defined it here…).

## Marginal likelihood and normalising constants

The marginal likelihood of a Bayesian model is

$\pi(y) = \int_\Theta \pi(\theta)\pi(y|\theta)d\theta.$

This quantity is of interest for many reasons, including calculation of the Bayes factor between two competing models. Note that this quantity has several different names in different fields. In particular, it is often known as the evidence, due to its role in Bayes factors. It is also worth noting that it is the normalising constant of the Bayesian posterior distribution. Although it is very easy to describe and define, it is notoriously difficult to compute reliably for complex models.

The normalising constant is conceptually very easy to estimate. From the above integral representation, it is clear that

$\pi(y) = E_\pi [ \pi(y|\theta) ]$

where the expectation is taken with respect to the prior. So, given samples from the prior, $\theta_1,\theta_2,\ldots,\theta_n$, we can construct the Monte Carlo estimate

$\displaystyle \widehat{\pi}(y) = \frac{1}{n}\sum_{i=1}^n \pi(y|\theta_i)$

and this will be a consistent estimator of the true evidence under fairly mild regularity conditions. Unfortunately, in practice it is likely to be a very poor estimator if the posterior and prior are not very similar. Now, we could also use Bayes theorem to re-write the integral as an expectation with respect to the posterior, so we could then use samples from the posterior to estimate the evidence. This leads to the harmonic mean estimator of the evidence, which has been described as the worst Monte Carlo method ever! Now it turns out that there are many different ways one can construct estimators of the evidence using samples from the prior and the posterior, some of which are considerably better than the two I’ve outlined. This is the subject of the bridge sampling paper of Meng and Wong. However, the reality is that no method will work well if the prior and posterior are very different.

If we have tempered chains, then we have a sequence of chains targeting distributions which, by construction, are not too different, and so we can use the output from tempered chains in order to construct estimators of the evidence that are more numerically stable. If we call the evidence of the $i$th chain $z_i$, so that $z_0=1$ and $z_N=\pi(y)$, then we can write the evidence in telescoping fashion as

$\displaystyle \pi(y)=z_N = \frac{z_N}{z_0} = \frac{z_1}{z_0}\times \frac{z_2}{z_1}\times \cdots \times \frac{z_N}{z_{N-1}}.$

Now the $i$th term in this product is $z_{i+1}/z_{i}$, which can be estimated using the output from the $i$th and/or $(i+1)$th chain(s). Again, this can be done in a variety of ways, using your favourite bridge sampling estimator, but the point is that the estimator should be reasonably good due to the fact that the $i$th and $(i+1)$th targets are very similar. For the power posterior, the simplest method is to write

$\displaystyle \frac{z_{i+1}}{z_i} = \frac{\displaystyle \int \pi(\theta)\pi(y|\theta)^{t_{i+1}}d\theta}{z_i} = \int \pi(y|\theta)^{t_{i+1}-t_i}\times \frac{\pi(y|\theta)^{t_i}\pi(\theta)}{z_i}d\theta$

$\displaystyle \mbox{}\qquad = E_i\left[\pi(y|\theta)^{t_{i+1}-t_i}\right],$

where the expectation is with respect to the $i$th target, and hence can be estimated in the usual way using samples from the $i$th chain.

For numerical stability, in practice we compute the log of the evidence as

$\displaystyle \log\pi(y) = \sum_{i=0}^{N-1} \log\frac{z_{i+1}}{z_i} = \sum_{i=0}^{N-1} \log E_i\left[\pi(y|\theta)^{t_{i+1}-t_i}\right]$

$\displaystyle = \sum_{i=0}^{N-1} \log E_i\left[\exp\{(t_{i+1}-t_i)\log\pi(y|\theta)\}\right].\qquad(\dagger)$

The above expression is exact, and is the obvious formula to use for computation. However, it is clear that if $t_i$ and $t_{i+1}$ are sufficiently close, it will be approximately OK to switch the expectation and exponential, giving

$\displaystyle \log\pi(y) \approx \sum_{i=0}^{N-1}(t_{i+1}-t_i)E_i\left[\log\pi(y|\theta)\right].$

In the continuous limit, this gives rise to the well-known path sampling identity,

$\displaystyle \log\pi(y) = \int_0^1 E_t\left[\log\pi(y|\theta)\right]dt.$

So, an alternative approach to computing the evidence is to use the samples to approximately numerically integrate the above integral, say, using the trapezium rule. However, it isn’t completely clear (to me) that this is better than using $(\dagger)$ directly, since there there is no numerical integration error to worry about.

## Numerical illustration

We can illustrate these ideas using the simple double potential well example from the previous post. Now that example doesn’t really correspond to a Bayesian posterior, and is tempered directly, rather than as a power posterior, but essentially the same ideas follow for general parallel tempered distributions. In general, we can use the sample to estimate the ratio of the last and first normalising constants, $z_N/z_0$. Here it isn’t obvious why we’d want to know that, but we’ll compute it anyway to illustrate the method. As before, we expand as a telescopic product, where the $i$th term is now

$\displaystyle \frac{z_{i+1}}{z_i} = E_i\left[\exp\{-(\gamma_{i+1}-\gamma_i)(x^2-1)^2\}\right].$

A Monte Carlo estimate of each of these terms is formed using the samples from the $i$th chain, and the logs of these are then summed to give $\log(z_N/z_0)$. A complete R script to run the Metropolis coupled sampler and compute the evidence is given below.

U=function(gam,x)
{
gam*(x*x-1)*(x*x-1)
}

temps=2^(0:3)
iters=1e5

chains=function(pot=U, tune=0.1, init=1)
{
x=rep(init,length(temps))
xmat=matrix(0,iters,length(temps))
for (i in 1:iters) {
can=x+rnorm(length(temps),0,tune)
logA=unlist(Map(pot,temps,x))-unlist(Map(pot,temps,can))
accept=(log(runif(length(temps)))<logA)
x[accept]=can[accept]
swap=sample(1:length(temps),2)
logA=pot(temps[swap[1]],x[swap[1]])+pot(temps[swap[2]],x[swap[2]])-
pot(temps[swap[1]],x[swap[2]])-pot(temps[swap[2]],x[swap[1]])
if (log(runif(1))<logA)
x[swap]=rev(x[swap])
xmat[i,]=x
}
colnames(xmat)=paste("gamma=",temps,sep="")
xmat
}

mat=chains()
mat=mat[,1:(length(temps)-1)]
diffs=diff(temps)
mat=(mat*mat-1)^2
mat=-t(diffs*t(mat))
mat=exp(mat)
logEvidence=sum(log(colMeans(mat)))
message(paste("The log of the ratio of the last and first normalising constants is",logEvidence))

It turns out that these double well potential densities are tractable, and so the normalising constants can be computed exactly. So, with a little help from Wolfram Alpha, I compute log of the ratio of the last and first normalising constants to be approximately -1.12. Hopefully the above script will output something a bit like that…

## Summary stats for ABC

#### Introduction

In the previous post I gave a very brief introduction to ABC, including a simple example for inferring the parameters of a Markov process given some time series observations. Towards the end of the post I observed that there were (at least!) two potential problems with scaling up the simple approach described, one relating to the dimension of the data and the other relating to the dimension of the parameter space. Before moving on to the (to me, more interesting) problem of the dimension of the parameter space, I will briefly discuss the data dimension problem in this post, and provide a couple of references for further reading.

#### Summary stats

Recall that the simple rejection sampling approach to ABC involves first sampling a candidate parameter $\theta^\star$ from the prior and then sampling a corresponding data set $x^\star$ from the model. This simulated data set is compared with the true data $x$ using some (pseudo-)norm, $\Vert\cdot\Vert$, and accepting $\theta^\star$ if the simulated data set is sufficiently close to the true data, $\Vert x^\star - x\Vert <\epsilon$. It should be clear that if we are using a proper norm then as $\epsilon$ tends to zero the distribution of the accepted values tends to the desired posterior distribution of the parameters given the data.

However, smaller choices of $\epsilon$ will lead to higher rejection rates. This will be a particular problem in the context of high-dimensional $x$, where it is often unrealistic to expect a close match between all components of $x$ and the simulated data $x^\star$, even for a good choice of $\theta^\star$. In this case, it makes more sense to look for good agreement between particular aspects of $x$, such as the mean, or variance, or auto-correlation, depending on the exact problem and context. If we can find a finite set of sufficient statistics, $s(x)$ for $\theta$, then it should be clear that replacing the acceptance criterion with $\Vert s(x^\star) - s(x)\Vert <\epsilon$ will also lead to a scheme tending to the true posterior as $\epsilon$ tends to zero (assuming a proper norm on the space of sufficient statistics), and will typically be better than the naive method, since the sufficient statistics will be of lower dimension and less “noisy” that the raw data, leading to higher acceptance rates with no loss of information.

Unfortunately for most problems of practical interest it is not possible to find low-dimensional sufficient statistics, and so people in practice use domain knowledge and heuristics to come up with a set of summary statistics, $s(x)$ which they hope will closely approximate sufficient statistics. There is still a question as to how these statistics should be weighted or transformed to give a particular norm. This can be done using theory or heuristics, and some relevant references for this problem are given at the end of the post.

#### Implementation in R

Let’s now look at the problem from the previous post. Here, instead of directly computing the Euclidean distance between the real and simulated data, we will look at the Euclidean distance between some (normalised) summary statistics. First we will load some packages and set some parameters.

require(smfsb)
require(parallel)
options(mc.cores=4)
data(LVdata)

N=1e7
bs=1e5
batches=N/bs
message(paste("N =",N," | bs =",bs," | batches =",batches))

Next we will define some summary stats for a univariate time series – the mean, the (log) variance, and the first two auto-correlations.

ssinit <- function(vec)
{
ac23=as.vector(acf(vec,lag.max=2,plot=FALSE)$acf)[2:3] c(mean(vec),log(var(vec)+1),ac23) } Once we have this, we can define some stats for a bivariate time series by combining the stats for the two component series, along with the cross-correlation between them. ssi <- function(ts) { c(ssinit(ts[,1]),ssinit(ts[,2]),cor(ts[,1],ts[,2])) } This gives a set of summary stats, but these individual statistics are potentially on very different scales. They can be transformed and re-weighted in a variety of ways, usually on the basis of a pilot run which gives some information about the distribution of the summary stats. Here we will do the simplest possible thing, which is to normalise the variance of the stats on the basis of a pilot run. This is not at all optimal – see the references at the end of the post for a description of better methods. message("Batch 0: Pilot run batch") prior=cbind(th1=exp(runif(bs,-6,2)),th2=exp(runif(bs,-6,2)),th3=exp(runif(bs,-6,2))) rows=lapply(1:bs,function(i){prior[i,]}) samples=mclapply(rows,function(th){simTs(c(50,100),0,30,2,stepLVc,th)}) sumstats=mclapply(samples,ssi) sds=apply(sapply(sumstats,c),1,sd) print(sds) # now define a standardised distance ss<-function(ts) { ssi(ts)/sds } ss0=ss(LVperfect) distance <- function(ts) { diff=ss(ts)-ss0 sum(diff*diff) } Now we have a normalised distance function defined, we can proceed exactly as before to obtain an ABC posterior via rejection sampling. post=NULL for (i in 1:batches) { message(paste("batch",i,"of",batches)) prior=cbind(th1=exp(runif(bs,-6,2)),th2=exp(runif(bs,-6,2)),th3=exp(runif(bs,-6,2))) rows=lapply(1:bs,function(i){prior[i,]}) samples=mclapply(rows,function(th){simTs(c(50,100),0,30,2,stepLVc,th)}) dist=mclapply(samples,distance) dist=sapply(dist,c) cutoff=quantile(dist,1000/N,na.rm=TRUE) post=rbind(post,prior[dist<cutoff,]) } message(paste("Finished. Kept",dim(post)[1],"simulations")) Having obtained the posterior, we can use the following code to plot the results. th=c(th1 = 1, th2 = 0.005, th3 = 0.6) op=par(mfrow=c(2,3)) for (i in 1:3) { hist(post[,i],30,col=5,main=paste("Posterior for theta[",i,"]",sep="")) abline(v=th[i],lwd=2,col=2) } for (i in 1:3) { hist(log(post[,i]),30,col=5,main=paste("Posterior for log(theta[",i,"])",sep="")) abline(v=log(th[i]),lwd=2,col=2) } par(op) This gives the plot shown below. From this we can see that the ABC posterior obtained here is very similar to that obtained in the previous post using the full data. Here the dimension reduction is not that great – reducing from 32 data points to 9 summary statistics – and so the improvement in performance is not that noticable. But in higher dimensional problems reducing the dimension of the data is practically essential. #### Summary and References As before, I recommend the wikipedia article on approximate Bayesian computation for further information and a comprehensive set of references for further reading. Here I just want to highlight two references particularly relevant to the issue of summary statistics. It is quite difficult to give much practical advice on how to construct good summary statistics, but how to transform a set of summary stats in a “good” way is a problem that is reasonably well understood. In this post I did something rather naive (normalising the variance), but the following two papers describe much better approaches. I still haven’t addressed the issue of a high-dimensional parameter space – that will be the topic of a subsequent post. #### The complete R script require(smfsb) require(parallel) options(mc.cores=4) data(LVdata) N=1e6 bs=1e5 batches=N/bs message(paste("N =",N," | bs =",bs," | batches =",batches)) ssinit <- function(vec) { ac23=as.vector(acf(vec,lag.max=2,plot=FALSE)$acf)[2:3]
c(mean(vec),log(var(vec)+1),ac23)
}

ssi <- function(ts)
{
c(ssinit(ts[,1]),ssinit(ts[,2]),cor(ts[,1],ts[,2]))
}

message("Batch 0: Pilot run batch")
prior=cbind(th1=exp(runif(bs,-6,2)),th2=exp(runif(bs,-6,2)),th3=exp(runif(bs,-6,2)))
rows=lapply(1:bs,function(i){prior[i,]})
samples=mclapply(rows,function(th){simTs(c(50,100),0,30,2,stepLVc,th)})
sumstats=mclapply(samples,ssi)
sds=apply(sapply(sumstats,c),1,sd)
print(sds)

# now define a standardised distance
ss<-function(ts)
{
ssi(ts)/sds
}

ss0=ss(LVperfect)

distance <- function(ts)
{
diff=ss(ts)-ss0
sum(diff*diff)
}

post=NULL
for (i in 1:batches) {
message(paste("batch",i,"of",batches))
prior=cbind(th1=exp(runif(bs,-6,2)),th2=exp(runif(bs,-6,2)),th3=exp(runif(bs,-6,2)))
rows=lapply(1:bs,function(i){prior[i,]})
samples=mclapply(rows,function(th){simTs(c(50,100),0,30,2,stepLVc,th)})
dist=mclapply(samples,distance)
dist=sapply(dist,c)
cutoff=quantile(dist,1000/N,na.rm=TRUE)
post=rbind(post,prior[dist<cutoff,])
}
message(paste("Finished. Kept",dim(post)[1],"simulations"))

# plot the results
th=c(th1 = 1, th2 = 0.005, th3 = 0.6)
op=par(mfrow=c(2,3))
for (i in 1:3) {
hist(post[,i],30,col=5,main=paste("Posterior for theta[",i,"]",sep=""))
abline(v=th[i],lwd=2,col=2)
}
for (i in 1:3) {
hist(log(post[,i]),30,col=5,main=paste("Posterior for log(theta[",i,"])",sep=""))
abline(v=log(th[i]),lwd=2,col=2)
}
par(op)

## Introduction to Approximate Bayesian Computation (ABC)

Many of the posts in this blog have been concerned with using MCMC based methods for Bayesian inference. These methods are typically “exact” in the sense that they have the exact posterior distribution of interest as their target equilibrium distribution, but are obviously “approximate”, in that for any finite amount of computing time, we can only generate a finite sample of correlated realisations from a Markov chain that we hope is close to equilibrium.

Approximate Bayesian Computation (ABC) methods go a step further, and generate samples from a distribution which is not the true posterior distribution of interest, but a distribution which is hoped to be close to the real posterior distribution of interest. There are many variants on ABC, and I won’t get around to explaining all of them in this blog. The wikipedia page on ABC is a good starting point for further reading. In this post I’ll explain the most basic rejection sampling version of ABC, and in a subsequent post, I’ll explain a sequential refinement, often referred to as ABC-SMC. As usual, I’ll use R code to illustrate the ideas.

#### Basic idea

There is a close connection between “likelihood free” MCMC methods and those of approximate Bayesian computation (ABC). To keep things simple, consider the case of a perfectly observed system, so that there is no latent variable layer. Then there are model parameters $\theta$ described by a prior $\pi(\theta)$, and a forwards-simulation model for the data $x$, defined by $\pi(x|\theta)$. It is clear that a simple algorithm for simulating from the desired posterior $\pi(\theta|x)$ can be obtained as follows. First simulate from the joint distribution $\pi(\theta,x)$ by simulating $\theta^\star\sim\pi(\theta)$ and then $x^\star\sim \pi(x|\theta^\star)$. This gives a sample $(\theta^\star,x^\star)$ from the joint distribution. A simple rejection algorithm which rejects the proposed pair unless $x^\star$ matches the true data $x$ clearly gives a sample from the required posterior distribution.

#### Exact rejection sampling

• 1. Sample $\theta^\star \sim \pi(\theta^\star)$
• 2. Sample $x^\star\sim \pi(x^\star|\theta^\star)$
• 3. If $x^\star=x$, keep $\theta^\star$ as a sample from $\pi(\theta|x)$, otherwise reject.

This algorithm is exact, and for discrete $x$ will have a non-zero acceptance rate. However, in most interesting problems the rejection rate will be intolerably high. In particular, the acceptance rate will typically be zero for continuous valued $x$.

#### ABC rejection sampling

The ABC “approximation” is to accept values provided that $x^\star$ is “sufficiently close” to $x$. In the first instance, we can formulate this as follows.

• 1. Sample $\theta^\star \sim \pi(\theta^\star)$
• 2. Sample $x^\star\sim \pi(x^\star|\theta^\star)$
• 3. If $\Vert x^\star-x\Vert< \epsilon$, keep $\theta^\star$ as a sample from $\pi(\theta|x)$, otherwise reject.

Euclidean distance is usually chosen as the norm, though any norm can be used. This procedure is “honest”, in the sense that it produces exact realisations from

$\theta^\star\big|\Vert x^\star-x\Vert < \epsilon.$

For suitable small choice of $\epsilon$, this will closely approximate the true posterior. However, smaller choices of $\epsilon$ will lead to higher rejection rates. This will be a particular problem in the context of high-dimensional $x$, where it is often unrealistic to expect a close match between all components of $x$ and the simulated data $x^\star$, even for a good choice of $\theta^\star$. In this case, it makes more sense to look for good agreement between particular aspects of $x$, such as the mean, or variance, or auto-correlation, depending on the exact problem and context.

In the simplest case, this is done by forming a (vector of) summary statistic(s), $s(x^\star)$ (ideally a sufficient statistic), and accepting provided that $\Vert s(x^\star)-s(x)\Vert<\epsilon$ for some suitable choice of metric and $\epsilon$. We will return to this issue in a subsequent post.

#### Inference for an intractable Markov process

I’ll illustrate ABC in the context of parameter inference for a Markov process with an intractable transition kernel: the discrete stochastic Lotka-Volterra model. A function for simulating exact realisations from the intractable kernel is included in the smfsb CRAN package discussed in a previous post. Using pMCMC to solve the parameter inference problem is discussed in another post. It may be helpful to skim through those posts quickly to become familiar with this problem before proceeding.

So, for a given proposed set of parameters, realisations from the process can be sampled using the functions simTs and stepLV (from the smfsb package). We will use the sample data set LVperfect (from the LVdata dataset) as our “true”, or “target” data, and try to find parameters for the process which are consistent with this data. A fairly minimal R script for this problem is given below.

require(smfsb)
data(LVdata)

N=1e5
message(paste("N =",N))
prior=cbind(th1=exp(runif(N,-6,2)),th2=exp(runif(N,-6,2)),th3=exp(runif(N,-6,2)))
rows=lapply(1:N,function(i){prior[i,]})
message("starting simulation")
samples=lapply(rows,function(th){simTs(c(50,100),0,30,2,stepLVc,th)})
message("finished simulation")

distance<-function(ts)
{
diff=ts-LVperfect
sum(diff*diff)
}

message("computing distances")
dist=lapply(samples,distance)
message("distances computed")

dist=sapply(dist,c)
cutoff=quantile(dist,1000/N)
post=prior[dist<cutoff,]

op=par(mfrow=c(2,3))
apply(post,2,hist,30)
apply(log(post),2,hist,30)
par(op)

This script should take 5-10 minutes to run on a decent laptop, and will result in histograms of the posterior marginals for the components of $\theta$ and $\log(\theta)$. Note that I have deliberately adopted a functional programming style, making use of the lapply function for the most computationally intensive steps. The reason for this will soon become apparent. Note that rather than pre-specifying a cutoff $\epsilon$, I’ve instead picked a quantile of the distance distribution. This is common practice in scenarios where the distance is difficult to have good intuition about. In fact here I’ve gone a step further and chosen a quantile to give a final sample of size 1000. Obviously then in this case I could have just selected out the top 1000 directly, but I wanted to illustrate the quantile based approach.

One problem with the above script is that all proposed samples are stored in memory at once. This is problematic for problems involving large numbers of samples. However, it is convenient to do simulations in large batches, both for computation of quantiles, and also for efficient parallelisation. The script below illustrates how to implement a batch parallelisation strategy for this problem. Samples are generated in batches of size 10^4, and only the best fitting samples are stored before the next batch is processed. This strategy can be used to get a good sized sample based on a more stringent acceptance criterion at the cost of addition simulation time. Note that the parallelisation code will only work with recent versions of R, and works by replacing calls to lapply with the parallel version, mclapply. You should notice an appreciable speed-up on a multicore machine.

require(smfsb)
require(parallel)
options(mc.cores=4)
data(LVdata)

N=1e5
bs=1e4
batches=N/bs
message(paste("N =",N," | bs =",bs," | batches =",batches))

distance<-function(ts)
{
diff=ts[,1]-LVprey
sum(diff*diff)
}

post=NULL
for (i in 1:batches) {
message(paste("batch",i,"of",batches))
prior=cbind(th1=exp(runif(bs,-6,2)),th2=exp(runif(bs,-6,2)),th3=exp(runif(bs,-6,2)))
rows=lapply(1:bs,function(i){prior[i,]})
samples=mclapply(rows,function(th){simTs(c(50,100),0,30,2,stepLVc,th)})
dist=mclapply(samples,distance)
dist=sapply(dist,c)
cutoff=quantile(dist,1000/N)
post=rbind(post,prior[dist<cutoff,])
}
message(paste("Finished. Kept",dim(post)[1],"simulations"))

op=par(mfrow=c(2,3))
apply(post,2,hist,30)
apply(log(post),2,hist,30)
par(op)

Note that there is an additional approximation here, since the top 100 samples from each of 10 batches of simulations won’t correspond exactly to the top 1000 samples overall, but given all of the other approximations going on in ABC, this one is likely to be the least of your worries.

Now, if you compare the approximate posteriors obtained here with the “true” posteriors obtained in an earlier post using pMCMC, you will see that these posteriors are really quite poor. However, this isn’t a very fair comparison, since we’ve only done 10^5 simulations. Jacking N up to 10^7 gives the ABC posterior below.

This is a bit better, but really not great. There are two basic problems with the simplistic ABC strategy adopted here, one related to the dimensionality of the data and the other the dimensionality of the parameter space. The most basic problem that we have here is the dimensionality of the data. We have 16 (bivariate) observations, so we want our stochastic simulation to shoot at a point in a 16- or 32-dimensional space. That’s a tough call. The standard way to address this problem is to reduce the dimension of the data by introducing a few carefully chosen summary statistics and then just attempting to hit those. I’ll illustrate this in a subsequent post. The other problem is that often the prior and posterior over the parameters are quite different, and this problem too is exacerbated as the dimension of the parameter space increases. The standard way to deal with this is to sequentially adapt from the prior through a sequence of ABC posteriors. I’ll examine this in a future post as well, once I’ve also posted an introduction to the use of sequential Monte Carlo (SMC) samplers for static problems.

For further reading, I suggest browsing the reference list of the Wikipedia page for ABC. Also look through the list of software on that page. In particular, note that there is a CRAN package, abc, providing R support for ABC. There is a vignette for this package which should be sufficient to get started.

## Getting started with Bayesian variable selection using JAGS and rjags

#### Bayesian variable selection

In a previous post I gave a quick introduction to using the rjags R package to access the JAGS Bayesian inference from within R. In this post I want to give a quick guide to using rjags for Bayesian variable selection. I intend to use this post as a starting point for future posts on Bayesian model and variable selection using more sophisticated approaches.

I will use the simple example of multiple linear regression to illustrate the ideas, but it should be noted that I’m just using that as an example. It turns out that in the context of linear regression there are lots of algebraic and computational tricks which can be used to simplify the variable selection problem. The approach I give here is therefore rather inefficient for linear regression, but generalises to more complex (non-linear) problems where analytical and computational short-cuts can’t be used so easily.

Consider a linear regression problem with n observations and p covariates, which we can write in matrix form as

$y = \alpha \boldmath{1} + X\beta + \varepsilon,$

where $X$ is an $n\times p$ matrix. The idea of variable selection is that probably not all of the p covariates are useful for predicting y, and therefore it would be useful to identify the variables which are, and just use those. Clearly each combination of variables corresponds to a different model, and so the variable selection amounts to choosing among the $2^p$ possible models. For large values of p it won’t be practical to consider each possible model separately, and so the idea of Bayesian variable selection is to consider a model containing all of the possible model combinations as sub-models, and the variable selection problem as just another aspect of the model which must be estimated from data. I’m simplifying and glossing over lots of details here, but there is a very nice review paper by O’Hara and Sillanpaa (2009) which the reader is referred to for further details.

The simplest and most natural way to tackle the variable selection problem from a Bayesian perspective is to introduce an indicator random variable $I_i$ for each covariate, and introduce these into the model in order to “zero out” inactive covariates. That is we write the ith regression coefficient $\beta_i$ as $\beta_i=I_i\beta^\star_i$, so that $\beta^\star_i$ is the regression coefficient when $I_i=1$, and “doesn’t matter” when $I_i=0$. There are various ways to choose the prior over $I_i$ and $\beta^\star_i$, but the simplest and most natural choice is to make them independent. This approach was used in Kuo and Mallick (1998), and hence is referred to as the Kuo and Mallick approach in O’Hara and Sillanpaa.

#### Simulating some data

In order to see how things work, let’s first simulate some data from a regression model with geometrically decaying regression coefficients.

n=500
p=20
X=matrix(rnorm(n*p),ncol=p)
beta=2^(0:(1-p))
print(beta)
alpha=3
tau=2
eps=rnorm(n,0,1/sqrt(tau))
y=alpha+as.vector(X%*%beta + eps)

Let’s also fit the model by least squares.

mod=lm(y~X)
print(summary(mod))

This should give output something like the following.

Call:
lm(formula = y ~ X)

Residuals:
Min       1Q   Median       3Q      Max
-1.62390 -0.48917 -0.02355  0.45683  2.35448

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  3.0565406  0.0332104  92.036  < 2e-16 ***
X1           0.9676415  0.0322847  29.972  < 2e-16 ***
X2           0.4840052  0.0333444  14.515  < 2e-16 ***
X3           0.2680482  0.0320577   8.361  6.8e-16 ***
X4           0.1127954  0.0314472   3.587 0.000369 ***
X5           0.0781860  0.0334818   2.335 0.019946 *
X6           0.0136591  0.0335817   0.407 0.684379
X7           0.0035329  0.0321935   0.110 0.912662
X8           0.0445844  0.0329189   1.354 0.176257
X9           0.0269504  0.0318558   0.846 0.397968
X10          0.0114942  0.0326022   0.353 0.724575
X11         -0.0045308  0.0330039  -0.137 0.890868
X12          0.0111247  0.0342482   0.325 0.745455
X13         -0.0584796  0.0317723  -1.841 0.066301 .
X14         -0.0005005  0.0343499  -0.015 0.988381
X15         -0.0410424  0.0334723  -1.226 0.220742
X16          0.0084832  0.0329650   0.257 0.797026
X17          0.0346331  0.0327433   1.058 0.290718
X18          0.0013258  0.0328920   0.040 0.967865
X19         -0.0086980  0.0354804  -0.245 0.806446
X20          0.0093156  0.0342376   0.272 0.785671
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7251 on 479 degrees of freedom
Multiple R-squared: 0.7187,     Adjusted R-squared: 0.707
F-statistic:  61.2 on 20 and 479 DF,  p-value: < 2.2e-16

The first 4 variables are “highly significant” and the 5th is borderline.

#### Saturated model

We can fit the saturated model using JAGS with the following code.

require(rjags)
data=list(y=y,X=X,n=n,p=p)
init=list(tau=1,alpha=0,beta=rep(0,p))
modelstring="
model {
for (i in 1:n) {
mean[i]<-alpha+inprod(X[i,],beta)
y[i]~dnorm(mean[i],tau)
}
for (j in 1:p) {
beta[j]~dnorm(0,0.001)
}
alpha~dnorm(0,0.0001)
tau~dgamma(1,0.001)
}
"
model=jags.model(textConnection(modelstring),
data=data,inits=init)
update(model,n.iter=100)
output=coda.samples(model=model,variable.names=c("alpha","beta","tau"),
n.iter=10000,thin=1)
print(summary(output))
plot(output)

I’ve hard-coded various hyper-parameters in the script which are vaguely reasonable for this kind of problem. I won’t include all of the output in this post, but this works fine and gives sensible results. However, it does not address the variable selection problem.

#### Basic variable selection

Let’s now modify the above script to do basic variable selection in the style of Kuo and Mallick.

data=list(y=y,X=X,n=n,p=p)
init=list(tau=1,alpha=0,betaT=rep(0,p),ind=rep(0,p))
modelstring="
model {
for (i in 1:n) {
mean[i]<-alpha+inprod(X[i,],beta)
y[i]~dnorm(mean[i],tau)
}
for (j in 1:p) {
ind[j]~dbern(0.2)
betaT[j]~dnorm(0,0.001)
beta[j]<-ind[j]*betaT[j]
}
alpha~dnorm(0,0.0001)
tau~dgamma(1,0.001)
}
"
model=jags.model(textConnection(modelstring),
data=data,inits=init)
update(model,n.iter=1000)
output=coda.samples(model=model,
variable.names=c("alpha","beta","ind","tau"),
n.iter=10000,thin=1)
print(summary(output))
plot(output)

Note that I’ve hard-coded an expectation that around 20% of variables should be included in the model. Again, I won’t include all of the output here, but the posterior mean of the indicator variables can be interpreted as posterior probabilities that the variables should be included in the model. Inspecting the output then reveals that the first three variables have a posterior probability of very close to one, the 4th variable has a small but non-negligible probability of inclusion, and the other variables all have very small probabilities of inclusion.

This is fine so far as it goes, but is not entirely satisfactory. One problem is that the choice of a “fixed effects” prior for the regression coefficients of the included variables is likely to lead to a Lindley’s paradox type situation, and a consequent under-selection of variables. It is arguably better to model the distribution of included variables using a “random effects” approach, leading to a more appropriate distribution for the included variables.

#### Variable selection with random effects

Adopting a random effects distribution for the included coefficients that is normal with mean zero and unknown variance helps to combat Lindley’s paradox, and can be implemented as follows.

data=list(y=y,X=X,n=n,p=p)
init=list(tau=1,taub=1,alpha=0,betaT=rep(0,p),ind=rep(0,p))
modelstring="
model {
for (i in 1:n) {
mean[i]<-alpha+inprod(X[i,],beta)
y[i]~dnorm(mean[i],tau)
}
for (j in 1:p) {
ind[j]~dbern(0.2)
betaT[j]~dnorm(0,taub)
beta[j]<-ind[j]*betaT[j]
}
alpha~dnorm(0,0.0001)
tau~dgamma(1,0.001)
taub~dgamma(1,0.001)
}
"
model=jags.model(textConnection(modelstring),
data=data,inits=init)
update(model,n.iter=1000)
output=coda.samples(model=model,
variable.names=c("alpha","beta","ind","tau","taub"),
n.iter=10000,thin=1)
print(summary(output))
plot(output)

This leads to a large inclusion probability for the 4th variable, and non-negligible inclusion probabilities for the next few (it is obviously somewhat dependent on the simulated data set). This random effects variable selection modelling approach generally performs better, but it still has the potentially undesirable feature of hard-coding the probability of variable inclusion. Under the prior model, the number of variables included is binomial, and the binomial distribution is rather concentrated about its mean. Where there is a general desire to control the degree of sparsity in the model, this is a good thing, but if there is considerable uncertainty about the degree of sparsity that is anticipated, then a more flexible model may be desirable.

#### Variable selection with random effects and a prior on the inclusion probability

The previous model can be modified by introducing a Beta prior for the model inclusion probability. This induces a distribution for the number of included variables which has longer tails than the binomial distribution, allowing the model to learn about the degree of sparsity.

data=list(y=y,X=X,n=n,p=p)
init=list(tau=1,taub=1,pind=0.5,alpha=0,betaT=rep(0,p),ind=rep(0,p))
modelstring="
model {
for (i in 1:n) {
mean[i]<-alpha+inprod(X[i,],beta)
y[i]~dnorm(mean[i],tau)
}
for (j in 1:p) {
ind[j]~dbern(pind)
betaT[j]~dnorm(0,taub)
beta[j]<-ind[j]*betaT[j]
}
alpha~dnorm(0,0.0001)
tau~dgamma(1,0.001)
taub~dgamma(1,0.001)
pind~dbeta(2,8)
}
"
model=jags.model(textConnection(modelstring),
data=data,inits=init)
update(model,n.iter=1000)
output=coda.samples(model=model,
variable.names=c("alpha","beta","ind","tau","taub","pind"),
n.iter=10000,thin=1)
print(summary(output))
plot(output)

It turns out that for this particular problem the posterior distribution is not very different to the previous case, as for this problem the hard-coded choice of 20% is quite consistent with the data. However, the variable inclusion probabilities can be rather sensitive to the choice of hard-coded proportion.

#### Conclusion

Bayesian variable selection (and model selection more generally) is a very delicate topic, and there is much more to say about it. In this post I’ve concentrated on the practicalities of introducing variable selection into JAGS models. For further reading, I highly recommend the review of O’Hara and Sillanpaa (2009), which discusses other computational algorithms for variable selection. I intend to discuss some of the other methods in future posts.

#### References

O’Hara, R. and Sillanpaa, M. (2009) A review of Bayesian variable selection methods: what, how and which. Bayesian Analysis, 4(1):85-118. [DOI, PDF, Supp, BUGS Code]
Kuo, L. and Mallick, B. (1998) Variable selection for regression models. Sankhya B, 60(1):65-81.