## MCMC as a Stream

### Introduction

This weekend I’ve been preparing some material for my upcoming Scala for statistical computing short course. As part of the course, I thought it would be useful to walk through how to think about and structure MCMC codes, and in particular, how to think about MCMC algorithms as infinite streams of state. This material is reasonably stand-alone, so it seems suitable for a blog post. Complete runnable code for the examples in this post are available from my blog repo.

### A simple MH sampler

For this post I will just consider a trivial toy Metropolis algorithm using a Uniform random walk proposal to target a standard normal distribution. I’ve considered this problem before on my blog, so if you aren’t very familiar with Metropolis-Hastings algorithms, you might want to quickly review my post on Metropolis-Hastings MCMC algorithms in R before continuing. At the end of that post, I gave the following R code for the Metropolis sampler:

metrop3<-function(n=1000,eps=0.5)
{
vec=vector("numeric", n)
x=0
oldll=dnorm(x,log=TRUE)
vec[1]=x
for (i in 2:n) {
can=x+runif(1,-eps,eps)
loglik=dnorm(can,log=TRUE)
loga=loglik-oldll
if (log(runif(1)) < loga) {
x=can
oldll=loglik
}
vec[i]=x
}
vec
}


I will begin this post with a fairly direct translation of this algorithm into Scala:

def metrop1(n: Int = 1000, eps: Double = 0.5): DenseVector[Double] = {
val vec = DenseVector.fill(n)(0.0)
var x = 0.0
var oldll = Gaussian(0.0, 1.0).logPdf(x)
vec(0) = x
(1 until n).foreach { i =>
val can = x + Uniform(-eps, eps).draw
val loglik = Gaussian(0.0, 1.0).logPdf(can)
val loga = loglik - oldll
if (math.log(Uniform(0.0, 1.0).draw) < loga) {
x = can
oldll = loglik
}
vec(i) = x
}
vec
}


This code works, and is reasonably fast and efficient, but there are several issues with it from a functional programmers perspective. One issue is that we have committed to storing all MCMC output in RAM in a DenseVector. This probably isn’t an issue here, but for some big problems we might prefer to not store the full set of states, but to just print the states to (say) the console, for possible re-direction to a file. It is easy enough to modify the code to do this:

def metrop2(n: Int = 1000, eps: Double = 0.5): Unit = {
var x = 0.0
var oldll = Gaussian(0.0, 1.0).logPdf(x)
(1 to n).foreach { i =>
val can = x + Uniform(-eps, eps).draw
val loglik = Gaussian(0.0, 1.0).logPdf(can)
val loga = loglik - oldll
if (math.log(Uniform(0.0, 1.0).draw) < loga) {
x = can
oldll = loglik
}
println(x)
}
}


But now we have two version of the algorithm. One for storing results locally, and one for streaming results to the console. This is clearly unsatisfactory, but we shall return to this issue shortly. Another issue that will jump out at functional programmers is the reliance on mutable variables for storing the state and old likelihood. Let’s fix that now by re-writing the algorithm as a tail-recursion.

@tailrec
def metrop3(n: Int = 1000, eps: Double = 0.5, x: Double = 0.0, oldll: Double = Double.MinValue): Unit = {
if (n > 0) {
println(x)
val can = x + Uniform(-eps, eps).draw
val loglik = Gaussian(0.0, 1.0).logPdf(can)
val loga = loglik - oldll
if (math.log(Uniform(0.0, 1.0).draw) < loga)
metrop3(n - 1, eps, can, loglik)
else
metrop3(n - 1, eps, x, oldll)
}
}


This has eliminated the vars, and is just as fast and efficient as the previous version of the code. Note that the @tailrec annotation is optional – it just signals to the compiler that we want it to throw an error if for some reason it cannot eliminate the tail call. However, this is for the print-to-console version of the code. What if we actually want to keep the iterations in RAM for subsequent analysis? We can keep the values in an accumulator, as follows.

@tailrec
def metrop4(n: Int = 1000, eps: Double = 0.5, x: Double = 0.0, oldll: Double = Double.MinValue, acc: List[Double] = Nil): DenseVector[Double] = {
if (n == 0)
DenseVector(acc.reverse.toArray)
else {
val can = x + Uniform(-eps, eps).draw
val loglik = Gaussian(0.0, 1.0).logPdf(can)
val loga = loglik - oldll
if (math.log(Uniform(0.0, 1.0).draw) < loga)
metrop4(n - 1, eps, can, loglik, can :: acc)
else
metrop4(n - 1, eps, x, oldll, x :: acc)
}
}


### Factoring out the updating logic

This is all fine, but we haven’t yet addressed the issue of having different versions of the code depending on what we want to do with the output. The problem is that we have tied up the logic of advancing the Markov chain with what to do with the output. What we need to do is separate out the code for advancing the state. We can do this by defining a new function.

def newState(x: Double, oldll: Double, eps: Double): (Double, Double) = {
val can = x + Uniform(-eps, eps).draw
val loglik = Gaussian(0.0, 1.0).logPdf(can)
val loga = loglik - oldll
if (math.log(Uniform(0.0, 1.0).draw) < loga) (can, loglik) else (x, oldll)
}


This function takes as input a current state and associated log likelihood and returns a new state and log likelihood following the execution of one step of a MH algorithm. This separates the concern of state updating from the rest of the code. So now if we want to write code that prints the state, we can write it as

  @tailrec
def metrop5(n: Int = 1000, eps: Double = 0.5, x: Double = 0.0, oldll: Double = Double.MinValue): Unit = {
if (n > 0) {
println(x)
val ns = newState(x, oldll, eps)
metrop5(n - 1, eps, ns._1, ns._2)
}
}


and if we want to accumulate the set of states visited, we can write that as

  @tailrec
def metrop6(n: Int = 1000, eps: Double = 0.5, x: Double = 0.0, oldll: Double = Double.MinValue, acc: List[Double] = Nil): DenseVector[Double] = {
if (n == 0) DenseVector(acc.reverse.toArray) else {
val ns = newState(x, oldll, eps)
metrop6(n - 1, eps, ns._1, ns._2, ns._1 :: acc)
}
}


Both of these functions call newState to do the real work, and concentrate on what to do with the sequence of states. However, both of these functions repeat the logic of how to iterate over the sequence of states.

### MCMC as a stream

Ideally we would like to abstract out the details of how to do state iteration from the code as well. Most functional languages have some concept of a Stream, which represents a (potentially infinite) sequence of states. The Stream can embody the logic of how to perform state iteration, allowing us to abstract that away from our code, as well.

To do this, we will restructure our code slightly so that it more clearly maps old state to new state.

def nextState(eps: Double)(state: (Double, Double)): (Double, Double) = {
val x = state._1
val oldll = state._2
val can = x + Uniform(-eps, eps).draw
val loglik = Gaussian(0.0, 1.0).logPdf(can)
val loga = loglik - oldll
if (math.log(Uniform(0.0, 1.0).draw) < loga) (can, loglik) else (x, oldll)
}


The "real" state of the chain is just x, but if we want to avoid recalculation of the old likelihood, then we need to make this part of the chain’s state. We can use this nextState function in order to construct a Stream.

  def metrop7(eps: Double = 0.5, x: Double = 0.0, oldll: Double = Double.MinValue): Stream[Double] =
Stream.iterate((x, oldll))(nextState(eps)) map (_._1)


The result of calling this is an infinite stream of states. Obviously it isn’t computed – that would require infinite computation, but it captures the logic of iteration and computation in a Stream, that can be thought of as a lazy List. We can get values out by converting the Stream to a regular collection, being careful to truncate the Stream to one of finite length beforehand! eg. metrop7().drop(1000).take(10000).toArray will do a burn-in of 1,000 iterations followed by a main monitoring run of length 10,000, capturing the results in an Array. Note that metrop7().drop(1000).take(10000) is a Stream, and so nothing is actually computed until the toArray is encountered. Conversely, if printing to console is required, just replace the .toArray with .foreach(println).

The above stream-based approach to MCMC iteration is clean and elegant, and deals nicely with issues like burn-in and thinning (which can be handled similarly). This is how I typically write MCMC codes these days. However, functional programming purists would still have issues with this approach, as it isn’t quite pure functional. The problem is that the code isn’t pure – it has a side-effect, which is to mutate the state of the under-pinning pseudo-random number generator. If the code was pure, calling nextState with the same inputs would always give the same result. Clearly this isn’t the case here, as we have specifically designed the function to be stochastic, returning a randomly sampled value from the desired probability distribution. So nextState represents a function for randomly sampling from a conditional probability distribution.

### A pure functional approach

Now, ultimately all code has side-effects, or there would be no point in running it! But in functional programming the desire is to make as much of the code as possible pure, and to push side-effects to the very edges of the code. So it’s fine to have side-effects in your main method, but not buried deep in your code. Here the side-effect is at the very heart of the code, which is why it is potentially an issue.

To keep things as simple as possible, at this point we will stop worrying about carrying forward the old likelihood, and hard-code a value of eps. Generalisation is straightforward. We can make our code pure by instead defining a function which represents the conditional probability distribution itself. For this we use a probability monad, which in Breeze is called Rand. We can couple together such functions using monadic binds (flatMap in Scala), expressed most neatly using for-comprehensions. So we can write our transition kernel as

def kernel(x: Double): Rand[Double] = for {
innov <- Uniform(-0.5, 0.5)
can = x + innov
oldll = Gaussian(0.0, 1.0).logPdf(x)
loglik = Gaussian(0.0, 1.0).logPdf(can)
loga = loglik - oldll
u <- Uniform(0.0, 1.0)
} yield if (math.log(u) < loga) can else x


This is now pure – the same input x will always return the same probability distribution – the conditional distribution of the next state given the current state. We can draw random samples from this distribution if we must, but it’s probably better to work as long as possible with pure functions. So next we need to encapsulate the iteration logic. Breeze has a MarkovChain object which can take kernels of this form and return a stochastic Process object representing the iteration logic, as follows.

MarkovChain(0.0)(kernel).
steps.
drop(1000).
take(10000).
foreach(println)


The steps method contains the logic of how to advance the state of the chain. But again note that no computation actually takes place until the foreach method is encountered – this is when the sampling occurs and the side-effects happen.

Metropolis-Hastings is a common use-case for Markov chains, so Breeze actually has a helper method built-in that will construct a MH sampler directly from an initial state, a proposal kernel, and a (log) target.

MarkovChain.
metropolisHastings(0.0, (x: Double) =>
Uniform(x - 0.5, x + 0.5))(x =>
Gaussian(0.0, 1.0).logPdf(x)).
steps.
drop(1000).
take(10000).
toArray


Note that if you are using the MH functionality in Breeze, it is important to make sure that you are using version 0.13 (or later), as I fixed a few issues with the MH code shortly prior to the 0.13 release.

### Summary

Viewing MCMC algorithms as infinite streams of state is useful for writing elegant, generic, flexible code. Streams occur everywhere in programming, and so there are lots of libraries for working with them. In this post I used the simple Stream from the Scala standard library, but there are much more powerful and flexible stream libraries for Scala, including fs2 and Akka-streams. But whatever libraries you are using, the fundamental concepts are the same. The most straightforward approach to implementation is to define impure stochastic streams to consume. However, a pure functional approach is also possible, and the Breeze library defines some useful functions to facilitate this approach. I’m still a little bit ambivalent about whether the pure approach is worth the additional cognitive overhead, but it’s certainly very interesting and worth playing with and thinking about the pros and cons.

Complete runnable code for the examples in this post are available from my blog repo.

## One-way ANOVA with fixed and random effects from a Bayesian perspective

This blog post is derived from a computer practical session that I ran as part of my new course on Statistics for Big Data, previously discussed. This course covered a lot of material very quickly. In particular, I deferred introducing notions of hierarchical modelling until the Bayesian part of the course, where I feel it is more natural and powerful. However, some of the terminology associated with hierarchical statistical modelling probably seems a bit mysterious to those without a strong background in classical statistical modelling, and so this practical session was intended to clear up some potential confusion. I will analyse a simple one-way Analysis of Variance (ANOVA) model from a Bayesian perspective, making sure to highlight the difference between fixed and random effects in a Bayesian context where everything is random, as well as emphasising the associated identifiability issues. R code is used to illustrate the ideas.

### Example scenario

We will consider the body mass index (BMI) of new male undergraduate students at a selection of UK Universities. Let us suppose that our data consist of measurements of (log) BMI for a random sample of 1,000 males at each of 8 Universities. We are interested to know if there are any differences between the Universities. Again, we want to model the process as we would simulate it, so thinking about how we would simulate such data is instructive. We start by assuming that the log BMI is a normal random quantity, and that the variance is common across the Universities in question (this is quite a big assumption, and it is easy to relax). We assume that the mean of this normal distribution is University-specific, but that we do not have strong prior opinions regarding the way in which the Universities differ. That said, we expect that the Universities would not be very different from one another.

### Simulating data

A simple simulation of the data with some plausible parameters can be carried out as follows.

set.seed(1)
Z=matrix(rnorm(1000*8,3.1,0.1),nrow=8)
RE=rnorm(8,0,0.01)
X=t(Z+RE)
colnames(X)=paste("Uni",1:8,sep="")
Data=stack(data.frame(X))
boxplot(exp(values)~ind,data=Data,notch=TRUE)


Make sure that you understand exactly what this code is doing before proceeding. The boxplot showing the simulated data is given below.

### Frequentist analysis

We will start with a frequentist analysis of the data. The model we would like to fit is

$y_{ij} = \mu + \theta_i + \varepsilon_{ij}$

where i is an indicator for the University and j for the individual within a particular University. The “effect”, $\theta_i$ represents how the ith University differs from the overall mean. We know that this model is not actually identifiable when the model parameters are all treated as “fixed effects”, but R will handle this for us.

> mod=lm(values~ind,data=Data)
> summary(mod)

Call:
lm(formula = values ~ ind, data = Data)

Residuals:
Min       1Q   Median       3Q      Max
-0.36846 -0.06778 -0.00069  0.06910  0.38219

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  3.101068   0.003223 962.244  < 2e-16 ***
indUni2     -0.006516   0.004558  -1.430 0.152826
indUni3     -0.017168   0.004558  -3.767 0.000166 ***
indUni4      0.017916   0.004558   3.931 8.53e-05 ***
indUni5     -0.022838   0.004558  -5.011 5.53e-07 ***
indUni6     -0.001651   0.004558  -0.362 0.717143
indUni7      0.007935   0.004558   1.741 0.081707 .
indUni8      0.003373   0.004558   0.740 0.459300
---
Signif. codes:  0 â€˜***â€™ 0.001 â€˜**â€™ 0.01 â€˜*â€™ 0.05 â€˜.â€™ 0.1 â€˜ â€™ 1

Residual standard error: 0.1019 on 7992 degrees of freedom
Multiple R-squared:  0.01439,	Adjusted R-squared:  0.01353
F-statistic: 16.67 on 7 and 7992 DF,  p-value: < 2.2e-16


We see that R has handled the identifiability problem using “treatment contrasts”, dropping the fixed effect for the first university, so that the intercept actually represents the mean value for the first University, and the effects for the other Univeristies represent the differences from the first University. If we would prefer to impose a sum constraint, then we can switch to sum contrasts with

options(contrasts=rep("contr.sum",2))


and then re-fit the model.

> mods=lm(values~ind,data=Data)
> summary(mods)

Call:
lm(formula = values ~ ind, data = Data)

Residuals:
Min       1Q   Median       3Q      Max
-0.36846 -0.06778 -0.00069  0.06910  0.38219

Coefficients:
Estimate Std. Error  t value Pr(>|t|)
(Intercept)  3.0986991  0.0011394 2719.558  < 2e-16 ***
ind1         0.0023687  0.0030146    0.786 0.432048
ind2        -0.0041477  0.0030146   -1.376 0.168905
ind3        -0.0147997  0.0030146   -4.909 9.32e-07 ***
ind4         0.0202851  0.0030146    6.729 1.83e-11 ***
ind5        -0.0204693  0.0030146   -6.790 1.20e-11 ***
ind6         0.0007175  0.0030146    0.238 0.811889
ind7         0.0103039  0.0030146    3.418 0.000634 ***
---
Signif. codes:  0 â€˜***â€™ 0.001 â€˜**â€™ 0.01 â€˜*â€™ 0.05 â€˜.â€™ 0.1 â€˜ â€™ 1

Residual standard error: 0.1019 on 7992 degrees of freedom
Multiple R-squared:  0.01439,	Adjusted R-squared:  0.01353
F-statistic: 16.67 on 7 and 7992 DF,  p-value: < 2.2e-16


This has 7 degrees of freedom for the effects, as before, but ensures that the 8 effects sum to precisely zero. This is arguably more interpretable in this case.

### Bayesian analysis

We will now analyse the simulated data from a Bayesian perspective, using JAGS.

#### Fixed effects

All parameters in Bayesian models are uncertain, and therefore random, so there is much confusion regarding the difference between “fixed” and “random” effects in a Bayesian context. For “fixed” effects, our prior captures the idea that we sample the effects independently from a “fixed” (typically vague) prior distribution. We could simply code this up and fit it in JAGS as follows.

require(rjags)
n=dim(X)[1]
p=dim(X)[2]
data=list(X=X,n=n,p=p)
init=list(mu=2,tau=1)
modelstring="
model {
for (j in 1:p) {
theta[j]~dnorm(0,0.0001)
for (i in 1:n) {
X[i,j]~dnorm(mu+theta[j],tau)
}
}
mu~dnorm(0,0.0001)
tau~dgamma(1,0.0001)
}
"
model=jags.model(textConnection(modelstring),data=data,inits=init)
update(model,n.iter=1000)
output=coda.samples(model=model,variable.names=c("mu","tau","theta"),n.iter=100000,thin=10)
print(summary(output))
plot(output)
autocorr.plot(output)
pairs(as.matrix(output))
crosscorr.plot(output)


On running the code we can clearly see that this naive approach leads to high posterior correlation between the mean and the effects, due to the fundamental lack of identifiability of the model. This also leads to MCMC mixing problems, but it is important to understand that this computational issue is conceptually entirely separate from the fundamental statisticial identifiability issue. Even if we could avoid MCMC entirely, the identifiability issue would remain.

A quick fix for the identifiability issue is to use “treatment contrasts”, just as for the frequentist model. We can implement that as follows.

data=list(X=X,n=n,p=p)
init=list(mu=2,tau=1)
modelstring="
model {
for (j in 1:p) {
for (i in 1:n) {
X[i,j]~dnorm(mu+theta[j],tau)
}
}
theta[1]<-0
for (j in 2:p) {
theta[j]~dnorm(0,0.0001)
}
mu~dnorm(0,0.0001)
tau~dgamma(1,0.0001)
}
"
model=jags.model(textConnection(modelstring),data=data,inits=init)
update(model,n.iter=1000)
output=coda.samples(model=model,variable.names=c("mu","tau","theta"),n.iter=100000,thin=10)
print(summary(output))
plot(output)
autocorr.plot(output)
pairs(as.matrix(output))
crosscorr.plot(output)


Running this we see that the model now works perfectly well, mixes nicely, and gives sensible inferences for the treatment effects.

Another source of confusion for models of this type is data formating and indexing in JAGS models. For our balanced data there was not problem passing in data to JAGS as a matrix and specifying the model using nested loops. However, for unbalanced designs this is not necessarily so convenient, and so then it can be helpful to specify the model based on two-column data, as we would use for fitting using lm(). This is illustrated with the following model specification, which is exactly equivalent to the previous model, and should give identical (up to Monte Carlo error) results.

N=n*p
data=list(y=Data$values,g=Data$ind,N=N,p=p)
init=list(mu=2,tau=1)
modelstring="
model {
for (i in 1:N) {
y[i]~dnorm(mu+theta[g[i]],tau)
}
theta[1]<-0
for (j in 2:p) {
theta[j]~dnorm(0,0.0001)
}
mu~dnorm(0,0.0001)
tau~dgamma(1,0.0001)
}
"
model=jags.model(textConnection(modelstring),data=data,inits=init)
update(model,n.iter=1000)
output=coda.samples(model=model,variable.names=c("mu","tau","theta"),n.iter=100000,thin=10)
print(summary(output))
plot(output)


As suggested above, this indexing scheme is much more convenient for unbalanced data, and hence widely used. However, since our data is balanced here, we will revert to the matrix approach for the remainder of the post.

One final thing to consider before moving on to random effects is the sum-contrast model. We can implement this in various ways, but I’ve tried to encode it for maximum clarity below, imposing the sum-to-zero constraint via the final effect.

data=list(X=X,n=n,p=p)
init=list(mu=2,tau=1)
modelstring="
model {
for (j in 1:p) {
for (i in 1:n) {
X[i,j]~dnorm(mu+theta[j],tau)
}
}
for (j in 1:(p-1)) {
theta[j]~dnorm(0,0.0001)
}
theta[p] <- -sum(theta[1:(p-1)])
mu~dnorm(0,0.0001)
tau~dgamma(1,0.0001)
}
"
model=jags.model(textConnection(modelstring),data=data,inits=init)
update(model,n.iter=1000)
output=coda.samples(model=model,variable.names=c("mu","tau","theta"),n.iter=100000,thin=10)
print(summary(output))
plot(output)


Again, this works perfectly well and gives similar results to the frequentist analysis.

#### Random effects

The key difference between fixed and random effects in a Bayesian framework is that random effects are not independent, being drawn from a distribution with parameters which are not fixed. Essentially, there is another level of hierarchy involved in the specification of the random effects. This is best illustrated by example. A random effects model for this problem is given below.

data=list(X=X,n=n,p=p)
init=list(mu=2,tau=1)
modelstring="
model {
for (j in 1:p) {
theta[j]~dnorm(0,taut)
for (i in 1:n) {
X[i,j]~dnorm(mu+theta[j],tau)
}
}
mu~dnorm(0,0.0001)
tau~dgamma(1,0.0001)
taut~dgamma(1,0.0001)
}
"
model=jags.model(textConnection(modelstring),data=data,inits=init)
update(model,n.iter=1000)
output=coda.samples(model=model,variable.names=c("mu","tau","taut","theta"),n.iter=100000,thin=10)
print(summary(output))
plot(output)


The only difference between this and our first naive attempt at a Bayesian fixed effects model is that we have put a gamma prior on the precision of the effect. Note that this model now runs and fits perfectly well, with reasonable mixing, and gives sensible parameter inferences. Although the effects here are not constrained to sum-to-zero, like in the case of sum contrasts for a fixed effects model, the prior encourages shrinkage towards zero, and so the random effect distribution can be thought of as a kind of soft version of a hard sum-to-zero constraint. From a predictive perspective, this model is much more powerful. In particular, using a random effects model, we can make strong predictions for unobserved groups (eg. a ninth University), with sensible prediction intervals based on our inferred understanding of how similar different universities are. Using a fixed effects model this isn’t really possible. Even for a Bayesian version of a fixed effects model using proper (but vague) priors, prediction intervals for unobserved groups are not really sensible.

Since we have used simulated data here, we can compare the estimated random effects with the true effects generated during the simulation.

> apply(as.matrix(output),2,mean)
mu           tau          taut      theta[1]      theta[2]
3.098813e+00  9.627110e+01  7.015976e+03  2.086581e-03 -3.935511e-03
theta[3]      theta[4]      theta[5]      theta[6]      theta[7]
-1.389099e-02  1.881528e-02 -1.921854e-02  5.640306e-04  9.529532e-03
theta[8]
5.227518e-03
> RE
[1]  0.002637034 -0.008294518 -0.014616348  0.016839902 -0.015443243
[6] -0.001908871  0.010162117  0.005471262


We see that the Bayesian random effects model has done an excellent job of estimation. If we wished, we could relax the assumption of common variance across the groups by making tau a vector indexed by j, though there is not much point in persuing this here, since we know that the groups do all have the same variance.

#### Strong subjective priors

The above is the usual story regarding fixed and random effects in Bayesian inference. I hope this is reasonably clear, so really I should quit while I’m ahead… However, the issues are really a bit more subtle than I’ve suggested. The inferred precision of the random effects was around 7,000, so now lets re-run the original, naive, “fixed effects” model with a strong subjective Bayesian prior on the distribution of the effects.

data=list(X=X,n=n,p=p)
init=list(mu=2,tau=1)
modelstring="
model {
for (j in 1:p) {
theta[j]~dnorm(0,7000)
for (i in 1:n) {
X[i,j]~dnorm(mu+theta[j],tau)
}
}
mu~dnorm(0,0.0001)
tau~dgamma(1,0.0001)
}
"
model=jags.model(textConnection(modelstring),data=data,inits=init)
update(model,n.iter=1000)
output=coda.samples(model=model,variable.names=c("mu","tau","theta"),n.iter=100000,thin=10)
print(summary(output))
plot(output)


This model also runs perfectly well and gives sensible inferences, despite the fact that the effects are iid from a fixed distribution and there is no hard constraint on the effects. Similarly, we can make sensible predictions, together with appropriate prediction intervals, for an unobserved group. So it isn’t so much the fact that the effects are coupled via an extra level of hierarchy that makes things work. It’s really the fact that the effects are sensibly distributed and not just sampled directly from a vague prior. So for “real” subjective Bayesians the line between fixed and random effects is actually very blurred indeed…

## Parallel Monte Carlo using Scala

### Introduction

In previous posts I have discussed general issues regarding parallel MCMC and examined in detail parallel Monte Carlo on a multicore laptop. In those posts I used the C programming language in conjunction with the MPI parallel library in order to illustrate the concepts. In this post I want to take the example from the second post and re-examine it using the Scala programming language.

The toy problem considered in the parallel Monte Carlo post used $10^9$ $U(0,1)$ random quantities to construct a Monte Carlo estimate of the integral

$\displaystyle I=\int_0^1\exp\{-u^2\}du$.

A very simple serial program to implement this algorithm is given below:

import java.util.concurrent.ThreadLocalRandom
import scala.math.exp
import scala.annotation.tailrec

object MonteCarlo {

@tailrec
def sum(its: Long,acc: Double): Double = {
if (its==0)
(acc)
else {
sum(its-1,acc+exp(-u*u))
}
}

def main(args: Array[String]) = {
println("Hello")
val iters=1000000000
val result=sum(iters,0.0)
println(result/iters)
println("Goodbye")
}

}


Note that ThreadLocalRandom is a parallel random number generator introduced into recent versions of the Java programming language, which can be easily utilised from Scala code. Assuming that Scala is installed, this can be compiled and run with commands like

scalac monte-carlo.scala
time scala MonteCarlo


This program works, and the timings (in seconds) for three runs are 57.79, 57.77 and 57.55 on the same laptop considered in the previous post. The first thing to note is that this Scala code is actually slightly faster than the corresponding C+MPI code in the single processor special case! Now that we have a good working implementation we should think how to parallelise it…

### Parallel implementation

Before constructing a parallel implementation, we will first construct a slightly re-factored serial version that will be easier to parallelise. The simplest way to introduce parallelisation into Scala code is to parallelise a map over a collection. We therefore need a collection and a map to apply to it. Here we will just divide our $10^9$ iterations into $N=4$ separate computations, and use a map to compute the required Monte Carlo sums.

import java.util.concurrent.ThreadLocalRandom
import scala.math.exp
import scala.annotation.tailrec

object MonteCarlo {

@tailrec
def sum(its: Long,acc: Double): Double = {
if (its==0)
(acc)
else {
sum(its-1,acc+exp(-u*u))
}
}

def main(args: Array[String]) = {
println("Hello")
val N=4
val iters=1000000000
val its=iters/N
val sums=(1 to N).toList map {x => sum(its,0.0)}
val result=sums.reduce(_+_)
println(result/iters)
println("Goodbye")
}

}


Running this new code confirms that it works and gives similar estimates for the Monte Carlo integral as the previous version. The timings for 3 runs on my laptop were 57.57, 57.67 and 57.80, similar to the previous version of the code. So far so good. But how do we make it parallel? Like this:

import java.util.concurrent.ThreadLocalRandom
import scala.math.exp
import scala.annotation.tailrec

object MonteCarlo {

@tailrec
def sum(its: Long,acc: Double): Double = {
if (its==0)
(acc)
else {
sum(its-1,acc+exp(-u*u))
}
}

def main(args: Array[String]) = {
println("Hello")
val N=4
val iters=1000000000
val its=iters/N
val sums=(1 to N).toList.par map {x => sum(its,0.0)}
val result=sums.reduce(_+_)
println(result/iters)
println("Goodbye")
}

}


That’s it! It’s now parallel. Studying the above code reveals that the only difference from the previous version is the introduction of the 4 characters .par in line 22 of the code. R programmers will find this very much analagous to using lapply() versus mclapply() in R code. The function par converts the collection (here an immutable List) to a parallel collection (here an immutable parallel List), and then subsequent maps, filters, etc., can be computed in parallel on appropriate multicore architectures. Timings for 3 runs on my laptop were 20.74, 20.82 and 20.88. Note that these timings are faster than the timings for N=4 processors for the corresponding C+MPI code…

### Varying the size of the parallel collection

We can trivially modify the previous code to make the size of the parallel collection, N, a command line argument:

import java.util.concurrent.ThreadLocalRandom
import scala.math.exp
import scala.annotation.tailrec

object MonteCarlo {

@tailrec
def sum(its: Long,acc: Double): Double = {
if (its==0)
(acc)
else {
sum(its-1,acc+exp(-u*u))
}
}

def main(args: Array[String]) = {
println("Hello")
val N=args(0).toInt
val iters=1000000000
val its=iters/N
val sums=(1 to N).toList.par map {x => sum(its,0.0)}
val result=sums.reduce(_+_)
println(result/iters)
println("Goodbye")
}

}


We can now run this code with varying sizes of N in order to see how the runtime of the code changes as the size of the parallel collection increases. Timings on my laptop are summarised in the table below.

 N     T1     T2     T3
1   57.67  57.62  57.83
2   32.20  33.24  32.76
3   26.63  26.60  26.63
4   20.99  20.92  20.75
5   20.13  18.70  18.76
6   16.57  16.52  16.59
7   15.72  14.92  15.27
8   13.56  13.51  13.32
9   18.30  18.13  18.12
10   17.25  17.33  17.22
11   17.04  16.99  17.09
12   15.95  15.85  15.91

16   16.62  16.68  16.74
32   15.41  15.54  15.42
64   15.03  15.03  15.28


So we see that the timings decrease steadily until the size of the parallel collection hits 8 (the number of processors my hyper-threaded quad-core presents via Linux), and then increases very slightly, but not much as the size of the collection increases. This is better than the case of C+MPI where performance degrades noticeably if too many processes are requested. Here, the Scala compiler and JVM runtime manage an appropriate number of threads for the collection irrespective of the actual size of the collection. Also note that all of the timings are faster than the corresponding C+MPI code discussed in the previous post.

However, the notion that the size of the collection is irrelevant is only true up to a point. Probably the most natural way to code this algorithm would be as:

import java.util.concurrent.ThreadLocalRandom
import scala.math.exp

object MonteCarlo {

def main(args: Array[String]) = {
println("Hello")
val iters=1000000000
val sums=(1 to iters).toList map {x => ThreadLocalRandom.current().nextDouble()} map {x => exp(-x*x)}
val result=sums.reduce(_+_)
println(result/iters)
println("Goodbye")
}

}


or as the parallel equivalent

import java.util.concurrent.ThreadLocalRandom
import scala.math.exp

object MonteCarlo {

def main(args: Array[String]) = {
println("Hello")
val iters=1000000000
val sums=(1 to iters).toList.par map {x => ThreadLocalRandom.current().nextDouble()} map {x => exp(-x*x)}
val result=sums.reduce(_+_)
println(result/iters)
println("Goodbye")
}

}


Although these algorithms are in many ways cleaner and more natural, they will bomb out with a lack of heap space unless you have a huge amount of RAM, as they rely on having all $10^9$ realisations in RAM simultaneously. The lesson here is that even though functional languages make it very easy to write clean, efficient parallel code, we must still be careful not to fill up the heap with gigantic (immutable) data structures…

## Brief introduction to Scala and Breeze for statistical computing

### Introduction

In the previous post I outlined why I think Scala is a good language for statistical computing and data science. In this post I want to give a quick taste of Scala and the Breeze numerical library to whet the appetite of the uninitiated. This post certainly won’t provide enough material to get started using Scala in anger – but I’ll try and provide a few pointers along the way. It also won’t be very interesting to anyone who knows Scala – I’m not introducing any of the very cool Scala stuff here – I think that some of the most powerful and interesting Scala language features can be a bit frightening for new users.

To reproduce the examples, you need to install Scala and Breeze. This isn’t very tricky, but I don’t want to get bogged down with a detailed walk-through here – I want to concentrate on the Scala language and Breeze library. You just need to install a recent version of Java, then Scala, and then Breeze. You might also want SBT and/or the ScalaIDE, though neither of these are necessary. Then you need to run the Scala REPL with the Breeze library in the classpath. There are several ways one can do this. The most obvious is to just run scala with the path to Breeze manually specified (or specified in an environment variable). Alternatively, you could run a console from an sbt session with a Breeze dependency (which is what I actually did for this post), or you could use a Scala Worksheet from inside a ScalaIDE project with a Breeze dependency.

### A Scala REPL session

#### A first glimpse of Scala

We’ll start with a few simple Scala concepts that are not dependent on Breeze. For further information, see the Scala documentation.

Welcome to Scala version 2.10.3 (OpenJDK 64-Bit Server VM, Java 1.7.0_25).
Type in expressions to have them evaluated.

scala> val a = 5
a: Int = 5

scala> a
res0: Int = 5


So far, so good. Using the Scala REPL is much like using the Python or R command line, so will be very familiar to anyone used to these or similar languages. The first thing to note is that labels need to be declared on first use. We have declared a to be a val. These are immutable values, which can not be just re-assigned, as the following code illustrates.

scala> a = 6
<console>:8: error: reassignment to val
a = 6
^
scala> a
res1: Int = 5


Immutability seems to baffle people unfamiliar with functional programming. But fear not, as Scala allows declaration of mutable variables as well:

scala> var b = 7
b: Int = 7

scala> b
res2: Int = 7

scala> b = 8
b: Int = 8

scala> b
res3: Int = 8


The Zen of functional programming is to realise that immutability is generally a good thing, but that really isn’t the point of this post. Scala has excellent support for both mutable and immutable collections as part of the standard library. See the API docs for more details. For example, it has immutable lists.

scala> val c = List(3,4,5,6)
c: List[Int] = List(3, 4, 5, 6)

scala> c(1)
res4: Int = 4

scala> c.sum
res5: Int = 18

scala> c.length
res6: Int = 4

scala> c.product
res7: Int = 360


Again, this should be pretty familiar stuff for anyone familiar with Python. Note that the sum and product methods are special cases of reduce operations, which are well supported in Scala. For example, we could compute the sum reduction using

scala> c.foldLeft(0)((x,y) => x+y)
res8: Int = 18


or the slightly more condensed form given below, and similarly for the product reduction.

scala> c.foldLeft(0)(_+_)
res9: Int = 18

scala> c.foldLeft(1)(_*_)
res10: Int = 360


Scala also has a nice immutable Vector class, which offers a range of constant time operations (but note that this has nothing to do with the mutable Vector class that is part of the Breeze library).

scala> val d = Vector(2,3,4,5,6,7,8,9)
d: scala.collection.immutable.Vector[Int] = Vector(2, 3, 4, 5, 6, 7, 8, 9)

scala> d
res11: scala.collection.immutable.Vector[Int] = Vector(2, 3, 4, 5, 6, 7, 8, 9)

scala> d.slice(3,6)
res12: scala.collection.immutable.Vector[Int] = Vector(5, 6, 7)

scala> val e = d.updated(3,0)
e: scala.collection.immutable.Vector[Int] = Vector(2, 3, 4, 0, 6, 7, 8, 9)

scala> d
res13: scala.collection.immutable.Vector[Int] = Vector(2, 3, 4, 5, 6, 7, 8, 9)

scala> e
res14: scala.collection.immutable.Vector[Int] = Vector(2, 3, 4, 0, 6, 7, 8, 9)


Note that when e is created as an updated version of d the whole of d is not copied – only the parts that have been updated. And we don’t have to worry that aspects of d and e point to the same information in memory, as they are both immutable… As should be clear by now, Scala has excellent support for functional programming techniques. In addition to the reduce operations mentioned already, maps and filters are also well covered.

scala> val f=(1 to 10).toList
f: List[Int] = List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

scala> f
res15: List[Int] = List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

scala> f.map(x => x*x)
res16: List[Int] = List(1, 4, 9, 16, 25, 36, 49, 64, 81, 100)

scala> f map {x => x*x}
res17: List[Int] = List(1, 4, 9, 16, 25, 36, 49, 64, 81, 100)

scala> f filter {_ > 4}
res18: List[Int] = List(5, 6, 7, 8, 9, 10)


Note how Scala allows methods with a single argument to be written as an infix operator, making for more readable code.

#### A first look at Breeze

The next part of the session requires the Breeze library – see the Breeze quickstart guide for further details. We begin by taking a quick look at everyone’s favourite topic of non-uniform random number generation. Let’s start by generating a couple of draws from a Poisson distribution with mean 3.

scala> import breeze.stats.distributions._
import breeze.stats.distributions._

scala> val poi = Poisson(3.0)
poi: breeze.stats.distributions.Poisson = Poisson(3.0)

scala> poi.draw
res19: Int = 2

scala> poi.draw
res20: Int = 3


If more than a single draw is required, an iid sample can be obtained.

scala> val x = poi.sample(10)
x: IndexedSeq[Int] = Vector(2, 3, 3, 4, 2, 2, 1, 2, 4, 2)

scala> x
res21: IndexedSeq[Int] = Vector(2, 3, 3, 4, 2, 2, 1, 2, 4, 2)

scala> x.sum
res22: Int = 25

scala> x.length
res23: Int = 10

scala> x.sum.toDouble/x.length
res24: Double = 2.5


Note that this Vector is mutable. The probability mass function (PMF) of the Poisson distribution is also available.

scala> poi.probabilityOf(2)
res25: Double = 0.22404180765538775

scala> x map {x => poi.probabilityOf(x)}
res26: IndexedSeq[Double] = Vector(0.22404180765538775, 0.22404180765538775, 0.22404180765538775, 0.16803135574154085, 0.22404180765538775, 0.22404180765538775, 0.14936120510359185, 0.22404180765538775, 0.16803135574154085, 0.22404180765538775)

scala> x map {poi.probabilityOf(_)}
res27: IndexedSeq[Double] = Vector(0.22404180765538775, 0.22404180765538775, 0.22404180765538775, 0.16803135574154085, 0.22404180765538775, 0.22404180765538775, 0.14936120510359185, 0.22404180765538775, 0.16803135574154085, 0.22404180765538775)


Obviously, Gaussian variables (and Gamma, and several others) are supported in a similar way.

scala> val gau=Gaussian(0.0,1.0)
gau: breeze.stats.distributions.Gaussian = Gaussian(0.0, 1.0)

scala> gau.draw
res28: Double = 1.606121255846881

scala> gau.draw
res29: Double = -0.1747896055492152

scala> val y=gau.sample(20)
y: IndexedSeq[Double] = Vector(-1.3758577012869702, -1.2148314970824652, -0.022501190144116855, 0.3244006323566883, 0.35978577573558407, 0.9651857500320781, -0.40834034207848985, 0.11583348205331555, -0.8797699986810634, -0.33609738668214695, 0.7043252811790879, -1.2045594639823656, 0.19442688045065826, -0.31442160076087067, 0.06313451540562891, -1.5304745838587115, -1.2372764884467027, 0.5875490994217284, -0.9385520597707431, -0.6647903243363228)

scala> y
res30: IndexedSeq[Double] = Vector(-1.3758577012869702, -1.2148314970824652, -0.022501190144116855, 0.3244006323566883, 0.35978577573558407, 0.9651857500320781, -0.40834034207848985, 0.11583348205331555, -0.8797699986810634, -0.33609738668214695, 0.7043252811790879, -1.2045594639823656, 0.19442688045065826, -0.31442160076087067, 0.06313451540562891, -1.5304745838587115, -1.2372764884467027, 0.5875490994217284, -0.9385520597707431, -0.6647903243363228)

scala> y.sum/y.length
res31: Double = -0.34064156102380994

scala> y map {gau.logPdf(_)}
res32: IndexedSeq[Double] = Vector(-1.8654307403000054, -1.6568463163564844, -0.9191916849836235, -0.9715564183413823, -0.9836614354155007, -1.3847302992371653, -1.0023094506890617, -0.9256472309869705, -1.3059361584943119, -0.975419259871957, -1.1669755840586733, -1.6444202843394145, -0.93783943912556, -0.9683690047171869, -0.9209315167224245, -2.090114759123421, -1.6843650876361744, -1.0915455053203147, -1.359378517654625, -1.1399116208702693)

scala> Gamma(2.0,3.0).sample(5)
res33: IndexedSeq[Double] = Vector(2.38436441278546, 2.125017198373521, 2.333118708811143, 5.880076392566909, 2.0901427084667503)


This is all good stuff for those of us who like to do Markov chain Monte Carlo. There are not masses of statistical data analysis routines built into Breeze, but a few basic tools are provided, including some basic summary statistics.

scala> import breeze.stats.DescriptiveStats._
import breeze.stats.DescriptiveStats._

scala> mean(y)
res34: Double = -0.34064156102380994

scala> variance(y)
res35: Double = 0.574257149387757

scala> meanAndVariance(y)
res36: (Double, Double) = (-0.34064156102380994,0.574257149387757)


Support for linear algebra is an important part of any scientific library. Here the Breeze developers have made the wise decision to provide a nice Scala interface to netlib-java. This in turn calls out to any native optimised BLAS or LAPACK libraries installed on the system, but will fall back to Java code if no optimised libraries are available. This means that linear algebra code using Scala and Breeze should run as fast as code written in any other language, including C, C++ and Fortran, provided that optimised libraries are installed on the system. For further details see the Breeze linear algebra guide. Let’s start by creating and messing with a dense vector.

scala> import breeze.linalg._
import breeze.linalg._

scala> val v=DenseVector(y.toArray)
v: breeze.linalg.DenseVector[Double] = DenseVector(-1.3758577012869702, -1.2148314970824652, -0.022501190144116855, 0.3244006323566883, 0.35978577573558407, 0.9651857500320781, -0.40834034207848985, 0.11583348205331555, -0.8797699986810634, -0.33609738668214695, 0.7043252811790879, -1.2045594639823656, 0.19442688045065826, -0.31442160076087067, 0.06313451540562891, -1.5304745838587115, -1.2372764884467027, 0.5875490994217284, -0.9385520597707431, -0.6647903243363228)

scala> v(1) = 0

scala> v
res38: breeze.linalg.DenseVector[Double] = DenseVector(-1.3758577012869702, 0.0, -0.022501190144116855, 0.3244006323566883, 0.35978577573558407, 0.9651857500320781, -0.40834034207848985, 0.11583348205331555, -0.8797699986810634, -0.33609738668214695, 0.7043252811790879, -1.2045594639823656, 0.19442688045065826, -0.31442160076087067, 0.06313451540562891, -1.5304745838587115, -1.2372764884467027, 0.5875490994217284, -0.9385520597707431, -0.6647903243363228)

scala> v(1 to 3) := 1.0
res39: breeze.linalg.DenseVector[Double] = DenseVector(1.0, 1.0, 1.0)

scala> v
res40: breeze.linalg.DenseVector[Double] = DenseVector(-1.3758577012869702, 1.0, 1.0, 1.0, 0.35978577573558407, 0.9651857500320781, -0.40834034207848985, 0.11583348205331555, -0.8797699986810634, -0.33609738668214695, 0.7043252811790879, -1.2045594639823656, 0.19442688045065826, -0.31442160076087067, 0.06313451540562891, -1.5304745838587115, -1.2372764884467027, 0.5875490994217284, -0.9385520597707431, -0.6647903243363228)

scala> v(1 to 3) := DenseVector(1.0,1.5,2.0)
res41: breeze.linalg.DenseVector[Double] = DenseVector(1.0, 1.5, 2.0)

scala> v
res42: breeze.linalg.DenseVector[Double] = DenseVector(-1.3758577012869702, 1.0, 1.5, 2.0, 0.35978577573558407, 0.9651857500320781, -0.40834034207848985, 0.11583348205331555, -0.8797699986810634, -0.33609738668214695, 0.7043252811790879, -1.2045594639823656, 0.19442688045065826, -0.31442160076087067, 0.06313451540562891, -1.5304745838587115, -1.2372764884467027, 0.5875490994217284, -0.9385520597707431, -0.6647903243363228)

scala> v :> 0.0
res43: breeze.linalg.BitVector = BitVector(1, 2, 3, 4, 5, 7, 10, 12, 14, 17)

scala> (v :> 0.0).toArray
res44: Array[Boolean] = Array(false, true, true, true, true, true, false, true, false, false, true, false, true, false, true, false, false, true, false, false)


Next let’s create and mess around with some dense matrices.

scala> val m = new DenseMatrix(5,4,linspace(1.0,20.0,20).toArray)
m: breeze.linalg.DenseMatrix[Double] =
1.0  6.0   11.0  16.0
2.0  7.0   12.0  17.0
3.0  8.0   13.0  18.0
4.0  9.0   14.0  19.0
5.0  10.0  15.0  20.0

scala> m
res45: breeze.linalg.DenseMatrix[Double] =
1.0  6.0   11.0  16.0
2.0  7.0   12.0  17.0
3.0  8.0   13.0  18.0
4.0  9.0   14.0  19.0
5.0  10.0  15.0  20.0

scala> m.rows
res46: Int = 5

scala> m.cols
res47: Int = 4

scala> m(::,1)
res48: breeze.linalg.DenseVector[Double] = DenseVector(6.0, 7.0, 8.0, 9.0, 10.0)

scala> m(1,::)
res49: breeze.linalg.DenseMatrix[Double] = 2.0  7.0  12.0  17.0

scala> m(1,::) := linspace(1.0,2.0,4)
res50: breeze.linalg.DenseMatrix[Double] = 1.0  1.3333333333333333  1.6666666666666665  2.0

scala> m
res51: breeze.linalg.DenseMatrix[Double] =
1.0  6.0                 11.0                16.0
1.0  1.3333333333333333  1.6666666666666665  2.0
3.0  8.0                 13.0                18.0
4.0  9.0                 14.0                19.0
5.0  10.0                15.0                20.0

scala>

scala> val n = m.t
n: breeze.linalg.DenseMatrix[Double] =
1.0   1.0                 3.0   4.0   5.0
6.0   1.3333333333333333  8.0   9.0   10.0
11.0  1.6666666666666665  13.0  14.0  15.0
16.0  2.0                 18.0  19.0  20.0

scala> n
res52: breeze.linalg.DenseMatrix[Double] =
1.0   1.0                 3.0   4.0   5.0
6.0   1.3333333333333333  8.0   9.0   10.0
11.0  1.6666666666666665  13.0  14.0  15.0
16.0  2.0                 18.0  19.0  20.0

scala> val o = m*n
o: breeze.linalg.DenseMatrix[Double] =
414.0              59.33333333333333  482.0              516.0              550.0
59.33333333333333  9.555555555555555  71.33333333333333  77.33333333333333  83.33333333333333
482.0              71.33333333333333  566.0              608.0              650.0
516.0              77.33333333333333  608.0              654.0              700.0
550.0              83.33333333333333  650.0              700.0              750.0

scala> o
res53: breeze.linalg.DenseMatrix[Double] =
414.0              59.33333333333333  482.0              516.0              550.0
59.33333333333333  9.555555555555555  71.33333333333333  77.33333333333333  83.33333333333333
482.0              71.33333333333333  566.0              608.0              650.0
516.0              77.33333333333333  608.0              654.0              700.0
550.0              83.33333333333333  650.0              700.0              750.0

scala> val p = n*m
p: breeze.linalg.DenseMatrix[Double] =
52.0                117.33333333333333  182.66666666666666  248.0
117.33333333333333  282.77777777777777  448.22222222222223  613.6666666666667
182.66666666666666  448.22222222222223  713.7777777777778   979.3333333333334
248.0               613.6666666666667   979.3333333333334   1345.0

scala> p
res54: breeze.linalg.DenseMatrix[Double] =
52.0                117.33333333333333  182.66666666666666  248.0
117.33333333333333  282.77777777777777  448.22222222222223  613.6666666666667
182.66666666666666  448.22222222222223  713.7777777777778   979.3333333333334
248.0               613.6666666666667   979.3333333333334   1345.0


So, messing around with vectors and matrices is more-or-less as convenient as in well-known dynamic and math languages. To conclude this section, let us see how to simulate some data from a regression model and then solve the least squares problem to obtain the estimated regression coefficients. We will simulate 1,000 observations from a model with 5 covariates.

scala> val X = new DenseMatrix(1000,5,gau.sample(5000).toArray)
X: breeze.linalg.DenseMatrix[Double] =
-0.40186606934180685  0.9847148198711287    ... (5 total)
-0.4760404521336951   -0.833737041320742    ...
-0.3315199616926892   -0.19460446824586297  ...
-0.14764615494496836  -0.17947658245206904  ...
-0.8357372755800905   -2.456222113596015    ...
-0.44458309216683184  1.848007773944826     ...
0.060314034896221065  0.5254462055311016    ...
0.8637867740789016    -0.9712570453363925   ...
0.11620167261655819   -1.2231380938032232   ...
-0.3335514290842617   -0.7487303696662753   ...
-0.5598937433421866   0.11083382409013512   ...
-1.7213395389510568   1.1717491221846357    ...
-1.078873342208984    0.9386859686451607    ...
-0.7793854546738327   -0.9829373863442161   ...
-1.054275201631216    0.10100826507456745   ...
-0.6947188686537832   1.215...
scala> val b0 = linspace(1.0,2.0,5)
b0: breeze.linalg.DenseVector[Double] = DenseVector(1.0, 1.25, 1.5, 1.75, 2.0)

scala> val y0 = X * b0
y0: breeze.linalg.DenseVector[Double] = DenseVector(0.08200546839589107, -0.5992571365601228, -5.646398002309553, -7.346136663325798, -8.486423788193362, 1.451119214541837, -0.25792385841948406, 2.324936340609002, -1.2285599639827862, -4.030261316643863, -4.1732627416377674, -0.5077151099958077, -0.2087263741903591, 0.46678616461409383, 2.0244342278575975, 1.775756468177401, -4.799821190728213, -1.8518388060564481, 1.5892306875621767, -1.6528539564387008, 1.4064864330994125, -0.8734630221484178, -7.75470002781836, -0.2893619536998493, -5.972958583649336, -4.952666733286302, 0.5431255990489059, -2.477076684976403, -0.6473617571867107, -0.509338416957489, -1.5415350935719594, -0.47068802465681125, 2.546118380362026, -7.940401988804477, -1.037049442788122, -1.564016663370888, -3.3147087994...
scala> val y = y0 + DenseVector(gau.sample(1000).toArray)
y: breeze.linalg.DenseVector[Double] = DenseVector(-0.572127338358624, -0.16481167194161406, -4.213873268823003, -10.142015065601388, -7.893898543052863, 1.7881055848475076, -0.26987820512025357, 3.3289433195054148, -2.514141419925489, -4.643625974157769, -3.8061000214061886, 0.6462624993109218, 0.23603338389134149, 1.0211137806779267, 2.0061727641393317, 0.022624943149799348, -5.429601401989341, -1.836181225242386, 1.0265599173053048, -0.1673732536615371, 0.8418249443853956, -1.1547110533101967, -8.392100167478764, -1.1586377992526877, -6.400362975646245, -5.487018086963841, 0.3038055584347069, -1.2247410435868684, -0.06476921390724344, -1.5039074374120407, -1.0189111630970076, 1.307339668865724, 2.048320821568789, -8.769328824477714, -0.9104251029228555, -1.3533910178496698, -2.178788...
scala> val b = X \ y  // defaults to a QR-solve of the least squares problem
b: breeze.linalg.DenseVector[Double] = DenseVector(0.9952708232116663, 1.2344546192238952, 1.5543512339052412, 1.744091673457169, 1.9874158953720507)


So all of the most important building blocks for statistical computing are included in the Breeze library.

At this point it is really worth reminding yourself that Scala is actually a statically typed language, despite the fact that in this session we have not explicitly declared the type of anything at all! This is because Scala has type inference, which makes type declarations optional when it is straightforward for the compiler to figure out what the types must be. For example, for our very first expression, val a = 5, because the RHS is an Int, it is clear that the LHS must also be an Int, and so the compiler infers that the type of a must be an Int, and treats the code as if the type had been declared as val a: Int = 5. This type inference makes Scala feel very much like a dynamic language in general use. Typically, we carefully specify the types of function arguments (and often the return type of the function, too), but then for the main body of each function, just let the compiler figure out all of the types and write code as if the language were dynamic. To me, this seems like the best of all worlds. The convenience of dynamic languages with the safety of static typing.

Declaring the types of function arguments is not usually a big deal, as the following simple example demonstrates.

scala> def mean(arr: Array[Int]): Double = {
|   arr.sum.toDouble/arr.length
| }
mean: (arr: Array[Int])Double

scala> mean(Array(3,1,4,5))
res55: Double = 3.25


### A complete Scala program

For completeness, I will finish this post with a very simple but complete Scala/Breeze program. In a previous post I discussed a simple Gibbs sampler in Scala, but in that post I used the Java COLT library for random number generation. Below is a version using Breeze instead.

object BreezeGibbs {

import breeze.stats.distributions._
import scala.math.sqrt

class State(val x: Double, val y: Double)

def nextIter(s: State): State = {
val newX = Gamma(3.0, 1.0 / ((s.y) * (s.y) + 4.0)).draw()
new State(newX, Gaussian(1.0 / (newX + 1), 1.0 / sqrt(2 * newX + 2)).draw())
}

def nextThinnedIter(s: State, left: Int): State = {
if (left == 0) s
else nextThinnedIter(nextIter(s), left - 1)
}

def genIters(s: State, current: Int, stop: Int, thin: Int): State = {
if (!(current > stop)) {
println(current + " " + s.x + " " + s.y)
genIters(nextThinnedIter(s, thin), current + 1, stop, thin)
} else s
}

def main(args: Array[String]) {
println("Iter x y")
genIters(new State(0.0, 0.0), 1, 50000, 1000)
}

}


### Summary

In this post I’ve tried to give a quick taste of the Scala language and the Breeze library for those used to dynamic languages for statistical computing. Hopefully I’ve illustrated that the basics don’t look too different, so there is no reason to fear Scala. It is perfectly possible to start using Scala as a better and faster Python or R. Once you’ve mastered the basics, you can then start exploring the full power of the language. There’s loads of introductory Scala material to be found on-line. It probably makes sense to start with the links I’ve highlighted above. After that, just start searching – there’s an interesting set of tutorials I noticed just the other day. A very time-efficient way to learn Scala quickly is to do the FP with Scala course on Coursera, but whether this makes sense will depend on when it is next running. For those who prefer real books, the book Programming in Scala is the standard reference, and I’ve also found Functional programming in Scala to be useful (free text of the first edition of the former and a draft of the latter can be found on-line).

### REPL Script

Below is a copy of the complete REPL script, for reference.

// start with non-Breeze stuff

val a = 5
a
a = 6
a

var b = 7
b
b = 8
b

val c = List(3,4,5,6)
c(1)
c.sum
c.length
c.product
c.foldLeft(0)((x,y) => x+y)
c.foldLeft(0)(_+_)
c.foldLeft(1)(_*_)

val d = Vector(2,3,4,5,6,7,8,9)
d
d.slice(3,6)
val e = d.updated(3,0)
d
e

val f=(1 to 10).toList
f
f.map(x => x*x)
f map {x => x*x}
f filter {_ > 4}

// introduce breeze through random distributions
// https://github.com/scalanlp/breeze/wiki/Quickstart

import breeze.stats.distributions._
val poi = Poisson(3.0)
poi.draw
poi.draw
val x = poi.sample(10)
x
x.sum
x.length
x.sum.toDouble/x.length
poi.probabilityOf(2)
x map {x => poi.probabilityOf(x)}
x map {poi.probabilityOf(_)}

val gau=Gaussian(0.0,1.0)
gau.draw
gau.draw
val y=gau.sample(20)
y
y.sum/y.length
y map {gau.logPdf(_)}

Gamma(2.0,3.0).sample(5)

import breeze.stats.DescriptiveStats._
mean(y)
variance(y)
meanAndVariance(y)

// move on to linear algebra
// https://github.com/scalanlp/breeze/wiki/Breeze-Linear-Algebra

import breeze.linalg._
val v=DenseVector(y.toArray)
v(1) = 0
v
v(1 to 3) := 1.0
v
v(1 to 3) := DenseVector(1.0,1.5,2.0)
v
v :> 0.0
(v :> 0.0).toArray

val m = new DenseMatrix(5,4,linspace(1.0,20.0,20).toArray)
m
m.rows
m.cols
m(::,1)
m(1,::)
m(1,::) := linspace(1.0,2.0,4)
m

val n = m.t
n
val o = m*n
o
val p = n*m
p

// regression and QR solution

val X = new DenseMatrix(1000,5,gau.sample(5000).toArray)
val b0 = linspace(1.0,2.0,5)
val y0 = X * b0
val y = y0 + DenseVector(gau.sample(1000).toArray)
val b = X \ y  // defaults to a QR-solve of the least squares problem

// a simple function example

def mean(arr: Array[Int]): Double = {
arr.sum.toDouble/arr.length
}

mean(Array(3,1,4,5))


## Metropolis Hastings MCMC when the proposal and target have differing support

### Introduction

Very often it is desirable to use Metropolis Hastings MCMC for a target distribution which does not have full support (for example, it may correspond to a non-negative random variable), using a proposal distribution which does (for example, a Gaussian random walk proposal). This isn’t a problem at all, but on more than one occasion now I have come across students getting this wrong, so I thought it might be useful to have a brief post on how to do it right, see what people sometimes get wrong, and why, and then think about correcting the wrong method in order to make it right…

### A simple example

For this post we will consider a simple $Ga(2,1)$ target distribution, with density

$\pi(x) = xe^{-x},\quad x\geq 0.$

Of course this is a very simple distribution, and there are many straightforward ways to simulate it directly, but for this post we will use a random walk Metropolis-Hastings (MH) scheme with standard Gaussian innovations. So, if the current state of the chain is $x$, a proposed new value $x^\star$ will be generated from

$f(x^\star|x) = \phi(x^\star-x),$

where $\phi(\cdot)$ is the standard normal density. This proposed new value is accepted with probability $\min\{1,A\}$, where

$\displaystyle A = \frac{\pi(x^\star)}{\pi(x)} \frac{f(x|x^\star)}{f(x^\star|x)} = \frac{\pi(x^\star)}{\pi(x)} \frac{\phi(x-x^\star)}{\phi(x^\star-x)} = \frac{\pi(x^\star)}{\pi(x)} ,$

since the standard normal density is symmetric.

#### Correct implementation

We can easily implement this using R as follows:

met1=function(iters)
{
xvec=numeric(iters)
x=1
for (i in 1:iters) {
xs=x+rnorm(1)
A=dgamma(xs,2,1)/dgamma(x,2,1)
if (runif(1)<A)
x=xs
xvec[i]=x
}
return(xvec)
}


We can run it, plot the results and check it against the true target with the following commands.

iters=1000000
out=met1(iters)
hist(out,100,freq=FALSE,main="met1")


If you have a slow computer, you may prefer to use iters=100000. The above code uses R’s built-in gamma density. Alternatively, we can hard-code the density as follows.

met2=function(iters)
{
xvec=numeric(iters)
x=1
for (i in 1:iters) {
xs=x+rnorm(1)
A=xs*exp(-xs)/(x*exp(-x))
if (runif(1)<A)
x=xs
xvec[i]=x
}
return(xvec)
}


We can run this code using the following commands, to verify that it does work as expected.

out=met2(iters)
hist(out,100,freq=FALSE,main="met2")


However, there is a potential problem with the above code that we have got away with in this instance, which often catches people out. We have hard-coded the density for $x>0$ without checking the sign of $x$. Here we get away with it as a negative proposal will lead to a negative acceptance ratio that we will reject straight away. This is not always the case (consider, for example, a $Ga(3,1)$ distribution). So really we should check the sign of $x^\star$ and reject immediately if is not within the support of the target.

Although this problem often catches people out, it tends not to be a big issue in practice, as it typically leads to an obviously incorrect sampler, or a sampler which crashes, and is relatively simple to debug and fix.

#### An incorrect sampler

The problem I want to focus on here is more subtle, but closely related. It is clear that any $x^\star<0$ should be rejected. With the above code, such values are indeed rejected, and the sampler advances to the next iteration. However, in more complex samplers, where an update like this might be one tiny part of a massive sampler with a very high-dimensional state space, it seems like a bit of a "waste" of a MH move to just propose a negative value, throw it away, and move on. Evidently, it seems tempting, therefore, to keep on sampling $x^\star$ values until a non-negative value is obtained, and then evaluate the acceptance ratio and decide whether or not to accept. We could code up this sampler as follows.

met3=function(iters)
{
xvec=numeric(iters)
x=1
for (i in 1:iters) {
repeat {
xs=x+rnorm(1)
if (xs>0)
break
}
A=xs*exp(-xs)/(x*exp(-x))
if (runif(1)<A)
x=xs
xvec[i]=x
}
return(xvec)
}


As reasonable as this idea may at first seem, it does not lead to a sampler having the desired target, as can be verified using the following commands.

out=met3(iters)
hist(out,100,freq=FALSE,main="met3")


So, this sampler seems to be sampling something close to the desired target, but not the same. This raises a couple of questions. First and most important, can we fix this sampler so that it does sample the correct target (yes), and second, can we figure out what target density the incorrect sampler is actually sampling (again, yes)? Let’s start with the issue of how to fix the sampler, as this will also help us to understand what the incorrect sampler is doing.

#### Fixing the truncated sampler

By repeatedly sampling from the proposal until we obtain a non-negative value, we are actually implementing a rejection sampler for sampling from the proposal distribution truncated at zero. This is a perfectly reasonable proposal distribution, so we can use it provided that we use the correct MH acceptance ratio. Now, the truncated density has the same density as the untruncated density, apart from the differing support and a normalising constant. Indeed, this may be why people often assume this method will work, because normalising constants often don’t matter in MH schemes. However, the normalising constant only doesn’t matter if it is independent of the state, and here it is not… Explicitly, we have

$f(x^\star|x) \propto \phi(x^\star-x),\quad x^\star>0.$

Including the normalising constant we have

$\displaystyle f(x^\star|x) = \frac{\phi(x^\star-x)}{\Phi(x)},\quad x^\star>0,$

where $\Phi(x)$ is the standard normal CDF. Consequently, the correct acceptance ratio to use with this proposal is

$\displaystyle A = \frac{\pi(x^\star)}{\pi(x)} \frac{\phi(x-x^\star)}{\phi(x^\star-x)}\frac{\Phi(x)}{\Phi(x^\star)} = \frac{\pi(x^\star)}{\pi(x)}\frac{\Phi(x)}{\Phi(x^\star)},$

where we see that the normalising constants do not cancel out. We can modify the previous sampler to use the correct acceptance ratio as follows.

met4=function(iters)
{
xvec=numeric(iters)
x=1
for (i in 1:iters) {
repeat {
xs=x+rnorm(1)
if (xs>0)
break
}
A=xs*exp(-xs)/(x*exp(-x))
A=A*pnorm(x)/pnorm(xs)
if (runif(1)<A)
x=xs
xvec[i]=x
}
return(xvec)
}


We can verify that this sampler gives leads to the correct target with the following commands.

out=met4(iters)
hist(out,100,freq=FALSE,main="met4")


So, truncating the proposal at zero is fine, provided that you modify the acceptance ratio accordingly.

#### What does the incorrect sampler target?

Now that we understand why the naive truncated sampler was wrong and how to fix it, we can, out of curiosity, wonder what distribution that sampler actually targets. Now we understand what proposal we are actually using, we can re-write the acceptance ratio as

$\displaystyle A = \frac{\pi(x^\star)\Phi(x^\star)}{\pi(x)\Phi(x)}\frac{\frac{\phi(x-x^\star)}{\Phi(x^\star)}}{\frac{\phi(x^\star-x)}{\Phi(x)}},$

from which it is clear that the actual target of this chain is

$\tilde\pi(x) \propto \pi(x)\Phi(x),$

or

$\tilde\pi(x)\propto xe^{-x}\Phi(x),\quad x\geq 0.$

The constant of proportionality is not immediately obvious, but is tractable, and turns out to be a nice undergraduate exercise in integration by parts, leading to

$\displaystyle \tilde\pi(x) = \frac{2\sqrt{2\pi}}{2+\sqrt{2\pi}}xe^{-x}\Phi(x),\quad x\geq 0.$

We can verify this using the following commands.

out=met3(iters)
hist(out,100,freq=FALSE,main="met3")


Now we know the actual target of the incorrect sampler, we can compare it with the correct target as follows.

curve(dgamma(x,2,1),0,10,col=2,lwd=2,main="Densities")


So we see that the distributions are different, but not so different that one would immediate suspect an error on the basis of a sample of output. This makes it a difficult bug to track down.

### Summary

There is no problem in principle using a proposal with full support for a target with limited support in MH algorithms. However, it is important to check whether a proposed value is within the support of the target and reject the proposed move if it is not. If you are concerned that such a scheme might be inefficient, it is possible to use a truncated proposal provided that you modify the MH acceptance ratio to include the relevant normalisation constants. If you don’t modify the acceptance probability, you will get a sampler which targets the wrong distribution, but it will often be quite similar to the correct target, making it a difficult bug to spot and track down.

## Gibbs sampling a Gaussian Markov random field (GMRF) using Java

### Introduction

As I’ve explained previously, I’m gradually coming around to the idea of using Java for the development of MCMC codes, and I’m starting to build up a collection of simple examples for getting started. One of the advantages of Java is that it includes a standard cross-platform GUI library. This might not seem like the most important requirement for MCMC, but can actually be very handy in several contexts, particularly for monitoring convergence. One obvious context is that of image analysis, where it can be useful to monitor image reconstructions as the sampler is running. In this post I’ll show three very small simple Java classes which together provide an application for running a Gibbs sampler on a (non-stationary, unconditioned) Gaussian Markov random field.

The model is essentially that the distribution of each pixel is defined intrinsically, dependent only on its four nearest neighbours on a rectangular lattice, and here the distribution will be Gaussian with mean equal to the sample mean of the four neighbouring pixels and a fixed (unit) variance. On its own this isn’t especially useful, but it is a key component of many image analysis applications.

### A simple Java implementation

We will start with the class MrfApp containing the main method for the application:

MrfApp.java

import java.io.*;
class MrfApp {
public static void main(String[] arg)
throws IOException
{
Mrf mrf;
System.out.println("started program");
mrf=new Mrf(800,600);
System.out.println("created mrf object");
mrf.update(1000);
mrf.saveImage("mrf.png");
System.out.println("finished program");
mrf.frame.dispose();
System.exit(0);
}
}


Hopefully this code is largely self-explanatory, but relies on a class called Mrf which contains all of the logic associated with the GMRF.

Mrf.java

import java.io.*;
import java.util.*;
import java.awt.image.*;
import javax.swing.*;
import javax.imageio.ImageIO;

class Mrf
{
int n,m;
double[][] cells;
Random rng;
BufferedImage bi;
WritableRaster wr;
JFrame frame;
ImagePanel ip;

Mrf(int n_arg,int m_arg)
{
n=n_arg;
m=m_arg;
cells=new double[n][m];
rng=new Random();
bi=new BufferedImage(n,m,BufferedImage.TYPE_BYTE_GRAY);
wr=bi.getRaster();
frame=new JFrame("MRF");
frame.setSize(n,m);
frame.setVisible(true);
}

public void saveImage(String filename)
throws IOException
{
ImageIO.write(bi,"PNG",new File(filename));
}

public void updateImage()
{
double mx=-1e+100;
double mn=1e+100;
for (int i=0;i<n;i++) {
for (int j=0;j<m;j++) {
if (cells[i][j]>mx) { mx=cells[i][j]; }
if (cells[i][j]<mn) { mn=cells[i][j]; }
}
}
for (int i=0;i<n;i++) {
for (int j=0;j<m;j++) {
int level=(int) (255*(cells[i][j]-mn)/(mx-mn));
wr.setSample(i,j,0,level);
}
}
frame.repaint();
}

public void update(int num)
{
for (int i=0;i<num;i++) {
updateOnce();
}
}

private void updateOnce()
{
double mean;
for (int i=0;i<n;i++) {
for (int j=0;j<m;j++) {
if (i==0) {
if (j==0) {
mean=0.5*(cells[0][1]+cells[1][0]);
}
else if (j==m-1) {
mean=0.5*(cells[0][j-1]+cells[1][j]);
}
else {
mean=(cells[0][j-1]+cells[0][j+1]+cells[1][j])/3.0;
}
}
else if (i==n-1) {
if (j==0) {
mean=0.5*(cells[i][1]+cells[i-1][0]);
}
else if (j==m-1) {
mean=0.5*(cells[i][j-1]+cells[i-1][j]);
}
else {
mean=(cells[i][j-1]+cells[i][j+1]+cells[i-1][j])/3.0;
}
}
else if (j==0) {
mean=(cells[i-1][0]+cells[i+1][0]+cells[i][1])/3.0;
}
else if (j==m-1) {
mean=(cells[i-1][j]+cells[i+1][j]+cells[i][j-1])/3.0;
}
else {
mean=0.25*(cells[i][j-1]+cells[i][j+1]+cells[i+1][j]
+cells[i-1][j]);
}
cells[i][j]=mean+rng.nextGaussian();
}
}
updateImage();
}

}


This class contains a few simple methods for creating and updating the GMRF, and also for maintaining and updating a graphical view of the GMRF as the sampler is running. The Gibbs sampler update itself is encoded in the final method, updateOnce, and most of the code is to deal with edge and corner cases (in the literal rather than metaphorical sense!). This is called repeatedly by the method update for the required number of iterations. At the end of each iteration, the method updateOnce triggers updateImage which updates the image associated GMRF. The GMRF itself is stored in a 2-dimensional array of doubles, but an image pixel typically consists of a grayscale value represented by an unsigned byte – that is, an integer from 0 to 255. So updateImage scans through the GMRF to find the maximum and minimum values and then maps the GMRF values onto the 0 to 255 scale. The image itself is set up by the constructor method, Mrf. This class relies on an additional class called ImagePanel, which is a simple GUI panel for displaying images:

ImagePanel.java

import java.awt.*;
import java.awt.image.*;
import javax.swing.*;

class ImagePanel extends JPanel {

protected BufferedImage image;

public ImagePanel(BufferedImage image) {
this.image=image;
Dimension dim=new Dimension(image.getWidth(),image.getHeight());
setPreferredSize(dim);
setMinimumSize(dim);
revalidate();
repaint();
}

public void paintComponent(Graphics g) {
g.drawImage(image,0,0,this);
}

}


This completes the application, which can be compiled and run from the command line with

javac *.java
java MrfApp


This should compile the code and run the application, which will show a GMRF updating for 1000 iterations. When the 1000 iterations are complete, the application writes the final image to a file and then quits.

### Using Parallel COLT

The above classes are very convenient, as they should work with any standard Java installation. However, in more complex scenarios, it is likely that a math library such as Parallel COLT will be required. In this case it will make sense to make use of features in the COLT library, such as random number generators and 2d matrix objects. We can adapt the above application by replacing the MrfApp and Mrf classes with the following versions (the ImagePanel class remains unchanged):

MrfApp.java

import java.io.*;
import cern.jet.random.tdouble.engine.*;

class MrfApp {

public static void main(String[] arg)
throws IOException
{
Mrf mrf;
int seed=1234;
System.out.println("started program");
DoubleRandomEngine rngEngine=new DoubleMersenneTwister(seed);
mrf=new Mrf(800,600,rngEngine);
System.out.println("created mrf object");
mrf.update(1000);
mrf.saveImage("mrf.png");
System.out.println("finished program");
mrf.frame.dispose();
System.exit(0);
}

}


Mrf.java

import java.io.*;
import java.util.*;
import java.awt.image.*;
import javax.swing.*;
import javax.imageio.ImageIO;
import cern.jet.random.tdouble.*;
import cern.jet.random.tdouble.engine.*;
import cern.colt.matrix.tdouble.impl.*;

class Mrf
{
int n,m;
DenseDoubleMatrix2D cells;
DoubleRandomEngine rng;
Normal rngN;
BufferedImage bi;
WritableRaster wr;
JFrame frame;
ImagePanel ip;

Mrf(int n_arg,int m_arg,DoubleRandomEngine rng)
{
n=n_arg;
m=m_arg;
cells=new DenseDoubleMatrix2D(n,m);
this.rng=rng;
rngN=new Normal(0.0,1.0,rng);
bi=new BufferedImage(n,m,BufferedImage.TYPE_BYTE_GRAY);
wr=bi.getRaster();
frame=new JFrame("MRF");
frame.setSize(n,m);
frame.setVisible(true);
}

public void saveImage(String filename)
throws IOException
{
ImageIO.write(bi,"PNG",new File(filename));
}

public void updateImage()
{
double mx=-1e+100;
double mn=1e+100;
for (int i=0;i<n;i++) {
for (int j=0;j<m;j++) {
if (cells.getQuick(i,j)>mx) { mx=cells.getQuick(i,j); }
if (cells.getQuick(i,j)<mn) { mn=cells.getQuick(i,j); }
}
}
for (int i=0;i<n;i++) {
for (int j=0;j<m;j++) {
int level=(int) (255*(cells.getQuick(i,j)-mn)/(mx-mn));
wr.setSample(i,j,0,level);
}
}
frame.repaint();
}

public void update(int num)
{
for (int i=0;i<num;i++) {
updateOnce();
}
}

private void updateOnce()
{
double mean;
for (int i=0;i<n;i++) {
for (int j=0;j<m;j++) {
if (i==0) {
if (j==0) {
mean=0.5*(cells.getQuick(0,1)+cells.getQuick(1,0));
}
else if (j==m-1) {
mean=0.5*(cells.getQuick(0,j-1)+cells.getQuick(1,j));
}
else {
mean=(cells.getQuick(0,j-1)+cells.getQuick(0,j+1)+cells.getQuick(1,j))/3.0;
}
}
else if (i==n-1) {
if (j==0) {
mean=0.5*(cells.getQuick(i,1)+cells.getQuick(i-1,0));
}
else if (j==m-1) {
mean=0.5*(cells.getQuick(i,j-1)+cells.getQuick(i-1,j));
}
else {
mean=(cells.getQuick(i,j-1)+cells.getQuick(i,j+1)+cells.getQuick(i-1,j))/3.0;
}
}
else if (j==0) {
mean=(cells.getQuick(i-1,0)+cells.getQuick(i+1,0)+cells.getQuick(i,1))/3.0;
}
else if (j==m-1) {
mean=(cells.getQuick(i-1,j)+cells.getQuick(i+1,j)+cells.getQuick(i,j-1))/3.0;
}
else {
mean=0.25*(cells.getQuick(i,j-1)+cells.getQuick(i,j+1)+cells.getQuick(i+1,j)
+cells.getQuick(i-1,j));
}
cells.setQuick(i,j,mean+rngN.nextDouble());
}
}
updateImage();
}

}


Again, the code should be reasonably self explanatory, and will compile and run in the same way provided that Parallel COLT is installed and in your classpath. This version runs approximately twice as fast as the previous version on all of the machines I’ve tried it on.

### Reference

I have found the following book very useful for understanding how to work with images in Java:

Hunt, K.A. (2010) The Art of Image Processing with Java, A K Peters/CRC Press.

## Gibbs sampler in various languages (revisited)

### Introduction

Regular readers of this blog will know that in April 2010 I published a short post showing how a trivial bivariate Gibbs sampler could be implemented in the four languages that I use most often these days (R, python, C, Java), and I discussed relative timings, and how one might start to think about trading off development time against execution time for more complex MCMC algorithms. I actually wrote the post very quickly one night while I was stuck in a hotel room in Seattle – I didn’t give much thought to it, and the main purpose was to provide simple illustrative examples of simple Monte Carlo codes using non-uniform random number generators in the different languages, as a starting point for someone thinking of switching languages (say, from R to Java or C, for efficiency reasons). It wasn’t meant to be very deep or provocative, or to start any language wars. Suffice to say that this post has had many more hits than all of my other posts combined, is still my most popular post, and still attracts comments and spawns other posts to this day. Several people have requested that I re-do the post more carefully, to include actual timings, and to include a few additional optimisations. Hence this post. For reference, the original post is here. A post about it from the python community is here, and a recent post about using Rcpp and inlined C++ code to speed up the R version is here.

### The sampler

So, the basic idea was to construct a Gibbs sampler for the bivariate distribution

$f(x,y) = kx^2\exp\{-xy^2-y^2+2y-4x\},\qquad x>0,y\in\Bbb{R}$

with unknown normalising constant $k>0$ ensuring that the density integrates to one. Unfortunately, in the original post I dropped a factor of 2 constructing one of the full conditionals, which meant that none of the samplers actually had exactly the right target distribution (thanks to Sanjog Misra for bringing this to my attention). So actually, the correct full conditionals are

$\displaystyle x|y \sim Ga(3,y^2+4)$

$\displaystyle y|x \sim N\left(\frac{1}{1+x},\frac{1}{2(1+x)}\right)$

Note the factor of two in the variance of the full conditional for $y$. Given the full conditionals, it is simple to alternately sample from them to construct a Gibbs sampler for the target distribution. We will run a Gibbs sampler with a thin of 1000 and obtain a final sample of 50000.

### Implementations

#### R

Let’s start with R again. The slightly modified version of the code from the old post is given below

gibbs=function(N,thin)
{
mat=matrix(0,ncol=3,nrow=N)
mat[,1]=1:N
x=0
y=0
for (i in 1:N) {
for (j in 1:thin) {
x=rgamma(1,3,y*y+4)
y=rnorm(1,1/(x+1),1/sqrt(2*x+2))
}
mat[i,2:3]=c(x,y)
}
mat=data.frame(mat)
names(mat)=c("Iter","x","y")
mat
}

writegibbs=function(N=50000,thin=1000)
{
mat=gibbs(N,thin)
write.table(mat,"data.tab",row.names=FALSE)
}

writegibbs()


I’ve just corrected the full conditional, and I’ve increased the sample size and thinning to 50k and 1k, respectively, to allow for more accurate timings (of the faster languages). This code can be run from the (Linux) command line with something like:

time Rscript gibbs.R


I discuss timings in detail towards the end of the post, but this code is slow, taking over 7 minutes on my (very fast) laptop. Now, the above code is typical of the way code is often structured in R – doing as much as possible in memory, and writing to disk only if necessary. However, this can be a bad idea with large MCMC codes, and is less natural in other languages, anyway, so below is an alternative version of the code, written in more of a scripting language style.

gibbs=function(N,thin)
{
x=0
y=0
cat(paste("Iter","x","y","\n"))
for (i in 1:N) {
for (j in 1:thin) {
x=rgamma(1,3,y*y+4)
y=rnorm(1,1/(x+1),1/sqrt(2*x+2))
}
cat(paste(i,x,y,"\n"))
}
}

gibbs(50000,1000)


This can be run with a command like

time Rscript gibbs-script.R > data.tab


This code actually turns out to be a slightly slower than the in-memory version for this simple example, but for larger problems I would not expect that to be the case. I always analyse MCMC output using R, whatever language I use for running the algorithm, so for completeness, here is a bit of code to load up the data file, do some plots and compute summary statistics.

fun=function(x,y)
{
x*x*exp(-x*y*y-y*y+2*y-4*x)
}

compare<-function(file="data.tab")
{
op=par(mfrow=c(2,1))
x=seq(0,3,0.1)
y=seq(-1,3,0.1)
z=outer(x,y,fun)
contour(x,y,z,main="Contours of actual (unnormalised) distribution")
require(KernSmooth)
fit=bkde2D(as.matrix(mat[,2:3]),c(0.1,0.1))
contour(fit$x1,fit$x2,fit$fhat,main="Contours of empirical distribution") par(op) print(summary(mat[,2:3])) } compare()  #### Python Another language I use a lot is Python. I don’t want to start any language wars, but I personally find python to be a better designed language than R, and generally much nicer for the development of large programs. A python script for this problem is given below import random,math def gibbs(N=50000,thin=1000): x=0 y=0 print "Iter x y" for i in range(N): for j in range(thin): x=random.gammavariate(3,1.0/(y*y+4)) y=random.gauss(1.0/(x+1),1.0/math.sqrt(2*x+2)) print i,x,y gibbs()  It can be run with a command like time python gibbs.py > data.tab  This code turns out to be noticeably faster than the R versions, taking around 4 minutes on my laptop (again, detailed timing information below). However, there is a project for python known as the PyPy project, which is concerned with compiling regular python code to very fast byte-code, giving significant speed-ups on certain problems. For this post, I downloaded and install version 1.5 of the 64-bit linux version of PyPy. Once installed, I can run the above code with the command time pypy gibbs.py > data.tab  To my astonishment, this “just worked”, and gave very impressive speed-up over regular python, running in around 30 seconds. This actually makes python a much more realistic prospect for the development of MCMC codes than I imagined. However, I need to understand the limitations of PyPy better – for example, why doesn’t everyone always use PyPy for everything?! It certainly seems to make python look like a very good option for prototyping MCMC codes. #### C Traditionally, I have mainly written MCMC codes in C, using the GSL. C is a fast, efficient, statically typed language, which compiles to native code. In many ways it represents the “gold standard” for speed. So, here is the C code for this problem. #include <stdio.h> #include <math.h> #include <stdlib.h> #include <gsl/gsl_rng.h> #include <gsl/gsl_randist.h> void main() { int N=50000; int thin=1000; int i,j; gsl_rng *r = gsl_rng_alloc(gsl_rng_mt19937); double x=0; double y=0; printf("Iter x y\n"); for (i=0;i<N;i++) { for (j=0;j<thin;j++) { x=gsl_ran_gamma(r,3.0,1.0/(y*y+4)); y=1.0/(x+1)+gsl_ran_gaussian(r,1.0/sqrt(2*x+2)); } printf("%d %f %f\n",i,x,y); } }  It can be compiled and run with command like gcc -O4 gibbs.c -lgsl -lgslcblas -lm -o gibbs time ./gibbs > datac.tab  This runs faster than anything else I consider in this post, taking around 8 seconds. #### Java I’ve recently been experimenting with Java for MCMC codes, in conjunction with Parallel COLT. Java is a statically typed object-oriented (O-O) language, but is usually compiled to byte-code to run on a virtual machine (known as the JVM). Java compilers and virtual machines are very fast these days, giving “close to C” performance, but with a nicer programming language, and advantages associated with virtual machines. Portability is a huge advantage of Java. For example, I can easily get my Java code to run on almost any University Condor pool, on both Windows and Linux clusters – they all have a recent JVM installed, and I can easily bundle any required libraries with my code. Suffice to say that getting GSL/C code to run on generic Condor pools is typically much less straightforward. Here is the Java code: import java.util.*; import cern.jet.random.tdouble.*; import cern.jet.random.tdouble.engine.*; class Gibbs { public static void main(String[] arg) { int N=50000; int thin=1000; DoubleRandomEngine rngEngine=new DoubleMersenneTwister(new Date()); Normal rngN=new Normal(0.0,1.0,rngEngine); Gamma rngG=new Gamma(1.0,1.0,rngEngine); double x=0; double y=0; System.out.println("Iter x y"); for (int i=0;i<N;i++) { for (int j=0;j<thin;j++) { x=rngG.nextDouble(3.0,y*y+4); y=rngN.nextDouble(1.0/(x+1),1.0/Math.sqrt(2*x+2)); } System.out.println(i+" "+x+" "+y); } } }  It can be compiled and run with javac Gibbs.java time java Gibbs > data.tab  This takes around 11.6s seconds on my laptop. This is well within a factor of 2 of the C version, and around 3 times faster than even the PyPy python version. It is around 40 times faster than R. Java looks like a good choice for implementing MCMC codes that would be messy to implement in C, or that need to run places where it would be fiddly to get native codes to run. #### Scala Another language I’ve been taking some interest in recently is Scala. Scala is a statically typed O-O/functional language which compiles to byte-code that runs on the JVM. Since it uses Java technology, it can seamlessly integrate with Java libraries, and can run anywhere that Java code can run. It is a much nicer language to program in than Java, and feels more like a dynamic language such as python. In fact, it is almost as nice to program in as python (and in some ways nicer), and will run in a lot more places than PyPy python code. Here is the scala code (which calls Parallel COLT for random number generation): object GibbsSc { import cern.jet.random.tdouble.engine.DoubleMersenneTwister import cern.jet.random.tdouble.Normal import cern.jet.random.tdouble.Gamma import Math.sqrt import java.util.Date def main(args: Array[String]) { val N=50000 val thin=1000 val rngEngine=new DoubleMersenneTwister(new Date) val rngN=new Normal(0.0,1.0,rngEngine) val rngG=new Gamma(1.0,1.0,rngEngine) var x=0.0 var y=0.0 println("Iter x y") for (i <- 0 until N) { for (j <- 0 until thin) { x=rngG.nextDouble(3.0,y*y+4) y=rngN.nextDouble(1.0/(x+1),1.0/sqrt(2*x+2)) } println(i+" "+x+" "+y) } } }  It can be compiled and run with scalac GibbsSc.scala time scala GibbsSc > data.tab  This code takes around 11.8s on my laptop – almost as fast as the Java code! So, on the basis of this very simple and superficial example, it looks like scala may offer the best of all worlds – a nice, elegant, terse programming language, functional and O-O programming styles, the safety of static typing, the ability to call on Java libraries, great speed and efficiency, and the portability of Java! Very interesting. #### Groovy James Durbin has kindly sent me a Groovy version of the code, which he has also discussed in his own blog post. Groovy is a dynamic O-O language for the JVM, which, like Scala, can integrate nicely with Java applications. It isn’t a language I have examined closely, but it seems quite nice. The code is given below: import cern.jet.random.tdouble.engine.*; import cern.jet.random.tdouble.*; N=50000; thin=1000; rngEngine= new DoubleMersenneTwister(new Date()); rngN=new Normal(0.0,1.0,rngEngine); rngG=new Gamma(1.0,1.0,rngEngine); x=0.0; y=0.0; println("Iter x y"); for(i in 1..N){ for(j in 1..thin){ x=rngG.nextDouble(3.0,y*y+4); y=rngN.nextDouble(1.0/(x+1),1.0/Math.sqrt(2*x+2)); } println("$i $x$y");
}


It can be run with a command like:

time groovy Gibbs.gv > data.tab


Again, rather amazingly, this code runs in around 35 seconds – very similar to the speed of PyPy. This makes Groovy also seem like a potential very attractive environment for prototyping MCMC codes, especially if I’m thinking about ultimately porting to Java.

### Timings

The laptop I’m running everything on is a Dell Precision M4500 with an Intel i7 Quad core (x940@2.13Ghz) CPU, running the 64-bit version of Ubuntu 11.04. I’m running stuff from the Ubuntu (Unity) desktop, and running several terminals and applications, but the machine is not loaded at the time each job runs. I’m running each job 3 times and taking the arithmetic mean real elapsed time. All timings are in seconds.

 R 2.12.1 (in memory) 435 R 2.12.1 (script) 450.2 Python 2.7.1+ 233.5 PyPy 1.5 32.2 Groovy 1.7.4 35.4 Java 1.6.0 11.6 Scala 2.7.7 11.8 C (gcc 4.5.2) 8.1

If we look at speed-up relative to the R code (in-memory version), we get:

 R (in memory) 1 R (script) 0.97 Python 1.86 PyPy 13.51 Groovy 12.3 Java 37.5 Scala 36.86 C 53.7

Alternatively, we can look at slow-down relative to the C version, to get:

 R (in memory) 53.7 R (script) 55.6 Python 28.8 PyPy 4 Groovy 4.4 Java 1.4 Scala 1.5 C 1

### Discussion

The findings here are generally consistent with those of the old post, but consideration of PyPy, Groovy and Scala does throw up some new issues. I was pretty stunned by PyPy. First, I didn’t expect that it would “just work” – I thought I would either have to spend time messing around with my configuration settings, or possibly even have to modify my code slightly. Nope. Running python code with pypy appears to be more than 10 times faster than R, and only 4 times slower than C. I find it quite amazing that it is possible to get python code to run just 4 times slower than C, and if that is indicative of more substantial examples, it really does open up the possibility of using python for “real” problems, although library coverage is currently a problem. It certainly solves my “prototyping problem”. I often like to prototype algorithms in very high level dynamic languages like R and python before porting to a more efficient language. However, I have found that this doesn’t always work well with complex MCMC codes, as they just run too slowly in the dynamic languages to develop, test and debug conveniently. But it looks now as though PyPy should be fast enough at least for prototyping purposes, and may even be fast enough for production code in some circumstances. But then again, exactly the same goes for Groovy, which runs on the JVM, and can access any existing Java library… I haven’t yet looked into Groovy in detail, but it appears that it could be a very nice language for prototyping algorithms that I intend to port to Java.

The results also confirm my previous findings that Java is now “fast enough” that one shouldn’t worry too much about the difference in speed between it and native code written in C (or C++). The Java language is much nicer than C or C++, and the JVM platform is very attractive in many situations. However, the Scala results were also very surprising for me. Scala is a really elegant language (certainly on a par with python), comes with all of the advantages of Java, and appears to be almost as fast as Java. I’m really struggling to come up with reasons not to use Scala for everything!

#### Speeding up R

MCMC codes are used by a range of different scientists for a range of different problems. However, they are very (most?) often used by Bayesian statisticians who use the algorithms to target a Bayesian posterior distribution. For various (good) reasons, many statisticians are heavily invested in R, like to use R as much as possible, and do as much as possible from within the R environment. These results show why R is not a good language in which to implement MCMC algorithms, so what is an R-dependent statistician supposed to do? One possibility would be to byte-code compile R code in an analogous way to python and pypy. The very latest versions of R support such functionality, but the post by Dirk Eddelbuettel suggests that the current version of cmpfun will only give a 40% speedup on this problem, which is still slower than regular python code. Short of a dramatic improvement in this technology, the only way forward seems to be to extend R using code from another language. It is easy to extend R using C, C++ and Java. I have shown in previous posts how to do this using Java and using C, and the recent post by Dirk shows how to extend using C++. Although interesting, this doesn’t really have much bearing on the current discussion. If you extend using Java you get Java-like speedups, and if you extend using C you get C-like speedups. However, in case people are interested, I intend to gather up these examples into one post and include detailed timing information in a subsequent post.

## Java math libraries and Monte Carlo simulation codes

### Java libraries for (non-uniform) random number simulation

Anyone writing serious Monte Carlo (and MCMC) codes relies on having a very good and fast (uniform) random number generator and associated functions for generation of non-uniform random quantities, such as Gaussian, Poisson, Gamma, etc. In a previous post I showed how to write a simple Gibbs sampler in four different languages. In C (and C++) random number generation is easy for most scientists, as the (excellent) GNU Scientific Library (GSL) provides exactly what most people need. But it wasn’t always that way… I remember the days before the GSL, when it was necessary to hunt around on the net for bits of C code to implement different algorithms. Worse, it was often necessary to hunt around for a bit of free FORTRAN code, and compile that with an F77 compiler and figure out how to call it from C. Even in the early Alpha days of the GSL, coverage was patchy, and the API changed often. Bad old days… But those days are long gone, and C programmers no longer have to worry about the problem of random variate generation – they can safely concentrate on developing their interesting new algorithm, and leave the rest to the GSL. Unfortunately for Java programmers, there isn’t yet anything quite comparable to the GSL in Java world.

I pretty much ignored Java until Java 5. Before then, the language was too limited, and the compilers and JVMs were too primitive to really take seriously for numerical work. But since the launch of Java 5 I’ve been starting to pay more interest. The language is now a perfectly reasonable O-O language, and the compilers and JVMs are pretty good. On a lot of benchmarks, Java is really quite comparable to C/C++, and Java is nicer to code, and has a lot of impressive associated technology. So if there was a math library comparable to the GSL, I’d be quite tempted to jump ship to the Java world and start writing all of my Monte Carlo codes in Java. But there isn’t. At least not yet.

When I first started to take Java seriously, the only good math library with good support for non-uniform random number generation was COLT. COLT was, and still is, pretty good. The code is generally well-written, and fast, and the documentation for it is reasonable. However, the structure of the library is very idiosyncratic, the coverage is a bit patchy, and there doesn’t ever seem to have been a proper development community behind it. It seems very much to have been a one-man project, which has long since stagnated. Unsurprisingly then, COLT has been forked. There is now a Parallel COLT project. This project is continuing the development of COLT, adding new features that were missing from COLT, and, as the name suggests, adding concurrency support. Parallel COLT is also good, and is the main library I currently use for random number generation in Java. However, it has obviously inherited all of the idiosyncrasies that COLT had, and still doesn’t seem to have a large and active development community associated with it. There is no doubt that it is an incredibly useful software library, but it still doesn’t really compare to the GSL.

I have watched the emergence of the Apache Commons Math project with great interest (not to be confused with Uncommons Math – another one-man project). I think this project probably has the greatest potential for providing the Java community with their own GSL equivalent. The Commons project has a lot of momentum, the Commons Math project seems to have an active development community, and the structure of the library is more intuitive than that of (Parallel) COLT. However, it is early days, and the library still has patchy coverage and is a bit rough around the edges. It reminds me a lot of the GSL back in its Alpha days. I’d not bothered to even download it until recently, as the random number generation component didn’t include the generation of gamma random quantities – an absolutely essential requirement for me. However, I noticed recently that the latest release (2.2) did include gamma generation, so I decided to download it and try it out. It works, but the generation of gamma random quantities is very slow (around 50 times slower than Parallel COLT). This isn’t a fundamental design flaw of the whole library – generating Gaussian random quantities is quite comparable with other libraries. It’s just that an inversion method has been used for gamma generation. All efficient gamma generators use a neat rejection scheme. In case anyone would like to investigate for themselves, here is a complete program for gamma generation designed to be linked against Parallel COLT:

import java.util.*;
import cern.jet.random.tdouble.*;
import cern.jet.random.tdouble.engine.*;

class GammaPC
{

public static void main(String[] arg)
{
DoubleRandomEngine rngEngine=new DoubleMersenneTwister();
Gamma rngG=new Gamma(1.0,1.0,rngEngine);
long N=10000;
double x=0.0;
for (int i=0;i<N;i++) {
for (int j=0;j<1000;j++) {
x=rngG.nextDouble(3.0,25.0);
}
System.out.println(x);
}
}

}


and here is a complete program designed to be linked against Commons Math:

import java.util.*;
import org.apache.commons.math.*;
import org.apache.commons.math.random.*;

class GammaACM
{

public static void main(String[] arg) throws MathException
{
RandomDataImpl rng=new RandomDataImpl();
long N=10000;
double x=0.0;
for (int i=0;i<N;i++) {
for (int j=0;j<1000;j++) {
x=rng.nextGamma(3.0,1.0/25.0);
}
System.out.println(x);
}
}

}


The two codes do the same thing (note that they parameterise the gamma distribution differently). Both programs work (they generate variates from the same, correct, distribution), and the Commons Math interface is slightly nicer, but the code is much slower to execute. I’m still optimistic that Commons Math will one day be Java’s GSL, but I’m not giving up on Parallel COLT (or C, for that matter!) just yet…

## Calling Java code from R

### Introduction

In the previous post I looked at some simple methods for calling C code from R using a simple Gibbs sampler as the motivating example. In this post we will look again at the same Gibbs sampler, but now implemented in Java, and look at a couple of options for calling that code from an R session.

### Stand-alone Java code

Below is some Java code for implementing the bivariate Gibbs sampler discussed previously. It relies on Parallel COLT, which must be installed and in the Java CLASSPATH in order to follow the examples.

import java.util.*;
import cern.jet.random.tdouble.*;
import cern.jet.random.tdouble.engine.*;

class Gibbs
{

public static void main(String[] arg)
{
if (arg.length != 3) {
System.err.println("Usage: java Gibbs <Iters> <Thin> <Seed>");
System.exit(1);
}
int N=Integer.parseInt(arg[0]);
int thin=Integer.parseInt(arg[1]);
int seed=Integer.parseInt(arg[2]);
DoubleRandomEngine rngEngine=new DoubleMersenneTwister(seed);
Normal rngN=new Normal(0.0,1.0,rngEngine);
Gamma rngG=new Gamma(1.0,1.0,rngEngine);
double x=0,y=0;
System.out.println("Iter x y");
for (int i=0;i<N;i++) {
for (int j=0;j<thin;j++) {
x=rngG.nextDouble(3.0,y*y+4);
y=rngN.nextDouble(1.0/(x+1),1.0/Math.sqrt(x+1));
}
System.out.println(i+" "+x+" "+y);
}
}

}


It can be compiled and run stand-alone from an OS shell with the following commands:

javac Gibbs.java
java Gibbs 10 1000 1


As discussed in the previous post, it is possible to call any command-line program from inside an R session using the system() command. A small wrapper function for conveniently running this code from within R can be written as follows.

gibbs<-function(N=10000,thin=500,
seed=trunc(runif(1)*1e6),
exec="Gibbs",
tmpfile=tempfile())
{
command=paste("java",exec,N,thin,seed,">",tmpfile)
system(command)
}


This can then be run from within an R session with a simple call to gibbs(). Note that a random seed is being generated within R to be passed to the Java code to be used to seed the COLT random number generator used within the Java code. As previously discussed, for many long running codes, this approach can be quite effective, and is clearly very simple. However, there is an overhead associated with the system() call, and also with writing output to disk and then reading it back again.

### Using rJava

It is possible to avoid the overheads associated with the above approach by directly calling the Java code from R and having the return values returned directly into the R session from memory. There isn’t really direct support for this within the core R language, but there are couple of different solutions provided by R packages. The simplest and most popular approach seems to be the rJava package. This package can be installed with a simple

install.packages("rJava")


This should “just work” on some OSs (eg. Windows), but may fail on other OSs if R is not aware of the local Java environment. If the installation fails, check the error message carefully for advice/instructions. On most Linux systems, the problem can be fixed by quitting R, then running the following command from the shell

sudo R CMD javareconf


before re-starting R and re-attempting the installation. rJava provides a mechanism for starting a JVM within the running R session, creating objects, calling methods and having method return values returned to R. It is actually much more flexible than the .C() function for C code discussed in the previous post.

In order to use this package for our example, we must first re-factor the code slightly in the following way.

import java.util.*;
import cern.jet.random.tdouble.*;
import cern.jet.random.tdouble.engine.*;

class GibbsR
{

public static void main(String[] arg)
{
if (arg.length != 3) {
System.err.println("Usage: java GibbsR <Iters> <Thin> <Seed>");
System.exit(1);
}
int N=Integer.parseInt(arg[0]);
int thin=Integer.parseInt(arg[1]);
int seed=Integer.parseInt(arg[2]);
double[][] mat=gibbs(N,thin,seed);
System.out.println("Iter x y");
for (int i=0;i<N;i++) {
System.out.println(""+i+" "+mat[0][i]+" "+mat[1][i]);
}
}

public static double[][] gibbs(int N,int thin,int seed)
{
DoubleRandomEngine rngEngine=new DoubleMersenneTwister(seed);
Normal rngN=new Normal(0.0,1.0,rngEngine);
Gamma rngG=new Gamma(1.0,1.0,rngEngine);
double x=0,y=0;
double[][] mat=new double[2][N];
for (int i=0;i<N;i++) {
for (int j=0;j<thin;j++) {
x=rngG.nextDouble(3.0,y*y+4);
y=rngN.nextDouble(1.0/(x+1),1.0/Math.sqrt(x+1));
}
mat[0][i]=x; mat[1][i]=y;
}
return mat;
}

}


This code can be compiled and run from the command-line just as the previous code could.

javac GibbsR.java
java GibbsR 10 1000 1


However, we have now separated out the code we want to be able to call from R into a static method called gibbs, which runs the Gibbs sampler and stores the result in a 2-dimensional array which is its return value. We can now see how to call this code from within a running R session. We first need to set up the R environment ready to call the code.

library(rJava)
.jinit()
obj=.jnew("GibbsR")


Line 1 loads the package, line 2 starts up the JVM, and line 3 creates a link to the the GibbsR class (in general this is used to create a new Java object of the given type, but here we are using static methods). Java methods are called on Java objects using .jcall(). We can write a simple R function to conveniently call the method as follows.

jgibbs<-function(N=10000,thin=500,seed=trunc(runif(1)*1e6))
{
result=.jcall(obj,"[[D","gibbs",as.integer(N),as.integer(thin),as.integer(seed))
mat=sapply(result,.jevalArray)
mat=cbind(1:N,mat)
colnames(mat)=c("Iter","x","y")
mat
}


This can now be called with a simple jgibbs(). The first line of the function body carries out the actual method call. The return type of the method must be explicitly declared – “[[D” means a 2-dimensional array of doubles, using JNI notation. Care must also be taken to coerce the method parameters into the correct type that the Java method expects to receive. .jcall() is generally quite good at unpacking basic Java types into corresponding R types. However, the two dimensional array is here returned as an R list consisting of one-dimensional Java array objects. The unpacking is completed using the subsequent call to jevalArray() using sapply(), before the resulting matrix is tidied up and returned to the R session.

We have looked at a couple of very simple methods for calling Java code from an R session. The rJava package is a very flexible mechanism for integrating Java code into R.

I haven’t found a lot of tutorial-level material on the web for the rJava package. However, the package itself has very good documentation associated with it. Start with the information on the rJava home page. From an R session with the rJava package loaded, help(package="rJava") lists the available functions, all of which have associated documentation. ?.jinit, ?.jnew, ?.jcall and ?.jevalArray provide further background and information on the example covered here.

After that, the source code of R packages which use rJava are a useful source of further inspiration – look at the reverse-depends list for rJava in CRAN. In particular, the helloJavaWorld package is a tutorial for how to include Java code in an R package (read the associated vignette).

## Getting started with parallel MCMC

### Introduction to parallel MCMC for Bayesian inference, using C, MPI, the GSL and SPRNG

#### Introduction

This post is aimed at people who already know how to code up Markov Chain Monte Carlo (MCMC) algorithms in C, but are interested in how to parallelise their code to run on multi-core machines and HPC clusters. I discussed different languages for coding MCMC algorithms in a previous post. The advantage of C is that it is fast, standard and has excellent scientific library support. Ultimately, people pursuing this route will be interested in running their code on large clusters of fast servers, but for the purposes of development and testing, this really isn’t necessary. A single dual-core laptop, or similar, is absolutely fine. I develop and test on a dual-core laptop running Ubuntu linux, so that is what I will assume for the rest of this post.

There are several possible environments for parallel computing, but I will focus on the Message-Passing Interface (MPI). This is a well-established standard for parallel computing, there are many implementations, and it is by far the most commonly used high performance computing (HPC) framework today. Even if you are ultimately interested in writing code for novel architectures such as GPUs, learning the basics of parallel computation using MPI will be time well spent.

#### MPI

The whole point of MPI is that it is a standard, so code written for one implementation should run fine with any other. There are many implementations. On Linux platforms, the most popular are OpenMPI, LAM, and MPICH. There are various pros and cons associated with each implementation, and if installing on a powerful HPC cluster, serious consideration should be given to which will be the most beneficial. For basic development and testing, however, it really doesn’t matter which is used. I use OpenMPI on my Ubuntu laptop, which can be installed with a simple:

sudo apt-get install openmpi-bin libopenmpi-dev


That’s it! You’re ready to go… You can test your installation with a simple “Hello world” program such as:

#include <stdio.h>
#include <mpi.h>

int main (int argc,char **argv)
{
int rank, size;
MPI_Init (&argc, &argv);
MPI_Comm_rank (MPI_COMM_WORLD, &rank);
MPI_Comm_size (MPI_COMM_WORLD, &size);
printf( "Hello world from process %d of %d\n", rank, size );
MPI_Finalize();
return 0;
}


You should be able to compile this with

mpicc -o helloworld helloworld.c


and run (on 2 processors) with

mpirun -np 2 helloworld


#### GSL

If you are writing non-trivial MCMC codes, you are almost certainly going to need to use a sophisticated math library and associated random number generation (RNG) routines. I typically use the GSL. On Ubuntu, the GSL can be installed with:

sudo apt-get install gsl-bin libgsl0-dev


A simple script to generate some non-uniform random numbers is given below.

#include <gsl/gsl_rng.h>
#include <gsl/gsl_randist.h>

int main(void)
{
int i; double z; gsl_rng *r;
r = gsl_rng_alloc(gsl_rng_mt19937);
gsl_rng_set(r,0);
for (i=0;i<10;i++) {
z = gsl_ran_gaussian(r,1.0);
printf("z(%d) = %f\n",i,z);
}
exit(EXIT_SUCCESS);
}


This can be compiled with a command like:

gcc gsl_ran_demo.c -o gsl_ran_demo -lgsl -lgslcblas


and run with

./gsl_ran_demo


#### SPRNG

When writing parallel Monte Carlo codes, it is important to be able to use independent streams of random numbers on each processor. Although it is tempting to “fudge” things by using a random number generator with a different seed on each processor, this does not guarantee independence of the streams, and an unfortunate choice of seeds could potentially lead to bad behaviour of your algorithm. The solution to this problem is to use a parallel random number generator (PRNG), designed specifically to give independent streams on different processors. Unfortunately the GSL does not have native support for such parallel random number generators, so an external library should be used. SPRNG 2.0 is a popular choice. SPRNG is designed so that it can be used with MPI, but also that it does not have to be. This is an issue, as the standard binary packages distributed with Ubuntu (libsprng2, libsprng2-dev) are compiled without MPI support. If you are going to be using SPRNG with MPI, things are simpler with MPI support, so it makes sense to download sprng2.0b.tar.gz from the SPRNG web site, and build it from source, carefully following the instructions for including MPI support. If you are not familiar with building libraries from source, you may need help from someone who is.

Once you have compiled SPRNG with MPI support, you can test it with the following code:

#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
#define SIMPLE_SPRNG
#define USE_MPI
#include "sprng.h"

int main(int argc,char *argv[])
{
double rn; int i,k;
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD,&k);
init_sprng(DEFAULT_RNG_TYPE,0,SPRNG_DEFAULT);
for (i=0;i<10;i++)
{
rn = sprng();
printf("Process %d, random number %d: %f\n", k, i+1, rn);
}
MPI_Finalize();
exit(EXIT_SUCCESS);
}


which can be compiled with a command like:

mpicc -I/usr/local/src/sprng2.0/include -L/usr/local/src/sprng2.0/lib -o sprng_demo sprng_demo.c -lsprng -lgmp


You will need to edit paths here to match your installation. If if builds, it can be run on 2 processors with a command like:

mpirun -np 2 sprng_demo


If it doesn’t build, there are many possible reasons. Check the error messages carefully. However, if the compilation fails at the linking stage with obscure messages about not being able to find certain SPRNG MPI functions, one possibility is that the SPRNG library has not been compiled with MPI support.

The problem with SPRNG is that it only provides a uniform random number generator. Of course we would really like to be able to use the SPRNG generator in conjunction with all of the sophisticated GSL routines for non-uniform random number generation. Many years ago I wrote a small piece of code to accomplish this, gsl-sprng.h. Download this and put it in your include path for the following example:

#include <mpi.h>
#include <gsl/gsl_rng.h>
#include "gsl-sprng.h"
#include <gsl/gsl_randist.h>

int main(int argc,char *argv[])
{
int i,k,po; gsl_rng *r;
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD,&k);
r=gsl_rng_alloc(gsl_rng_sprng20);
for (i=0;i<10;i++)
{
po = gsl_ran_poisson(r,2.0);
printf("Process %d, random number %d: %d\n", k, i+1, po);
}
MPI_Finalize();
exit(EXIT_SUCCESS);
}


A new GSL RNG, gsl_rng_sprng20 is created, by including gsl-sprng.h immediately after gsl_rng.h. If you want to set a seed, do so in the usual GSL way, but make sure to set it to be the same on each processor. I have had several emails recently from people who claim that gsl-sprng.h “doesn’t work”. All I can say is that it still works for me! I suspect the problem is that people are attempting to use it with a version of SPRNG without MPI support. That won’t work… Check that the previous SPRNG example works, first.

I can compile and run the above code with

mpicc -I/usr/local/src/sprng2.0/include -L/usr/local/src/sprng2.0/lib -o gsl-sprng_demo gsl-sprng_demo.c -lsprng -lgmp -lgsl -lgslcblas
mpirun -np 2 gsl-sprng_demo


#### Parallel Monte Carlo

Now we have parallel random number streams, we can think about how to do parallel Monte Carlo simulations. Here is a simple example:

#include <math.h>
#include <mpi.h>
#include <gsl/gsl_rng.h>
#include "gsl-sprng.h"

int main(int argc,char *argv[])
{
int i,k,N; double u,ksum,Nsum; gsl_rng *r;
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&N);
MPI_Comm_rank(MPI_COMM_WORLD,&k);
r=gsl_rng_alloc(gsl_rng_sprng20);
for (i=0;i<10000;i++) {
u = gsl_rng_uniform(r);
ksum += exp(-u*u);
}
MPI_Reduce(&ksum,&Nsum,1,MPI_DOUBLE,MPI_SUM,0,MPI_COMM_WORLD);
if (k == 0) {
printf("Monte carlo estimate is %f\n", (Nsum/10000)/N );
}
MPI_Finalize();
exit(EXIT_SUCCESS);
}


which calculates a Monte Carlo estimate of the integral

$\displaystyle I=\int_0^1 \exp(-u^2)du$

using 10k variates on each available processor. The MPI command MPI_Reduce is used to summarise the values obtained independently in each process. I compile and run with

mpicc -I/usr/local/src/sprng2.0/include -L/usr/local/src/sprng2.0/lib -o monte-carlo monte-carlo.c -lsprng -lgmp -lgsl -lgslcblas
mpirun -np 2 monte-carlo


#### Parallel chains MCMC

Once parallel Monte Carlo has been mastered, it is time to move on to parallel MCMC. First it makes sense to understand how to run parallel MCMC chains in an MPI environment. I will illustrate this with a simple Metropolis-Hastings algorithm to sample a standard normal using uniform proposals, as discussed in a previous post. Here an independent chain is run on each processor, and the results are written into separate files.

#include <gsl/gsl_rng.h>
#include "gsl-sprng.h"
#include <gsl/gsl_randist.h>
#include <mpi.h>

int main(int argc,char *argv[])
{
int k,i,iters; double x,can,a,alpha; gsl_rng *r;
FILE *s; char filename[15];
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD,&k);
if ((argc != 3)) {
if (k == 0)
fprintf(stderr,"Usage: %s <iters> <alpha>\n",argv[0]);
MPI_Finalize(); return(EXIT_FAILURE);
}
iters=atoi(argv[1]); alpha=atof(argv[2]);
r=gsl_rng_alloc(gsl_rng_sprng20);
sprintf(filename,"chain-%03d.tab",k);
s=fopen(filename,"w");
if (s==NULL) {
perror("Failed open");
MPI_Finalize(); return(EXIT_FAILURE);
}
x = gsl_ran_flat(r,-20,20);
fprintf(s,"Iter X\n");
for (i=0;i<iters;i++) {
can = x + gsl_ran_flat(r,-alpha,alpha);
a = gsl_ran_ugaussian_pdf(can) / gsl_ran_ugaussian_pdf(x);
if (gsl_rng_uniform(r) < a)
x = can;
fprintf(s,"%d %f\n",i,x);
}
fclose(s);
MPI_Finalize(); return(EXIT_SUCCESS);
}


I can compile and run this with the following commands

mpicc -I/usr/local/src/sprng2.0/include -L/usr/local/src/sprng2.0/lib -o mcmc mcmc.c -lsprng -lgmp -lgsl -lgslcblas
mpirun -np 2 mcmc 100000 1


#### Parallelising a single MCMC chain

The parallel chains approach turns out to be surprisingly effective in practice. Obviously the disadvantage of that approach is that “burn in” has to be repeated on every processor, which limits how much efficiency gain can be acheived by running across many processors. Consequently it is often desirable to try and parallelise a single MCMC chain. As MCMC algorithms are inherently sequential, parallelisation is not completely trivial, and most (but not all) approaches to parallelising a single MCMC chain focus on the parallelisation of each iteration. In order for this to be worthwhile, it is necessary that the problem being considered is non-trivial, having a large state space. The strategy is then to divide the state space into “chunks” which can be updated in parallel. I don’t have time to go through a real example in detail in this blog post, but fortunately I wrote a book chapter on this topic almost 10 years ago which is still valid and relevant today. The citation details are:

Wilkinson, D. J. (2005) Parallel Bayesian Computation, Chapter 16 in E. J. Kontoghiorghes (ed.) Handbook of Parallel Computing and Statistics, Marcel Dekker/CRC Press, 481-512.

The book was eventually published in 2005 after a long delay. The publisher which originally commisioned the handbook (Marcel Dekker) was taken over by CRC Press before publication, and the project lay dormant for a couple of years until the new publisher picked it up again and decided to proceed with publication. I have a draft of my original submission in PDF which I recommend reading for further information. The code examples used are also available for download, including several of the examples used in this post, as well as an extended case study on parallelisation of a single chain for Bayesian inference in a stochastic volatility model. Although the chapter is nearly 10 years old, the issues discussed are all still remarkably up-to-date, and the code examples all still work. I think that is a testament to the stability of the technology adopted (C, MPI, GSL). Some of the other handbook chapters have not stood the test of time so well.

For basic information on getting started with MPI and key MPI commands for implementing parallel MCMC algorithms, the above mentioned book chapter is a reasonable place to start. Read it all through carefully, run the examples, and carefully study the code for the parallel stochastic volatility example. Once that is understood, you should find it possible to start writing your own parallel MCMC algorithms. For further information about more sophisticated MPI usage and additional commands, I find the annotated specification: MPI – The complete reference to be as good a source as any.