## Introduction

In a previous post I’ve given a brief introduction to monads in Scala, aimed at people interested in scientific and statistical computing. Monads are a concept from category theory which turn out to be exceptionally useful for solving many problems in functional programming. But most categorical concepts have a dual, usually prefixed with “co”, so the dual of a monad is the comonad. Comonads turn out to be especially useful for formulating algorithms from scientific and statistical computing in an elegant way. In this post I’ll illustrate their use in signal processing, image processing, numerical integration of PDEs, and Gibbs sampling (of an Ising model). Comonads enable the extension of a local computation to a global computation, and this pattern crops up all over the place in statistical computing.

Simplifying massively, from the viewpoint of a Scala programmer, a monad is a mappable (functor) type class augmented with the methods pure and flatMap:

trait Monad[M[_]] extends Functor[M] {
def pure[T](v: T): M[T]
def flatMap[T,S](v: M[T])(f: T => M[S]): M[S]
}


In category theory, the dual of a concept is typically obtained by “reversing the arrows”. Here that means reversing the direction of the methods pure and flatMap to get extract and coflatMap, respectively.

trait Comonad[W[_]] extends Functor[W] {
def extract[T](v: W[T]): T
def coflatMap[T,S](v: W[T])(f: W[T] => S): W[S]
}


So, while pure allows you to wrap plain values in a monad, extract allows you to get a value out of a comonad. So you can always get a value out of a comonad (unlike a monad). Similarly, while flatMap allows you to transform a monad using a function returning a monad, coflatMap allows you to transform a comonad using a function which collapses a comonad to a single value. It is coflatMap (sometimes called extend) which can extend a local computation (producing a single value) to the entire comonad. We’ll look at how that works in the context of some familiar examples.

## Applying a linear filter to a data stream

One of the simplest examples of a comonad is an infinite stream of data. I’ve discussed streams in a previous post. By focusing on infinite streams we know the stream will never be empty, so there will always be a value that we can extract. Which value does extract give? For a Stream encoded as some kind of lazy list, the only value we actually know is the value at the head of the stream, with subsequent values to be lazily computed as required. So the head of the list is the only reasonable value for extract to return.

Understanding coflatMap is a bit more tricky, but it is coflatMap that provides us with the power to apply a non-trivial statistical computation to the stream. The input is a function which transforms a stream into a value. In our example, that will be a function which computes a weighted average of the first few values and returns that weighted average as the result. But the return type of coflatMap must be a stream of such computations. Following the types, a few minutes thought reveals that the only reasonable thing to do is to return the stream formed by applying the weighted average function to all sub-streams, recursively. So, for a Stream s (of type Stream[T]) and an input function f: W[T] => S, we form a stream whose head is f(s) and whose tail is coflatMap(f) applied to s.tail. Again, since we are working with an infinite stream, we don’t have to worry about whether or not the tail is empty. This gives us our comonadic Stream, and it is exactly what we need for applying a linear filter to the data stream.

In Scala, Cats is a library providing type classes from Category theory, and instances of those type classes for parametrised types in the standard library. In particular, it provides us with comonadic functionality for the standard Scala Stream. Let’s start by defining a stream corresponding to the logistic map.

import cats._
import cats.implicits._

val lam = 3.7
def s = Stream.iterate(0.5)(x => lam*x*(1-x))
s.take(10).toList
// res0: List[Double] = List(0.5, 0.925, 0.25668749999999985,
//  0.7059564011718747, 0.7680532550204203, 0.6591455741499428, ...


Let us now suppose that we want to apply a linear filter to this stream, in order to smooth the values. The idea behind using comonads is that you figure out how to generate one desired value, and let coflatMap take care of applying the same logic to the rest of the structure. So here, we need a function to generate the first filtered value (since extract is focused on the head of the stream). A simple first attempt a function to do this might look like the following.

  def linearFilterS(weights: Stream[Double])(s: Stream[Double]): Double =
(weights, s).parMapN(_*_).sum


This aligns each weight in parallel with a corresponding value from the stream, and combines them using multiplication. The resulting (hopefully finite length) stream is then summed (with addition). We can test this with

linearFilterS(Stream(0.25,0.5,0.25))(s)
// res1: Double = 0.651671875


and let coflatMap extend this computation to the rest of the stream with something like:

s.coflatMap(linearFilterS(Stream(0.25,0.5,0.25))).take(5).toList
// res2: List[Double] = List(0.651671875, 0.5360828502929686, ...


This is all completely fine, but our linearFilterS function is specific to the Stream comonad, despite the fact that all we’ve used about it in the function is that it is a parallelly composable and foldable. We can make this much more generic as follows:

  def linearFilter[F[_]: Foldable, G[_]](
weights: F[Double], s: F[Double]
)(implicit ev: NonEmptyParallel[F, G]): Double =
(weights, s).parMapN(_*_).fold


This uses some fairly advanced Scala concepts which I don’t want to get into right now (I should also acknowledge that I had trouble getting the syntax right for this, and got help from Fabio Labella (@SystemFw) on the Cats gitter channel). But this version is more generic, and can be used to linearly filter other data structures than Stream. We can use this for regular Streams as follows:

s.coflatMap(s => linearFilter(Stream(0.25,0.5,0.25),s))
// res3: scala.collection.immutable.Stream[Double] = Stream(0.651671875, ?)


But we can apply this new filter to other collections. This could be other, more sophisticated, streams such as provided by FS2, Monix or Akka streams. But it could also be a non-stream collection, such as List:

val sl = s.take(10).toList
sl.coflatMap(sl => linearFilter(List(0.25,0.5,0.25),sl))
// res4: List[Double] = List(0.651671875, 0.5360828502929686, ...


Assuming that we have the Breeze scientific library available, we can plot the raw and smoothed trajectories.

def myFilter(s: Stream[Double]): Double =
linearFilter(Stream(0.25, 0.5, 0.25),s)
val n = 500
import breeze.plot._
import breeze.linalg._
val fig = Figure(s"The (smoothed) logistic map (lambda=$lam)") val p0 = fig.subplot(3,1,0) p0 += plot(linspace(1,n,n),s.take(n)) p0.ylim = (0.0,1.0) p0.title = s"The logistic map (lambda=$lam)"
val p1 = fig.subplot(3,1,1)
p1 += plot(linspace(1,n,n),s.coflatMap(myFilter).take(n))
p1.ylim = (0.0,1.0)
p1.title = "Smoothed by a simple linear filter"
val p2 = fig.subplot(3,1,2)
p2 += plot(linspace(1,n,n),s.coflatMap(myFilter).coflatMap(myFilter).coflatMap(myFilter).coflatMap(myFilter).coflatMap(myFilter).take(n))
p2.ylim = (0.0,1.0)
p2.title = "Smoothed with 5 applications of the linear filter"
fig.refresh


## Image processing and the heat equation

Streaming data is in no way the only context in which a comonadic approach facilitates an elegant approach to scientific and statistical computing. Comonads crop up anywhere where we want to extend a computation that is local to a small part of a data structure to the full data structure. Another commonly cited area of application of comonadic approaches is image processing (I should acknowledge that this section of the post is very much influenced by a blog post on comonadic image processing in Haskell). However, the kinds of operations used in image processing are in many cases very similar to the operations used in finite difference approaches to numerical integration of partial differential equations (PDEs) such as the heat equation, so in this section I will blur (sic) the distinction between the two, and numerically integrate the 2D heat equation in order to Gaussian blur a noisy image.

First we need a simple image type which can have pixels of arbitrary type T (this is very important – all functors must be fully type polymorphic).

  import scala.collection.parallel.immutable.ParVector
case class Image[T](w: Int, h: Int, data: ParVector[T]) {
def apply(x: Int, y: Int): T = data(x*h+y)
def map[S](f: T => S): Image[S] = Image(w, h, data map f)
def updated(x: Int, y: Int, value: T): Image[T] =
Image(w,h,data.updated(x*h+y,value))
}


Here I’ve chosen to back the image with a parallel immutable vector. This wasn’t necessary, but since this type has a map operation which automatically parallelises over multiple cores, any map operations applied to the image will be automatically parallelised. This will ultimately lead to all of our statistical computations being automatically parallelised without us having to think about it.

As it stands, this image isn’t comonadic, since it doesn’t implement extract or coflatMap. Unlike the case of Stream, there isn’t really a uniquely privileged pixel, so it’s not clear what extract should return. For many data structures of this type, we make them comonadic by adding a “cursor” pointing to a “current” element of interest, and use this as the focus for computations applied with coflatMap. This is simplest to explain by example. We can define our “pointed” image type as follows:

  case class PImage[T](x: Int, y: Int, image: Image[T]) {
def extract: T = image(x, y)
def map[S](f: T => S): PImage[S] = PImage(x, y, image map f)
def coflatMap[S](f: PImage[T] => S): PImage[S] = PImage(
x, y, Image(image.w, image.h,
(0 until (image.w * image.h)).toVector.par.map(i => {
val xx = i / image.h
val yy = i % image.h
f(PImage(xx, yy, image))
})))


There is missing a closing brace, as I’m not quite finished. Here x and y represent the location of our cursor, so extract returns the value of the pixel indexed by our cursor. Similarly, coflatMap forms an image where the value of the image at each location is the result of applying the function f to the image which had the cursor set to that location. Clearly f should use the cursor in some way, otherwise the image will have the same value at every pixel location. Note that map and coflatMap operations will be automatically parallelised. The intuitive idea behind coflatMap is that it extends local computations. For the stream example, the local computation was a linear combination of nearby values. Similarly, in image analysis problems, we often want to apply a linear filter to nearby pixels. We can get at the pixel at the cursor location using extract, but we probably also want to be able to move the cursor around to nearby locations. We can do that by adding some appropriate methods to complete the class definition.

    def up: PImage[T] = {
val py = y-1
val ny = if (py >= 0) py else (py + image.h)
PImage(x,ny,image)
}
def down: PImage[T] = {
val py = y+1
val ny = if (py < image.h) py else (py - image.h)
PImage(x,ny,image)
}
def left: PImage[T] = {
val px = x-1
val nx = if (px >= 0) px else (px + image.w)
PImage(nx,y,image)
}
def right: PImage[T] = {
val px = x+1
val nx = if (px < image.w) px else (px - image.w)
PImage(nx,y,image)
}
}


Here each method returns a new pointed image with the cursor shifted by one pixel in the appropriate direction. Note that I’ve used periodic boundary conditions here, which often makes sense for numerical integration of PDEs, but makes less sense for real image analysis problems. Note that we have embedded all “indexing” issues inside the definition of our classes. Now that we have it, none of the statistical algorithms that we develop will involve any explicit indexing. This makes it much less likely to develop algorithms containing bugs corresponding to “off-by-one” or flipped axis errors.

This class is now fine for our requirements. But if we wanted Cats to understand that this structure is really a comonad (perhaps because we wanted to use derived methods, such as coflatten), we would need to provide evidence for this. The details aren’t especially important for this post, but we can do it simply as follows:

  implicit val pimageComonad = new Comonad[PImage] {
def extract[A](wa: PImage[A]) = wa.extract
def coflatMap[A,B](wa: PImage[A])(f: PImage[A] => B): PImage[B] =
wa.coflatMap(f)
def map[A,B](wa: PImage[A])(f: A => B): PImage[B] = wa.map(f)
}


It’s handy to have some functions for converting Breeze dense matrices back and forth with our image class.

  import breeze.linalg.{Vector => BVec, _}
def BDM2I[T](m: DenseMatrix[T]): Image[T] =
Image(m.cols, m.rows, m.data.toVector.par)
def I2BDM(im: Image[Double]): DenseMatrix[Double] =
new DenseMatrix(im.h,im.w,im.data.toArray)


Now we are ready to see how to use this in practice. Let’s start by defining a very simple linear filter.

def fil(pi: PImage[Double]): Double = (2*pi.extract+
pi.up.extract+pi.down.extract+pi.left.extract+pi.right.extract)/6.0


This simple filter can be used to “smooth” or “blur” an image. However, from a more sophisticated viewpoint, exactly this type of filter can be used to represent one time step of a numerical method for time integration of the 2D heat equation. Now we can simulate a noisy image and apply our filter to it using coflatMap:

import breeze.stats.distributions.Gaussian
val bdm = DenseMatrix.tabulate(200,250){case (i,j) => math.cos(
0.1*math.sqrt((i*i+j*j))) + Gaussian(0.0,2.0).draw}
val pim0 = PImage(0,0,BDM2I(bdm))
def pims = Stream.iterate(pim0)(_.coflatMap(fil))


Note that here, rather than just applying the filter once, I’ve generated an infinite stream of pointed images, each one representing an additional application of the linear filter. Thus the sequence represents the time solution of the heat equation with initial condition corresponding to our simulated noisy image.

We can render the first few frames to check that it seems to be working.

import breeze.plot._
val fig = Figure("Diffusing a noisy image")
pims.take(25).zipWithIndex.foreach{case (pim,i) => {
val p = fig.subplot(5,5,i)
p += image(I2BDM(pim.image))
}}


Note that the numerical integration is carried out in parallel on all available cores automatically. Other image filters can be applied, and other (parabolic) PDEs can be numerically integrated in an essentially similar way.

## Gibbs sampling the Ising model

Another place where the concept of extending a local computation to a global computation crops up is in the context of Gibbs sampling a high-dimensional probability distribution by cycling through the sampling of each variable in turn from its full-conditional distribution. I’ll illustrate this here using the Ising model, so that I can reuse the pointed image class from above, but the principles apply to any Gibbs sampling problem. In particular, the Ising model that we consider has a conditional independence structure corresponding to a graph of a square lattice. As above, we will use the comonadic structure of the square lattice to construct a Gibbs sampler. However, we can construct a Gibbs sampler for arbitrary graphical models in an essentially identical way by using a graph comonad.

Let’s begin by simulating a random image containing +/-1s:

import breeze.stats.distributions.{Binomial,Bernoulli}
val beta = 0.4
val bdm = DenseMatrix.tabulate(500,600){
case (i,j) => (new Binomial(1,0.2)).draw
}.map(_*2 - 1) // random matrix of +/-1s
val pim0 = PImage(0,0,BDM2I(bdm))


We can use this to initialise our Gibbs sampler. We now need a Gibbs kernel representing the update of each pixel.

def gibbsKernel(pi: PImage[Int]): Int = {
val sum = pi.up.extract+pi.down.extract+pi.left.extract+pi.right.extract
val p1 = math.exp(beta*sum)
val p2 = math.exp(-beta*sum)
val probplus = p1/(p1+p2)
if (new Bernoulli(probplus).draw) 1 else -1
}


So far so good, but there a couple of issues that we need to consider before we plough ahead and start coflatMapping. The first is that pure functional programmers will object to the fact that this function is not pure. It is a stochastic function which has the side-effect of mutating the random number state. I’m just going to duck that issue here, as I’ve previously discussed how to fix it using probability monads, and I don’t want it to distract us here.

However, there is a more fundamental problem here relating to parallel versus sequential application of Gibbs kernels. coflatMap is conceptually parallel (irrespective of how it is implemented) in that all computations used to build the new comonad are based solely on the information available in the starting comonad. OTOH, detailed balance of the Markov chain will only be preserved if the kernels for each pixel are applied sequentially. So if we coflatMap this kernel over the image we will break detailed balance. I should emphasise that this has nothing to do with the fact that I’ve implemented the pointed image using a parallel vector. Exactly the same issue would arise if we switched to backing the image with a regular (sequential) immutable Vector.

The trick here is to recognise that if we coloured alternate pixels black and white using a chequerboard pattern, then all of the black pixels are conditionally independent given the white pixels and vice-versa. Conditionally independent pixels can be updated by parallel application of a Gibbs kernel. So we just need separate kernels for updating odd and even pixels.

def oddKernel(pi: PImage[Int]): Int =
if ((pi.x+pi.y) % 2 != 0) pi.extract else gibbsKernel(pi)
def evenKernel(pi: PImage[Int]): Int =
if ((pi.x+pi.y) % 2 == 0) pi.extract else gibbsKernel(pi)


Each of these kernels can be coflatMapped over the image preserving detailed balance of the chain. So we can now construct an infinite stream of MCMC iterations as follows.

def pims = Stream.iterate(pim0)(_.coflatMap(oddKernel).
coflatMap(evenKernel))


We can animate the first few iterations with:

import breeze.plot._
val fig = Figure("Ising model Gibbs sampler")
fig.width = 1000
fig.height = 800
pims.take(50).zipWithIndex.foreach{case (pim,i) => {
print(s"$i ") fig.clear val p = fig.subplot(1,1,0) p.title = s"Ising model: frame$i"
p += image(I2BDM(pim.image.map{_.toDouble}))
fig.refresh
}}
println


Here I have a movie showing the first 1000 iterations. Note that youtube seems to have over-compressed it, but you should get the basic idea.

Again, note that this MCMC sampler runs in parallel on all available cores, automatically. This issue of odd/even pixel updating emphasises another issue that crops up a lot in functional programming: very often, thinking about how to express an algorithm functionally leads to an algorithm which parallelises naturally. For general graphs, figuring out which groups of nodes can be updated in parallel is essentially the graph colouring problem. I’ve discussed this previously in relation to parallel MCMC in:

Wilkinson, D. J. (2005) Parallel Bayesian Computation, Chapter 16 in E. J. Kontoghiorghes (ed.) Handbook of Parallel Computing and Statistics, Marcel Dekker/CRC Press, 481-512.

There are quite a few blog posts discussing comonads in the context of Haskell. In particular, the post on comonads for image analysis I mentioned previously, and this one on cellular automata. Bartosz’s post on comonads gives some connection back to the mathematical origins. Runar’s Scala comonad tutorial is the best source I know for comonads in Scala.

Full runnable code corresponding to this blog post is available from my blog repo.

## Statistical computing with Scala free on-line course

I’ve recently delivered a three-day intensive short-course on Scala for statistical computing and data science. The course seemed to go well, and the experience has convinced me that Scala should be used a lot more by statisticians and data scientists for a range of problems in statistical computing. In particular, the simplicity of writing fast efficient parallel algorithms is reason alone to take a careful look at Scala. With a view to helping more statisticians get to grips with Scala, I’ve decided to freely release all of the essential materials associated with the course: the course notes (as PDF), code fragments, complete examples, end-of-chapter exercises, etc. Although I developed the materials with the training course in mind, the course notes are reasonably self-contained, making the course quite suitable for self-study. At some point I will probably flesh out the notes into a proper book, but that will probably take me a little while.

I’ve written a brief self-study guide to point people in the right direction. For people studying the material in their spare time, the course is probably best done over nine weeks (one chapter per week), and this will then cover material at a similar rate to a typical MOOC.

The nine chapters are:

1. Introduction
2. Scala and FP Basics
3. Collections
4. Scala Breeze
5. Monte Carlo
6. Statistical modelling
7. Tools
8. Apache Spark

For anyone frustrated by the limitations of dynamic languages such as R, Python or Octave, this course should provide a good pathway to an altogether more sophisticated, modern programming paradigm.

## MCMC as a Stream

### Introduction

This weekend I’ve been preparing some material for my upcoming Scala for statistical computing short course. As part of the course, I thought it would be useful to walk through how to think about and structure MCMC codes, and in particular, how to think about MCMC algorithms as infinite streams of state. This material is reasonably stand-alone, so it seems suitable for a blog post. Complete runnable code for the examples in this post are available from my blog repo.

### A simple MH sampler

For this post I will just consider a trivial toy Metropolis algorithm using a Uniform random walk proposal to target a standard normal distribution. I’ve considered this problem before on my blog, so if you aren’t very familiar with Metropolis-Hastings algorithms, you might want to quickly review my post on Metropolis-Hastings MCMC algorithms in R before continuing. At the end of that post, I gave the following R code for the Metropolis sampler:

metrop3<-function(n=1000,eps=0.5)
{
vec=vector("numeric", n)
x=0
oldll=dnorm(x,log=TRUE)
vec[1]=x
for (i in 2:n) {
can=x+runif(1,-eps,eps)
loglik=dnorm(can,log=TRUE)
loga=loglik-oldll
if (log(runif(1)) < loga) {
x=can
oldll=loglik
}
vec[i]=x
}
vec
}


I will begin this post with a fairly direct translation of this algorithm into Scala:

def metrop1(n: Int = 1000, eps: Double = 0.5): DenseVector[Double] = {
val vec = DenseVector.fill(n)(0.0)
var x = 0.0
var oldll = Gaussian(0.0, 1.0).logPdf(x)
vec(0) = x
(1 until n).foreach { i =>
val can = x + Uniform(-eps, eps).draw
val loglik = Gaussian(0.0, 1.0).logPdf(can)
val loga = loglik - oldll
if (math.log(Uniform(0.0, 1.0).draw) < loga) {
x = can
oldll = loglik
}
vec(i) = x
}
vec
}


This code works, and is reasonably fast and efficient, but there are several issues with it from a functional programmers perspective. One issue is that we have committed to storing all MCMC output in RAM in a DenseVector. This probably isn’t an issue here, but for some big problems we might prefer to not store the full set of states, but to just print the states to (say) the console, for possible re-direction to a file. It is easy enough to modify the code to do this:

def metrop2(n: Int = 1000, eps: Double = 0.5): Unit = {
var x = 0.0
var oldll = Gaussian(0.0, 1.0).logPdf(x)
(1 to n).foreach { i =>
val can = x + Uniform(-eps, eps).draw
val loglik = Gaussian(0.0, 1.0).logPdf(can)
val loga = loglik - oldll
if (math.log(Uniform(0.0, 1.0).draw) < loga) {
x = can
oldll = loglik
}
println(x)
}
}


But now we have two version of the algorithm. One for storing results locally, and one for streaming results to the console. This is clearly unsatisfactory, but we shall return to this issue shortly. Another issue that will jump out at functional programmers is the reliance on mutable variables for storing the state and old likelihood. Let’s fix that now by re-writing the algorithm as a tail-recursion.

@tailrec
def metrop3(n: Int = 1000, eps: Double = 0.5, x: Double = 0.0, oldll: Double = Double.MinValue): Unit = {
if (n > 0) {
println(x)
val can = x + Uniform(-eps, eps).draw
val loglik = Gaussian(0.0, 1.0).logPdf(can)
val loga = loglik - oldll
if (math.log(Uniform(0.0, 1.0).draw) < loga)
metrop3(n - 1, eps, can, loglik)
else
metrop3(n - 1, eps, x, oldll)
}
}


This has eliminated the vars, and is just as fast and efficient as the previous version of the code. Note that the @tailrec annotation is optional – it just signals to the compiler that we want it to throw an error if for some reason it cannot eliminate the tail call. However, this is for the print-to-console version of the code. What if we actually want to keep the iterations in RAM for subsequent analysis? We can keep the values in an accumulator, as follows.

@tailrec
def metrop4(n: Int = 1000, eps: Double = 0.5, x: Double = 0.0, oldll: Double = Double.MinValue, acc: List[Double] = Nil): DenseVector[Double] = {
if (n == 0)
DenseVector(acc.reverse.toArray)
else {
val can = x + Uniform(-eps, eps).draw
val loglik = Gaussian(0.0, 1.0).logPdf(can)
val loga = loglik - oldll
if (math.log(Uniform(0.0, 1.0).draw) < loga)
metrop4(n - 1, eps, can, loglik, can :: acc)
else
metrop4(n - 1, eps, x, oldll, x :: acc)
}
}


### Factoring out the updating logic

This is all fine, but we haven’t yet addressed the issue of having different versions of the code depending on what we want to do with the output. The problem is that we have tied up the logic of advancing the Markov chain with what to do with the output. What we need to do is separate out the code for advancing the state. We can do this by defining a new function.

def newState(x: Double, oldll: Double, eps: Double): (Double, Double) = {
val can = x + Uniform(-eps, eps).draw
val loglik = Gaussian(0.0, 1.0).logPdf(can)
val loga = loglik - oldll
if (math.log(Uniform(0.0, 1.0).draw) < loga) (can, loglik) else (x, oldll)
}


This function takes as input a current state and associated log likelihood and returns a new state and log likelihood following the execution of one step of a MH algorithm. This separates the concern of state updating from the rest of the code. So now if we want to write code that prints the state, we can write it as

  @tailrec
def metrop5(n: Int = 1000, eps: Double = 0.5, x: Double = 0.0, oldll: Double = Double.MinValue): Unit = {
if (n > 0) {
println(x)
val ns = newState(x, oldll, eps)
metrop5(n - 1, eps, ns._1, ns._2)
}
}


and if we want to accumulate the set of states visited, we can write that as

  @tailrec
def metrop6(n: Int = 1000, eps: Double = 0.5, x: Double = 0.0, oldll: Double = Double.MinValue, acc: List[Double] = Nil): DenseVector[Double] = {
if (n == 0) DenseVector(acc.reverse.toArray) else {
val ns = newState(x, oldll, eps)
metrop6(n - 1, eps, ns._1, ns._2, ns._1 :: acc)
}
}


Both of these functions call newState to do the real work, and concentrate on what to do with the sequence of states. However, both of these functions repeat the logic of how to iterate over the sequence of states.

### MCMC as a stream

Ideally we would like to abstract out the details of how to do state iteration from the code as well. Most functional languages have some concept of a Stream, which represents a (potentially infinite) sequence of states. The Stream can embody the logic of how to perform state iteration, allowing us to abstract that away from our code, as well.

To do this, we will restructure our code slightly so that it more clearly maps old state to new state.

def nextState(eps: Double)(state: (Double, Double)): (Double, Double) = {
val x = state._1
val oldll = state._2
val can = x + Uniform(-eps, eps).draw
val loglik = Gaussian(0.0, 1.0).logPdf(can)
val loga = loglik - oldll
if (math.log(Uniform(0.0, 1.0).draw) < loga) (can, loglik) else (x, oldll)
}


The "real" state of the chain is just x, but if we want to avoid recalculation of the old likelihood, then we need to make this part of the chain’s state. We can use this nextState function in order to construct a Stream.

  def metrop7(eps: Double = 0.5, x: Double = 0.0, oldll: Double = Double.MinValue): Stream[Double] =
Stream.iterate((x, oldll))(nextState(eps)) map (_._1)


The result of calling this is an infinite stream of states. Obviously it isn’t computed – that would require infinite computation, but it captures the logic of iteration and computation in a Stream, that can be thought of as a lazy List. We can get values out by converting the Stream to a regular collection, being careful to truncate the Stream to one of finite length beforehand! eg. metrop7().drop(1000).take(10000).toArray will do a burn-in of 1,000 iterations followed by a main monitoring run of length 10,000, capturing the results in an Array. Note that metrop7().drop(1000).take(10000) is a Stream, and so nothing is actually computed until the toArray is encountered. Conversely, if printing to console is required, just replace the .toArray with .foreach(println).

The above stream-based approach to MCMC iteration is clean and elegant, and deals nicely with issues like burn-in and thinning (which can be handled similarly). This is how I typically write MCMC codes these days. However, functional programming purists would still have issues with this approach, as it isn’t quite pure functional. The problem is that the code isn’t pure – it has a side-effect, which is to mutate the state of the under-pinning pseudo-random number generator. If the code was pure, calling nextState with the same inputs would always give the same result. Clearly this isn’t the case here, as we have specifically designed the function to be stochastic, returning a randomly sampled value from the desired probability distribution. So nextState represents a function for randomly sampling from a conditional probability distribution.

### A pure functional approach

Now, ultimately all code has side-effects, or there would be no point in running it! But in functional programming the desire is to make as much of the code as possible pure, and to push side-effects to the very edges of the code. So it’s fine to have side-effects in your main method, but not buried deep in your code. Here the side-effect is at the very heart of the code, which is why it is potentially an issue.

To keep things as simple as possible, at this point we will stop worrying about carrying forward the old likelihood, and hard-code a value of eps. Generalisation is straightforward. We can make our code pure by instead defining a function which represents the conditional probability distribution itself. For this we use a probability monad, which in Breeze is called Rand. We can couple together such functions using monadic binds (flatMap in Scala), expressed most neatly using for-comprehensions. So we can write our transition kernel as

def kernel(x: Double): Rand[Double] = for {
innov <- Uniform(-0.5, 0.5)
can = x + innov
oldll = Gaussian(0.0, 1.0).logPdf(x)
loglik = Gaussian(0.0, 1.0).logPdf(can)
loga = loglik - oldll
u <- Uniform(0.0, 1.0)
} yield if (math.log(u) < loga) can else x


This is now pure – the same input x will always return the same probability distribution – the conditional distribution of the next state given the current state. We can draw random samples from this distribution if we must, but it’s probably better to work as long as possible with pure functions. So next we need to encapsulate the iteration logic. Breeze has a MarkovChain object which can take kernels of this form and return a stochastic Process object representing the iteration logic, as follows.

MarkovChain(0.0)(kernel).
steps.
drop(1000).
take(10000).
foreach(println)


The steps method contains the logic of how to advance the state of the chain. But again note that no computation actually takes place until the foreach method is encountered – this is when the sampling occurs and the side-effects happen.

Metropolis-Hastings is a common use-case for Markov chains, so Breeze actually has a helper method built-in that will construct a MH sampler directly from an initial state, a proposal kernel, and a (log) target.

MarkovChain.
metropolisHastings(0.0, (x: Double) =>
Uniform(x - 0.5, x + 0.5))(x =>
Gaussian(0.0, 1.0).logPdf(x)).
steps.
drop(1000).
take(10000).
toArray


Note that if you are using the MH functionality in Breeze, it is important to make sure that you are using version 0.13 (or later), as I fixed a few issues with the MH code shortly prior to the 0.13 release.

### Summary

Viewing MCMC algorithms as infinite streams of state is useful for writing elegant, generic, flexible code. Streams occur everywhere in programming, and so there are lots of libraries for working with them. In this post I used the simple Stream from the Scala standard library, but there are much more powerful and flexible stream libraries for Scala, including fs2 and Akka-streams. But whatever libraries you are using, the fundamental concepts are the same. The most straightforward approach to implementation is to define impure stochastic streams to consume. However, a pure functional approach is also possible, and the Breeze library defines some useful functions to facilitate this approach. I’m still a little bit ambivalent about whether the pure approach is worth the additional cognitive overhead, but it’s certainly very interesting and worth playing with and thinking about the pros and cons.

Complete runnable code for the examples in this post are available from my blog repo.

## A scalable particle filter in Scala

### Introduction

Many modern algorithms in computational Bayesian statistics have at their heart a particle filter or some other sequential Monte Carlo (SMC) procedure. In this blog I’ve discussed particle MCMC algorithms which use a particle filter in the inner-loop in order to compute a (noisy, unbiased) estimate of the marginal likelihood of the data. These algorithms are often very computationally intensive, either because the forward model used to propagate the particles is expensive, or because the likelihood associated with each particle/observation is expensive (or both). In this case it is desirable to parallelise the particle filter to run on all available cores of a machine, or in some cases, it would even be desirable to distribute the the particle filter computation across a cluster of machines.

Parallelisation is difficult when using the conventional imperative programming languages typically used in scientific and statistical computing, but is much easier using modern functional languages such as Scala. In fact, in languages such as Scala it is possible to describe algorithms at a higher level of abstraction, so that exactly the same algorithm can run in serial, run in parallel across all available cores on a single machine, or run in parallel across a cluster of machines, all without changing any code. Doing so renders parallelisation a non-issue. In this post I’ll talk through how to do this for a simple bootstrap particle filter, but the same principle applies for a large range of statistical computing algorithms.

In the previous post I gave a quick introduction to the monad concept, and to monadic collections in particular. Many computational tasks in statistics can be accomplished using a sequence of operations on monadic collections. We would like to write code that is independent of any particular implementation of a monadic collection, so that we can switch to a different implementation without changing the code of our algorithm (for example, switching from a serial to a parallel collection). But in strongly typed languages we need to know at compile time that the collection we use has the methods that we require. Typeclasses provide a nice solution to this problem. I don’t want to get bogged down in a big discussion about Scala typeclasses here, but suffice to say that they describe a family of types conforming to a particular interface in an ad hoc loosely coupled way (they are said to provide ad hoc polymorphism). They are not the same as classes in traditional O-O languages, but they do solve a similar problem to the adaptor design pattern, in a much cleaner way. We can describe a simple typeclass for our monadic collection as follows:

trait GenericColl[C[_]] {
def map[A, B](ca: C[A])(f: A => B): C[B]
def reduce[A](ca: C[A])(f: (A, A) => A): A
def flatMap[A, B, D[B] <: GenTraversable[B]](ca: C[A])(f: A => D[B]): C[B]
def zip[A, B](ca: C[A])(cb: C[B]): C[(A, B)]
def length[A](ca: C[A]): Int
}


In the typeclass we just list the methods that we expect our generic collection to provide, but do not say anything about how they are implemented. For example, we know that operations such as map and reduce can be executed in parallel, but this is a separate concern. We can now write code that can be used for any collection conforming to the requirements of this typeclass. The full code for this example is provided in the associated github repo for this blog, and includes the obvious syntax for this typeclass, and typeclass instances for the Scala collections Vector and ParVector, that we will exploit later in the example.

### SIR step for a bootstrap filter

We can now write some code for a single observation update of a bootstrap particle filter.

def update[S: State, O: Observation, C[_]: GenericColl](
dataLik: (S, O) => LogLik, stepFun: S => S
)(x: C[S], o: O): (LogLik, C[S]) = {
val xp = x map (stepFun(_))
val lw = xp map (dataLik(_, o))
val max = lw reduce (math.max(_, _))
val rw = lw map (lwi => math.exp(lwi - max))
val srw = rw reduce (_ + _)
val l = rw.length
val z = rw zip xp
val rx = z flatMap (p => Vector.fill(Poisson(p._1 * l / srw).draw)(p._2))
(max + math.log(srw / l), rx)
}


This is a very simple bootstrap filter, using Poisson resampling for simplicity and data locality, but does include use of the log-sum-exp trick to prevent over/underflow of raw weight calculations, and tracks the marginal (log-)likelihood of the observation. With this function we can now pass in a “prior” particle distribution in any collection conforming to our typeclass, together with a propagator function, an observation (log-)likelihood, and an observation, and it will return back a new collection of particles of exactly the same type that was provided for input. Note that all of the operations we require can be accomplished with the standard monadic collection operations declared in our typeclass.

### Filtering as a functional fold

Once we have a function for executing one step of a particle filter, we can produce a function for particle filtering as a functional fold over a sequence of observations:

def pFilter[S: State, O: Observation, C[_]: GenericColl, D[O] <: GenTraversable[O]](
x0: C[S], data: D[O], dataLik: (S, O) => LogLik, stepFun: S => S
): (LogLik, C[S]) = {
val updater = update[S, O, C](dataLik, stepFun) _
data.foldLeft((0.0, x0))((prev, o) => {
val next = updater(prev._2, o)
(prev._1 + next._1, next._2)
})
}


Folding data structures is a fundamental concept in functional programming, and is exactly what is required for any kind of filtering problem. Note that Brian Beckman has recently written a series of articles on Kalman filtering as a functional fold.

### Marginal likelihoods and parameter estimation

So far we haven’t said anything about parameters or parameter estimation, but this is appropriate, since parametrisation is a separate concern from filtering. However, once we have a function for particle filtering, we can produce a function concerned with evaluating marginal likelihoods trivially:

def pfMll[S: State, P: Parameter, O: Observation,
C[_]: GenericColl, D[O] <: GenTraversable[O]](
simX0: P => C[S], stepFun: P => S => S,
dataLik: P => (S, O) => LogLik, data: D[O]
): (P => LogLik) = (th: P) =>
pFilter(simX0(th), data, dataLik(th), stepFun(th))._1


Note that this higher-order function does not return a value, but instead a function which will accept a parameter as input and return a (log-)likelihood as output. This can then be used for parameter estimation purposes, perhaps being used in a PMMH pMCMC algorithm, or something else. Again, this is a separate concern.

### Example

Here I’ll just give a completely trivial toy example, purely to show how the functions work. For avoidance of doubt, I know that there are many better/simpler/easier ways to tackle this problem! Here we will just look at inferring the auto-regression parameter of a linear Gaussian AR(1)-plus-noise model using the functions we have developed.

First we can simulate some synthetic data from this model, using a value of 0.8 for the auto-regression parameter:

val inNoise = Gaussian(0.0, 1.0).sample(99)
val state = DenseVector(inNoise.scanLeft(0.0)((s, i) => 0.8 * s + i).toArray)
val noise = DenseVector(Gaussian(0.0, 2.0).sample(100).toArray)
val data = (state + noise).toArray.toList


Now assuming that we don’t know the auto-regression parameter, we can construct a function to evaluate the likelihood of different parameter values as follows:

val mll = pfMll(
(th: Double) => Gaussian(0.0, 10.0).sample(10000).toVector.par,
(th: Double) => (s: Double) => Gaussian(th * s, 1.0).draw,
(th: Double) => (s: Double, o: Double) => Gaussian(s, 2.0).logPdf(o),
data
)


Note that the 4 characters “.par” at the end of line 2 are the only difference between running this code serially or in parallel! Now we can run this code by calling the returned function with different values. So, hopefully mll(0.8) will return a larger log-likelihood than (say) mll(0.6) or mll(0.9). The example code in the github repo plots the results of calling mll() for a range of values (note that if that was the genuine use-case, then it would be much better to parallellise the parameter range than the particle filter, due to providing better parallelisation granularity, but many other examples require parallelisation of the particle filter itself). In this particular example, both the forward model and the likelihood are very cheap operations, so there is little to be gained from parallelisation. Nevertheless, I still get a speedup of more than a factor of two using the parallel version on my laptop.

### Conclusion

In this post we have shown how typeclasses can be used in Scala to write code that is parallelisation-agnostic. Code written in this way can be run on one or many cores as desired. We’ve illustrated the concept with a scalable particle filter, but nothing about the approach is specific to that application. It would be easy to build up a library of statistical routines this way, all of which can effectively exploit available parallel hardware. Further, although we haven’t demonstrated it here, it is trivial to extend this idea to allow code to be distribution over a cluster of parallel machines if necessary. For example, if an Apache Spark cluster is available, it is easy to make a Spark RDD instance for our generic collection typeclass, that will then allow us to run our (unmodified) particle filter code over a Spark cluster. This emphasises the fact that Spark can be useful for distributing computation as well as just processing “big data”. I’ll say more about Spark in subsequent posts.

## Calling Scala code from R using rscala

### Introduction

In a previous post I looked at how to call Scala code from R using a CRAN package called jvmr. This package now seems to have been replaced by a new package called rscala. Like the old package, it requires a pre-existing Java installation. Unlike the old package, however, it no longer depends on rJava, which may simplify some installations. The rscala package is well documented, with a reference manual and a draft paper. In this post I will concentrate on the issue of calling sbt-based projects with dependencies on external libraries (such as breeze).

On a system with Java installed, it should be possible to install the rscala package with a simple

install.packages("rscala")


from the R command prompt. Calling

library(rscala)


will check that it has worked. The package will do a sensible search for a Scala installation and use it if it can find one. If it can’t find one (or can only find an installation older than 2.10.x), it will fail. In this case you can download and install a Scala installation specifically for rscala using the command

rscala::scalaInstall()


This option is likely to be attractive to sbt (or IDE) users who don’t like to rely on a system-wide scala installation.

### A Gibbs sampler in Scala using Breeze

For illustration I’m going to use a Scala implementation of a Gibbs sampler. The Scala code, gibbs.scala is given below:

package gibbs

object Gibbs {

import scala.annotation.tailrec
import scala.math.sqrt
import breeze.stats.distributions.{Gamma,Gaussian}

case class State(x: Double, y: Double) {
override def toString: String = x.toString + " , " + y + "\n"
}

def nextIter(s: State): State = {
val newX = Gamma(3.0, 1.0/((s.y)*(s.y)+4.0)).draw
State(newX, Gaussian(1.0/(newX+1), 1.0/sqrt(2*newX+2)).draw)
}

@tailrec def nextThinnedIter(s: State,left: Int): State =
if (left==0) s else nextThinnedIter(nextIter(s),left-1)

def genIters(s: State, stop: Int, thin: Int): List[State] = {
@tailrec def go(s: State, left: Int, acc: List[State]): List[State] =
if (left>0)
go(nextThinnedIter(s,thin), left-1, s::acc)
else acc
go(s,stop,Nil).reverse
}

def main(args: Array[String]) = {
if (args.length != 3) {
println("Usage: sbt \"run <outFile> <iters> <thin>\"")
sys.exit(1)
} else {
val outF=args(0)
val iters=args(1).toInt
val thin=args(2).toInt
val out = genIters(State(0.0,0.0),iters,thin)
val s = new java.io.FileWriter(outF)
s.write("x , y\n")
out map { it => s.write(it.toString) }
s.close
}
}

}


This code requires Scala and the Breeze scientific library in order to build. We can specify this in a sbt build file, which should be called build.sbt and placed in the same directory as the Scala code.

name := "gibbs"

version := "0.1"

scalacOptions ++= Seq("-unchecked", "-deprecation", "-feature")

libraryDependencies  ++= Seq(
"org.scalanlp" %% "breeze" % "0.10",
"org.scalanlp" %% "breeze-natives" % "0.10"
)

resolvers ++= Seq(
"Sonatype Snapshots" at "https://oss.sonatype.org/content/repositories/snapshots/",
"Sonatype Releases" at "https://oss.sonatype.org/content/repositories/releases/"
)

scalaVersion := "2.11.6"


Now, from a system command prompt in the directory where the files are situated, it should be possible to download all dependencies and compile and run the code with a simple

sbt "run output.csv 50000 1000"


sbt magically manages all of the dependencies for us so that we don’t have to worry about them. However, for calling from R, it may be desirable to run the code without running sbt. There are several ways to achieve this, but the simplest is to build an “assembly jar” or “fat jar”, which is a Java byte-code file containing all code and libraries required in order to run the code on any system with a Java installation.

To build an assembly jar first create a subdirectory called project (the name matters), an in it place two files. The first should be called assembly.sbt, and should contain the line

addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.13.0")


Since the version of the assembly tool can depend on the version of sbt, it is also best to fix the version of sbt being used by creating another file in the project directory called build.properties, which should contain the line

sbt.version=0.13.7


sbt assembly


If this works, it should create a fat jar target/scala-2.11/gibbs-assembly-0.1.jar. You can check it works by running

java -jar target/scala-2.11/gibbs-assembly-0.1.jar output.csv 10000 10


Assuming that it does, you are now ready to try running the code from within R.

#### Calling via R system calls

Since this code takes a relatively long time to run, calling it from R via simple system calls isn’t a particularly terrible idea. For example, we can do this from the R command prompt with the following commands

system("java -jar target/scala-2.11/gibbs-assembly-0.1.jar output.csv 50000 1000")
library(smfsb)
mcmcSummary(out,rows=2)


This works fine, but is a bit clunky. Tighter integration between R and Scala would be useful, which is where rscala comes in.

#### Calling assembly Scala projects via rscala

rscala provides a very simple way to embed a Scala interpreter within an R session, to be able to execute Scala expressions from R and to have the results returned back to the R session for further processing. The main issue with using this in practice is managing dependencies on external libraries and setting the Scala classpath correctly. By using an assembly jar we can bypass most of these issues, and it becomes trivial to call our Scala code direct from the R interpreter, as the following code illustrates.

library(rscala)
sc=scalaInterpreter("target/scala-2.11/gibbs-assembly-0.1.jar")
sc%~%'import gibbs.Gibbs._'
out=sc%~%'genIters(State(0.0,0.0),50000,1000).toArray.map{s=>Array(s.x,s.y)}'
library(smfsb)
mcmcSummary(out,rows=2)


Here we call the getIters function directly, rather than via the main method. This function returns an immutable List of States. Since R doesn’t understand this, we map it to an Array of Arrays, which R then unpacks into an R matrix for us to store in the matrix out.

### Summary

The CRAN package rscala makes it very easy to embed a Scala interpreter within an R session. However, for most non-trivial statistical computing problems, the Scala code will have dependence on external scientific libraries such as Breeze. The standard way to easily manage external dependencies in the Scala ecosystem is sbt. Given an sbt-based Scala project, it is easy to generate an assembly jar in order to initialise the rscala Scala interpreter with the classpath needed to call arbitrary Scala functions. This provides very convenient inter-operability between R and Scala for many statistical computing applications.

## Calling Scala code from R using jvmr

[Update: the jvmr package has been replaced by a new package called rscala. I have a new post which explains it.]

### Introduction

In previous posts I have explained why I think that Scala is a good language to use for statistical computing and data science. Despite this, R is very convenient for simple exploratory data analysis and visualisation – currently more convenient than Scala. I explained in my recent talk at the RSS what (relatively straightforward) things would need to be developed for Scala in order to make R completely redundant, but for the short term at least, it seems likely that I will need to use both R and Scala for my day-to-day work.

Since I use both Scala and R for statistical computing, it is very convenient to have a degree of interoperability between the two languages. I could call R from Scala code or Scala from R code, or both. Fortunately, some software tools have been developed recently which make this much simpler than it used to be. The software is jvmr, and as explained at the website, it enables calling Java and Scala from R and calling R from Java and Scala. I have previously discussed calling Java from R using the R CRAN package rJava. In this post I will focus on calling Scala from R using the CRAN package jvmr, which depends on rJava. I may examine calling R from Scala in a future post.

On a system with Java installed, it should be possible to install the jvmr R package with a simple

install.packages("jvmr")


from the R command prompt. The package has the usual documentation associated with it, but the draft paper describing the package is the best way to get an overview of its capabilities and a walk-through of simple usage.

### A Gibbs sampler in Scala using Breeze

For illustration I’m going to use a Scala implementation of a Gibbs sampler which relies on the Breeze scientific library, and will be built using the simple build tool, sbt. Most non-trivial Scala projects depend on various versions of external libraries, and sbt is an easy way to build even very complex projects trivially on any system with Java installed. You don’t even need to have Scala installed in order to build and run projects using sbt. I give some simple complete worked examples of building and running Scala sbt projects in the github repo associated with my recent RSS talk. Installing sbt is trivial as explained in the repo READMEs.

For this post, the Scala code, gibbs.scala is given below:

package gibbs

object Gibbs {

import scala.annotation.tailrec
import scala.math.sqrt
import breeze.stats.distributions.{Gamma,Gaussian}

case class State(x: Double, y: Double) {
override def toString: String = x.toString + " , " + y + "\n"
}

def nextIter(s: State): State = {
val newX = Gamma(3.0, 1.0/((s.y)*(s.y)+4.0)).draw
State(newX, Gaussian(1.0/(newX+1), 1.0/sqrt(2*newX+2)).draw)
}

@tailrec def nextThinnedIter(s: State,left: Int): State =
if (left==0) s else nextThinnedIter(nextIter(s),left-1)

def genIters(s: State, stop: Int, thin: Int): List[State] = {
@tailrec def go(s: State, left: Int, acc: List[State]): List[State] =
if (left&gt;0)
go(nextThinnedIter(s,thin), left-1, s::acc)
else acc
go(s,stop,Nil).reverse
}

def main(args: Array[String]) = {
if (args.length != 3) {
println("Usage: sbt \"run &lt;outFile&gt; &lt;iters&gt; &lt;thin&gt;\"")
sys.exit(1)
} else {
val outF=args(0)
val iters=args(1).toInt
val thin=args(2).toInt
val out = genIters(State(0.0,0.0),iters,thin)
val s = new java.io.FileWriter(outF)
s.write("x , y\n")
out map { it =&gt; s.write(it.toString) }
s.close
}
}

}


This code requires Scala and the Breeze scientific library in order to build. We can specify this in a sbt build file, which should be called build.sbt and placed in the same directory as the Scala code.

name := "gibbs"

version := "0.1"

scalacOptions ++= Seq("-unchecked", "-deprecation", "-feature")

libraryDependencies  ++= Seq(
"org.scalanlp" %% "breeze" % "0.10",
"org.scalanlp" %% "breeze-natives" % "0.10"
)

resolvers ++= Seq(
"Sonatype Snapshots" at "https://oss.sonatype.org/content/repositories/snapshots/",
"Sonatype Releases" at "https://oss.sonatype.org/content/repositories/releases/"
)

scalaVersion := "2.11.2"


Now, from a system command prompt in the directory where the files are situated, it should be possible to download all dependencies and compile and run the code with a simple

sbt "run output.csv 50000 1000"


#### Calling via R system calls

Since this code takes a relatively long time to run, calling it from R via simple system calls isn’t a particularly terrible idea. For example, we can do this from the R command prompt with the following commands

system("sbt \"run output.csv 50000 1000\"")
library(smfsb)
mcmcSummary(out,rows=2)


This works fine, but won’t work so well for code which needs to be called repeatedly. For this, tighter integration between R and Scala would be useful, which is where jvmr comes in.

#### Calling sbt-based Scala projects via jvmr

jvmr provides a very simple way to embed a Scala interpreter within an R session, to be able to execute Scala expressions from R and to have the results returned back to the R session for further processing. The main issue with using this in practice is managing dependencies on external libraries and setting the Scala classpath correctly. For an sbt project such as we are considering here, it is relatively easy to get sbt to provide us with all of the information we need in a fully automated way.

First, we need to add a new task to our sbt build instructions, which will output the full classpath in a way that is easy to parse from R. Just add the following to the end of the file build.sbt:

lazy val printClasspath = taskKey[Unit]("Dump classpath")

printClasspath := {
(fullClasspath in Runtime value) foreach {
e =&gt; print(e.data+"!")
}
}


Be aware that blank lines are significant in sbt build files. Once we have this in our build file, we can write a small R function to get the classpath from sbt and then initialise a jvmr scalaInterpreter with the correct full classpath needed for the project. An R function which does this, sbtInit(), is given below

sbtInit&lt;-function()
{
library(jvmr)
system2("sbt","compile")
cpstr=system2("sbt","printClasspath",stdout=TRUE)
cpst=cpstr[length(cpstr)]
cpsp=strsplit(cpst,"!")[[1]]
cp=cpsp[1:(length(cpsp)-1)]
scalaInterpreter(cp,use.jvmr.class.path=FALSE)
}


With this function at our disposal, it becomes trivial to call our Scala code direct from the R interpreter, as the following code illustrates.

sc=sbtInit()
sc['import gibbs.Gibbs._']
out=sc['genIters(State(0.0,0.0),50000,1000).toArray.map{s=&gt;Array(s.x,s.y)}']
library(smfsb)
mcmcSummary(out,rows=2)


Here we call the getIters function directly, rather than via the main method. This function returns an immutable List of States. Since R doesn’t understand this, we map it to an Array of Arrays, which R then unpacks into an R matrix for us to store in the matrix out.

### Summary

The CRAN package jvmr makes it very easy to embed a Scala interpreter within an R session. However, for most non-trivial statistical computing problems, the Scala code will have dependence on external scientific libraries such as Breeze. The standard way to easily manage external dependencies in the Scala ecosystem is sbt. Given an sbt-based Scala project, it is easy to add a task to the sbt build file and a function to R in order to initialise the jvmr Scala interpreter with the full classpath needed to call arbitrary Scala functions. This provides very convenient inter-operability between R and Scala for many statistical computing applications.

## One-way ANOVA with fixed and random effects from a Bayesian perspective

This blog post is derived from a computer practical session that I ran as part of my new course on Statistics for Big Data, previously discussed. This course covered a lot of material very quickly. In particular, I deferred introducing notions of hierarchical modelling until the Bayesian part of the course, where I feel it is more natural and powerful. However, some of the terminology associated with hierarchical statistical modelling probably seems a bit mysterious to those without a strong background in classical statistical modelling, and so this practical session was intended to clear up some potential confusion. I will analyse a simple one-way Analysis of Variance (ANOVA) model from a Bayesian perspective, making sure to highlight the difference between fixed and random effects in a Bayesian context where everything is random, as well as emphasising the associated identifiability issues. R code is used to illustrate the ideas.

### Example scenario

We will consider the body mass index (BMI) of new male undergraduate students at a selection of UK Universities. Let us suppose that our data consist of measurements of (log) BMI for a random sample of 1,000 males at each of 8 Universities. We are interested to know if there are any differences between the Universities. Again, we want to model the process as we would simulate it, so thinking about how we would simulate such data is instructive. We start by assuming that the log BMI is a normal random quantity, and that the variance is common across the Universities in question (this is quite a big assumption, and it is easy to relax). We assume that the mean of this normal distribution is University-specific, but that we do not have strong prior opinions regarding the way in which the Universities differ. That said, we expect that the Universities would not be very different from one another.

### Simulating data

A simple simulation of the data with some plausible parameters can be carried out as follows.

set.seed(1)
Z=matrix(rnorm(1000*8,3.1,0.1),nrow=8)
RE=rnorm(8,0,0.01)
X=t(Z+RE)
colnames(X)=paste("Uni",1:8,sep="")
Data=stack(data.frame(X))
boxplot(exp(values)~ind,data=Data,notch=TRUE)


Make sure that you understand exactly what this code is doing before proceeding. The boxplot showing the simulated data is given below.

### Frequentist analysis

We will start with a frequentist analysis of the data. The model we would like to fit is

$y_{ij} = \mu + \theta_i + \varepsilon_{ij}$

where i is an indicator for the University and j for the individual within a particular University. The “effect”, $\theta_i$ represents how the ith University differs from the overall mean. We know that this model is not actually identifiable when the model parameters are all treated as “fixed effects”, but R will handle this for us.

> mod=lm(values~ind,data=Data)
> summary(mod)

Call:
lm(formula = values ~ ind, data = Data)

Residuals:
Min       1Q   Median       3Q      Max
-0.36846 -0.06778 -0.00069  0.06910  0.38219

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  3.101068   0.003223 962.244  < 2e-16 ***
indUni2     -0.006516   0.004558  -1.430 0.152826
indUni3     -0.017168   0.004558  -3.767 0.000166 ***
indUni4      0.017916   0.004558   3.931 8.53e-05 ***
indUni5     -0.022838   0.004558  -5.011 5.53e-07 ***
indUni6     -0.001651   0.004558  -0.362 0.717143
indUni7      0.007935   0.004558   1.741 0.081707 .
indUni8      0.003373   0.004558   0.740 0.459300
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1019 on 7992 degrees of freedom
Multiple R-squared:  0.01439,	Adjusted R-squared:  0.01353
F-statistic: 16.67 on 7 and 7992 DF,  p-value: < 2.2e-16


We see that R has handled the identifiability problem using “treatment contrasts”, dropping the fixed effect for the first university, so that the intercept actually represents the mean value for the first University, and the effects for the other Univeristies represent the differences from the first University. If we would prefer to impose a sum constraint, then we can switch to sum contrasts with

options(contrasts=rep("contr.sum",2))


and then re-fit the model.

> mods=lm(values~ind,data=Data)
> summary(mods)

Call:
lm(formula = values ~ ind, data = Data)

Residuals:
Min       1Q   Median       3Q      Max
-0.36846 -0.06778 -0.00069  0.06910  0.38219

Coefficients:
Estimate Std. Error  t value Pr(>|t|)
(Intercept)  3.0986991  0.0011394 2719.558  < 2e-16 ***
ind1         0.0023687  0.0030146    0.786 0.432048
ind2        -0.0041477  0.0030146   -1.376 0.168905
ind3        -0.0147997  0.0030146   -4.909 9.32e-07 ***
ind4         0.0202851  0.0030146    6.729 1.83e-11 ***
ind5        -0.0204693  0.0030146   -6.790 1.20e-11 ***
ind6         0.0007175  0.0030146    0.238 0.811889
ind7         0.0103039  0.0030146    3.418 0.000634 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1019 on 7992 degrees of freedom
Multiple R-squared:  0.01439,	Adjusted R-squared:  0.01353
F-statistic: 16.67 on 7 and 7992 DF,  p-value: < 2.2e-16


This has 7 degrees of freedom for the effects, as before, but ensures that the 8 effects sum to precisely zero. This is arguably more interpretable in this case.

### Bayesian analysis

We will now analyse the simulated data from a Bayesian perspective, using JAGS.

#### Fixed effects

All parameters in Bayesian models are uncertain, and therefore random, so there is much confusion regarding the difference between “fixed” and “random” effects in a Bayesian context. For “fixed” effects, our prior captures the idea that we sample the effects independently from a “fixed” (typically vague) prior distribution. We could simply code this up and fit it in JAGS as follows.

require(rjags)
n=dim(X)[1]
p=dim(X)[2]
data=list(X=X,n=n,p=p)
init=list(mu=2,tau=1)
modelstring="
model {
for (j in 1:p) {
theta[j]~dnorm(0,0.0001)
for (i in 1:n) {
X[i,j]~dnorm(mu+theta[j],tau)
}
}
mu~dnorm(0,0.0001)
tau~dgamma(1,0.0001)
}
"
model=jags.model(textConnection(modelstring),data=data,inits=init)
update(model,n.iter=1000)
output=coda.samples(model=model,variable.names=c("mu","tau","theta"),n.iter=100000,thin=10)
print(summary(output))
plot(output)
autocorr.plot(output)
pairs(as.matrix(output))
crosscorr.plot(output)


On running the code we can clearly see that this naive approach leads to high posterior correlation between the mean and the effects, due to the fundamental lack of identifiability of the model. This also leads to MCMC mixing problems, but it is important to understand that this computational issue is conceptually entirely separate from the fundamental statisticial identifiability issue. Even if we could avoid MCMC entirely, the identifiability issue would remain.

A quick fix for the identifiability issue is to use “treatment contrasts”, just as for the frequentist model. We can implement that as follows.

data=list(X=X,n=n,p=p)
init=list(mu=2,tau=1)
modelstring="
model {
for (j in 1:p) {
for (i in 1:n) {
X[i,j]~dnorm(mu+theta[j],tau)
}
}
theta[1]<-0
for (j in 2:p) {
theta[j]~dnorm(0,0.0001)
}
mu~dnorm(0,0.0001)
tau~dgamma(1,0.0001)
}
"
model=jags.model(textConnection(modelstring),data=data,inits=init)
update(model,n.iter=1000)
output=coda.samples(model=model,variable.names=c("mu","tau","theta"),n.iter=100000,thin=10)
print(summary(output))
plot(output)
autocorr.plot(output)
pairs(as.matrix(output))
crosscorr.plot(output)


Running this we see that the model now works perfectly well, mixes nicely, and gives sensible inferences for the treatment effects.

Another source of confusion for models of this type is data formating and indexing in JAGS models. For our balanced data there was not problem passing in data to JAGS as a matrix and specifying the model using nested loops. However, for unbalanced designs this is not necessarily so convenient, and so then it can be helpful to specify the model based on two-column data, as we would use for fitting using lm(). This is illustrated with the following model specification, which is exactly equivalent to the previous model, and should give identical (up to Monte Carlo error) results.

N=n*p
data=list(y=Data$values,g=Data$ind,N=N,p=p)
init=list(mu=2,tau=1)
modelstring="
model {
for (i in 1:N) {
y[i]~dnorm(mu+theta[g[i]],tau)
}
theta[1]<-0
for (j in 2:p) {
theta[j]~dnorm(0,0.0001)
}
mu~dnorm(0,0.0001)
tau~dgamma(1,0.0001)
}
"
model=jags.model(textConnection(modelstring),data=data,inits=init)
update(model,n.iter=1000)
output=coda.samples(model=model,variable.names=c("mu","tau","theta"),n.iter=100000,thin=10)
print(summary(output))
plot(output)


As suggested above, this indexing scheme is much more convenient for unbalanced data, and hence widely used. However, since our data is balanced here, we will revert to the matrix approach for the remainder of the post.

One final thing to consider before moving on to random effects is the sum-contrast model. We can implement this in various ways, but I’ve tried to encode it for maximum clarity below, imposing the sum-to-zero constraint via the final effect.

data=list(X=X,n=n,p=p)
init=list(mu=2,tau=1)
modelstring="
model {
for (j in 1:p) {
for (i in 1:n) {
X[i,j]~dnorm(mu+theta[j],tau)
}
}
for (j in 1:(p-1)) {
theta[j]~dnorm(0,0.0001)
}
theta[p] <- -sum(theta[1:(p-1)])
mu~dnorm(0,0.0001)
tau~dgamma(1,0.0001)
}
"
model=jags.model(textConnection(modelstring),data=data,inits=init)
update(model,n.iter=1000)
output=coda.samples(model=model,variable.names=c("mu","tau","theta"),n.iter=100000,thin=10)
print(summary(output))
plot(output)


Again, this works perfectly well and gives similar results to the frequentist analysis.

#### Random effects

The key difference between fixed and random effects in a Bayesian framework is that random effects are not independent, being drawn from a distribution with parameters which are not fixed. Essentially, there is another level of hierarchy involved in the specification of the random effects. This is best illustrated by example. A random effects model for this problem is given below.

data=list(X=X,n=n,p=p)
init=list(mu=2,tau=1)
modelstring="
model {
for (j in 1:p) {
theta[j]~dnorm(0,taut)
for (i in 1:n) {
X[i,j]~dnorm(mu+theta[j],tau)
}
}
mu~dnorm(0,0.0001)
tau~dgamma(1,0.0001)
taut~dgamma(1,0.0001)
}
"
model=jags.model(textConnection(modelstring),data=data,inits=init)
update(model,n.iter=1000)
output=coda.samples(model=model,variable.names=c("mu","tau","taut","theta"),n.iter=100000,thin=10)
print(summary(output))
plot(output)


The only difference between this and our first naive attempt at a Bayesian fixed effects model is that we have put a gamma prior on the precision of the effect. Note that this model now runs and fits perfectly well, with reasonable mixing, and gives sensible parameter inferences. Although the effects here are not constrained to sum-to-zero, like in the case of sum contrasts for a fixed effects model, the prior encourages shrinkage towards zero, and so the random effect distribution can be thought of as a kind of soft version of a hard sum-to-zero constraint. From a predictive perspective, this model is much more powerful. In particular, using a random effects model, we can make strong predictions for unobserved groups (eg. a ninth University), with sensible prediction intervals based on our inferred understanding of how similar different universities are. Using a fixed effects model this isn’t really possible. Even for a Bayesian version of a fixed effects model using proper (but vague) priors, prediction intervals for unobserved groups are not really sensible.

Since we have used simulated data here, we can compare the estimated random effects with the true effects generated during the simulation.

> apply(as.matrix(output),2,mean)
mu           tau          taut      theta[1]      theta[2]
3.098813e+00  9.627110e+01  7.015976e+03  2.086581e-03 -3.935511e-03
theta[3]      theta[4]      theta[5]      theta[6]      theta[7]
-1.389099e-02  1.881528e-02 -1.921854e-02  5.640306e-04  9.529532e-03
theta[8]
5.227518e-03
> RE
[1]  0.002637034 -0.008294518 -0.014616348  0.016839902 -0.015443243
[6] -0.001908871  0.010162117  0.005471262


We see that the Bayesian random effects model has done an excellent job of estimation. If we wished, we could relax the assumption of common variance across the groups by making tau a vector indexed by j, though there is not much point in persuing this here, since we know that the groups do all have the same variance.

#### Strong subjective priors

The above is the usual story regarding fixed and random effects in Bayesian inference. I hope this is reasonably clear, so really I should quit while I’m ahead… However, the issues are really a bit more subtle than I’ve suggested. The inferred precision of the random effects was around 7,000, so now lets re-run the original, naive, “fixed effects” model with a strong subjective Bayesian prior on the distribution of the effects.

data=list(X=X,n=n,p=p)
init=list(mu=2,tau=1)
modelstring="
model {
for (j in 1:p) {
theta[j]~dnorm(0,7000)
for (i in 1:n) {
X[i,j]~dnorm(mu+theta[j],tau)
}
}
mu~dnorm(0,0.0001)
tau~dgamma(1,0.0001)
}
"
model=jags.model(textConnection(modelstring),data=data,inits=init)
update(model,n.iter=1000)
output=coda.samples(model=model,variable.names=c("mu","tau","theta"),n.iter=100000,thin=10)
print(summary(output))
plot(output)


This model also runs perfectly well and gives sensible inferences, despite the fact that the effects are iid from a fixed distribution and there is no hard constraint on the effects. Similarly, we can make sensible predictions, together with appropriate prediction intervals, for an unobserved group. So it isn’t so much the fact that the effects are coupled via an extra level of hierarchy that makes things work. It’s really the fact that the effects are sensibly distributed and not just sampled directly from a vague prior. So for “real” subjective Bayesians the line between fixed and random effects is actually very blurred indeed…