MCMC as a Stream

Introduction

This weekend I’ve been preparing some material for my upcoming Scala for statistical computing short course. As part of the course, I thought it would be useful to walk through how to think about and structure MCMC codes, and in particular, how to think about MCMC algorithms as infinite streams of state. This material is reasonably stand-alone, so it seems suitable for a blog post. Complete runnable code for the examples in this post are available from my blog repo.

A simple MH sampler

For this post I will just consider a trivial toy Metropolis algorithm using a Uniform random walk proposal to target a standard normal distribution. I’ve considered this problem before on my blog, so if you aren’t very familiar with Metropolis-Hastings algorithms, you might want to quickly review my post on Metropolis-Hastings MCMC algorithms in R before continuing. At the end of that post, I gave the following R code for the Metropolis sampler:

metrop3<-function(n=1000,eps=0.5) 
{
        vec=vector("numeric", n)
        x=0
        oldll=dnorm(x,log=TRUE)
        vec[1]=x
        for (i in 2:n) {
                can=x+runif(1,-eps,eps)
                loglik=dnorm(can,log=TRUE)
                loga=loglik-oldll
                if (log(runif(1)) < loga) { 
                        x=can
                        oldll=loglik
                        }
                vec[i]=x
        }
        vec
}

I will begin this post with a fairly direct translation of this algorithm into Scala:

def metrop1(n: Int = 1000, eps: Double = 0.5): DenseVector[Double] = {
    val vec = DenseVector.fill(n)(0.0)
    var x = 0.0
    var oldll = Gaussian(0.0, 1.0).logPdf(x)
    vec(0) = x
    (1 until n).foreach { i =>
      val can = x + Uniform(-eps, eps).draw
      val loglik = Gaussian(0.0, 1.0).logPdf(can)
      val loga = loglik - oldll
      if (math.log(Uniform(0.0, 1.0).draw) < loga) {
        x = can
        oldll = loglik
      }
      vec(i) = x
    }
    vec
}

This code works, and is reasonably fast and efficient, but there are several issues with it from a functional programmers perspective. One issue is that we have committed to storing all MCMC output in RAM in a DenseVector. This probably isn’t an issue here, but for some big problems we might prefer to not store the full set of states, but to just print the states to (say) the console, for possible re-direction to a file. It is easy enough to modify the code to do this:

def metrop2(n: Int = 1000, eps: Double = 0.5): Unit = {
    var x = 0.0
    var oldll = Gaussian(0.0, 1.0).logPdf(x)
    (1 to n).foreach { i =>
      val can = x + Uniform(-eps, eps).draw
      val loglik = Gaussian(0.0, 1.0).logPdf(can)
      val loga = loglik - oldll
      if (math.log(Uniform(0.0, 1.0).draw) < loga) {
        x = can
        oldll = loglik
      }
      println(x)
    }
}

But now we have two version of the algorithm. One for storing results locally, and one for streaming results to the console. This is clearly unsatisfactory, but we shall return to this issue shortly. Another issue that will jump out at functional programmers is the reliance on mutable variables for storing the state and old likelihood. Let’s fix that now by re-writing the algorithm as a tail-recursion.

@tailrec
def metrop3(n: Int = 1000, eps: Double = 0.5, x: Double = 0.0, oldll: Double = Double.MinValue): Unit = {
    if (n > 0) {
      println(x)
      val can = x + Uniform(-eps, eps).draw
      val loglik = Gaussian(0.0, 1.0).logPdf(can)
      val loga = loglik - oldll
      if (math.log(Uniform(0.0, 1.0).draw) < loga)
        metrop3(n - 1, eps, can, loglik)
      else
        metrop3(n - 1, eps, x, oldll)
    }
  }

This has eliminated the vars, and is just as fast and efficient as the previous version of the code. Note that the @tailrec annotation is optional – it just signals to the compiler that we want it to throw an error if for some reason it cannot eliminate the tail call. However, this is for the print-to-console version of the code. What if we actually want to keep the iterations in RAM for subsequent analysis? We can keep the values in an accumulator, as follows.

@tailrec
def metrop4(n: Int = 1000, eps: Double = 0.5, x: Double = 0.0, oldll: Double = Double.MinValue, acc: List[Double] = Nil): DenseVector[Double] = {
    if (n == 0)
      DenseVector(acc.reverse.toArray)
    else {
      val can = x + Uniform(-eps, eps).draw
      val loglik = Gaussian(0.0, 1.0).logPdf(can)
      val loga = loglik - oldll
      if (math.log(Uniform(0.0, 1.0).draw) < loga)
        metrop4(n - 1, eps, can, loglik, can :: acc)
      else
        metrop4(n - 1, eps, x, oldll, x :: acc)
    }
}

Factoring out the updating logic

This is all fine, but we haven’t yet addressed the issue of having different versions of the code depending on what we want to do with the output. The problem is that we have tied up the logic of advancing the Markov chain with what to do with the output. What we need to do is separate out the code for advancing the state. We can do this by defining a new function.

def newState(x: Double, oldll: Double, eps: Double): (Double, Double) = {
    val can = x + Uniform(-eps, eps).draw
    val loglik = Gaussian(0.0, 1.0).logPdf(can)
    val loga = loglik - oldll
    if (math.log(Uniform(0.0, 1.0).draw) < loga) (can, loglik) else (x, oldll)
}

This function takes as input a current state and associated log likelihood and returns a new state and log likelihood following the execution of one step of a MH algorithm. This separates the concern of state updating from the rest of the code. So now if we want to write code that prints the state, we can write it as

  @tailrec
  def metrop5(n: Int = 1000, eps: Double = 0.5, x: Double = 0.0, oldll: Double = Double.MinValue): Unit = {
    if (n > 0) {
      println(x)
      val ns = newState(x, oldll, eps)
      metrop5(n - 1, eps, ns._1, ns._2)
    }
  }

and if we want to accumulate the set of states visited, we can write that as

  @tailrec
  def metrop6(n: Int = 1000, eps: Double = 0.5, x: Double = 0.0, oldll: Double = Double.MinValue, acc: List[Double] = Nil): DenseVector[Double] = {
    if (n == 0) DenseVector(acc.reverse.toArray) else {
      val ns = newState(x, oldll, eps)
      metrop6(n - 1, eps, ns._1, ns._2, ns._1 :: acc)
    }
  }

Both of these functions call newState to do the real work, and concentrate on what to do with the sequence of states. However, both of these functions repeat the logic of how to iterate over the sequence of states.

MCMC as a stream

Ideally we would like to abstract out the details of how to do state iteration from the code as well. Most functional languages have some concept of a Stream, which represents a (potentially infinite) sequence of states. The Stream can embody the logic of how to perform state iteration, allowing us to abstract that away from our code, as well.

To do this, we will restructure our code slightly so that it more clearly maps old state to new state.

def nextState(eps: Double)(state: (Double, Double)): (Double, Double) = {
    val x = state._1
    val oldll = state._2
    val can = x + Uniform(-eps, eps).draw
    val loglik = Gaussian(0.0, 1.0).logPdf(can)
    val loga = loglik - oldll
    if (math.log(Uniform(0.0, 1.0).draw) < loga) (can, loglik) else (x, oldll)
}

The "real" state of the chain is just x, but if we want to avoid recalculation of the old likelihood, then we need to make this part of the chain’s state. We can use this nextState function in order to construct a Stream.

  def metrop7(eps: Double = 0.5, x: Double = 0.0, oldll: Double = Double.MinValue): Stream[Double] =
    Stream.iterate((x, oldll))(nextState(eps)) map (_._1)

The result of calling this is an infinite stream of states. Obviously it isn’t computed – that would require infinite computation, but it captures the logic of iteration and computation in a Stream, that can be thought of as a lazy List. We can get values out by converting the Stream to a regular collection, being careful to truncate the Stream to one of finite length beforehand! eg. metrop7().drop(1000).take(10000).toArray will do a burn-in of 1,000 iterations followed by a main monitoring run of length 10,000, capturing the results in an Array. Note that metrop7().drop(1000).take(10000) is a Stream, and so nothing is actually computed until the toArray is encountered. Conversely, if printing to console is required, just replace the .toArray with .foreach(println).

The above stream-based approach to MCMC iteration is clean and elegant, and deals nicely with issues like burn-in and thinning (which can be handled similarly). This is how I typically write MCMC codes these days. However, functional programming purists would still have issues with this approach, as it isn’t quite pure functional. The problem is that the code isn’t pure – it has a side-effect, which is to mutate the state of the under-pinning pseudo-random number generator. If the code was pure, calling nextState with the same inputs would always give the same result. Clearly this isn’t the case here, as we have specifically designed the function to be stochastic, returning a randomly sampled value from the desired probability distribution. So nextState represents a function for randomly sampling from a conditional probability distribution.

A pure functional approach

Now, ultimately all code has side-effects, or there would be no point in running it! But in functional programming the desire is to make as much of the code as possible pure, and to push side-effects to the very edges of the code. So it’s fine to have side-effects in your main method, but not buried deep in your code. Here the side-effect is at the very heart of the code, which is why it is potentially an issue.

To keep things as simple as possible, at this point we will stop worrying about carrying forward the old likelihood, and hard-code a value of eps. Generalisation is straightforward. We can make our code pure by instead defining a function which represents the conditional probability distribution itself. For this we use a probability monad, which in Breeze is called Rand. We can couple together such functions using monadic binds (flatMap in Scala), expressed most neatly using for-comprehensions. So we can write our transition kernel as

def kernel(x: Double): Rand[Double] = for {
    innov <- Uniform(-0.5, 0.5)
    can = x + innov
    oldll = Gaussian(0.0, 1.0).logPdf(x)
    loglik = Gaussian(0.0, 1.0).logPdf(can)
    loga = loglik - oldll
    u <- Uniform(0.0, 1.0)
} yield if (math.log(u) < loga) can else x

This is now pure – the same input x will always return the same probability distribution – the conditional distribution of the next state given the current state. We can draw random samples from this distribution if we must, but it’s probably better to work as long as possible with pure functions. So next we need to encapsulate the iteration logic. Breeze has a MarkovChain object which can take kernels of this form and return a stochastic Process object representing the iteration logic, as follows.

MarkovChain(0.0)(kernel).
  steps.
  drop(1000).
  take(10000).
  foreach(println)

The steps method contains the logic of how to advance the state of the chain. But again note that no computation actually takes place until the foreach method is encountered – this is when the sampling occurs and the side-effects happen.

Metropolis-Hastings is a common use-case for Markov chains, so Breeze actually has a helper method built-in that will construct a MH sampler directly from an initial state, a proposal kernel, and a (log) target.

MarkovChain.
  metropolisHastings(0.0, (x: Double) =>
  Uniform(x - 0.5, x + 0.5))(x =>
  Gaussian(0.0, 1.0).logPdf(x)).
  steps.
  drop(1000).
  take(10000).
  toArray

Note that if you are using the MH functionality in Breeze, it is important to make sure that you are using version 0.13 (or later), as I fixed a few issues with the MH code shortly prior to the 0.13 release.

Summary

Viewing MCMC algorithms as infinite streams of state is useful for writing elegant, generic, flexible code. Streams occur everywhere in programming, and so there are lots of libraries for working with them. In this post I used the simple Stream from the Scala standard library, but there are much more powerful and flexible stream libraries for Scala, including fs2 and Akka-streams. But whatever libraries you are using, the fundamental concepts are the same. The most straightforward approach to implementation is to define impure stochastic streams to consume. However, a pure functional approach is also possible, and the Breeze library defines some useful functions to facilitate this approach. I’m still a little bit ambivalent about whether the pure approach is worth the additional cognitive overhead, but it’s certainly very interesting and worth playing with and thinking about the pros and cons.

Complete runnable code for the examples in this post are available from my blog repo.

Advertisements

Calling R from Scala sbt projects

[Update: The jvmr package has been replaced by the rscala package. There is a new version of this post which replaces this one.]

Overview

In previous posts I’ve shown how the jvmr CRAN R package can be used to call Scala sbt projects from R and inline Scala Breeze code in R. In this post I will show how to call to R from a Scala sbt project. This requires that R and the jvmr CRAN R package are installed on your system, as described in the previous posts. Since I’m focusing here on Scala sbt projects, I’m also assuming that sbt is installed.

The only “trick” required for calling back to R from Scala is telling sbt where the jvmr jar file is located. You can find the location from the R console as illustrated by the following session:

&gt; library(jvmr)
&gt; .jvmr.jar
[1] "/home/ndjw1/R/x86_64-pc-linux-gnu-library/3.1/jvmr/java/jvmr_2.11-2.11.2.1.jar"

This location (which will obviously be different for you) can then be added in to your sbt classpath by adding the following line to your build.sbt file:

unmanagedJars in Compile += file("/home/ndjw1/R/x86_64-pc-linux-gnu-library/3.1/jvmr/java/jvmr_2.11-2.11.2.1.jar")

Once this is done, calling out to R from your Scala sbt project can be carried out as described in the jvmr documentation. For completeness, a working example is given below.

Example

In this example I will use Scala to simulate some data consistent with a Poisson regression model, and then push the data to R to fit it using the R function glm(), and then pull back the fitted regression coefficients into Scala. This is obviously a very artificial example, but the point is to show how it is possible to call back to R for some statistical procedure that may be “missing” from Scala.

The dependencies for this project are described in the file build.sbt

name := "jvmr test"

version := "0.1"

scalacOptions ++= Seq("-unchecked", "-deprecation", "-feature")

libraryDependencies  ++= Seq(
            "org.scalanlp" %% "breeze" % "0.10",
            "org.scalanlp" %% "breeze-natives" % "0.10"
)

resolvers ++= Seq(
            "Sonatype Snapshots" at "https://oss.sonatype.org/content/repositories/snapshots/",
            "Sonatype Releases" at "https://oss.sonatype.org/content/repositories/releases/"
)

unmanagedJars in Compile += file("/home/ndjw1/R/x86_64-pc-linux-gnu-library/3.1/jvmr/java/jvmr_2.11-2.11.2.1.jar")

scalaVersion := "2.11.2"

The complete Scala program is contained in the file PoisReg.scala

import org.ddahl.jvmr.RInScala
import breeze.stats.distributions._
import breeze.linalg._

object ScalaToRTest {

  def main(args: Array[String]) = {

    // first simulate some data consistent with a Poisson regression model
    val x = Uniform(50,60).sample(1000)
    val eta = x map { xi =&gt; (xi * 0.1) - 3 }
    val mu = eta map { math.exp(_) }
    val y = mu map { Poisson(_).draw }
    
    // call to R to fit the Poission regression model
    val R = RInScala() // initialise an R interpreter
    R.x=x.toArray // send x to R
    R.y=y.toArray // send y to R
    R.eval("mod &lt;- glm(y~x,family=poisson())") // fit the model in R
    // pull the fitted coefficents back into scala
    val beta = DenseVector[Double](R.toVector[Double]("mod$coefficients"))

    // print the fitted coefficents
    println(beta)

  }

}

If these two files are put in an empty directory, the code can be compiled and run by typing sbt run from the command prompt in the relevant directory. The commented code should be self-explanatory, but see the jvmr documentation for further details.

Inlining Scala Breeze code in R using jvmr and sbt

[Update: The CRAN package “jvmr” has been replaced by a new package “rscala”. Rather than completely re-write this post, I’ve just created a github gist containing a new function, breezeInterpreter(), which works similarly to the function breezeInit() in this post. Usage information is given at the top of the gist.]

Introduction

In the previous post I showed how to call Scala code from R using sbt and jvmr. The approach described in that post is the one I would recommend for any non-trivial piece of Scala code – mixing up code from different languages in the same source code file is not a good strategy in general. That said, for very small snippets of code, it can sometimes be convenient to inline Scala code directly into an R source code file. The canonical example of this is a computationally intensive algorithm being prototyped in R which has a slow inner loop. If the inner loop can be recoded in a few lines of Scala, it would be nice to just inline this directly into the R code without having to create a separate Scala project. The CRAN package jvmr provides a very simple and straightforward way to do this. However, as discussed in the last post, most Scala code for statistical computing (even short and simple code) is likely to rely on Breeze for special functions, probability distributions, non-uniform random number generation, numerical linear algebra, etc. In this post we will see how to use sbt in order to make sure that the Breeze library and all of its dependencies are downloaded and cached, and to provide a correct classpath with which to initialise a jvmr scalaInterpreter session.

Setting up

Configuring your system to be able to inline Scala Breeze code is very easy. You just need to install Java, R and sbt. Then install the CRAN R package jvmr. At this point you have everything you need except for the R function breezeInit, given at the end of this post. I’ve deferred the function to the end of the post as it is a bit ugly, and the details of it are not important. All it does is get sbt to ensure that Breeze is correctly downloaded and cached and then starts a scalaInterpreter with Breeze on the classpath. With this function available, we can use it within an R session as the following R session illustrates:

&gt; b=breezeInit()
&gt; b['import breeze.stats.distributions._']
NULL
&gt; b['Poisson(10).sample(20).toArray']
 [1] 13 14 13 10  7  6 15 14  5 10 14 11 15  8 11 12  6  7  5  7
&gt; summary(b['Gamma(3,2).sample(10000).toArray'])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.2124  3.4630  5.3310  5.9910  7.8390 28.5200 
&gt; 

So we see that Scala Breeze code can be inlined directly into an R session, and if we are careful about return types, have the results of Scala expressions automatically unpack back into convenient R data structures.

Summary

In this post I have shown how easy it is to inline Scala Breeze code into R using sbt in conjunction with the CRAN package jvmr. This has many potential applications, with the most obvious being the desire to recode slow inner loops from R to Scala. This should give performance quite comparable with alternatives such as Rcpp, with the advantage being that you get to write beautiful, elegant, functional Scala code instead of horrible, ugly, imperative C++ code! 😉

The breezeInit function

The actual breezeInit() function is given below. It is a little ugly, but very simple. It is obviously easy to customise for different libraries and library versions as required. All of the hard work is done by sbt which must be installed and on the default system path in order for this function to work.

breezeInit&lt;-function()
{
  library(jvmr)
  sbtStr="name := \"tmp\"

version := \"0.1\"

libraryDependencies  ++= Seq(
            \"org.scalanlp\" %% \"breeze\" % \"0.10\",
            \"org.scalanlp\" %% \"breeze-natives\" % \"0.10\"
)

resolvers ++= Seq(
            \"Sonatype Snapshots\" at \"https://oss.sonatype.org/content/repositories/snapshots/\",
            \"Sonatype Releases\" at \"https://oss.sonatype.org/content/repositories/releases/\"
)

scalaVersion := \"2.11.2\"

lazy val printClasspath = taskKey[Unit](\"Dump classpath\")

printClasspath := {
  (fullClasspath in Runtime value) foreach {
    e =&gt; print(e.data+\"!\")
  }
}
"
  tmps=file(file.path(tempdir(),"build.sbt"),"w")
  cat(sbtStr,file=tmps)
  close(tmps)
  owd=getwd()
  setwd(tempdir())
  cpstr=system2("sbt","printClasspath",stdout=TRUE)
  cpst=cpstr[length(cpstr)]
  cpsp=strsplit(cpst,"!")[[1]]
  cp=cpsp[2:(length(cpsp)-1)]
  si=scalaInterpreter(cp,use.jvmr.class.path=FALSE)
  setwd(owd)
  si
}

A functional Gibbs sampler in Scala

For many years I’ve had a passing interest in functional programming and languages which support functional programming approaches. I’m also quite interested in MOOCs and their future role in higher education. So I recently signed up for my first on-line course, Functional Programming Principles in Scala, via Coursera. I’m around half way through the course at the time of writing, and I’m enjoying it very much. I knew that I didn’t know much about Scala before starting the course, but during the course I’ve also learned that I didn’t know as much about functional programming as I thought I did, either! 😉 The course itself is very interesting, the assignments are well designed and appropriately challenging, and the web infrastructure to support the course is working well. I suspect I’ll try other on-line courses in the future.

Functional programming emphasises immutability, and discourages imperative programming approaches that use variables that can be modified during run-time. There are many advantages to immutability, especially in the context of parallel and concurrent programming, which is becoming increasingly important as multi-core systems become the norm. I’ve always found functional programming to be intellectually appealing, but have often worried about the practicalities of using functional programming in the context of scientific computing where many algorithms are iterative in nature, and are typically encoded using imperative approaches. The Scala programming language is appealing to me as it supports both imperative and functional styles of programming, as well as object oriented approaches. However, as a result of taking this course I am now determined to pursue functional approaches further, and get more of a feel for how practical they are for scientific computing applications.

For my first experiment, I’m going back to my post describing a Gibbs sampler in various languages. See that post for further details of the algorithm. In that post I did have an example implementation in Scala, which looked like this:

object GibbsSc {
 
    import cern.jet.random.tdouble.engine.DoubleMersenneTwister
    import cern.jet.random.tdouble.Normal
    import cern.jet.random.tdouble.Gamma
    import Math.sqrt
    import java.util.Date
 
    def main(args: Array[String]) {
        val N=50000
        val thin=1000
        val rngEngine=new DoubleMersenneTwister(new Date)
        val rngN=new Normal(0.0,1.0,rngEngine)
        val rngG=new Gamma(1.0,1.0,rngEngine)
        var x=0.0
        var y=0.0
        println("Iter x y")
        for (i <- 0 until N) {
            for (j <- 0 until thin) {
                x=rngG.nextDouble(3.0,y*y+4)
                y=rngN.nextDouble(1.0/(x+1),1.0/sqrt(2*x+2))
            }
            println(i+" "+x+" "+y)
        }
    }
 
}

At the time I wrote that post I knew even less about Scala than I do now, so I created the code by starting from the Java version and removing all of the annoying clutter! 😉 Clearly this code has an imperative style, utilising variables (declared with var) x and y having mutable state that is updated by a nested for loop. This algorithm is typical of the kind I use every day, so if I can’t re-write this in a more functional style, removing all mutable variables from my code, then I’m not going to get very far with functional programming!

In fact it is very easy to re-write this in a more functional style without utilising mutable variables. One possible approach is presented below.

object FunGibbs {
 
    import cern.jet.random.tdouble.engine.DoubleMersenneTwister
    import cern.jet.random.tdouble.Normal
    import cern.jet.random.tdouble.Gamma
    import java.util.Date
    import scala.math.sqrt

    val rngEngine=new DoubleMersenneTwister(new Date)
    val rngN=new Normal(0.0,1.0,rngEngine)
    val rngG=new Gamma(1.0,1.0,rngEngine)

    class State(val x: Double,val y: Double)

    def nextIter(s: State): State = {
         val newX=rngG.nextDouble(3.0,(s.y)*(s.y)+4.0)
         new State(newX, 
              rngN.nextDouble(1.0/(newX+1),1.0/sqrt(2*newX+2)))
    }

    def nextThinnedIter(s: State,left: Int): State = {
       if (left==0) s 
       else nextThinnedIter(nextIter(s),left-1)
    }

    def genIters(s: State,current: Int,stop: Int,thin: Int): State = {
         if (!(current>stop)) {
             println(current+" "+s.x+" "+s.y)
             genIters(nextThinnedIter(s,thin),current+1,stop,thin)
         }
         else s
    }

    def main(args: Array[String]) {
        println("Iter x y")
        genIters(new State(0.0,0.0),1,50000,1000)
     }

}

Although it is a few lines longer, it is a fairly clean implementation, and doesn’t look like a hack. Like many functional programs, this one makes extensive use of recursion. This is one of the things that has always concerned me about functional programming – many scientific computing applications involve lots of iteration, and that can potentially translate into very deep recursion. The above program has an apparent recursion depth of 50 million! However, it runs fine without crashing despite the fact that most programming languages will crash out with a stack overflow with recursion depths of more than a couple of thousand. So why doesn’t this crash? It runs fine because the recursion I used is a special form of recursion known as a tail call. Most functional (and some imperative) programming languages automatically perform tail call elimination which essentially turns the tail call into an iteration which runs very fast without creating new stack frames. In fact, this functional version of the code runs at roughly the same speed as the iterative version I presented first (perhaps just a few percent slower – I haven’t done careful timings), and runs well within a factor of 2 of imperative C code. So actually this seems perfectly practical so far, and I’m looking forward to experimenting more with functional programming approaches to statistical computation over the coming months…

Calling Java code from R

Introduction

In the previous post I looked at some simple methods for calling C code from R using a simple Gibbs sampler as the motivating example. In this post we will look again at the same Gibbs sampler, but now implemented in Java, and look at a couple of options for calling that code from an R session.

Stand-alone Java code

Below is some Java code for implementing the bivariate Gibbs sampler discussed previously. It relies on Parallel COLT, which must be installed and in the Java CLASSPATH in order to follow the examples.

import java.util.*;
import cern.jet.random.tdouble.*;
import cern.jet.random.tdouble.engine.*;

class Gibbs
{

    public static void main(String[] arg)
    {
	if (arg.length != 3) {
            System.err.println("Usage: java Gibbs <Iters> <Thin> <Seed>");
            System.exit(1);  
        }
	int N=Integer.parseInt(arg[0]);
	int thin=Integer.parseInt(arg[1]);
	int seed=Integer.parseInt(arg[2]);
	DoubleRandomEngine rngEngine=new DoubleMersenneTwister(seed);
	Normal rngN=new Normal(0.0,1.0,rngEngine);
	Gamma rngG=new Gamma(1.0,1.0,rngEngine);
	double x=0,y=0;
	System.out.println("Iter x y");
	for (int i=0;i<N;i++) {
	    for (int j=0;j<thin;j++) {
		x=rngG.nextDouble(3.0,y*y+4);
		y=rngN.nextDouble(1.0/(x+1),1.0/Math.sqrt(x+1));
	    }
	    System.out.println(i+" "+x+" "+y);
	}
    }

}

It can be compiled and run stand-alone from an OS shell with the following commands:

javac Gibbs.java
java Gibbs 10 1000 1

As discussed in the previous post, it is possible to call any command-line program from inside an R session using the system() command. A small wrapper function for conveniently running this code from within R can be written as follows.

gibbs<-function(N=10000,thin=500,
             seed=trunc(runif(1)*1e6),
             exec="Gibbs",
             tmpfile=tempfile())
{
  command=paste("java",exec,N,thin,seed,">",tmpfile)
  system(command)
  read.table(tmpfile,header=TRUE)
}

This can then be run from within an R session with a simple call to gibbs(). Note that a random seed is being generated within R to be passed to the Java code to be used to seed the COLT random number generator used within the Java code. As previously discussed, for many long running codes, this approach can be quite effective, and is clearly very simple. However, there is an overhead associated with the system() call, and also with writing output to disk and then reading it back again.

Using rJava

It is possible to avoid the overheads associated with the above approach by directly calling the Java code from R and having the return values returned directly into the R session from memory. There isn’t really direct support for this within the core R language, but there are couple of different solutions provided by R packages. The simplest and most popular approach seems to be the rJava package. This package can be installed with a simple

install.packages("rJava")

This should “just work” on some OSs (eg. Windows), but may fail on other OSs if R is not aware of the local Java environment. If the installation fails, check the error message carefully for advice/instructions. On most Linux systems, the problem can be fixed by quitting R, then running the following command from the shell

sudo R CMD javareconf

before re-starting R and re-attempting the installation. rJava provides a mechanism for starting a JVM within the running R session, creating objects, calling methods and having method return values returned to R. It is actually much more flexible than the .C() function for C code discussed in the previous post.

In order to use this package for our example, we must first re-factor the code slightly in the following way.

import java.util.*;
import cern.jet.random.tdouble.*;
import cern.jet.random.tdouble.engine.*;

class GibbsR
{

    public static void main(String[] arg)
    {
	if (arg.length != 3) {
            System.err.println("Usage: java GibbsR <Iters> <Thin> <Seed>");
            System.exit(1);  
        }
	int N=Integer.parseInt(arg[0]);
	int thin=Integer.parseInt(arg[1]);
	int seed=Integer.parseInt(arg[2]);
	double[][] mat=gibbs(N,thin,seed);
	System.out.println("Iter x y");
	for (int i=0;i<N;i++) {
	    System.out.println(""+i+" "+mat[0][i]+" "+mat[1][i]);
	}	
    }

    public static double[][] gibbs(int N,int thin,int seed)
    {
	DoubleRandomEngine rngEngine=new DoubleMersenneTwister(seed);
	Normal rngN=new Normal(0.0,1.0,rngEngine);
	Gamma rngG=new Gamma(1.0,1.0,rngEngine);
	double x=0,y=0;
	double[][] mat=new double[2][N];
	for (int i=0;i<N;i++) {
	    for (int j=0;j<thin;j++) {
		x=rngG.nextDouble(3.0,y*y+4);
		y=rngN.nextDouble(1.0/(x+1),1.0/Math.sqrt(x+1));
	    }
	    mat[0][i]=x; mat[1][i]=y;
	}
	return mat;
    }

}

This code can be compiled and run from the command-line just as the previous code could.

javac GibbsR.java
java GibbsR 10 1000 1

However, we have now separated out the code we want to be able to call from R into a static method called gibbs, which runs the Gibbs sampler and stores the result in a 2-dimensional array which is its return value. We can now see how to call this code from within a running R session. We first need to set up the R environment ready to call the code.

library(rJava)
.jinit()
obj=.jnew("GibbsR")

Line 1 loads the package, line 2 starts up the JVM, and line 3 creates a link to the the GibbsR class (in general this is used to create a new Java object of the given type, but here we are using static methods). Java methods are called on Java objects using .jcall(). We can write a simple R function to conveniently call the method as follows.

jgibbs<-function(N=10000,thin=500,seed=trunc(runif(1)*1e6))
{
    result=.jcall(obj,"[[D","gibbs",as.integer(N),as.integer(thin),as.integer(seed))
    mat=sapply(result,.jevalArray)
    mat=cbind(1:N,mat)
    colnames(mat)=c("Iter","x","y")
    mat
}

This can now be called with a simple jgibbs(). The first line of the function body carries out the actual method call. The return type of the method must be explicitly declared – “[[D” means a 2-dimensional array of doubles, using JNI notation. Care must also be taken to coerce the method parameters into the correct type that the Java method expects to receive. .jcall() is generally quite good at unpacking basic Java types into corresponding R types. However, the two dimensional array is here returned as an R list consisting of one-dimensional Java array objects. The unpacking is completed using the subsequent call to jevalArray() using sapply(), before the resulting matrix is tidied up and returned to the R session.

Summary and further reading

We have looked at a couple of very simple methods for calling Java code from an R session. The rJava package is a very flexible mechanism for integrating Java code into R.

I haven’t found a lot of tutorial-level material on the web for the rJava package. However, the package itself has very good documentation associated with it. Start with the information on the rJava home page. From an R session with the rJava package loaded, help(package="rJava") lists the available functions, all of which have associated documentation. ?.jinit, ?.jnew, ?.jcall and ?.jevalArray provide further background and information on the example covered here.

After that, the source code of R packages which use rJava are a useful source of further inspiration – look at the reverse-depends list for rJava in CRAN. In particular, the helloJavaWorld package is a tutorial for how to include Java code in an R package (read the associated vignette).