Calling R from Scala sbt projects using rscala

Overview

In the previous post I showed how the rscala package (which has replaced the jvmr package) can be used to call Scala code from within R. In this post I will show how to call R from Scala code. I have previously described how to do this using jvmr. This post is really just an update to show how things work with rscala.

Since I’m focusing here on Scala sbt projects, I’m assuming that sbt is installed, in addition to rscala (described in the previous post). The only “trick” required for calling back to R from Scala is telling sbt where the rscala jar file is located. You can find the location from the R console as illustrated by the following session:

> library(rscala)
> rscala::rscalaJar("2.11")
[1] "/home/ndjw1/R/x86_64-pc-linux-gnu-library/3.2/rscala/java/rscala_2.11-1.0.6.jar"

This location (which will obviously be different for you) can then be added in to your sbt classpath by adding the following line to your build.sbt file:

unmanagedJars in Compile += file("/home/ndjw1/R/x86_64-pc-linux-gnu-library/3.2/rscala/java/rscala_2.11-1.0.6.jar")

Once this is done, calling out to R from your Scala sbt project can be carried out as described in the rscala documentation. For completeness, a working example is given below.

Example

In this example I will use Scala to simulate some data consistent with a Poisson regression model, and then push the data to R to fit it using the R function glm(), and then pull back the fitted regression coefficients into Scala. This is obviously a very artificial example, but the point is to show how it is possible to call back to R for some statistical procedure that may be “missing” from Scala.

The dependencies for this project are described in the file build.sbt

name := "rscala test"

version := "0.1"

scalacOptions ++= Seq("-unchecked", "-deprecation", "-feature")

libraryDependencies  ++= Seq(
            "org.scalanlp" %% "breeze" % "0.10",
            "org.scalanlp" %% "breeze-natives" % "0.10"
)

resolvers ++= Seq(
            "Sonatype Snapshots" at "https://oss.sonatype.org/content/repositories/snapshots/",
            "Sonatype Releases" at "https://oss.sonatype.org/content/repositories/releases/"
)

unmanagedJars in Compile += file("/home/ndjw1/R/x86_64-pc-linux-gnu-library/3.2/rscala/java/rscala_2.11-1.0.6.jar")

scalaVersion := "2.11.6"

The complete Scala program is contained in the file PoisReg.scala

import org.ddahl.rscala.callback._
import breeze.stats.distributions._
import breeze.linalg._

object ScalaToRTest {

  def main(args: Array[String]) = {

    // first simulate some data consistent with a Poisson regression model
    val x = Uniform(50,60).sample(1000)
    val eta = x map { xi => (xi * 0.1) - 3 }
    val mu = eta map { math.exp(_) }
    val y = mu map { Poisson(_).draw }
    
    // call to R to fit the Poission regression model
    val R = RClient() // initialise an R interpreter
    R.x=x.toArray // send x to R
    R.y=y.toArray // send y to R
    R.eval("mod <- glm(y~x,family=poisson())") // fit the model in R
    // pull the fitted coefficents back into scala
    val beta = DenseVector[Double](R.evalD1("mod$coefficients"))

    // print the fitted coefficents
    println(beta)

  }

}

If these two files are put in an empty directory, the code can be compiled and run by typing sbt run from the command prompt in the relevant directory. The commented code should be self-explanatory, but see the rscala documentation for further details. In particular, the rscala scaladoc is useful.

Calling Scala code from R using rscala

Introduction

In a previous post I looked at how to call Scala code from R using a CRAN package called jvmr. This package now seems to have been replaced by a new package called rscala. Like the old package, it requires a pre-existing Java installation. Unlike the old package, however, it no longer depends on rJava, which may simplify some installations. The rscala package is well documented, with a reference manual and a draft paper. In this post I will concentrate on the issue of calling sbt-based projects with dependencies on external libraries (such as breeze).

On a system with Java installed, it should be possible to install the rscala package with a simple

install.packages("rscala")

from the R command prompt. Calling

library(rscala)

will check that it has worked. The package will do a sensible search for a Scala installation and use it if it can find one. If it can’t find one (or can only find an installation older than 2.10.x), it will fail. In this case you can download and install a Scala installation specifically for rscala using the command

rscala::scalaInstall()

This option is likely to be attractive to sbt (or IDE) users who don’t like to rely on a system-wide scala installation.

A Gibbs sampler in Scala using Breeze

For illustration I’m going to use a Scala implementation of a Gibbs sampler. The Scala code, gibbs.scala is given below:

package gibbs

object Gibbs {

    import scala.annotation.tailrec
    import scala.math.sqrt
    import breeze.stats.distributions.{Gamma,Gaussian}

    case class State(x: Double, y: Double) {
      override def toString: String = x.toString + " , " + y + "\n"
    }

    def nextIter(s: State): State = {
      val newX = Gamma(3.0, 1.0/((s.y)*(s.y)+4.0)).draw
      State(newX, Gaussian(1.0/(newX+1), 1.0/sqrt(2*newX+2)).draw)
    }

    @tailrec def nextThinnedIter(s: State,left: Int): State =
      if (left==0) s else nextThinnedIter(nextIter(s),left-1)

    def genIters(s: State, stop: Int, thin: Int): List[State] = {
      @tailrec def go(s: State, left: Int, acc: List[State]): List[State] =
        if (left>0)
          go(nextThinnedIter(s,thin), left-1, s::acc)
          else acc
      go(s,stop,Nil).reverse
    }

    def main(args: Array[String]) = {
      if (args.length != 3) {
        println("Usage: sbt \"run <outFile> <iters> <thin>\"")
        sys.exit(1)
      } else {
        val outF=args(0)
        val iters=args(1).toInt
        val thin=args(2).toInt
        val out = genIters(State(0.0,0.0),iters,thin)
        val s = new java.io.FileWriter(outF)
        s.write("x , y\n")
        out map { it => s.write(it.toString) }
        s.close
      }
    }

}

This code requires Scala and the Breeze scientific library in order to build. We can specify this in a sbt build file, which should be called build.sbt and placed in the same directory as the Scala code.

name := "gibbs"

version := "0.1"

scalacOptions ++= Seq("-unchecked", "-deprecation", "-feature")

libraryDependencies  ++= Seq(
            "org.scalanlp" %% "breeze" % "0.10",
            "org.scalanlp" %% "breeze-natives" % "0.10"
)

resolvers ++= Seq(
            "Sonatype Snapshots" at "https://oss.sonatype.org/content/repositories/snapshots/",
            "Sonatype Releases" at "https://oss.sonatype.org/content/repositories/releases/"
)

scalaVersion := "2.11.6"

Now, from a system command prompt in the directory where the files are situated, it should be possible to download all dependencies and compile and run the code with a simple

sbt "run output.csv 50000 1000"

sbt magically manages all of the dependencies for us so that we don’t have to worry about them. However, for calling from R, it may be desirable to run the code without running sbt. There are several ways to achieve this, but the simplest is to build an “assembly jar” or “fat jar”, which is a Java byte-code file containing all code and libraries required in order to run the code on any system with a Java installation.

To build an assembly jar first create a subdirectory called project (the name matters), an in it place two files. The first should be called assembly.sbt, and should contain the line

addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.13.0")

Since the version of the assembly tool can depend on the version of sbt, it is also best to fix the version of sbt being used by creating another file in the project directory called build.properties, which should contain the line

sbt.version=0.13.7

Now return to the parent directory and run

sbt assembly

If this works, it should create a fat jar target/scala-2.11/gibbs-assembly-0.1.jar. You can check it works by running

java -jar target/scala-2.11/gibbs-assembly-0.1.jar output.csv 10000 10

Assuming that it does, you are now ready to try running the code from within R.

Calling via R system calls

Since this code takes a relatively long time to run, calling it from R via simple system calls isn’t a particularly terrible idea. For example, we can do this from the R command prompt with the following commands

system("java -jar target/scala-2.11/gibbs-assembly-0.1.jar output.csv 50000 1000")
out=read.csv("output.csv")
library(smfsb)
mcmcSummary(out,rows=2)

This works fine, but is a bit clunky. Tighter integration between R and Scala would be useful, which is where rscala comes in.

Calling assembly Scala projects via rscala

rscala provides a very simple way to embed a Scala interpreter within an R session, to be able to execute Scala expressions from R and to have the results returned back to the R session for further processing. The main issue with using this in practice is managing dependencies on external libraries and setting the Scala classpath correctly. By using an assembly jar we can bypass most of these issues, and it becomes trivial to call our Scala code direct from the R interpreter, as the following code illustrates.

library(rscala)
sc=scalaInterpreter("target/scala-2.11/gibbs-assembly-0.1.jar")
sc%~%'import gibbs.Gibbs._'
out=sc%~%'genIters(State(0.0,0.0),50000,1000).toArray.map{s=>Array(s.x,s.y)}'
library(smfsb)
mcmcSummary(out,rows=2)

Here we call the getIters function directly, rather than via the main method. This function returns an immutable List of States. Since R doesn’t understand this, we map it to an Array of Arrays, which R then unpacks into an R matrix for us to store in the matrix out.

Summary

The CRAN package rscala makes it very easy to embed a Scala interpreter within an R session. However, for most non-trivial statistical computing problems, the Scala code will have dependence on external scientific libraries such as Breeze. The standard way to easily manage external dependencies in the Scala ecosystem is sbt. Given an sbt-based Scala project, it is easy to generate an assembly jar in order to initialise the rscala Scala interpreter with the classpath needed to call arbitrary Scala functions. This provides very convenient inter-operability between R and Scala for many statistical computing applications.

Calling R from Scala sbt projects

[Update: The jvmr package has been replaced by the rscala package. There is a new version of this post which replaces this one.]

Overview

In previous posts I’ve shown how the jvmr CRAN R package can be used to call Scala sbt projects from R and inline Scala Breeze code in R. In this post I will show how to call to R from a Scala sbt project. This requires that R and the jvmr CRAN R package are installed on your system, as described in the previous posts. Since I’m focusing here on Scala sbt projects, I’m also assuming that sbt is installed.

The only “trick” required for calling back to R from Scala is telling sbt where the jvmr jar file is located. You can find the location from the R console as illustrated by the following session:

&gt; library(jvmr)
&gt; .jvmr.jar
[1] "/home/ndjw1/R/x86_64-pc-linux-gnu-library/3.1/jvmr/java/jvmr_2.11-2.11.2.1.jar"

This location (which will obviously be different for you) can then be added in to your sbt classpath by adding the following line to your build.sbt file:

unmanagedJars in Compile += file("/home/ndjw1/R/x86_64-pc-linux-gnu-library/3.1/jvmr/java/jvmr_2.11-2.11.2.1.jar")

Once this is done, calling out to R from your Scala sbt project can be carried out as described in the jvmr documentation. For completeness, a working example is given below.

Example

In this example I will use Scala to simulate some data consistent with a Poisson regression model, and then push the data to R to fit it using the R function glm(), and then pull back the fitted regression coefficients into Scala. This is obviously a very artificial example, but the point is to show how it is possible to call back to R for some statistical procedure that may be “missing” from Scala.

The dependencies for this project are described in the file build.sbt

name := "jvmr test"

version := "0.1"

scalacOptions ++= Seq("-unchecked", "-deprecation", "-feature")

libraryDependencies  ++= Seq(
            "org.scalanlp" %% "breeze" % "0.10",
            "org.scalanlp" %% "breeze-natives" % "0.10"
)

resolvers ++= Seq(
            "Sonatype Snapshots" at "https://oss.sonatype.org/content/repositories/snapshots/",
            "Sonatype Releases" at "https://oss.sonatype.org/content/repositories/releases/"
)

unmanagedJars in Compile += file("/home/ndjw1/R/x86_64-pc-linux-gnu-library/3.1/jvmr/java/jvmr_2.11-2.11.2.1.jar")

scalaVersion := "2.11.2"

The complete Scala program is contained in the file PoisReg.scala

import org.ddahl.jvmr.RInScala
import breeze.stats.distributions._
import breeze.linalg._

object ScalaToRTest {

  def main(args: Array[String]) = {

    // first simulate some data consistent with a Poisson regression model
    val x = Uniform(50,60).sample(1000)
    val eta = x map { xi =&gt; (xi * 0.1) - 3 }
    val mu = eta map { math.exp(_) }
    val y = mu map { Poisson(_).draw }
    
    // call to R to fit the Poission regression model
    val R = RInScala() // initialise an R interpreter
    R.x=x.toArray // send x to R
    R.y=y.toArray // send y to R
    R.eval("mod &lt;- glm(y~x,family=poisson())") // fit the model in R
    // pull the fitted coefficents back into scala
    val beta = DenseVector[Double](R.toVector[Double]("mod$coefficients"))

    // print the fitted coefficents
    println(beta)

  }

}

If these two files are put in an empty directory, the code can be compiled and run by typing sbt run from the command prompt in the relevant directory. The commented code should be self-explanatory, but see the jvmr documentation for further details.

Inlining Scala Breeze code in R using jvmr and sbt

[Update: The CRAN package “jvmr” has been replaced by a new package “rscala”. Rather than completely re-write this post, I’ve just created a github gist containing a new function, breezeInterpreter(), which works similarly to the function breezeInit() in this post. Usage information is given at the top of the gist.]

Introduction

In the previous post I showed how to call Scala code from R using sbt and jvmr. The approach described in that post is the one I would recommend for any non-trivial piece of Scala code – mixing up code from different languages in the same source code file is not a good strategy in general. That said, for very small snippets of code, it can sometimes be convenient to inline Scala code directly into an R source code file. The canonical example of this is a computationally intensive algorithm being prototyped in R which has a slow inner loop. If the inner loop can be recoded in a few lines of Scala, it would be nice to just inline this directly into the R code without having to create a separate Scala project. The CRAN package jvmr provides a very simple and straightforward way to do this. However, as discussed in the last post, most Scala code for statistical computing (even short and simple code) is likely to rely on Breeze for special functions, probability distributions, non-uniform random number generation, numerical linear algebra, etc. In this post we will see how to use sbt in order to make sure that the Breeze library and all of its dependencies are downloaded and cached, and to provide a correct classpath with which to initialise a jvmr scalaInterpreter session.

Setting up

Configuring your system to be able to inline Scala Breeze code is very easy. You just need to install Java, R and sbt. Then install the CRAN R package jvmr. At this point you have everything you need except for the R function breezeInit, given at the end of this post. I’ve deferred the function to the end of the post as it is a bit ugly, and the details of it are not important. All it does is get sbt to ensure that Breeze is correctly downloaded and cached and then starts a scalaInterpreter with Breeze on the classpath. With this function available, we can use it within an R session as the following R session illustrates:

&gt; b=breezeInit()
&gt; b['import breeze.stats.distributions._']
NULL
&gt; b['Poisson(10).sample(20).toArray']
 [1] 13 14 13 10  7  6 15 14  5 10 14 11 15  8 11 12  6  7  5  7
&gt; summary(b['Gamma(3,2).sample(10000).toArray'])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.2124  3.4630  5.3310  5.9910  7.8390 28.5200 
&gt; 

So we see that Scala Breeze code can be inlined directly into an R session, and if we are careful about return types, have the results of Scala expressions automatically unpack back into convenient R data structures.

Summary

In this post I have shown how easy it is to inline Scala Breeze code into R using sbt in conjunction with the CRAN package jvmr. This has many potential applications, with the most obvious being the desire to recode slow inner loops from R to Scala. This should give performance quite comparable with alternatives such as Rcpp, with the advantage being that you get to write beautiful, elegant, functional Scala code instead of horrible, ugly, imperative C++ code! 😉

The breezeInit function

The actual breezeInit() function is given below. It is a little ugly, but very simple. It is obviously easy to customise for different libraries and library versions as required. All of the hard work is done by sbt which must be installed and on the default system path in order for this function to work.

breezeInit&lt;-function()
{
  library(jvmr)
  sbtStr="name := \"tmp\"

version := \"0.1\"

libraryDependencies  ++= Seq(
            \"org.scalanlp\" %% \"breeze\" % \"0.10\",
            \"org.scalanlp\" %% \"breeze-natives\" % \"0.10\"
)

resolvers ++= Seq(
            \"Sonatype Snapshots\" at \"https://oss.sonatype.org/content/repositories/snapshots/\",
            \"Sonatype Releases\" at \"https://oss.sonatype.org/content/repositories/releases/\"
)

scalaVersion := \"2.11.2\"

lazy val printClasspath = taskKey[Unit](\"Dump classpath\")

printClasspath := {
  (fullClasspath in Runtime value) foreach {
    e =&gt; print(e.data+\"!\")
  }
}
"
  tmps=file(file.path(tempdir(),"build.sbt"),"w")
  cat(sbtStr,file=tmps)
  close(tmps)
  owd=getwd()
  setwd(tempdir())
  cpstr=system2("sbt","printClasspath",stdout=TRUE)
  cpst=cpstr[length(cpstr)]
  cpsp=strsplit(cpst,"!")[[1]]
  cp=cpsp[2:(length(cpsp)-1)]
  si=scalaInterpreter(cp,use.jvmr.class.path=FALSE)
  setwd(owd)
  si
}

Faster Gibbs sampling MCMC from within R

Introduction

This post follows on from the previous post on Gibbs sampling in various languages. In that post a simple Gibbs sampler was implemented in various languages, and speeds were compared. It was seen that R is very slow for iterative simulation algorithms characteristic of MCMC methods such as the Gibbs sampler. Statically typed languages such as C/C++ and Java were seen to be fastest for this type of algorithm. Since many statisticians like to use R for most of their work, there is natural interest in the possibility of extending R by calling simulation algorithms written in other languages. It turns out to be straightforward to call C, C++ and Java from within R, so this post will look at how this can be done, and exactly how fast the different options turn out to be. The post draws heavily on my previous posts on calling C from R and calling Java from R, as well as Dirk Eddelbuettel’s post on calling C++ from R, and it may be helpful to consult these posts for further details.

Languages

R

We will start with the simple pure R version of the Gibbs sampler, and use this as our point of reference for understanding the benefits of re-coding in other languages. The background to the problem was given in the previous post and so won’t be repeated here. The code can be given as follows:

gibbs<-function(N=50000,thin=1000)
{
	mat=matrix(0,ncol=2,nrow=N)
	x=0
	y=0
	for (i in 1:N) {
		for (j in 1:thin) {
			x=rgamma(1,3,y*y+4)
			y=rnorm(1,1/(x+1),1/sqrt(2*x+2))
		}
		mat[i,]=c(x,y)
	}
	names(mat)=c("x","y")
	mat
}

This code works perfectly, but is very slow. It takes 458.9 seconds on my very fast laptop (details given in previous post).

C

Let us now see how we can introduce a new function, gibbsC into R, which works in exactly the same way as gibbs, but actually calls on compiled C code to do all of the work. First we need the C code in a file called gibbs.c:

#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#include <R.h>
#include <Rmath.h>

void gibbs(int *Np,int *thinp,double *xvec,double *yvec)
{
  int i,j;
  int N=*Np,thin=*thinp;
  GetRNGstate();
  double x=0;
  double y=0;
  for (i=0;i<N;i++) {
    for (j=0;j<thin;j++) {
      x=rgamma(3.0,1.0/(y*y+4));
      y=rnorm(1.0/(x+1),1.0/sqrt(2*x+2));
    }
    xvec[i]=x; yvec[i]=y;
  }
  PutRNGstate();
}

This can be compiled with R CMD SHLIB gibbs.c. We can load it into R and wrap it up so that it is easy to use with the following code:

dyn.load(file.path(".",paste("gibbs",.Platform$dynlib.ext,sep="")))
gibbsC<-function(n=50000,thin=1000)
{
  tmp=.C("gibbs",as.integer(n),as.integer(thin),
                x=as.double(1:n),y=as.double(1:n))
  mat=cbind(tmp$x,tmp$y)
  colnames(mat)=c("x","y")
  mat
}

The new function gibbsC works just like gibbs, but takes just 12.1 seconds to run. This is roughly 40 times faster than the pure R version, which is a big deal.

Note that using the R inline package, it is possible to directly inline the C code into the R source code. We can do this with the following R code:

require(inline)
code='
  int i,j;
  int N=*Np,thin=*thinp;
  GetRNGstate();
  double x=0;
  double y=0;
  for (i=0;i<N;i++) {
    for (j=0;j<thin;j++) {
      x=rgamma(3.0,1.0/(y*y+4));
      y=rnorm(1.0/(x+1),1.0/sqrt(2*x+2));
    }
    xvec[i]=x; yvec[i]=y;
  }
  PutRNGstate();'
gibbsCin<-cfunction(sig=signature(Np="integer",thinp="integer",xvec="numeric",yvec="numeric"),body=code,includes="#include <Rmath.h>",language="C",convention=".C")
gibbsCinline<-function(n=50000,thin=1000)
{
  tmp=gibbsCin(n,thin,rep(0,n),rep(0,n))
  mat=cbind(tmp$x,tmp$y)
  colnames(mat)=c("x","y")
  mat
}

This runs at the same speed as the code compiled separately, and is arguably a bit cleaner in this case. Personally I’m not a big fan of inlining code unless it is something really very simple. If there is one thing that we have learned from the murky world of web development, it is that little good comes from mixing up different languages in the same source code file!

C++

We can also inline C++ code into R using the inline and Rcpp packages. The code below originates from Sanjog Misra, and was discussed in the post by Dirk Eddelbuettel mentioned at the start of this post.

require(Rcpp)
require(inline)

gibbscode = '
int N = as<int>(n);
int thn = as<int>(thin);
int i,j;
RNGScope scope;
NumericVector xs(N),ys(N);
double x=0;
double y=0;
for (i=0;i<N;i++) {
  for (j=0;j<thn;j++) {
    x = ::Rf_rgamma(3.0,1.0/(y*y+4));
    y= ::Rf_rnorm(1.0/(x+1),1.0/sqrt(2*x+2));
  }
  xs(i) = x;
  ys(i) = y;
}
return Rcpp::DataFrame::create( Named("x")= xs, Named("y") = ys);
'

RcppGibbsFn <- cxxfunction( signature(n="int", thin = "int"),
                              gibbscode, plugin="Rcpp")

RcppGibbs <- function(N=50000,thin=1000)
{
	RcppGibbsFn(N,thin)
}

This version of the sampler runs in 12.4 seconds, just a little bit slower than the C version.

Java

It is also quite straightforward to call Java code from within R using the rJava package. The following code

import java.util.*;
import cern.jet.random.tdouble.*;
import cern.jet.random.tdouble.engine.*;

class GibbsR
{

    public static double[][] gibbs(int N,int thin,int seed)
    {
	DoubleRandomEngine rngEngine=new DoubleMersenneTwister(seed);
	Normal rngN=new Normal(0.0,1.0,rngEngine);
	Gamma rngG=new Gamma(1.0,1.0,rngEngine);
	double x=0,y=0;
	double[][] mat=new double[2][N];
	for (int i=0;i<N;i++) {
	    for (int j=0;j<thin;j++) {
		x=rngG.nextDouble(3.0,y*y+4);
		y=rngN.nextDouble(1.0/(x+1),1.0/Math.sqrt(2*x+2));
	    }
	    mat[0][i]=x; mat[1][i]=y;
	}
	return mat;
    }

}

can be compiled with javac GibbsR.java (assuming that Parallel COLT is in the classpath), and wrapped up from within an R session with

library(rJava)
.jinit()
obj=.jnew("GibbsR")

gibbsJ<-function(N=50000,thin=1000,seed=trunc(runif(1)*1e6))
{
    result=.jcall(obj,"[[D","gibbs",as.integer(N),as.integer(thin),as.integer(seed))
    mat=sapply(result,.jevalArray)
    colnames(mat)=c("x","y")
    mat
}

This code runs in 10.7 seconds. Yes, that’s correct. Yes, the Java code is faster than both the C and C++ code! This really goes to show that Java is now an excellent option for numerically intensive work such as this. However, before any C/C++ enthusiasts go apoplectic, I should explain why Java turns out to be faster here, as the comparison is not quite fair… In the C and C++ code, use was made of the internal R random number generation routines, which are relatively slow compared to many modern numerical library implementations. In the Java code, I used Parallel COLT for random number generation, as it isn’t straightforward to call the R generators from Java code. It turns out that the COLT generators are faster than the R generators, and that is why Java turns out to be faster here…

C+GSL

Of course we do not have to use the R random number generators within our C code. For example, we could instead call on the GSL generators, using the following code:

#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#include <gsl/gsl_rng.h>
#include <gsl/gsl_randist.h>
#include <R.h>

void gibbsGSL(int *Np,int *thinp,int *seedp,double *xvec,double *yvec)
{
  int i,j;
  int N=*Np,thin=*thinp,seed=*seedp;
  gsl_rng *r = gsl_rng_alloc(gsl_rng_mt19937);
  gsl_rng_set(r,seed);
  double x=0;
  double y=0;
  for (i=0;i<N;i++) {
    for (j=0;j<thin;j++) {
      x=gsl_ran_gamma(r,3.0,1.0/(y*y+4));
      y=1.0/(x+1)+gsl_ran_gaussian(r,1.0/sqrt(2*x+2));
    }
    xvec[i]=x; yvec[i]=y;
  }
}

It can be compiled with R CMD SHLIB -lgsl -lgslcblas gibbsGSL.c, and then called as for the regular C version. This runs in 8.0 seconds, which is noticeably faster than the Java code, but probably not “enough” faster to make it an important factor to consider in language choice.

Summary

In this post I’ve shown that it is relatively straightforward to call code written in C, C++ or Java from within R, and that this can give very significant performance gains relative to pure R code. All of the options give fairly similar performance gains. I showed that in the case of this particular example, the “obvious” Java code is actually slightly faster than the “obvious” C or C++ code, and explained why, and how to make the C version slightly faster by using the GSL. The post by Dirk shows how to call the GSL generators from the C++ version, which I haven’t replicated here.

Calling C code from R

Introduction

In this post I’ll look at how to call compiled C code from an R session. The focus here is on calling C code from R, rather than on extending R using C. Although the two are technically very similar problems, the emphasis is somewhat different. A lot of existing documentation focuses on the latter problem, and this is one of the motivations for writing this post. Fortunately, the problem of calling existing C code from R is a bit simpler than the more general problem of extending R in C.

In a previous post I looked at how to implement a trivial bivariate Gibbs sampler in various languages. It was seen there that the C version ran approximately 60 times faster than the R version. It is therefore often desirable to code up MCMC algorithms in C. However, it is usually very convenient to be able to call such algorithms from inside an R session. There are various ways to do this, ranging from the trivial to very complex. In this post I will look at some of the simpler methods and discuss the pros and cons.

Standalone C code

We will restrict attention to the Gibbs sampler discussed in a previous post. We will focus on the C version of the code. Below is a slightly modified version of the code which includes some command-line arguments that enable some flexibility in how the code is run post-compilation.

#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#include <gsl/gsl_rng.h>
#include <gsl/gsl_randist.h>

int main(int argc, char *argv[])
{
  if (argc!=4) {
    fprintf(stderr,"Usage: %s <Iters> <Thin> <Seed>\n",argv[0]);
    exit(EXIT_FAILURE);
  }
  long N=(long) atoi(argv[1]);
  long thin=(long) atoi(argv[2]);
  long seed=(long) atoi(argv[3]);
  long i,j;
  gsl_rng *r = gsl_rng_alloc(gsl_rng_mt19937);
  gsl_rng_set(r,seed);
  double x=0;
  double y=0;
  printf("Iter x y\n");
  for (i=0;i<N;i++) {
    for (j=0;j<thin;j++) {
      x=gsl_ran_gamma(r,3.0,1.0/(y*y+4));
      y=1.0/(x+1)+gsl_ran_gaussian(r,1.0/sqrt(x+1));
    }
    printf("%ld %f %f\n",i,x,y);
  }
  exit(EXIT_SUCCESS);
}

Assuming a Unix/Linux environment (including a GSL implementation), the above code can be compiled from the Unix shell with a command like:

gcc -O2 -lgsl -lgslcblas standalone.c -o standalone

and run with a command like:

./standalone 10000 500 1 > data.tab

The first command-line argument is the number of iterations required, and the second is the “thin” to be applied to the output. The third argument is the “seed” to be applied to the GSL random number generator (RNG). This allows different (not quite independent – see my post on parallel MCMC for details) runs to be obtained by selecting different seed values. The simplest way to call this code from within an R session is to call this unmodified executable using the R system() command. A small “wrapper” function to do this is given below.

standalone<-function(N=10000,thin=500,
             seed=trunc(runif(1)*1e6),
             exec=file.path(".","standalone"),
             tmpfile=tempfile())
{
  command=paste(exec,N,thin,seed,">",tmpfile)
  system(command)
  read.table(tmpfile,header=TRUE)
}

Note the use of the file.path() and tempfile() R functions in a (probably vain!) attempt to make the code somewhat portable. Just running standalone() from an R session should then return a data frame containing the MCMC output. I gave some commands for analysing this output in a previous post. This approach to calling external code is very simple and crude, and quite generic (it is not specific to C code at all). However, it is very quick and easy to implement, and in many cases quite efficient. There is a considerable computational overhead in executing the system command and parsing output files from disk. However, if the code being called is very computationally intensive and relatively slow (as is typically the case), then this overhead can often be negligible, rendering this approach quite practical.

Building and linking to a shared library

If one is really keen to avoid the overhead of executing an R system command, then it is necessary to compile the required C code into a shared library (or DLL), and link this code into R where it can be called directly via R’s foreign language interface. Below is a version of the previous C code modified to make it appropriate for calling from R.

#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#include <gsl/gsl_rng.h>
#include <gsl/gsl_randist.h>
#include <R.h>

void gibbs(int *Np,int *thinp,int *seedp,double *xvec,double *yvec)
{
  int i,j;
  int N=*Np,thin=*thinp,seed=*seedp;
  gsl_rng *r = gsl_rng_alloc(gsl_rng_mt19937);
  gsl_rng_set(r,seed);
  double x=0;
  double y=0;
  for (i=0;i<N;i++) {
    for (j=0;j<thin;j++) {
      x=gsl_ran_gamma(r,3.0,1.0/(y*y+4));
      y=1.0/(x+1)+gsl_ran_gaussian(r,1.0/sqrt(x+1));
    }
    xvec[i]=x; yvec[i]=y;
  }
}

Note that it is only possible to pass pointers from simple R/C data types, and so all function arguments must be pointers. Also note that there is no return value to the function, and that values are retrieved in R by modifying some of the values pointed to by the pointer arguments. This is the mode of operation imposed by the basic method that R provides for calling C code from R (the .C() function). Note that there are other methods for extending R in C, using the .Call() and .External() functions, but these are beyond the scope of this post. Again assuming a Unix/Linux environment, this code can be compiled into a shared library with a command like:

R CMD SHLIB -lgsl -lgslcblas dynamic.c

It can then be loaded into a running R session with a command like dyn.load("dynamic.so"). Again, if we are attempting to write portable code, we might use a command like:

dyn.load(file.path(".",paste("dynamic",.Platform$dynlib.ext,sep="")))

You can check what dynamic libraries are loaded into the current R session with getLoadedDLLs(). Once the DLL (Dynamic Link Library) is loaded, it can be called using the .C() function. A small wrapper function appropriate in this instance is given below:

dynamic<-function(n=10000,thin=500,seed=trunc(runif(1)*1e6))
{
  tmp=.C("gibbs",as.integer(n),as.integer(thin),
               as.integer(seed),x=as.double(1:n),
                  y=as.double(1:n))
  mat=cbind(1:n,tmp$x,tmp$y) 
  colnames(mat)=c("Iter","x","y")
  mat
}

Note how a random seed is generated in R to be passed to the C code to be used to seed the GSL random generator used within the C code. The code can then be run with a simple call to dynamic() and everything should work OK provided that all of the required libraries are found. This is the simplest way to link C code into R in a way that avoids the overhead associated with a system() call. However, this approach is also not without issues. In particular, the C code relies on the GSL, and more specifically on the random number streams provided by the GSL. These are completely separate from the random number streams used within the R system. In some situations it would make sense to use the same random number streams used within the R session, and to remove the dependence of the C code on the GSL.

Using the R API

The C code discussed in the previous section relies on the GSL only for the generation of (non-uniform) random numbers. Obviously R has its own very sophisticated system for handling random numbers and it is possible to use this system from within externally called C code using the R API. In particular, C versions of functions such as rnorm() and rgamma() can be called in C by including Rmath.h. Below is a version of the C code previously given modified to use the R random number generation routines and to remove all dependence on the GSL.

#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#include <R.h>
#include <Rmath.h>

void gibbsR(int *Np,int *thinp,double *xvec,double *yvec)
{
  int i,j;
  int N=*Np,thin=*thinp;
  GetRNGstate();
  double x=0;
  double y=0;
  for (i=0;i<N;i++) {
    for (j=0;j<thin;j++) {
      x=rgamma(3.0,1.0/(y*y+4));
      y=rnorm(1.0/(x+1),1.0/sqrt(x+1));
    }
    xvec[i]=x; yvec[i]=y;
  }
  PutRNGstate();
}

Note that a call to GetRNGstate() must be made before calling any random number functions and that a call to PutRNGstate() must be called before the function returns control back to R. This code can be compiled with a command like

R CMD SHLIB dynamicR.c

and linked into R with a command like

dyn.load(file.path(".",paste("dynamicR",.Platform$dynlib.ext,sep="")))

An appropriate wrapper for this code is given below:

dynamicR<-function(n=10000,thin=500)
{
  tmp=.C("gibbsR",as.integer(n),as.integer(thin),
                x=as.double(1:n),y=as.double(1:n))
  mat=cbind(1:n,tmp$x,tmp$y) 
  colnames(mat)=c("Iter","x","y")
  mat
}

This code is now slightly simpler, and the lack of dependence on external libraries such as the GSL makes it much easier to integrate into R packages, should this be desired.

Summary and further reading

Foreign language interfaces are a notoriously complex subject and this post has obviously just scratched the surface of the problem. For a few more examples, first see my old computer practicals on Stochastic simulation in R and C. The examples are a bit out of date, but easy to fix. Also see a howto by the Flemish Supercomputing Centre on a similar topic to this one. For more detailed information, see the manual on Writing R extensions, especially the sections on Foreign language interfaces and the R API. I also find Chapter 6 of R Programming for Bioinformatics to be a useful introduction to more complex aspects.

I have also somewhat belatedly re-discovered Charlie Geyer‘s notes on Calling C and Fortran from R, which covers very similar ground to this post. They were probably the unconscious inspiration for this post…