Calling R from Scala sbt projects using rscala

Overview

In the previous post I showed how the rscala package (which has replaced the jvmr package) can be used to call Scala code from within R. In this post I will show how to call R from Scala code. I have previously described how to do this using jvmr. This post is really just an update to show how things work with rscala.

Since I’m focusing here on Scala sbt projects, I’m assuming that sbt is installed, in addition to rscala (described in the previous post). The only “trick” required for calling back to R from Scala is telling sbt where the rscala jar file is located. You can find the location from the R console as illustrated by the following session:

> library(rscala)
> rscala::rscalaJar("2.11")
[1] "/home/ndjw1/R/x86_64-pc-linux-gnu-library/3.2/rscala/java/rscala_2.11-1.0.6.jar"

This location (which will obviously be different for you) can then be added in to your sbt classpath by adding the following line to your build.sbt file:

unmanagedJars in Compile += file("/home/ndjw1/R/x86_64-pc-linux-gnu-library/3.2/rscala/java/rscala_2.11-1.0.6.jar")

Once this is done, calling out to R from your Scala sbt project can be carried out as described in the rscala documentation. For completeness, a working example is given below.

Example

In this example I will use Scala to simulate some data consistent with a Poisson regression model, and then push the data to R to fit it using the R function glm(), and then pull back the fitted regression coefficients into Scala. This is obviously a very artificial example, but the point is to show how it is possible to call back to R for some statistical procedure that may be “missing” from Scala.

The dependencies for this project are described in the file build.sbt

name := "rscala test"

version := "0.1"

scalacOptions ++= Seq("-unchecked", "-deprecation", "-feature")

libraryDependencies  ++= Seq(
            "org.scalanlp" %% "breeze" % "0.10",
            "org.scalanlp" %% "breeze-natives" % "0.10"
)

resolvers ++= Seq(
            "Sonatype Snapshots" at "https://oss.sonatype.org/content/repositories/snapshots/",
            "Sonatype Releases" at "https://oss.sonatype.org/content/repositories/releases/"
)

unmanagedJars in Compile += file("/home/ndjw1/R/x86_64-pc-linux-gnu-library/3.2/rscala/java/rscala_2.11-1.0.6.jar")

scalaVersion := "2.11.6"

The complete Scala program is contained in the file PoisReg.scala

import org.ddahl.rscala.callback._
import breeze.stats.distributions._
import breeze.linalg._

object ScalaToRTest {

  def main(args: Array[String]) = {

    // first simulate some data consistent with a Poisson regression model
    val x = Uniform(50,60).sample(1000)
    val eta = x map { xi => (xi * 0.1) - 3 }
    val mu = eta map { math.exp(_) }
    val y = mu map { Poisson(_).draw }
    
    // call to R to fit the Poission regression model
    val R = RClient() // initialise an R interpreter
    R.x=x.toArray // send x to R
    R.y=y.toArray // send y to R
    R.eval("mod <- glm(y~x,family=poisson())") // fit the model in R
    // pull the fitted coefficents back into scala
    val beta = DenseVector[Double](R.evalD1("mod$coefficients"))

    // print the fitted coefficents
    println(beta)

  }

}

If these two files are put in an empty directory, the code can be compiled and run by typing sbt run from the command prompt in the relevant directory. The commented code should be self-explanatory, but see the rscala documentation for further details. In particular, the rscala scaladoc is useful.

Calling Scala code from R using rscala

Introduction

In a previous post I looked at how to call Scala code from R using a CRAN package called jvmr. This package now seems to have been replaced by a new package called rscala. Like the old package, it requires a pre-existing Java installation. Unlike the old package, however, it no longer depends on rJava, which may simplify some installations. The rscala package is well documented, with a reference manual and a draft paper. In this post I will concentrate on the issue of calling sbt-based projects with dependencies on external libraries (such as breeze).

On a system with Java installed, it should be possible to install the rscala package with a simple

install.packages("rscala")

from the R command prompt. Calling

library(rscala)

will check that it has worked. The package will do a sensible search for a Scala installation and use it if it can find one. If it can’t find one (or can only find an installation older than 2.10.x), it will fail. In this case you can download and install a Scala installation specifically for rscala using the command

rscala::scalaInstall()

This option is likely to be attractive to sbt (or IDE) users who don’t like to rely on a system-wide scala installation.

A Gibbs sampler in Scala using Breeze

For illustration I’m going to use a Scala implementation of a Gibbs sampler. The Scala code, gibbs.scala is given below:

package gibbs

object Gibbs {

    import scala.annotation.tailrec
    import scala.math.sqrt
    import breeze.stats.distributions.{Gamma,Gaussian}

    case class State(x: Double, y: Double) {
      override def toString: String = x.toString + " , " + y + "\n"
    }

    def nextIter(s: State): State = {
      val newX = Gamma(3.0, 1.0/((s.y)*(s.y)+4.0)).draw
      State(newX, Gaussian(1.0/(newX+1), 1.0/sqrt(2*newX+2)).draw)
    }

    @tailrec def nextThinnedIter(s: State,left: Int): State =
      if (left==0) s else nextThinnedIter(nextIter(s),left-1)

    def genIters(s: State, stop: Int, thin: Int): List[State] = {
      @tailrec def go(s: State, left: Int, acc: List[State]): List[State] =
        if (left>0)
          go(nextThinnedIter(s,thin), left-1, s::acc)
          else acc
      go(s,stop,Nil).reverse
    }

    def main(args: Array[String]) = {
      if (args.length != 3) {
        println("Usage: sbt \"run <outFile> <iters> <thin>\"")
        sys.exit(1)
      } else {
        val outF=args(0)
        val iters=args(1).toInt
        val thin=args(2).toInt
        val out = genIters(State(0.0,0.0),iters,thin)
        val s = new java.io.FileWriter(outF)
        s.write("x , y\n")
        out map { it => s.write(it.toString) }
        s.close
      }
    }

}

This code requires Scala and the Breeze scientific library in order to build. We can specify this in a sbt build file, which should be called build.sbt and placed in the same directory as the Scala code.

name := "gibbs"

version := "0.1"

scalacOptions ++= Seq("-unchecked", "-deprecation", "-feature")

libraryDependencies  ++= Seq(
            "org.scalanlp" %% "breeze" % "0.10",
            "org.scalanlp" %% "breeze-natives" % "0.10"
)

resolvers ++= Seq(
            "Sonatype Snapshots" at "https://oss.sonatype.org/content/repositories/snapshots/",
            "Sonatype Releases" at "https://oss.sonatype.org/content/repositories/releases/"
)

scalaVersion := "2.11.6"

Now, from a system command prompt in the directory where the files are situated, it should be possible to download all dependencies and compile and run the code with a simple

sbt "run output.csv 50000 1000"

sbt magically manages all of the dependencies for us so that we don’t have to worry about them. However, for calling from R, it may be desirable to run the code without running sbt. There are several ways to achieve this, but the simplest is to build an “assembly jar” or “fat jar”, which is a Java byte-code file containing all code and libraries required in order to run the code on any system with a Java installation.

To build an assembly jar first create a subdirectory called project (the name matters), an in it place two files. The first should be called assembly.sbt, and should contain the line

addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.13.0")

Since the version of the assembly tool can depend on the version of sbt, it is also best to fix the version of sbt being used by creating another file in the project directory called build.properties, which should contain the line

sbt.version=0.13.7

Now return to the parent directory and run

sbt assembly

If this works, it should create a fat jar target/scala-2.11/gibbs-assembly-0.1.jar. You can check it works by running

java -jar target/scala-2.11/gibbs-assembly-0.1.jar output.csv 10000 10

Assuming that it does, you are now ready to try running the code from within R.

Calling via R system calls

Since this code takes a relatively long time to run, calling it from R via simple system calls isn’t a particularly terrible idea. For example, we can do this from the R command prompt with the following commands

system("java -jar target/scala-2.11/gibbs-assembly-0.1.jar output.csv 50000 1000")
out=read.csv("output.csv")
library(smfsb)
mcmcSummary(out,rows=2)

This works fine, but is a bit clunky. Tighter integration between R and Scala would be useful, which is where rscala comes in.

Calling assembly Scala projects via rscala

rscala provides a very simple way to embed a Scala interpreter within an R session, to be able to execute Scala expressions from R and to have the results returned back to the R session for further processing. The main issue with using this in practice is managing dependencies on external libraries and setting the Scala classpath correctly. By using an assembly jar we can bypass most of these issues, and it becomes trivial to call our Scala code direct from the R interpreter, as the following code illustrates.

library(rscala)
sc=scalaInterpreter("target/scala-2.11/gibbs-assembly-0.1.jar")
sc%~%'import gibbs.Gibbs._'
out=sc%~%'genIters(State(0.0,0.0),50000,1000).toArray.map{s=>Array(s.x,s.y)}'
library(smfsb)
mcmcSummary(out,rows=2)

Here we call the getIters function directly, rather than via the main method. This function returns an immutable List of States. Since R doesn’t understand this, we map it to an Array of Arrays, which R then unpacks into an R matrix for us to store in the matrix out.

Summary

The CRAN package rscala makes it very easy to embed a Scala interpreter within an R session. However, for most non-trivial statistical computing problems, the Scala code will have dependence on external scientific libraries such as Breeze. The standard way to easily manage external dependencies in the Scala ecosystem is sbt. Given an sbt-based Scala project, it is easy to generate an assembly jar in order to initialise the rscala Scala interpreter with the classpath needed to call arbitrary Scala functions. This provides very convenient inter-operability between R and Scala for many statistical computing applications.