library – Darren Wilkinson's blog

Using EvilPlot with scala-view

EvilPlot

EvilPlot is a new functional data visualisation library for Scala. Although there are several data viz libraries for Scala, this new library has a nice functional API for producing attractive, flexible, compositional plots which can be rendered in JVM applications and in web applications (via Scala.js). For a quick introduction, see this blog post from one of the library’s creators. For further information, see the official documentation and the github repo. For a quick overview of the kinds of plots that the library is capable of generating, see the plot catalog.

The library is designed to produce plots which can be rendered into applications. However, when doing data analysis in the REPL on the JVM, it is often convenient to be able to just pop up a plot in a window on the desktop. EvilPlot doesn’t seem to contain code for on-screen rendering, but the plots can be rendered to a bitmap image. In the previous post I described a small library, scala-view, which renders such images, and image sequences on the desktop. In this post I’ll walk through using scala-view to render EvilPlot plots on-screen.

An interactive session

To follow this session, you just need to run SBT from an empty directory. Just run sbt and paste the following at the SBT prompt:

set libraryDependencies += "com.cibo" %% "evilplot" % "0.2.0"
set libraryDependencies += "com.github.darrenjw" %% "scala-view" % "0.6-SNAPSHOT"
set resolvers += Resolver.bintrayRepo("cibotech", "public")
set resolvers += "Sonatype Snapshots" at "https://oss.sonatype.org/content/repositories/snapshots/"
set scalaVersion := "2.12.4"
set fork := true
console

Displaying a single plot

This will give a Scala REPL prompt. First we need some imports:

import com.cibo.evilplot.plot._
import com.cibo.evilplot.colors._
import com.cibo.evilplot.plot.aesthetics.DefaultTheme._
import com.cibo.evilplot.numeric.Point
import java.awt.Image.SCALE_SMOOTH
import scalaview.Utils._

We can simulate some data an produce a simple line chart:

val data = Seq.tabulate(100) { i =>
  Point(i.toDouble, scala.util.Random.nextDouble())
}
val plot = LinePlot.series(data, "Line graph", HSL(210, 100, 56)).
  xAxis().yAxis().frame().
  xLabel("x").yLabel("y").render()

This plot object contains the rendering instructions, but doesn’t actually produce a plot. We can use scala-view to display it as follows:

scalaview.SfxImageViewer(biResize(plot.asBufferedImage,1000,800,SCALE_SMOOTH))

This will produce a window on screen something like the following:

Don’t close this plot yet, as this will confuse the REPL. Just switch back to the REPL and continue.

Animating a sequence of plots

Sometimes we want to produce a sequence of plots. Let’s now suppose that the data above arises sequentially as a stream, and that we want to produce a sequence of plots with each observation as it arrives. First create a stream of partial datasets and map a function which turns a dataset into a plot to get a stream of images representing the plots. Then pass the stream of images into the viewer to get an animated sequence of plots on-screen:

val dataStream = data.toStream
val cumulStream = dataStream.scanLeft(Nil: List[Point])((l,p) => p :: l).drop(1)
def dataToImage(data: List[Point]) = LinePlot.
  series(data, "Line graph", HSL(210, 100, 56)).
    xAxis().yAxis().frame().
    xLabel("x").yLabel("y").render().asBufferedImage
val plotStream = cumulStream map (d => biResize(dataToImage(d),1000,800,SCALE_SMOOTH))
scalaview.SfxImageViewer.bi(plotStream, 100000, autoStart=true)

Scala-view: Animate streams of images

Introduction

In the previous post I discussed how comonads can be useful for structuring certain kinds of scientific and statistical computations. Two of the examples I gave were concerned with the time-evolution of 2-d images. In that post I used Breeze to animate the sequence of computed images. In this post I want to describe an alternative that is better suited to animating an image sequence.

Scala-view is a small Scala library for animating a Stream of Images on-screen in a separate window managed by your window manager. It works with both ScalaFX Images (recommended) and Scala Swing/AWT BufferedImages (legacy). The stream of images is animated in a window with some simple controls to start and stop the animation, and to turn on and off the saving of image frames to disk (typically for the purpose of turning the image sequence into a movie). An example of what a window might look like is given below.

Ising window

More comprehensive documentation is available from the scala-view github repo, but here I give a quick introduction to the library to outline its capabilities.

A Scala-view tutorial

This brief tutorial gives a quick introduction to using the Scala-view library for viewing a ScalaFX Image Stream. It assumes only that you have SBT installed, and that you run SBT from an empty directory.

An SBT REPL

Start by running SBT from an empty or temporary directory to get an SBT prompt:

$ sbt
>

Now we need to configure SBT to use the Scala-view library, and start a console. From the SBT prompt:

set libraryDependencies += "com.github.darrenjw" %% "scala-view" % "0.5"
set scalaVersion := "2.12.4"
console

The should result in a scala> REPL prompt. We can now use Scala and the Scala-view library interactively.

An example REPL session

You should be able to paste the code snippets below directly into the REPL. You may find :paste mode helpful.

We will replicate the heat equation example from the examples-sfx directory, which is loosely based on the example from my blog post on comonads. We will start by defining a simple parallel Image and corresponding comonadic pointed image PImage type. If you aren’t familiar with comonads, you may find it helpful to read through that post.

import scala.collection.parallel.immutable.ParVector
case class Image[T](w: Int, h: Int, data: ParVector[T]) {
  def apply(x: Int, y: Int): T = data(x * h + y)
  def map[S](f: T => S): Image[S] = Image(w, h, data map f)
  def updated(x: Int, y: Int, value: T): Image[T] =
    Image(w, h, data.updated(x * h + y, value))
}

case class PImage[T](x: Int, y: Int, image: Image[T]) {
  def extract: T = image(x, y)
  def map[S](f: T => S): PImage[S] = PImage(x, y, image map f)
  def coflatMap[S](f: PImage[T] => S): PImage[S] = PImage(
    x, y, Image(image.w, image.h,
      (0 until (image.w * image.h)).toVector.par.map(i => {
        val xx = i / image.h
        val yy = i % image.h
        f(PImage(xx, yy, image))
      })))
  def up: PImage[T] = {
    val py = y - 1
    val ny = if (py >= 0) py else (py + image.h)
    PImage(x, ny, image)
  }
  def down: PImage[T] = {
    val py = y + 1
    val ny = if (py < image.h) py else (py - image.h)
    PImage(x, ny, image)
  }
  def left: PImage[T] = {
    val px = x - 1
    val nx = if (px >= 0) px else (px + image.w)
    PImage(nx, y, image)
  }
  def right: PImage[T] = {
    val px = x + 1
    val nx = if (px < image.w) px else (px - image.w)
    PImage(nx, y, image)
  }
}

We will need a function to convert this image into a ScalaFX WritableImage.

import scalafx.scene.image.WritableImage
import scalafx.scene.paint._
def toSfxI(im: Image[Double]): WritableImage = {
    val wi = new WritableImage(im.w, im.h)
    val pw = wi.pixelWriter
    (0 until im.w) foreach (i =>
      (0 until im.h) foreach (j =>
        pw.setColor(i, j, Color.gray(im(i,j)))
      ))
    wi
  }

We will need a starting image representing the initial condition for the heat equation.

val w = 600
val h = 500
val pim0 = PImage(0, 0, Image(w, h,
  ((0 until w*h).toVector map {i: Int => {
  val x = i / h
  val y = i % h
  0.1*math.cos(0.1*math.sqrt((x*x+y*y))) + 0.1 + 0.8*math.random
  }}).par
))

We can define a kernel associated with the update of a single image pixel based on a single time step of a finite difference solution of the heat equation.

def kernel(pi: PImage[Double]): Double = (2*pi.extract+
  pi.up.extract+pi.down.extract+pi.left.extract+pi.right.extract)/6.0

We can now create a Stream of PImage with

def pims = Stream.iterate(pim0)(_.coflatMap(kernel))

We can turn this into a Stream[WritableImage] with

def sfxis = pims map (im => toSfxI(im.image))

Note that we are essentially finished at this point, but so far everything we have done has been purely functional with no side effects. We haven’t even computed our solution to the heat equation. All we have constructed are lazy infinite streams representing the solution of the heat equation.

Finally, we can render our Stream of Images on screen with

scalaview.SfxImageViewer(sfxis,1e7.toInt)

which has a delay of 1e7 nanoseconds (10 milliseconds) between frames.

This should pop up a window on your display containing the initial image. Click on the Start button to animate the solution of the heat equation. See the API docs for SfxImageViewer for additional options. The ScalaFX API docs may also be useful, especially the docs for Image and WritableImage.

Scala as a platform for statistical computing and data science

There has been a lot of discussion on-line recently about languages for data analysis, statistical computing, and data science more generally. I don’t really want to go into the detail of why I believe that all of the common choices are fundamentally and unfixably flawed – language wars are so unseemly. Instead I want to explain why I’ve been using the Scala programming language recently and why, despite being far from perfect, I personally consider it to be a good language to form a platform for efficient and scalable statistical computing. Obviously, language choice is to some extent a personal preference, implicitly taking into account subjective trade-offs between features different individuals consider to be important. So I’ll start by listing some language/library/ecosystem features that I think are important, and then explain why.

A feature wish list

It should:

be a general purpose language with a sizable user community and an array of general purpose libraries, including good GUI libraries, networking and web frameworks
be free, open-source and platform independent
be fast and efficient
have a good, well-designed library for scientific computing, including non-uniform random number generation and linear algebra
have a strong type system, and be statically typed with good compile-time type checking and type safety
have reasonable type inference
have a REPL for interactive use
have good tool support (including build tools, doc tools, testing tools, and an intelligent IDE)
have excellent support for functional programming, including support for immutability and immutable data structures and “monadic” design
allow imperative programming for those (rare) occasions where it makes sense
be designed with concurrency and parallelism in mind, having excellent language and library support for building really scalable concurrent and parallel applications

The not-very-surprising punch-line is that Scala ticks all of those boxes and that I don’t know of any other languages that do. But before expanding on the above, it is worth noting a couple of (perhaps surprising) omissions. For example:

have excellent data viz capability built-in
have vast numbers of statistical routines in the standard library

The above are points (and there are other similar points) where other languages (for example, R), currently score better than Scala. It is not that these things are not important – indeed, they are highly desirable. But I consider them to be of lesser importance as they are much easier to fix, given a suitable platform, than fixing an unsuitable language and platform. Visualisation is not trivial, but it is not fantastically difficult in a language with excellent GUI libraries. Similarly, most statistical routines are quite straightforward to implement for anyone with reasonable expertise in scientific and statistical computing and numerical linear algebra. These are things that are relatively easy for a community to contribute to. Building a great programming language, on the other hand, is really, really, difficult.

I will now expand briefly on each point in turn.

be a general purpose language with a sizable user community and an array of general purpose libraries, including good GUI libraries, networking and web frameworks

History has demonstrated, time and time again, that domain specific languages (DSLs) are synonymous with idiosyncratic, inconsistent languages that are terrible for anything other than what they were specifically designed for. They can often be great for precisely the thing that they were designed for, but people always want to do other things, and that is when the problems start. For the avoidance of controversy I won’t go into details, but the whole Python versus R thing is a perfect illustration of this general versus specific trade-off. Similarly, although there has been some buzz around another new language recently, which is faster than R and Python, my feeling is that the last thing the world needs right now is Just Unother Language for Indexed Arrays…

In this day-and-age it is vital that statistical code can use a variety of libraries, and communicate with well-designed network libraries and web frameworks, as statistical analysis does not exist in a vacuum. Scala certainly fits the bill here, being used in a large number of important high-profile systems, ensuring a lively, well-motivated ecosystem. There are numerous well-maintained libraries for almost any task. Picking on web frameworks, for example, there are a number of excellent libraries, including Lift and Play. Scala also has the advantage of offering seamless Java integration, for those (increasingly rare) occasions when a native Scala library for the task at hand doesn’t exist.

be free, open-source and platform independent

This hardly needs expanding upon, other than to observe that there are a few well-known commercial software solutions for scientific, statistical and mathematical computing. There are all kinds of problems with using closed proprietary systems, including transparency and reproducibility, but also platform and scalability problems. eg. running code requiring a license server in the cloud. The academic statistical community has largely moved away from commercial software, and I don’t think there is any going back. Scala is open source and runs on the JVM, which is about as platform independent as it is possible to get.

be fast and efficient

Speed and efficiency continue to be important, despite increasing processor speeds. Computationally intensive algorithms are being pushed to ever larger and more complex models and data sets. Compute cycles and memory efficiency really matter, and can’t be ignored. This doesn’t mean that we all have to code in C/C++/Fortran, but we can’t afford to code in languages which are orders of magnitude slower. This will always be a problem. Scala code generally runs well within a factor of 2 of comparable native code – see my Gibbs sampler post for a simple example including timings.

have a good, well-designed library for scientific computing, including non-uniform random number generation and linear algebra

I hesitated about including this in my list of essentials, because it is certainly something that can, in principle, be added to a language at a later date. However, building such libraries is far from trivial, and they need to be well-designed, comprehensive and efficient. For Scala, Breeze is rapidly becoming the standard scientific library, including special functions, non-uniform random number generation and numerical linear algebra. For a data library, there is Saddle, and for a scalable analytics library there is Spark. These libraries certainly don’t cover everything that can be found in R/CRAN, but they provide a fairly solid foundation on which to build.

have a strong type system, and be statically typed with good compile-time type checking and type safety

I love dynamic languages – they are fun and exciting. It is fun to quickly throw together a few functions in a scripting language without worrying about declaring the types of anything. And it is exciting to see the myriad of strange and unanticipated ways your code can crash-and-burn at runtime! 😉 But this excitement soon wears off, and you end up adding lots of boilerplate argument checking code that would not only be much cleaner and simpler in a statically typed language, but would be checked at compile-time, making the static code faster and more efficient. For messing about prototyping, dynamic languages are attractive, but as a solid platform for statistical computing, they really don’t make sense. Scala has a strong type system offering a high degree of compile-time checking, making it a safe and efficient language.

have reasonable type inference

A common issue with statically typed languages is that they lead to verbose code containing many redundant type declarations that the compiler ought to be able to check. This doesn’t just mean more typing – it leads to verbose code that can hide the program logic. Languages with type inference offer the best of both worlds – the safety of static typing without the verbosity. Scala does a satisfactory job here.

have a REPL for interactive use

One thing that dynamic languages have taught us is that it is actually incredibly useful to have a REPL for interactive analysis. This is true generally, but especially so for statistical computing, where human intervention is often desirable. Again, Scala has a nice REPL.

have good tool support (including build tools, doc tools, testing tools, and an intelligent IDE)

Tools matter. Scala has an excellent build tool in the SBT. It has code documentation in the form of scaladoc (similar to javadoc). It has a unit testing framework, and a reasonably intelligent IDE in the form of the Scala IDE (based on Eclipse).

have excellent support for functional programming, including support for immutability and immutable data structures and “monadic” design

I, like many others, am gradually coming to realise that functional programming offers many advantages over other programming styles. In particular, it provides best route to building scalable software, in terms of both program complexity and data size/complexity. Scala has good support for functional programming, including immutable named values, immutable data structures and for-comprehensions. And if off-the-shelf Scala isn’t sufficiently functional already, libraries such as scalaz make it even more so.

allow imperative programming for those (rare) occasions where it makes sense

Although most algorithms in scientific computing are typically conceived of and implemented in an imperative style, I’m increasingly convinced that most can be recast in a pure functional way without significant loss of efficiency, and with significant benefits. That said, there really are some problems that are more efficient to implement in an imperative framework. It is therefore important that the language is not so “pure” functional that this is forbidden. Again, Scala fits the bill.

be designed with concurrency and parallelism in mind, having excellent language and library support for building really scalable concurrent and parallel applications

These days scalability typically means exploiting concurrency and parallelism. In an imperative world this is hard, and libraries such as MPI prove that it is difficult to bolt parallelism on top of a language post-hoc. Check-points, communication overhead, deadlocks and race conditions make it very difficult to build codes that scale well to more than a few processors. Concurrency is more straightforward in functional languages, and this is one of the reasons for the recent resurgence of functional languages and programming. Scala has good concurrency support built-in, and libraries such as Akka make it relatively easy to build truly scalable software.

Summary

The Scala programming language ticks many boxes when it comes to forming a nice solid foundation for building a platform for efficient scalable statistical computing. Although I still use R and Python almost every day, I’m increasingly using Scala for serious algorithm development. In the short term I can interface to my Scala code from R using jvmr, but in the longer term I hope that Scala will become a complete framework for statistics and data science. In a subsequent post I will attempt to give a very brief introduction to Scala and the Breeze numerical library.

Java math libraries and Monte Carlo simulation codes

Java libraries for (non-uniform) random number simulation

Anyone writing serious Monte Carlo (and MCMC) codes relies on having a very good and fast (uniform) random number generator and associated functions for generation of non-uniform random quantities, such as Gaussian, Poisson, Gamma, etc. In a previous post I showed how to write a simple Gibbs sampler in four different languages. In C (and C++) random number generation is easy for most scientists, as the (excellent) GNU Scientific Library (GSL) provides exactly what most people need. But it wasn’t always that way… I remember the days before the GSL, when it was necessary to hunt around on the net for bits of C code to implement different algorithms. Worse, it was often necessary to hunt around for a bit of free FORTRAN code, and compile that with an F77 compiler and figure out how to call it from C. Even in the early Alpha days of the GSL, coverage was patchy, and the API changed often. Bad old days… But those days are long gone, and C programmers no longer have to worry about the problem of random variate generation – they can safely concentrate on developing their interesting new algorithm, and leave the rest to the GSL. Unfortunately for Java programmers, there isn’t yet anything quite comparable to the GSL in Java world.

I pretty much ignored Java until Java 5. Before then, the language was too limited, and the compilers and JVMs were too primitive to really take seriously for numerical work. But since the launch of Java 5 I’ve been starting to pay more interest. The language is now a perfectly reasonable O-O language, and the compilers and JVMs are pretty good. On a lot of benchmarks, Java is really quite comparable to C/C++, and Java is nicer to code, and has a lot of impressive associated technology. So if there was a math library comparable to the GSL, I’d be quite tempted to jump ship to the Java world and start writing all of my Monte Carlo codes in Java. But there isn’t. At least not yet.

When I first started to take Java seriously, the only good math library with good support for non-uniform random number generation was COLT. COLT was, and still is, pretty good. The code is generally well-written, and fast, and the documentation for it is reasonable. However, the structure of the library is very idiosyncratic, the coverage is a bit patchy, and there doesn’t ever seem to have been a proper development community behind it. It seems very much to have been a one-man project, which has long since stagnated. Unsurprisingly then, COLT has been forked. There is now a Parallel COLT project. This project is continuing the development of COLT, adding new features that were missing from COLT, and, as the name suggests, adding concurrency support. Parallel COLT is also good, and is the main library I currently use for random number generation in Java. However, it has obviously inherited all of the idiosyncrasies that COLT had, and still doesn’t seem to have a large and active development community associated with it. There is no doubt that it is an incredibly useful software library, but it still doesn’t really compare to the GSL.

I have watched the emergence of the Apache Commons Math project with great interest (not to be confused with Uncommons Math – another one-man project). I think this project probably has the greatest potential for providing the Java community with their own GSL equivalent. The Commons project has a lot of momentum, the Commons Math project seems to have an active development community, and the structure of the library is more intuitive than that of (Parallel) COLT. However, it is early days, and the library still has patchy coverage and is a bit rough around the edges. It reminds me a lot of the GSL back in its Alpha days. I’d not bothered to even download it until recently, as the random number generation component didn’t include the generation of gamma random quantities – an absolutely essential requirement for me. However, I noticed recently that the latest release (2.2) did include gamma generation, so I decided to download it and try it out. It works, but the generation of gamma random quantities is very slow (around 50 times slower than Parallel COLT). This isn’t a fundamental design flaw of the whole library – generating Gaussian random quantities is quite comparable with other libraries. It’s just that an inversion method has been used for gamma generation. All efficient gamma generators use a neat rejection scheme. In case anyone would like to investigate for themselves, here is a complete program for gamma generation designed to be linked against Parallel COLT:

import java.util.*;
import cern.jet.random.tdouble.*;
import cern.jet.random.tdouble.engine.*;

class GammaPC
{

    public static void main(String[] arg)
    {
	DoubleRandomEngine rngEngine=new DoubleMersenneTwister();
	Gamma rngG=new Gamma(1.0,1.0,rngEngine);
	long N=10000;
	double x=0.0;
	for (int i=0;i<N;i++) {
	    for (int j=0;j<1000;j++) {
		x=rngG.nextDouble(3.0,25.0);
	    }
	    System.out.println(x);
	}
    }
    
}

and here is a complete program designed to be linked against Commons Math:

import java.util.*;
import org.apache.commons.math.*;
import org.apache.commons.math.random.*;

class GammaACM
{

    public static void main(String[] arg) throws MathException
    {
	RandomDataImpl rng=new RandomDataImpl();
	long N=10000;
	double x=0.0;
	for (int i=0;i<N;i++) {
	    for (int j=0;j<1000;j++) {
		x=rng.nextGamma(3.0,1.0/25.0);
	    }
	    System.out.println(x);
	}
    }
    
}

The two codes do the same thing (note that they parameterise the gamma distribution differently). Both programs work (they generate variates from the same, correct, distribution), and the Commons Math interface is slightly nicer, but the code is much slower to execute. I’m still optimistic that Commons Math will one day be Java’s GSL, but I’m not giving up on Parallel COLT (or C, for that matter!) just yet…

Calling C code from R

Introduction

In this post I’ll look at how to call compiled C code from an R session. The focus here is on calling C code from R, rather than on extending R using C. Although the two are technically very similar problems, the emphasis is somewhat different. A lot of existing documentation focuses on the latter problem, and this is one of the motivations for writing this post. Fortunately, the problem of calling existing C code from R is a bit simpler than the more general problem of extending R in C.

In a previous post I looked at how to implement a trivial bivariate Gibbs sampler in various languages. It was seen there that the C version ran approximately 60 times faster than the R version. It is therefore often desirable to code up MCMC algorithms in C. However, it is usually very convenient to be able to call such algorithms from inside an R session. There are various ways to do this, ranging from the trivial to very complex. In this post I will look at some of the simpler methods and discuss the pros and cons.

Standalone C code

We will restrict attention to the Gibbs sampler discussed in a previous post. We will focus on the C version of the code. Below is a slightly modified version of the code which includes some command-line arguments that enable some flexibility in how the code is run post-compilation.

#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#include <gsl/gsl_rng.h>
#include <gsl/gsl_randist.h>

int main(int argc, char *argv[])
{
  if (argc!=4) {
    fprintf(stderr,"Usage: %s <Iters> <Thin> <Seed>\n",argv[0]);
    exit(EXIT_FAILURE);
  }
  long N=(long) atoi(argv[1]);
  long thin=(long) atoi(argv[2]);
  long seed=(long) atoi(argv[3]);
  long i,j;
  gsl_rng *r = gsl_rng_alloc(gsl_rng_mt19937);
  gsl_rng_set(r,seed);
  double x=0;
  double y=0;
  printf("Iter x y\n");
  for (i=0;i<N;i++) {
    for (j=0;j<thin;j++) {
      x=gsl_ran_gamma(r,3.0,1.0/(y*y+4));
      y=1.0/(x+1)+gsl_ran_gaussian(r,1.0/sqrt(x+1));
    }
    printf("%ld %f %f\n",i,x,y);
  }
  exit(EXIT_SUCCESS);
}

Assuming a Unix/Linux environment (including a GSL implementation), the above code can be compiled from the Unix shell with a command like:

gcc -O2 -lgsl -lgslcblas standalone.c -o standalone

and run with a command like:

./standalone 10000 500 1 > data.tab

The first command-line argument is the number of iterations required, and the second is the “thin” to be applied to the output. The third argument is the “seed” to be applied to the GSL random number generator (RNG). This allows different (not quite independent – see my post on parallel MCMC for details) runs to be obtained by selecting different seed values. The simplest way to call this code from within an R session is to call this unmodified executable using the R system() command. A small “wrapper” function to do this is given below.

standalone<-function(N=10000,thin=500,
             seed=trunc(runif(1)*1e6),
             exec=file.path(".","standalone"),
             tmpfile=tempfile())
{
  command=paste(exec,N,thin,seed,">",tmpfile)
  system(command)
  read.table(tmpfile,header=TRUE)
}

Note the use of the file.path() and tempfile() R functions in a (probably vain!) attempt to make the code somewhat portable. Just running standalone() from an R session should then return a data frame containing the MCMC output. I gave some commands for analysing this output in a previous post. This approach to calling external code is very simple and crude, and quite generic (it is not specific to C code at all). However, it is very quick and easy to implement, and in many cases quite efficient. There is a considerable computational overhead in executing the system command and parsing output files from disk. However, if the code being called is very computationally intensive and relatively slow (as is typically the case), then this overhead can often be negligible, rendering this approach quite practical.

Building and linking to a shared library

If one is really keen to avoid the overhead of executing an R system command, then it is necessary to compile the required C code into a shared library (or DLL), and link this code into R where it can be called directly via R’s foreign language interface. Below is a version of the previous C code modified to make it appropriate for calling from R.

#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#include <gsl/gsl_rng.h>
#include <gsl/gsl_randist.h>
#include <R.h>

void gibbs(int *Np,int *thinp,int *seedp,double *xvec,double *yvec)
{
  int i,j;
  int N=*Np,thin=*thinp,seed=*seedp;
  gsl_rng *r = gsl_rng_alloc(gsl_rng_mt19937);
  gsl_rng_set(r,seed);
  double x=0;
  double y=0;
  for (i=0;i<N;i++) {
    for (j=0;j<thin;j++) {
      x=gsl_ran_gamma(r,3.0,1.0/(y*y+4));
      y=1.0/(x+1)+gsl_ran_gaussian(r,1.0/sqrt(x+1));
    }
    xvec[i]=x; yvec[i]=y;
  }
}

Note that it is only possible to pass pointers from simple R/C data types, and so all function arguments must be pointers. Also note that there is no return value to the function, and that values are retrieved in R by modifying some of the values pointed to by the pointer arguments. This is the mode of operation imposed by the basic method that R provides for calling C code from R (the .C() function). Note that there are other methods for extending R in C, using the .Call() and .External() functions, but these are beyond the scope of this post. Again assuming a Unix/Linux environment, this code can be compiled into a shared library with a command like:

R CMD SHLIB -lgsl -lgslcblas dynamic.c

It can then be loaded into a running R session with a command like dyn.load("dynamic.so"). Again, if we are attempting to write portable code, we might use a command like:

dyn.load(file.path(".",paste("dynamic",.Platform$dynlib.ext,sep="")))

You can check what dynamic libraries are loaded into the current R session with getLoadedDLLs(). Once the DLL (Dynamic Link Library) is loaded, it can be called using the .C() function. A small wrapper function appropriate in this instance is given below:

dynamic<-function(n=10000,thin=500,seed=trunc(runif(1)*1e6))
{
  tmp=.C("gibbs",as.integer(n),as.integer(thin),
               as.integer(seed),x=as.double(1:n),
                  y=as.double(1:n))
  mat=cbind(1:n,tmp$x,tmp$y) 
  colnames(mat)=c("Iter","x","y")
  mat
}

Note how a random seed is generated in R to be passed to the C code to be used to seed the GSL random generator used within the C code. The code can then be run with a simple call to dynamic() and everything should work OK provided that all of the required libraries are found. This is the simplest way to link C code into R in a way that avoids the overhead associated with a system() call. However, this approach is also not without issues. In particular, the C code relies on the GSL, and more specifically on the random number streams provided by the GSL. These are completely separate from the random number streams used within the R system. In some situations it would make sense to use the same random number streams used within the R session, and to remove the dependence of the C code on the GSL.

Using the R API

The C code discussed in the previous section relies on the GSL only for the generation of (non-uniform) random numbers. Obviously R has its own very sophisticated system for handling random numbers and it is possible to use this system from within externally called C code using the R API. In particular, C versions of functions such as rnorm() and rgamma() can be called in C by including Rmath.h. Below is a version of the C code previously given modified to use the R random number generation routines and to remove all dependence on the GSL.

#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#include <R.h>
#include <Rmath.h>

void gibbsR(int *Np,int *thinp,double *xvec,double *yvec)
{
  int i,j;
  int N=*Np,thin=*thinp;
  GetRNGstate();
  double x=0;
  double y=0;
  for (i=0;i<N;i++) {
    for (j=0;j<thin;j++) {
      x=rgamma(3.0,1.0/(y*y+4));
      y=rnorm(1.0/(x+1),1.0/sqrt(x+1));
    }
    xvec[i]=x; yvec[i]=y;
  }
  PutRNGstate();
}

Note that a call to GetRNGstate() must be made before calling any random number functions and that a call to PutRNGstate() must be called before the function returns control back to R. This code can be compiled with a command like

R CMD SHLIB dynamicR.c

and linked into R with a command like

dyn.load(file.path(".",paste("dynamicR",.Platform$dynlib.ext,sep="")))

An appropriate wrapper for this code is given below:

dynamicR<-function(n=10000,thin=500)
{
  tmp=.C("gibbsR",as.integer(n),as.integer(thin),
                x=as.double(1:n),y=as.double(1:n))
  mat=cbind(1:n,tmp$x,tmp$y) 
  colnames(mat)=c("Iter","x","y")
  mat
}

This code is now slightly simpler, and the lack of dependence on external libraries such as the GSL makes it much easier to integrate into R packages, should this be desired.

Summary and further reading

Foreign language interfaces are a notoriously complex subject and this post has obviously just scratched the surface of the problem. For a few more examples, first see my old computer practicals on Stochastic simulation in R and C. The examples are a bit out of date, but easy to fix. Also see a howto by the Flemish Supercomputing Centre on a similar topic to this one. For more detailed information, see the manual on Writing R extensions, especially the sections on Foreign language interfaces and the R API. I also find Chapter 6 of R Programming for Bioinformatics to be a useful introduction to more complex aspects.

I have also somewhat belatedly re-discovered Charlie Geyer‘s notes on Calling C and Fortran from R, which covers very similar ground to this post. They were probably the unconscious inspiration for this post…