Unbiased MCMC with couplings

Yesterday there was an RSS Read Paper meeting for the paper Unbiased Markov chain Monte Carlo with couplings by Pierre Jacob, John O’Leary and Yves F. Atchadé. The paper addresses the bias in MCMC estimates due to lack of convergence to equilibrium (the “burn-in” problem), and shows how it is possible to modify MCMC algorithms in order to construct estimates which exactly remove this bias. The requirement is to couple a pair of MCMC chains so that they will at some point meet exactly and thereafter remain coupled. This turns out to be easier to do that one might naively expect. There are many reasons why we might want to remove bias from MCMC estimates, but the primary motivation in the paper was the application to parallel MCMC computation. The idea here is that many pairs of chains can be run independently on any available processors, and the unbiased estimates from the different pairs can be safely averaged to get an (improved) unbiased estimate based on all of the chains. As a discussant of the paper, I’ve spent a bit of time thinking about this idea, and have created a small repository of materials relating to the paper which may be useful for others interested in understanding the method and how to use it in practice.

The repo includes a page of links to related papers, blog posts, software and other resources relating to unbiased MCMC that I’ve noticed on-line.

Earlier in the year I gave an internal seminar at Newcastle giving a tutorial introduction to the main ideas from the paper, including runnable R code implementations of the examples. The talk was prepared as an executable R Markdown document. The R Markdown source code is available in the repo, but for the convenience of casual browsers I’ve also included a pre-built set of PDF slides. Code examples include code for maximal coupling of two (univariate) distributions, coupling Metropolis-Hastings chains, and coupling a Gibbs sampler for an AR(1) process.

I haven’t yet finalised my written discussion contribution, but the slides I presented at the Read Paper meeting are also available. Again, there is source code and pre-built PDF slides. My discussion focused on seeing how well the technique works for Gibbs samplers applied to high-dimensional latent process models (an AR(1) process and a Gaussian Markov random field), and reflecting on the extent to which the technique really solves the burn-in/parallel MCMC problem.

The repo also contains a few stand-alone code examples. There are some simple tutorial examples in R (largely derived from my tutorial introduction), including implementation of (univariate) independent and reflection maximal couplings, and a coupled AR(1) process example.

The more substantial example concerns a coupled Gibbs sampler for a GMRF. This example is written in the Scala programming language. There are a couple of notable features of this implementation. First, the code illustrates monadic coupling of probability distributions, based on the Rand type in the Breeze scientific library. This provides an elegant way to max couple arbitrary (continuous) random variables, and to create coupled Metropolis(-Hastings) kernels. For example, a coupling of two distributions can be constructed as

  def couple[T](p: ContinuousDistr[T], q: ContinuousDistr[T]): Rand[(T, T)] = {
    def ys: Rand[T] =
      for {
        y  <- q
        w  <- Uniform(0, 1)
        ay <- if (math.log(w) > p.logPdf(y) - q.logPdf(y)) Rand.always(y) else ys
      } yield ay
    val pair = for {
      x <- p
      w <- Uniform(0, 1)
    } yield (math.log(w) <= q.logPdf(x) - p.logPdf(x), x)
    pair flatMap {
      case (b, x) => if (b) Rand.always((x, x)) else (ys map (y => (x, y)))
    }
  }

and then draws can be sampled from the resulting Rand[(T, T)] polymorphic type as required. Incidentally, this also illustrates how to construct an independent maximal coupling without evaluating any raw likelihoods.
The other notable feature of the code is the use of a parallel comonadic image type for parallel Gibbs sampling of the GMRF, producing a (lazy) Stream of coupled MCMC samples.

The smfsb R package

Introduction

In the previous post I gave a brief introduction to the third edition of my textbook, Stochastic modelling for systems biology. The algorithms described in the book are illustrated by implementations in R. These implementations are collected together in an R package on CRAN called smfsb. This post will provide a brief introduction to the package and its capabilities.

Installation

The package is on CRAN – see the CRAN package page for details. So the simplest way to install it is to enter

install.packages("smfsb")

at the R command prompt. This will install the latest version that is on CRAN. Once installed, the package can be loaded with

library(smfsb)

The package is well-documented, so further information can be obtained with the usual R mechanisms, such as

vignette(package="smfsb")
vignette("smfsb")
help(package="smfsb")
?StepGillespie
example(StepCLE1D)

The version of the package on CRAN is almost certainly what you want. However, the package is developed on R-Forge – see the R-Forge project page for details. So the very latest version of the package can always be installed with

install.packages("smfsb", repos="http://R-Forge.R-project.org")

if you have a reason for wanting it.

A brief tutorial

The vignette gives a quick introduction the the library, which I don’t need to repeat verbatim here. If you are new to the package, I recommend working through that before continuing. Here I’ll concentrate on some of the new features associated with the third edition.

Simulating stochastic kinetic models

Much of the book is concerned with the simulation of stochastic kinetic models using exact and approximate algorithms. Although the primary focus of the text is the application to modelling of intra-cellular processes, the methods are also appropriate for population modelling of ecological and epidemic processes. For example, we can start by simulating a simple susceptible-infectious-recovered (SIR) disease epidemic model.

set.seed(2)
data(spnModels)

stepSIR = StepGillespie(SIR)
plot(simTs(SIR$M, 0, 8, 0.05, stepSIR),
  main="Exact simulation of the SIR model")

Exact simulation of the SIR epidemic model
The focus of the text is stochastic simulation of discrete models, so that is the obvious place to start. But there is also support for continuous deterministic simulation.

plot(simTs(SIR$M, 0, 8, 0.05, StepEulerSPN(SIR)),
  main="Euler simulation of the SIR model")

Euler simulation of the SIR model
My favourite toy population dynamics model is the Lotka-Volterra (LV) model, so I tend to use this frequently as a running example throughout the book. We can simulate this (exactly) as follows.

stepLV = StepGillespie(LV)
plot(simTs(LV$M, 0, 30, 0.2, stepLV),
  main="Exact simulation of the LV model")

Exact simulation of the Lotka-Volterra model

Stochastic reaction-diffusion modelling

The first two editions of the book were almost exclusively concerned with well-mixed systems, where spatial effects are ignorable. One of the main new features of the third edition is the inclusion of a new chapter on spatially extended systems. The focus is on models related to the reaction diffusion master equation (RDME) formulation, rather than individual particle-based simulations. For these models, space is typically divided into a regular grid of voxels, with reactions taking place as normal within each voxel, and additional reaction events included, corresponding to the diffusion of particles to adjacent voxels. So to specify such models, we just need an initial condition, a reaction model, and diffusion coefficients (one for each reacting species). So, we can carry out exact simulation of an RDME model for a 1D spatial domain as follows.

N=20; T=30
x0=matrix(0, nrow=2, ncol=N)
rownames(x0) = c("x1", "x2")
x0[,round(N/2)] = LV$M
stepLV1D = StepGillespie1D(LV, c(0.6, 0.6))
xx = simTs1D(x0, 0, T, 0.2, stepLV1D, verb=TRUE)
image(xx[1,,], main="Prey", xlab="Space", ylab="Time")

Discrete 1D simulation of the LV model

image(xx[2,,], main="Predator", xlab="Space", ylab="Time")

Discrete 1D simulation of the LV model
Exact simulation of discrete stochastic reaction diffusion systems is very expensive (and the reference implementation provided in the package is very inefficient), so we will often use diffusion approximations based on the CLE.

stepLV1DC = StepCLE1D(LV, c(0.6, 0.6))
xx = simTs1D(x0, 0, T, 0.2, stepLV1D)
image(xx[1,,], main="Prey", xlab="Space", ylab="Time")

Spatial CLE simulation of the 1D LV model

image(xx[2,,], main="Predator", xlab="Space", ylab="Time")

Spatial CLE simulation of the 1D LV model
We can think of this algorithm as an explicit numerical integration of the obvious SPDE approximation to the exact model.

The package also includes support for simulation of 2D systems. Again, we can use the Spatial CLE to speed things up.

m=70; n=50; T=10
data(spnModels)
x0=array(0, c(2,m,n))
dimnames(x0)[[1]]=c("x1", "x2")
x0[,round(m/2),round(n/2)] = LV$M
stepLV2D = StepCLE2D(LV, c(0.6,0.6), dt=0.05)
xx = simTs2D(x0, 0, T, 0.5, stepLV2D)
N = dim(xx)[4]
image(xx[1,,,N],main="Prey",xlab="x",ylab="y")

Spatial CLE simulation of the 2D LV model

image(xx[2,,,N],main="Predator",xlab="x",ylab="y")

Spatial CLE simulation of the 2D LV model

Bayesian parameter inference

Although much of the book is concerned with the problem of forward simulation, the final chapters are concerned with the inverse problem of estimating model parameters, such as reaction rate constants, from data. A computational Bayesian approach is adopted, with the main emphasis being placed on “likelihood free” methods, which rely on forward simulation to avoid explicit computation of sample path likelihoods. The second edition included some rudimentary code for a likelihood free particle marginal Metropolis-Hastings (PMMH) particle Markov chain Monte Carlo (pMCMC) algorithm. The third edition includes a more complete and improved implementation, in addition to approximate inference algorithms based on approximate Bayesian computation (ABC).

The key function underpinning the PMMH approach is pfMLLik, which computes an estimate of marginal model log-likelihood using a (bootstrap) particle filter. There is a new implementation of this function with the third edition. There is also a generic implementation of the Metropolis-Hastings algorithm, metropolisHastings, which can be combined with pfMLLik to create a PMMH algorithm. PMMH algorithms are very slow, but a full demo of how to use these functions for parameter inference is included in the package and can be run with

demo(PMCMC)

Simple rejection-based ABC methods are facilitated by the (very simple) function abcRun, which just samples from a prior and then carries out independent simulations in parallel before computing summary statistics. A simple illustration of the use of the function is given below.

data(LVdata)
rprior <- function() { exp(c(runif(1, -3, 3),runif(1,-8,-2),runif(1,-4,2))) }
rmodel <- function(th) { simTs(c(50,100), 0, 30, 2, stepLVc, th) }
sumStats <- identity
ssd = sumStats(LVperfect)
distance <- function(s) {
    diff = s - ssd
    sqrt(sum(diff*diff))
}
rdist <- function(th) { distance(sumStats(rmodel(th))) }
out = abcRun(10000, rprior, rdist)
q=quantile(out$dist, c(0.01, 0.05, 0.1))
print(q)
##       1%       5%      10% 
## 772.5546 845.8879 881.0573
accepted = out$param[out$dist < q[1],]
print(summary(accepted))
##        V1                V2                  V3         
##  Min.   :0.06498   Min.   :0.0004467   Min.   :0.01887  
##  1st Qu.:0.16159   1st Qu.:0.0012598   1st Qu.:0.04122  
##  Median :0.35750   Median :0.0023488   Median :0.14664  
##  Mean   :0.68565   Mean   :0.0046887   Mean   :0.36726  
##  3rd Qu.:0.86708   3rd Qu.:0.0057264   3rd Qu.:0.36870  
##  Max.   :4.76773   Max.   :0.0309364   Max.   :3.79220
print(summary(log(accepted)))
##        V1                V2               V3         
##  Min.   :-2.7337   Min.   :-7.714   Min.   :-3.9702  
##  1st Qu.:-1.8228   1st Qu.:-6.677   1st Qu.:-3.1888  
##  Median :-1.0286   Median :-6.054   Median :-1.9198  
##  Mean   :-0.8906   Mean   :-5.877   Mean   :-1.9649  
##  3rd Qu.:-0.1430   3rd Qu.:-5.163   3rd Qu.:-0.9978  
##  Max.   : 1.5619   Max.   :-3.476   Max.   : 1.3329

Naive rejection-based ABC algorithms are notoriously inefficient, so the library also includes an implementation of a more efficient, sequential version of ABC, often known as ABC-SMC, in the function abcSmc. This function requires specification of a perturbation kernel to “noise up” the particles at each algorithm sweep. Again, the implementation is parallel, using the parallel package to run the required simulations in parallel on multiple cores. A simple illustration of use is given below.

rprior <- function() { c(runif(1, -3, 3), runif(1, -8, -2), runif(1, -4, 2)) }
dprior <- function(x, ...) { dunif(x[1], -3, 3, ...) + 
    dunif(x[2], -8, -2, ...) + dunif(x[3], -4, 2, ...) }
rmodel <- function(th) { simTs(c(50,100), 0, 30, 2, stepLVc, exp(th)) }
rperturb <- function(th){th + rnorm(3, 0, 0.5)}
dperturb <- function(thNew, thOld, ...){sum(dnorm(thNew, thOld, 0.5, ...))}
sumStats <- identity
ssd = sumStats(LVperfect)
distance <- function(s) {
    diff = s - ssd
    sqrt(sum(diff*diff))
}
rdist <- function(th) { distance(sumStats(rmodel(th))) }
out = abcSmc(5000, rprior, dprior, rdist, rperturb,
    dperturb, verb=TRUE, steps=6, factor=5)
## 6 5 4 3 2 1
print(summary(out))
##        V1                V2               V3        
##  Min.   :-2.9961   Min.   :-7.988   Min.   :-3.999  
##  1st Qu.:-1.9001   1st Qu.:-6.786   1st Qu.:-3.428  
##  Median :-1.2571   Median :-6.167   Median :-2.433  
##  Mean   :-1.0789   Mean   :-6.014   Mean   :-2.196  
##  3rd Qu.:-0.2682   3rd Qu.:-5.261   3rd Qu.:-1.161  
##  Max.   : 2.1128   Max.   :-2.925   Max.   : 1.706

We can then plot some results with

hist(out[,1],30,main="log(c1)")

ABC-SMC posterior for the LV model

hist(out[,2],30,main="log(c2)")

ABC-SMC posterior for the LV model

hist(out[,3],30,main="log(c3)")

ABC-SMC posterior for the LV model

Although the inference methods are illustrated in the book in the context of parameter inference for stochastic kinetic models, their implementation is generic, and can be used with any appropriate parameter inference problem.

The smfsbSBML package

smfsbSBML is another R package associated with the third edition of the book. This package is not on CRAN due to its dependency on a package not on CRAN, and hence is slightly less straightforward to install. Follow the available installation instructions to install the package. Once installed, you should be able to load the package with

library(smfsbSBML)

This package provides a function for reading in SBML files and parsing them into the simulatable stochastic Petri net (SPN) objects used by the main smfsb R package. Examples of suitable SBML models are included in the main smfsb GitHub repo. An appropriate SBML model can be read and parsed with a command like:

model = sbml2spn("mySbmlModel.xml")

The resulting value, model is an SPN object which can be passed in to simulation functions such as StepGillespie for constructing stochastic simulation algorithms.

Other software

In addition to the above R packages, I also have some Python scripts for converting between SBML and the SBML-shorthand notation I use in the book. See the SBML-shorthand page for further details.

Although R is a convenient language for teaching and learning about stochastic simulation, it isn’t ideal for serious research-level scientific computing or computational statistics. So for the third edition of the book I have also developed scala-smfsb, a library written in the Scala programming language, which re-implements all of the models and algorithms from the third edition of the book in Scala, a fast, efficient, strongly-typed, compiled, functional programming language. I’ll give an introduction to this library in a subsequent post, but in the meantime, it is already well documented, so see the scala-smfsb repo for further details, including information on installation, getting started, a tutorial, examples, API docs, etc.

Source

This blog post started out as an RMarkdown document, the source of which can be found here.

Stochastic Modelling for Systems Biology, third edition

The third edition of my textbook, Stochastic Modelling for Systems Biology has recently been published by Chapman & Hall/CRC Press. The book has ISBN-10 113854928-2 and ISBN-13 978-113854928-9. It can be ordered from CRC Press, Amazon.com, Amazon.co.uk and similar book sellers.

I was fairly happy with the way that the second edition, published in 2011, turned out, and so I haven’t substantially re-written any of the text for the third edition. Instead, I’ve concentrated on adding in new material and improving the associated on-line resources. Those on-line resources are all free and open source, and hence available to everyone, irrespective of whether you have a copy of the new edition. I’ll give an introduction to those resources below (and in subsequent posts). The new material can be briefly summarised as follows:

  • New chapter on spatially extended systems, covering the spatial Gillespie algorithm for reaction diffusion master equation (RDME) models in 1- and 2-d, the next subvolume method, spatial CLE, scaling issues, etc.
  • Significantly expanded chapter on inference for stochastic kinetic models from data, covering approximate methods of inference (ABC), including ABC-SMC. The material relating to particle MCMC has also been improved and extended.
  • Updated R package, including code relating to all of the new material
  • New R package for parsing SBML models into simulatable stochastic Petri net models
  • New software library, written in Scala, replicating most of the functionality of the R packages in a fast, compiled, strongly typed, functional language

New content

Although some minor edits and improvements have been made throughout the text, there are two substantial new additions to the text in this new edition. The first is an entirely new chapter on spatially extended systems. The first two editions of the text focused on the implications of discreteness and stochasticity in chemical reaction systems, but maintained the well-mixed assumption throughout. This is a reasonable first approach, since discreteness and stochasticity are most pronounced in very small volumes where diffusion should be rapid. In any case, even these non-spatial models have very interesting behaviour, and become computationally challenging very quickly for non-trivial reaction networks. However, we know that, in fact, the cell is a very crowded environment, and so even at small spatial scales, many interesting processes are diffusion limited. It therefore seems appropriate to dedicate one chapter (the new Chapter 9) to studying some of the implications of relaxing the well-mixed assumption. Entire books can be written on stochastic reaction-diffusion systems, so here only a brief introduction is provided, based mainly around models in the reaction-diffusion master equation (RDME) style. Exact stochastic simulation algorithms are discussed, and implementations provided in the 1- and 2-d cases, and an appropriate Langevin approximation is examined, the spatial CLE.

The second major addition is to the chapter on inference for stochastic kinetic models from data (now Chapter 11). The second edition of the book included a discussion of “likelihood free” Bayesian MCMC methods for inference, and provided a working implementation of likelihood free particle marginal Metropolis-Hastings (PMMH) for stochastic kinetic models. The third edition improves on that implementation, and discusses approximate Bayesian computation (ABC) as an alternative to MCMC for likelihood free inference. Implementation issues are discussed, and sequential ABC approaches are examined, concentrating in particular on the method known as ABC-SMC.

New software and on-line resources

Accompanying the text are new and improved on-line resources, all well-documented, free, and open source.

New website/GitHub repo

Information and materials relating to the previous editions were kept on my University website. All materials relating to this new edition are kept in a public GitHub repo: darrenjw/smfsb. This will be simpler to maintain, and will make it much easier for people to make copies of the material for use and studying off-line.

Updated R package(s)

Along with the second edition of the book I released an accompanying R package, “smfsb”, published on CRAN. This was a very popular feature, allowing anyone with R to trivially experiment with all of the models and algorithms discussed in the text. This R package has been updated, and a new version has been published to CRAN. The updates are all backwards-compatible with the version associated with the second edition of the text, so owners of that edition can still upgrade safely. I’ll give a proper introduction to the package, including the new features, in a subsequent post, but in the meantime, you can install/upgrade the package from a running R session with

install.packages("smfsb")

and then pop up a tutorial vignette with:

vignette("smfsb")

This should be enough to get you started.

In addition to the main R package, there is an additional R package for parsing SBML models into models that can be simulated within R. This package is not on CRAN, due to its dependency on a non-CRAN package. See the repo for further details.

There are also Python scripts available for converting SBML models to and from the shorthand SBML notation used in the text.

New Scala library

Another major new resource associated with the third edition of the text is a software library written in the Scala programming language. This library provides Scala implementations of all of the algorithms discussed in the book and implemented in the associated R packages. This then provides example implementations in a fast, efficient, compiled language, and is likely to be most useful for people wanting to use the methods in the book for research. Again, I’ll provide a tutorial introduction to this library in a subsequent post, but it is well-documented, with all necessary information needed to get started available at the scala-smfsb repo/website, including a step-by-step tutorial and some additional examples.

Data frames and tables in Scala

Introduction

To statisticians and data scientists used to working in R, the concept of a data frame is one of the most natural and basic starting points for statistical computing and data analysis. It always surprises me that data frames aren’t a core concept in most programming languages’ standard libraries, since they are essentially a representation of a relational database table, and relational databases are pretty ubiquitous in data processing and related computing. For statistical modelling and data science, having functions designed for data frames is much more elegant than using functions designed to work directly on vectors and matrices, for example. Trivial things like being able to refer to columns by a readable name rather than a numeric index makes a huge difference, before we even get into issues like columns of heterogeneous types, coherent handling of missing data, etc. This is why modelling in R is typically nicer than in certain other languages I could mention, where libraries for scientific and numerical computing existed for a long time before libraries for data frames were added to the language ecosystem.

To build good libraries for statistical computing in Scala, it will be helpful to build those libraries using a good data frame implementation. With that in mind I’ve started to look for existing Scala data frame libraries and to compare them.

A simple data manipulation task

For this post I’m going to consider a very simple data manipulation task: first reading in a CSV file from disk into a data frame object, then filtering out some rows, then adding a derived column, then finally writing the data frame back to disk as a CSV file. We will start by looking at how this would be done in R. First we need an example CSV file. Since many R packages contain example datasets, we will use one of those. We will export Cars93 from the MASS package:

library(MASS)
write.csv(Cars93,"cars93.csv",row.names=FALSE)

If MASS isn’t installed, it can be installed with a simple install.packages("MASS"). The above code snippet generates a CSV file to be used for the example. Typing ?Cars93 will give some information about the dataset, including the original source.

Our analysis task is going to be to load the file from disk, filter out cars with EngineSize larger than 4 (litres), add a new column to the data frame, WeightKG, containing the weight of the car in KG, derived from the column Weight (in pounds), and then write back to disk in CSV format. This is the kind of thing that R excels at (pun intended):

df=read.csv("cars93.csv")
print(dim(df))
df = df[df$EngineSize<=4.0,]
print(dim(df))
df$WeightKG = df$Weight*0.453592
print(dim(df))
write.csv(df,"cars93m.csv",row.names=FALSE)

Now let’s see how a similar task could be accomplished using Scala data frames.

Data frames and tables in Scala

Saddle

Saddle is probably the best known data frame library for Scala. It is strongly influenced by the pandas library for Python. A simple Saddle session for accomplishing this task might proceed as follows:

val file = CsvFile("cars93.csv")
val df = CsvParser.parse(file).withColIndex(0)
println(df)
val df2 = df.rfilter(_("EngineSize").
             mapValues(CsvParser.parseDouble).at(0)<=4.0)
println(df2)
val wkg=df2.col("Weight").mapValues(CsvParser.parseDouble).
             mapValues(_*0.453592).setColIndex(Index("WeightKG"))
val df3=df2.joinPreserveColIx(wkg.mapValues(_.toString))
println(df3)
df3.writeCsvFile("saddle-out.csv")

Although this looks OK, it’s not completely satisfactory, as the data frame is actually representing a matrix of Strings. Although you can have a data frame containing columns of any type, since Saddle data frames are backed by a matrix object (with type corresponding to the common super-type), the handling of columns of heterogeneous types always seems rather cumbersome. I suspect that it is this clumsy handling of heterogeneously typed columns that has motivated the development of alternative data frame libraries for Scala.

Scala-datatable

Scala-datatable is a lightweight minimal immutable data table for Scala, with good support for columns of differing types. However, it is currently really very minimal, and doesn’t have CSV import or export, for example. That said, there are several CSV libraries for Scala, so it’s quite easy to write functions to import from CSV into a datatable and write CSV back out from one. I’ve a couple of example functions, readCsv() and writeCsv() in the full code examples associated with this post. Now since datatable supports heterogeneous column types and I don’t want to write a type guesser, my readCsv() function expects information regarding the column types. This could be relaxed with a bit of effort. An example session follows:

    val colTypes=Map("DriveTrain" -> StringCol, 
                     "Min.Price" -> Double, 
                     "Cylinders" -> Int, 
                     "Horsepower" -> Int, 
                     "Length" -> Int, 
                     "Make" -> StringCol, 
                     "Passengers" -> Int, 
                     "Width" -> Int, 
                     "Fuel.tank.capacity" -> Double, 
                     "Origin" -> StringCol, 
                     "Wheelbase" -> Int, 
                     "Price" -> Double, 
                     "Luggage.room" -> Double, 
                     "Weight" -> Int, 
                     "Model" -> StringCol, 
                     "Max.Price" -> Double, 
                     "Manufacturer" -> StringCol, 
                     "EngineSize" -> Double, 
                     "AirBags" -> StringCol, 
                     "Man.trans.avail" -> StringCol, 
                     "Rear.seat.room" -> Double, 
                     "RPM" -> Int, 
                     "Turn.circle" -> Double, 
                     "MPG.highway" -> Int, 
                     "MPG.city" -> Int, 
                     "Rev.per.mile" -> Int, 
                     "Type" -> StringCol)
    val df=readCsv("Cars93",new FileReader("cars93.csv"),colTypes)
    println(df.length,df.columns.length)
    val df2=df.filter(row=>row.as[Double]("EngineSize")<=4.0).toDataTable
    println(df2.length,df2.columns.length)

    val oldCol=df2.columns("Weight").as[Int]
    val newCol=new DataColumn[Double]("WeightKG",oldCol.data.map{_.toDouble*0.453592})
    val df3=df2.columns.add(newCol).get
    println(df3.length,df3.columns.length)

    writeCsv(df3,new File("out.csv"))

Apart from the declaration of column types, the code is actually a little bit cleaner than the corresponding Saddle code, and the column types are all properly preserved and appropriately handled. However, a significant limitation of this data frame is that it doesn’t seem to have special handling of missing values, requiring some kind of manually coded “special value” approach from users of this data frame. This is likely to limit the appeal of this library for general statistical and data science applications.

Framian

Framian is a full-featured data frame library for Scala, open-sourced by Pellucid analytics. It is strongly influenced by R data frame libraries, and aims to provide most of the features that R users would expect. It has good support for clean handling of heterogeneously typed columns (using shapeless), handles missing data, and includes good CSV import:

val df=Csv.parseFile(new File("cars93.csv")).labeled.toFrame
println(""+df.rows+" "+df.cols)
val df2=df.filter(Cols("EngineSize").as[Double])( _ <= 4.0 )
println(""+df2.rows+" "+df2.cols)
val df3=df2.map(Cols("Weight").as[Int],"WeightKG")(r=>r.toDouble*0.453592)
println(""+df3.rows+" "+df3.cols)
println(df3.colIndex)
val csv = Csv.fromFrame(new CsvFormat(",", header = true))(df3)
new PrintWriter("out.csv") { write(csv.toString); close }

This is arguably the cleanest solution so far. Unfortunately the output isn’t quite right(!), as there currently seems to be a bug in Csv.fromFrame which causes the ordering of columns to get out of sync with the ordering of the column headers. Presumably this bug will soon be fixed, and if not it is easy to write a CSV writer for these frames, as I did above for scala-datatable.

Spark DataFrames

The three data frames considered so far are all standard single-machine, non-distributed, in-memory objects. The Scala data frame implementation currently subject to the most social media buzz is a different beast entirely. A DataFrame object has recently been added to Apache Spark. I’ve already discussed the problems of first developing a data analysis library without data frames and then attempting to bolt a data frame object on top post-hoc. Spark has repeated this mistake, but it’s still much better to have a data frame in Spark than not. Spark is a Scala framework for the distributed processing and analysis of huge datasets on a cluster. I will discuss it further in future posts. If you have a legitimate need for this kind of set-up, then Spark is a pretty impressive piece of technology (though note that there are competitors, such as flink). However, for datasets that can be analysed on a single machine, then Spark seems like a rather slow and clunky sledgehammer to crack a nut. So, for datasets in the terabyte range and above, Spark DataFrames are great, but for datasets smaller than a few gigs, it’s probably not the best solution. With those caveats in mind, here’s how to solve our problem using Spark DataFrames (and the spark-csv library) in the Spark Shell:

val df = sqlContext.read.format("com.databricks.spark.csv").
                         option("header", "true").
                         option("inferSchema","true").
                         load("cars93.csv")
val df2=df.filter("EngineSize <= 4.0")
val col=df2.col("Weight")*0.453592
val df3=df2.withColumn("WeightKG",col)
df3.write.format("com.databricks.spark.csv").
                         option("header","true").
                         save("out-csv")

Summary

If you really need a distributed data frame library, then you will probably want to use Spark. However, for the vast majority of statistical modelling and data science tasks, Spark is likely to be unnecessarily complex and heavyweight. The other three libraries considered all have pros and cons. They are all largely one-person hobby projects, quite immature, and not currently under very active development. Saddle is fine for when you just want to add column headings to a matrix. Scala-datatable is lightweight and immutable, if you don’t care about missing values. On balance, I think Framian is probably the most full-featured “batteries included” R-like data frame, and so is likely to be most attractive to statisticians and data scientists. However, it’s pretty immature, and the dependence on shapeless may be of concern to those who prefer libraries to be lean and devoid of sorcery!

I’d be really interested to know of other people’s experiences of these libraries, so please do comment if you have any views, and especially if you have opinions on the relative merits of the different libraries.

The full source code for all of these examples, including sbt build files, can be found in a new github repo I’ve created for the code examples associated with this blog.

Calling R from Scala sbt projects using rscala

Overview

In the previous post I showed how the rscala package (which has replaced the jvmr package) can be used to call Scala code from within R. In this post I will show how to call R from Scala code. I have previously described how to do this using jvmr. This post is really just an update to show how things work with rscala.

Since I’m focusing here on Scala sbt projects, I’m assuming that sbt is installed, in addition to rscala (described in the previous post). The only “trick” required for calling back to R from Scala is telling sbt where the rscala jar file is located. You can find the location from the R console as illustrated by the following session:

> library(rscala)
> rscala::rscalaJar("2.11")
[1] "/home/ndjw1/R/x86_64-pc-linux-gnu-library/3.2/rscala/java/rscala_2.11-1.0.6.jar"

This location (which will obviously be different for you) can then be added in to your sbt classpath by adding the following line to your build.sbt file:

unmanagedJars in Compile += file("/home/ndjw1/R/x86_64-pc-linux-gnu-library/3.2/rscala/java/rscala_2.11-1.0.6.jar")

Once this is done, calling out to R from your Scala sbt project can be carried out as described in the rscala documentation. For completeness, a working example is given below.

Example

In this example I will use Scala to simulate some data consistent with a Poisson regression model, and then push the data to R to fit it using the R function glm(), and then pull back the fitted regression coefficients into Scala. This is obviously a very artificial example, but the point is to show how it is possible to call back to R for some statistical procedure that may be “missing” from Scala.

The dependencies for this project are described in the file build.sbt

name := "rscala test"

version := "0.1"

scalacOptions ++= Seq("-unchecked", "-deprecation", "-feature")

libraryDependencies  ++= Seq(
            "org.scalanlp" %% "breeze" % "0.10",
            "org.scalanlp" %% "breeze-natives" % "0.10"
)

resolvers ++= Seq(
            "Sonatype Snapshots" at "https://oss.sonatype.org/content/repositories/snapshots/",
            "Sonatype Releases" at "https://oss.sonatype.org/content/repositories/releases/"
)

unmanagedJars in Compile += file("/home/ndjw1/R/x86_64-pc-linux-gnu-library/3.2/rscala/java/rscala_2.11-1.0.6.jar")

scalaVersion := "2.11.6"

The complete Scala program is contained in the file PoisReg.scala

import org.ddahl.rscala.callback._
import breeze.stats.distributions._
import breeze.linalg._

object ScalaToRTest {

  def main(args: Array[String]) = {

    // first simulate some data consistent with a Poisson regression model
    val x = Uniform(50,60).sample(1000)
    val eta = x map { xi => (xi * 0.1) - 3 }
    val mu = eta map { math.exp(_) }
    val y = mu map { Poisson(_).draw }
    
    // call to R to fit the Poission regression model
    val R = RClient() // initialise an R interpreter
    R.x=x.toArray // send x to R
    R.y=y.toArray // send y to R
    R.eval("mod <- glm(y~x,family=poisson())") // fit the model in R
    // pull the fitted coefficents back into scala
    val beta = DenseVector[Double](R.evalD1("mod$coefficients"))

    // print the fitted coefficents
    println(beta)

  }

}

If these two files are put in an empty directory, the code can be compiled and run by typing sbt run from the command prompt in the relevant directory. The commented code should be self-explanatory, but see the rscala documentation for further details. In particular, the rscala scaladoc is useful.

Calling Scala code from R using rscala

Introduction

In a previous post I looked at how to call Scala code from R using a CRAN package called jvmr. This package now seems to have been replaced by a new package called rscala. Like the old package, it requires a pre-existing Java installation. Unlike the old package, however, it no longer depends on rJava, which may simplify some installations. The rscala package is well documented, with a reference manual and a draft paper. In this post I will concentrate on the issue of calling sbt-based projects with dependencies on external libraries (such as breeze).

On a system with Java installed, it should be possible to install the rscala package with a simple

install.packages("rscala")

from the R command prompt. Calling

library(rscala)

will check that it has worked. The package will do a sensible search for a Scala installation and use it if it can find one. If it can’t find one (or can only find an installation older than 2.10.x), it will fail. In this case you can download and install a Scala installation specifically for rscala using the command

rscala::scalaInstall()

This option is likely to be attractive to sbt (or IDE) users who don’t like to rely on a system-wide scala installation.

A Gibbs sampler in Scala using Breeze

For illustration I’m going to use a Scala implementation of a Gibbs sampler. The Scala code, gibbs.scala is given below:

package gibbs

object Gibbs {

    import scala.annotation.tailrec
    import scala.math.sqrt
    import breeze.stats.distributions.{Gamma,Gaussian}

    case class State(x: Double, y: Double) {
      override def toString: String = x.toString + " , " + y + "\n"
    }

    def nextIter(s: State): State = {
      val newX = Gamma(3.0, 1.0/((s.y)*(s.y)+4.0)).draw
      State(newX, Gaussian(1.0/(newX+1), 1.0/sqrt(2*newX+2)).draw)
    }

    @tailrec def nextThinnedIter(s: State,left: Int): State =
      if (left==0) s else nextThinnedIter(nextIter(s),left-1)

    def genIters(s: State, stop: Int, thin: Int): List[State] = {
      @tailrec def go(s: State, left: Int, acc: List[State]): List[State] =
        if (left>0)
          go(nextThinnedIter(s,thin), left-1, s::acc)
          else acc
      go(s,stop,Nil).reverse
    }

    def main(args: Array[String]) = {
      if (args.length != 3) {
        println("Usage: sbt \"run <outFile> <iters> <thin>\"")
        sys.exit(1)
      } else {
        val outF=args(0)
        val iters=args(1).toInt
        val thin=args(2).toInt
        val out = genIters(State(0.0,0.0),iters,thin)
        val s = new java.io.FileWriter(outF)
        s.write("x , y\n")
        out map { it => s.write(it.toString) }
        s.close
      }
    }

}

This code requires Scala and the Breeze scientific library in order to build. We can specify this in a sbt build file, which should be called build.sbt and placed in the same directory as the Scala code.

name := "gibbs"

version := "0.1"

scalacOptions ++= Seq("-unchecked", "-deprecation", "-feature")

libraryDependencies  ++= Seq(
            "org.scalanlp" %% "breeze" % "0.10",
            "org.scalanlp" %% "breeze-natives" % "0.10"
)

resolvers ++= Seq(
            "Sonatype Snapshots" at "https://oss.sonatype.org/content/repositories/snapshots/",
            "Sonatype Releases" at "https://oss.sonatype.org/content/repositories/releases/"
)

scalaVersion := "2.11.6"

Now, from a system command prompt in the directory where the files are situated, it should be possible to download all dependencies and compile and run the code with a simple

sbt "run output.csv 50000 1000"

sbt magically manages all of the dependencies for us so that we don’t have to worry about them. However, for calling from R, it may be desirable to run the code without running sbt. There are several ways to achieve this, but the simplest is to build an “assembly jar” or “fat jar”, which is a Java byte-code file containing all code and libraries required in order to run the code on any system with a Java installation.

To build an assembly jar first create a subdirectory called project (the name matters), an in it place two files. The first should be called assembly.sbt, and should contain the line

addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.13.0")

Since the version of the assembly tool can depend on the version of sbt, it is also best to fix the version of sbt being used by creating another file in the project directory called build.properties, which should contain the line

sbt.version=0.13.7

Now return to the parent directory and run

sbt assembly

If this works, it should create a fat jar target/scala-2.11/gibbs-assembly-0.1.jar. You can check it works by running

java -jar target/scala-2.11/gibbs-assembly-0.1.jar output.csv 10000 10

Assuming that it does, you are now ready to try running the code from within R.

Calling via R system calls

Since this code takes a relatively long time to run, calling it from R via simple system calls isn’t a particularly terrible idea. For example, we can do this from the R command prompt with the following commands

system("java -jar target/scala-2.11/gibbs-assembly-0.1.jar output.csv 50000 1000")
out=read.csv("output.csv")
library(smfsb)
mcmcSummary(out,rows=2)

This works fine, but is a bit clunky. Tighter integration between R and Scala would be useful, which is where rscala comes in.

Calling assembly Scala projects via rscala

rscala provides a very simple way to embed a Scala interpreter within an R session, to be able to execute Scala expressions from R and to have the results returned back to the R session for further processing. The main issue with using this in practice is managing dependencies on external libraries and setting the Scala classpath correctly. By using an assembly jar we can bypass most of these issues, and it becomes trivial to call our Scala code direct from the R interpreter, as the following code illustrates.

library(rscala)
sc=scalaInterpreter("target/scala-2.11/gibbs-assembly-0.1.jar")
sc%~%'import gibbs.Gibbs._'
out=sc%~%'genIters(State(0.0,0.0),50000,1000).toArray.map{s=>Array(s.x,s.y)}'
library(smfsb)
mcmcSummary(out,rows=2)

Here we call the getIters function directly, rather than via the main method. This function returns an immutable List of States. Since R doesn’t understand this, we map it to an Array of Arrays, which R then unpacks into an R matrix for us to store in the matrix out.

Summary

The CRAN package rscala makes it very easy to embed a Scala interpreter within an R session. However, for most non-trivial statistical computing problems, the Scala code will have dependence on external scientific libraries such as Breeze. The standard way to easily manage external dependencies in the Scala ecosystem is sbt. Given an sbt-based Scala project, it is easy to generate an assembly jar in order to initialise the rscala Scala interpreter with the classpath needed to call arbitrary Scala functions. This provides very convenient inter-operability between R and Scala for many statistical computing applications.

Calling R from Scala sbt projects

[Update: The jvmr package has been replaced by the rscala package. There is a new version of this post which replaces this one.]

Overview

In previous posts I’ve shown how the jvmr CRAN R package can be used to call Scala sbt projects from R and inline Scala Breeze code in R. In this post I will show how to call to R from a Scala sbt project. This requires that R and the jvmr CRAN R package are installed on your system, as described in the previous posts. Since I’m focusing here on Scala sbt projects, I’m also assuming that sbt is installed.

The only “trick” required for calling back to R from Scala is telling sbt where the jvmr jar file is located. You can find the location from the R console as illustrated by the following session:

&gt; library(jvmr)
&gt; .jvmr.jar
[1] "/home/ndjw1/R/x86_64-pc-linux-gnu-library/3.1/jvmr/java/jvmr_2.11-2.11.2.1.jar"

This location (which will obviously be different for you) can then be added in to your sbt classpath by adding the following line to your build.sbt file:

unmanagedJars in Compile += file("/home/ndjw1/R/x86_64-pc-linux-gnu-library/3.1/jvmr/java/jvmr_2.11-2.11.2.1.jar")

Once this is done, calling out to R from your Scala sbt project can be carried out as described in the jvmr documentation. For completeness, a working example is given below.

Example

In this example I will use Scala to simulate some data consistent with a Poisson regression model, and then push the data to R to fit it using the R function glm(), and then pull back the fitted regression coefficients into Scala. This is obviously a very artificial example, but the point is to show how it is possible to call back to R for some statistical procedure that may be “missing” from Scala.

The dependencies for this project are described in the file build.sbt

name := "jvmr test"

version := "0.1"

scalacOptions ++= Seq("-unchecked", "-deprecation", "-feature")

libraryDependencies  ++= Seq(
            "org.scalanlp" %% "breeze" % "0.10",
            "org.scalanlp" %% "breeze-natives" % "0.10"
)

resolvers ++= Seq(
            "Sonatype Snapshots" at "https://oss.sonatype.org/content/repositories/snapshots/",
            "Sonatype Releases" at "https://oss.sonatype.org/content/repositories/releases/"
)

unmanagedJars in Compile += file("/home/ndjw1/R/x86_64-pc-linux-gnu-library/3.1/jvmr/java/jvmr_2.11-2.11.2.1.jar")

scalaVersion := "2.11.2"

The complete Scala program is contained in the file PoisReg.scala

import org.ddahl.jvmr.RInScala
import breeze.stats.distributions._
import breeze.linalg._

object ScalaToRTest {

  def main(args: Array[String]) = {

    // first simulate some data consistent with a Poisson regression model
    val x = Uniform(50,60).sample(1000)
    val eta = x map { xi =&gt; (xi * 0.1) - 3 }
    val mu = eta map { math.exp(_) }
    val y = mu map { Poisson(_).draw }
    
    // call to R to fit the Poission regression model
    val R = RInScala() // initialise an R interpreter
    R.x=x.toArray // send x to R
    R.y=y.toArray // send y to R
    R.eval("mod &lt;- glm(y~x,family=poisson())") // fit the model in R
    // pull the fitted coefficents back into scala
    val beta = DenseVector[Double](R.toVector[Double]("mod$coefficients"))

    // print the fitted coefficents
    println(beta)

  }

}

If these two files are put in an empty directory, the code can be compiled and run by typing sbt run from the command prompt in the relevant directory. The commented code should be self-explanatory, but see the jvmr documentation for further details.

Inlining Scala Breeze code in R using jvmr and sbt

[Update: The CRAN package “jvmr” has been replaced by a new package “rscala”. Rather than completely re-write this post, I’ve just created a github gist containing a new function, breezeInterpreter(), which works similarly to the function breezeInit() in this post. Usage information is given at the top of the gist.]

Introduction

In the previous post I showed how to call Scala code from R using sbt and jvmr. The approach described in that post is the one I would recommend for any non-trivial piece of Scala code – mixing up code from different languages in the same source code file is not a good strategy in general. That said, for very small snippets of code, it can sometimes be convenient to inline Scala code directly into an R source code file. The canonical example of this is a computationally intensive algorithm being prototyped in R which has a slow inner loop. If the inner loop can be recoded in a few lines of Scala, it would be nice to just inline this directly into the R code without having to create a separate Scala project. The CRAN package jvmr provides a very simple and straightforward way to do this. However, as discussed in the last post, most Scala code for statistical computing (even short and simple code) is likely to rely on Breeze for special functions, probability distributions, non-uniform random number generation, numerical linear algebra, etc. In this post we will see how to use sbt in order to make sure that the Breeze library and all of its dependencies are downloaded and cached, and to provide a correct classpath with which to initialise a jvmr scalaInterpreter session.

Setting up

Configuring your system to be able to inline Scala Breeze code is very easy. You just need to install Java, R and sbt. Then install the CRAN R package jvmr. At this point you have everything you need except for the R function breezeInit, given at the end of this post. I’ve deferred the function to the end of the post as it is a bit ugly, and the details of it are not important. All it does is get sbt to ensure that Breeze is correctly downloaded and cached and then starts a scalaInterpreter with Breeze on the classpath. With this function available, we can use it within an R session as the following R session illustrates:

&gt; b=breezeInit()
&gt; b['import breeze.stats.distributions._']
NULL
&gt; b['Poisson(10).sample(20).toArray']
 [1] 13 14 13 10  7  6 15 14  5 10 14 11 15  8 11 12  6  7  5  7
&gt; summary(b['Gamma(3,2).sample(10000).toArray'])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.2124  3.4630  5.3310  5.9910  7.8390 28.5200 
&gt; 

So we see that Scala Breeze code can be inlined directly into an R session, and if we are careful about return types, have the results of Scala expressions automatically unpack back into convenient R data structures.

Summary

In this post I have shown how easy it is to inline Scala Breeze code into R using sbt in conjunction with the CRAN package jvmr. This has many potential applications, with the most obvious being the desire to recode slow inner loops from R to Scala. This should give performance quite comparable with alternatives such as Rcpp, with the advantage being that you get to write beautiful, elegant, functional Scala code instead of horrible, ugly, imperative C++ code! 😉

The breezeInit function

The actual breezeInit() function is given below. It is a little ugly, but very simple. It is obviously easy to customise for different libraries and library versions as required. All of the hard work is done by sbt which must be installed and on the default system path in order for this function to work.

breezeInit&lt;-function()
{
  library(jvmr)
  sbtStr="name := \"tmp\"

version := \"0.1\"

libraryDependencies  ++= Seq(
            \"org.scalanlp\" %% \"breeze\" % \"0.10\",
            \"org.scalanlp\" %% \"breeze-natives\" % \"0.10\"
)

resolvers ++= Seq(
            \"Sonatype Snapshots\" at \"https://oss.sonatype.org/content/repositories/snapshots/\",
            \"Sonatype Releases\" at \"https://oss.sonatype.org/content/repositories/releases/\"
)

scalaVersion := \"2.11.2\"

lazy val printClasspath = taskKey[Unit](\"Dump classpath\")

printClasspath := {
  (fullClasspath in Runtime value) foreach {
    e =&gt; print(e.data+\"!\")
  }
}
"
  tmps=file(file.path(tempdir(),"build.sbt"),"w")
  cat(sbtStr,file=tmps)
  close(tmps)
  owd=getwd()
  setwd(tempdir())
  cpstr=system2("sbt","printClasspath",stdout=TRUE)
  cpst=cpstr[length(cpstr)]
  cpsp=strsplit(cpst,"!")[[1]]
  cp=cpsp[2:(length(cpsp)-1)]
  si=scalaInterpreter(cp,use.jvmr.class.path=FALSE)
  setwd(owd)
  si
}

Calling Scala code from R using jvmr

[Update: the jvmr package has been replaced by a new package called rscala. I have a new post which explains it.]

Introduction

In previous posts I have explained why I think that Scala is a good language to use for statistical computing and data science. Despite this, R is very convenient for simple exploratory data analysis and visualisation – currently more convenient than Scala. I explained in my recent talk at the RSS what (relatively straightforward) things would need to be developed for Scala in order to make R completely redundant, but for the short term at least, it seems likely that I will need to use both R and Scala for my day-to-day work.

Since I use both Scala and R for statistical computing, it is very convenient to have a degree of interoperability between the two languages. I could call R from Scala code or Scala from R code, or both. Fortunately, some software tools have been developed recently which make this much simpler than it used to be. The software is jvmr, and as explained at the website, it enables calling Java and Scala from R and calling R from Java and Scala. I have previously discussed calling Java from R using the R CRAN package rJava. In this post I will focus on calling Scala from R using the CRAN package jvmr, which depends on rJava. I may examine calling R from Scala in a future post.

On a system with Java installed, it should be possible to install the jvmr R package with a simple

install.packages("jvmr")

from the R command prompt. The package has the usual documentation associated with it, but the draft paper describing the package is the best way to get an overview of its capabilities and a walk-through of simple usage.

A Gibbs sampler in Scala using Breeze

For illustration I’m going to use a Scala implementation of a Gibbs sampler which relies on the Breeze scientific library, and will be built using the simple build tool, sbt. Most non-trivial Scala projects depend on various versions of external libraries, and sbt is an easy way to build even very complex projects trivially on any system with Java installed. You don’t even need to have Scala installed in order to build and run projects using sbt. I give some simple complete worked examples of building and running Scala sbt projects in the github repo associated with my recent RSS talk. Installing sbt is trivial as explained in the repo READMEs.

For this post, the Scala code, gibbs.scala is given below:

package gibbs

object Gibbs {

    import scala.annotation.tailrec
    import scala.math.sqrt
    import breeze.stats.distributions.{Gamma,Gaussian}

    case class State(x: Double, y: Double) {
      override def toString: String = x.toString + " , " + y + "\n"
    }

    def nextIter(s: State): State = {
      val newX = Gamma(3.0, 1.0/((s.y)*(s.y)+4.0)).draw
      State(newX, Gaussian(1.0/(newX+1), 1.0/sqrt(2*newX+2)).draw)
    }

    @tailrec def nextThinnedIter(s: State,left: Int): State =
      if (left==0) s else nextThinnedIter(nextIter(s),left-1)

    def genIters(s: State, stop: Int, thin: Int): List[State] = {
      @tailrec def go(s: State, left: Int, acc: List[State]): List[State] =
        if (left&gt;0)
          go(nextThinnedIter(s,thin), left-1, s::acc)
          else acc
      go(s,stop,Nil).reverse
    }

    def main(args: Array[String]) = {
      if (args.length != 3) {
        println("Usage: sbt \"run &lt;outFile&gt; &lt;iters&gt; &lt;thin&gt;\"")
        sys.exit(1)
      } else {
        val outF=args(0)
        val iters=args(1).toInt
        val thin=args(2).toInt
        val out = genIters(State(0.0,0.0),iters,thin)
        val s = new java.io.FileWriter(outF)
        s.write("x , y\n")
        out map { it =&gt; s.write(it.toString) }
        s.close
      }
    }

}

This code requires Scala and the Breeze scientific library in order to build. We can specify this in a sbt build file, which should be called build.sbt and placed in the same directory as the Scala code.

name := "gibbs"

version := "0.1"

scalacOptions ++= Seq("-unchecked", "-deprecation", "-feature")

libraryDependencies  ++= Seq(
            "org.scalanlp" %% "breeze" % "0.10",
            "org.scalanlp" %% "breeze-natives" % "0.10"
)

resolvers ++= Seq(
            "Sonatype Snapshots" at "https://oss.sonatype.org/content/repositories/snapshots/",
            "Sonatype Releases" at "https://oss.sonatype.org/content/repositories/releases/"
)

scalaVersion := "2.11.2"

Now, from a system command prompt in the directory where the files are situated, it should be possible to download all dependencies and compile and run the code with a simple

sbt "run output.csv 50000 1000"

Calling via R system calls

Since this code takes a relatively long time to run, calling it from R via simple system calls isn’t a particularly terrible idea. For example, we can do this from the R command prompt with the following commands

system("sbt \"run output.csv 50000 1000\"")
out=read.csv("output.csv")
library(smfsb)
mcmcSummary(out,rows=2)

This works fine, but won’t work so well for code which needs to be called repeatedly. For this, tighter integration between R and Scala would be useful, which is where jvmr comes in.

Calling sbt-based Scala projects via jvmr

jvmr provides a very simple way to embed a Scala interpreter within an R session, to be able to execute Scala expressions from R and to have the results returned back to the R session for further processing. The main issue with using this in practice is managing dependencies on external libraries and setting the Scala classpath correctly. For an sbt project such as we are considering here, it is relatively easy to get sbt to provide us with all of the information we need in a fully automated way.

First, we need to add a new task to our sbt build instructions, which will output the full classpath in a way that is easy to parse from R. Just add the following to the end of the file build.sbt:

lazy val printClasspath = taskKey[Unit]("Dump classpath")

printClasspath := {
  (fullClasspath in Runtime value) foreach {
    e =&gt; print(e.data+"!")
  }
}

Be aware that blank lines are significant in sbt build files. Once we have this in our build file, we can write a small R function to get the classpath from sbt and then initialise a jvmr scalaInterpreter with the correct full classpath needed for the project. An R function which does this, sbtInit(), is given below

sbtInit&lt;-function()
{
  library(jvmr)
  system2("sbt","compile")
  cpstr=system2("sbt","printClasspath",stdout=TRUE)
  cpst=cpstr[length(cpstr)]
  cpsp=strsplit(cpst,"!")[[1]]
  cp=cpsp[1:(length(cpsp)-1)]
  scalaInterpreter(cp,use.jvmr.class.path=FALSE)
}

With this function at our disposal, it becomes trivial to call our Scala code direct from the R interpreter, as the following code illustrates.

sc=sbtInit()
sc['import gibbs.Gibbs._']
out=sc['genIters(State(0.0,0.0),50000,1000).toArray.map{s=&gt;Array(s.x,s.y)}']
library(smfsb)
mcmcSummary(out,rows=2)

Here we call the getIters function directly, rather than via the main method. This function returns an immutable List of States. Since R doesn’t understand this, we map it to an Array of Arrays, which R then unpacks into an R matrix for us to store in the matrix out.

Summary

The CRAN package jvmr makes it very easy to embed a Scala interpreter within an R session. However, for most non-trivial statistical computing problems, the Scala code will have dependence on external scientific libraries such as Breeze. The standard way to easily manage external dependencies in the Scala ecosystem is sbt. Given an sbt-based Scala project, it is easy to add a task to the sbt build file and a function to R in order to initialise the jvmr Scala interpreter with the full classpath needed to call arbitrary Scala functions. This provides very convenient inter-operability between R and Scala for many statistical computing applications.

One-way ANOVA with fixed and random effects from a Bayesian perspective

This blog post is derived from a computer practical session that I ran as part of my new course on Statistics for Big Data, previously discussed. This course covered a lot of material very quickly. In particular, I deferred introducing notions of hierarchical modelling until the Bayesian part of the course, where I feel it is more natural and powerful. However, some of the terminology associated with hierarchical statistical modelling probably seems a bit mysterious to those without a strong background in classical statistical modelling, and so this practical session was intended to clear up some potential confusion. I will analyse a simple one-way Analysis of Variance (ANOVA) model from a Bayesian perspective, making sure to highlight the difference between fixed and random effects in a Bayesian context where everything is random, as well as emphasising the associated identifiability issues. R code is used to illustrate the ideas.

Example scenario

We will consider the body mass index (BMI) of new male undergraduate students at a selection of UK Universities. Let us suppose that our data consist of measurements of (log) BMI for a random sample of 1,000 males at each of 8 Universities. We are interested to know if there are any differences between the Universities. Again, we want to model the process as we would simulate it, so thinking about how we would simulate such data is instructive. We start by assuming that the log BMI is a normal random quantity, and that the variance is common across the Universities in question (this is quite a big assumption, and it is easy to relax). We assume that the mean of this normal distribution is University-specific, but that we do not have strong prior opinions regarding the way in which the Universities differ. That said, we expect that the Universities would not be very different from one another.

Simulating data

A simple simulation of the data with some plausible parameters can be carried out as follows.

set.seed(1)
Z=matrix(rnorm(1000*8,3.1,0.1),nrow=8)
RE=rnorm(8,0,0.01)
X=t(Z+RE)
colnames(X)=paste("Uni",1:8,sep="")
Data=stack(data.frame(X))
boxplot(exp(values)~ind,data=Data,notch=TRUE)

Make sure that you understand exactly what this code is doing before proceeding. The boxplot showing the simulated data is given below.

Boxplot of simulated data

Frequentist analysis

We will start with a frequentist analysis of the data. The model we would like to fit is

y_{ij} = \mu + \theta_i + \varepsilon_{ij}

where i is an indicator for the University and j for the individual within a particular University. The “effect”, \theta_i represents how the ith University differs from the overall mean. We know that this model is not actually identifiable when the model parameters are all treated as “fixed effects”, but R will handle this for us.

> mod=lm(values~ind,data=Data)
> summary(mod)

Call:
lm(formula = values ~ ind, data = Data)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.36846 -0.06778 -0.00069  0.06910  0.38219 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.101068   0.003223 962.244  < 2e-16 ***
indUni2     -0.006516   0.004558  -1.430 0.152826    
indUni3     -0.017168   0.004558  -3.767 0.000166 ***
indUni4      0.017916   0.004558   3.931 8.53e-05 ***
indUni5     -0.022838   0.004558  -5.011 5.53e-07 ***
indUni6     -0.001651   0.004558  -0.362 0.717143    
indUni7      0.007935   0.004558   1.741 0.081707 .  
indUni8      0.003373   0.004558   0.740 0.459300    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1019 on 7992 degrees of freedom
Multiple R-squared:  0.01439,	Adjusted R-squared:  0.01353 
F-statistic: 16.67 on 7 and 7992 DF,  p-value: < 2.2e-16

We see that R has handled the identifiability problem using “treatment contrasts”, dropping the fixed effect for the first university, so that the intercept actually represents the mean value for the first University, and the effects for the other Univeristies represent the differences from the first University. If we would prefer to impose a sum constraint, then we can switch to sum contrasts with

options(contrasts=rep("contr.sum",2))

and then re-fit the model.

> mods=lm(values~ind,data=Data)
> summary(mods)

Call:
lm(formula = values ~ ind, data = Data)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.36846 -0.06778 -0.00069  0.06910  0.38219 

Coefficients:
              Estimate Std. Error  t value Pr(>|t|)    
(Intercept)  3.0986991  0.0011394 2719.558  < 2e-16 ***
ind1         0.0023687  0.0030146    0.786 0.432048    
ind2        -0.0041477  0.0030146   -1.376 0.168905    
ind3        -0.0147997  0.0030146   -4.909 9.32e-07 ***
ind4         0.0202851  0.0030146    6.729 1.83e-11 ***
ind5        -0.0204693  0.0030146   -6.790 1.20e-11 ***
ind6         0.0007175  0.0030146    0.238 0.811889    
ind7         0.0103039  0.0030146    3.418 0.000634 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1019 on 7992 degrees of freedom
Multiple R-squared:  0.01439,	Adjusted R-squared:  0.01353 
F-statistic: 16.67 on 7 and 7992 DF,  p-value: < 2.2e-16

This has 7 degrees of freedom for the effects, as before, but ensures that the 8 effects sum to precisely zero. This is arguably more interpretable in this case.

Bayesian analysis

We will now analyse the simulated data from a Bayesian perspective, using JAGS.

Fixed effects

All parameters in Bayesian models are uncertain, and therefore random, so there is much confusion regarding the difference between “fixed” and “random” effects in a Bayesian context. For “fixed” effects, our prior captures the idea that we sample the effects independently from a “fixed” (typically vague) prior distribution. We could simply code this up and fit it in JAGS as follows.

require(rjags)
n=dim(X)[1]
p=dim(X)[2]
data=list(X=X,n=n,p=p)
init=list(mu=2,tau=1)
modelstring="
  model {
    for (j in 1:p) {
      theta[j]~dnorm(0,0.0001)
      for (i in 1:n) {
        X[i,j]~dnorm(mu+theta[j],tau)
      }
    }
    mu~dnorm(0,0.0001)
    tau~dgamma(1,0.0001)
  }
"
model=jags.model(textConnection(modelstring),data=data,inits=init)
update(model,n.iter=1000)
output=coda.samples(model=model,variable.names=c("mu","tau","theta"),n.iter=100000,thin=10)
print(summary(output))
plot(output)
autocorr.plot(output)
pairs(as.matrix(output))
crosscorr.plot(output)

On running the code we can clearly see that this naive approach leads to high posterior correlation between the mean and the effects, due to the fundamental lack of identifiability of the model. This also leads to MCMC mixing problems, but it is important to understand that this computational issue is conceptually entirely separate from the fundamental statisticial identifiability issue. Even if we could avoid MCMC entirely, the identifiability issue would remain.

A quick fix for the identifiability issue is to use “treatment contrasts”, just as for the frequentist model. We can implement that as follows.

data=list(X=X,n=n,p=p)
init=list(mu=2,tau=1)
modelstring="
  model {
    for (j in 1:p) {
      for (i in 1:n) {
        X[i,j]~dnorm(mu+theta[j],tau)
      }
    }
    theta[1]<-0
    for (j in 2:p) {
      theta[j]~dnorm(0,0.0001)
    }
    mu~dnorm(0,0.0001)
    tau~dgamma(1,0.0001)
  }
"
model=jags.model(textConnection(modelstring),data=data,inits=init)
update(model,n.iter=1000)
output=coda.samples(model=model,variable.names=c("mu","tau","theta"),n.iter=100000,thin=10)
print(summary(output))
plot(output)
autocorr.plot(output)
pairs(as.matrix(output))
crosscorr.plot(output)

Running this we see that the model now works perfectly well, mixes nicely, and gives sensible inferences for the treatment effects.

Another source of confusion for models of this type is data formating and indexing in JAGS models. For our balanced data there was not problem passing in data to JAGS as a matrix and specifying the model using nested loops. However, for unbalanced designs this is not necessarily so convenient, and so then it can be helpful to specify the model based on two-column data, as we would use for fitting using lm(). This is illustrated with the following model specification, which is exactly equivalent to the previous model, and should give identical (up to Monte Carlo error) results.

N=n*p
data=list(y=Data$values,g=Data$ind,N=N,p=p)
init=list(mu=2,tau=1)
modelstring="
  model {
    for (i in 1:N) {
      y[i]~dnorm(mu+theta[g[i]],tau)
    }
    theta[1]<-0
    for (j in 2:p) {
      theta[j]~dnorm(0,0.0001)
    }
    mu~dnorm(0,0.0001)
    tau~dgamma(1,0.0001)
  }
"
model=jags.model(textConnection(modelstring),data=data,inits=init)
update(model,n.iter=1000)
output=coda.samples(model=model,variable.names=c("mu","tau","theta"),n.iter=100000,thin=10)
print(summary(output))
plot(output)

As suggested above, this indexing scheme is much more convenient for unbalanced data, and hence widely used. However, since our data is balanced here, we will revert to the matrix approach for the remainder of the post.

One final thing to consider before moving on to random effects is the sum-contrast model. We can implement this in various ways, but I’ve tried to encode it for maximum clarity below, imposing the sum-to-zero constraint via the final effect.

data=list(X=X,n=n,p=p)
init=list(mu=2,tau=1)
modelstring="
  model {
    for (j in 1:p) {
      for (i in 1:n) {
        X[i,j]~dnorm(mu+theta[j],tau)
      }
    }
    for (j in 1:(p-1)) {
      theta[j]~dnorm(0,0.0001)
    }
    theta[p] <- -sum(theta[1:(p-1)])
    mu~dnorm(0,0.0001)
    tau~dgamma(1,0.0001)
  }
"
model=jags.model(textConnection(modelstring),data=data,inits=init)
update(model,n.iter=1000)
output=coda.samples(model=model,variable.names=c("mu","tau","theta"),n.iter=100000,thin=10)
print(summary(output))
plot(output)

Again, this works perfectly well and gives similar results to the frequentist analysis.

Random effects

The key difference between fixed and random effects in a Bayesian framework is that random effects are not independent, being drawn from a distribution with parameters which are not fixed. Essentially, there is another level of hierarchy involved in the specification of the random effects. This is best illustrated by example. A random effects model for this problem is given below.

data=list(X=X,n=n,p=p)
init=list(mu=2,tau=1)
modelstring="
  model {
    for (j in 1:p) {
      theta[j]~dnorm(0,taut)
      for (i in 1:n) {
        X[i,j]~dnorm(mu+theta[j],tau)
      }
    }
    mu~dnorm(0,0.0001)
    tau~dgamma(1,0.0001)
    taut~dgamma(1,0.0001)
  }
"
model=jags.model(textConnection(modelstring),data=data,inits=init)
update(model,n.iter=1000)
output=coda.samples(model=model,variable.names=c("mu","tau","taut","theta"),n.iter=100000,thin=10)
print(summary(output))
plot(output)

The only difference between this and our first naive attempt at a Bayesian fixed effects model is that we have put a gamma prior on the precision of the effect. Note that this model now runs and fits perfectly well, with reasonable mixing, and gives sensible parameter inferences. Although the effects here are not constrained to sum-to-zero, like in the case of sum contrasts for a fixed effects model, the prior encourages shrinkage towards zero, and so the random effect distribution can be thought of as a kind of soft version of a hard sum-to-zero constraint. From a predictive perspective, this model is much more powerful. In particular, using a random effects model, we can make strong predictions for unobserved groups (eg. a ninth University), with sensible prediction intervals based on our inferred understanding of how similar different universities are. Using a fixed effects model this isn’t really possible. Even for a Bayesian version of a fixed effects model using proper (but vague) priors, prediction intervals for unobserved groups are not really sensible.

Since we have used simulated data here, we can compare the estimated random effects with the true effects generated during the simulation.

> apply(as.matrix(output),2,mean)
           mu           tau          taut      theta[1]      theta[2] 
 3.098813e+00  9.627110e+01  7.015976e+03  2.086581e-03 -3.935511e-03 
     theta[3]      theta[4]      theta[5]      theta[6]      theta[7] 
-1.389099e-02  1.881528e-02 -1.921854e-02  5.640306e-04  9.529532e-03 
     theta[8] 
 5.227518e-03 
> RE
[1]  0.002637034 -0.008294518 -0.014616348  0.016839902 -0.015443243
[6] -0.001908871  0.010162117  0.005471262

We see that the Bayesian random effects model has done an excellent job of estimation. If we wished, we could relax the assumption of common variance across the groups by making tau a vector indexed by j, though there is not much point in persuing this here, since we know that the groups do all have the same variance.

Strong subjective priors

The above is the usual story regarding fixed and random effects in Bayesian inference. I hope this is reasonably clear, so really I should quit while I’m ahead… However, the issues are really a bit more subtle than I’ve suggested. The inferred precision of the random effects was around 7,000, so now lets re-run the original, naive, “fixed effects” model with a strong subjective Bayesian prior on the distribution of the effects.

data=list(X=X,n=n,p=p)
init=list(mu=2,tau=1)
modelstring="
  model {
    for (j in 1:p) {
      theta[j]~dnorm(0,7000)
      for (i in 1:n) {
        X[i,j]~dnorm(mu+theta[j],tau)
      }
    }
    mu~dnorm(0,0.0001)
    tau~dgamma(1,0.0001)
  }
"
model=jags.model(textConnection(modelstring),data=data,inits=init)
update(model,n.iter=1000)
output=coda.samples(model=model,variable.names=c("mu","tau","theta"),n.iter=100000,thin=10)
print(summary(output))
plot(output)

This model also runs perfectly well and gives sensible inferences, despite the fact that the effects are iid from a fixed distribution and there is no hard constraint on the effects. Similarly, we can make sensible predictions, together with appropriate prediction intervals, for an unobserved group. So it isn’t so much the fact that the effects are coupled via an extra level of hierarchy that makes things work. It’s really the fact that the effects are sensibly distributed and not just sampled directly from a vague prior. So for “real” subjective Bayesians the line between fixed and random effects is actually very blurred indeed…