## Calling R from Scala sbt projects

### Overview

In previous posts I’ve shown how the jvmr CRAN R package can be used to call Scala sbt projects from R and inline Scala Breeze code in R. In this post I will show how to call to R from a Scala sbt project. This requires that R and the jvmr CRAN R package are installed on your system, as described in the previous posts. Since I’m focusing here on Scala sbt projects, I’m also assuming that sbt is installed.

The only “trick” required for calling back to R from Scala is telling sbt where the jvmr jar file is located. You can find the location from the R console as illustrated by the following session:

> library(jvmr)
> .jvmr.jar
[1] "/home/ndjw1/R/x86_64-pc-linux-gnu-library/3.1/jvmr/java/jvmr_2.11-2.11.2.1.jar"


This location (which will obviously be different for you) can then be added in to your sbt classpath by adding the following line to your build.sbt file:

unmanagedJars in Compile += file("/home/ndjw1/R/x86_64-pc-linux-gnu-library/3.1/jvmr/java/jvmr_2.11-2.11.2.1.jar")


Once this is done, calling out to R from your Scala sbt project can be carried out as described in the jvmr documentation. For completeness, a working example is given below.

### Example

In this example I will use Scala to simulate some data consistent with a Poisson regression model, and then push the data to R to fit it using the R function glm(), and then pull back the fitted regression coefficients into Scala. This is obviously a very artificial example, but the point is to show how it is possible to call back to R for some statistical procedure that may be “missing” from Scala.

The dependencies for this project are described in the file build.sbt

name := "jvmr test"

version := "0.1"

scalacOptions ++= Seq("-unchecked", "-deprecation", "-feature")

libraryDependencies  ++= Seq(
"org.scalanlp" %% "breeze" % "0.10",
"org.scalanlp" %% "breeze-natives" % "0.10"
)

resolvers ++= Seq(
"Sonatype Snapshots" at "https://oss.sonatype.org/content/repositories/snapshots/",
"Sonatype Releases" at "https://oss.sonatype.org/content/repositories/releases/"
)

unmanagedJars in Compile += file("/home/ndjw1/R/x86_64-pc-linux-gnu-library/3.1/jvmr/java/jvmr_2.11-2.11.2.1.jar")

scalaVersion := "2.11.2"


The complete Scala program is contained in the file PoisReg.scala

import org.ddahl.jvmr.RInScala
import breeze.stats.distributions._
import breeze.linalg._

object ScalaToRTest {

def main(args: Array[String]) = {

// first simulate some data consistent with a Poisson regression model
val x = Uniform(50,60).sample(1000)
val eta = x map { xi => (xi * 0.1) - 3 }
val mu = eta map { math.exp(_) }
val y = mu map { Poisson(_).draw }

// call to R to fit the Poission regression model
val R = RInScala() // initialise an R interpreter
R.x=x.toArray // send x to R
R.y=y.toArray // send y to R
R.eval("mod <- glm(y~x,family=poisson())") // fit the model in R
// pull the fitted coefficents back into scala
val beta = DenseVector[Double](R.toVector[Double]("mod$coefficients")) // print the fitted coefficents println(beta) } }  If these two files are put in an empty directory, the code can be compiled and run by typing sbt run from the command prompt in the relevant directory. The commented code should be self-explanatory, but see the jvmr documentation for further details. ## Inlining Scala Breeze code in R using jvmr and sbt ### Introduction In the previous post I showed how to call Scala code from R using sbt and jvmr. The approach described in that post is the one I would recommend for any non-trivial piece of Scala code – mixing up code from different languages in the same source code file is not a good strategy in general. That said, for very small snippets of code, it can sometimes be convenient to inline Scala code directly into an R source code file. The canonical example of this is a computationally intensive algorithm being prototyped in R which has a slow inner loop. If the inner loop can be recoded in a few lines of Scala, it would be nice to just inline this directly into the R code without having to create a separate Scala project. The CRAN package jvmr provides a very simple and straightforward way to do this. However, as discussed in the last post, most Scala code for statistical computing (even short and simple code) is likely to rely on Breeze for special functions, probability distributions, non-uniform random number generation, numerical linear algebra, etc. In this post we will see how to use sbt in order to make sure that the Breeze library and all of its dependencies are downloaded and cached, and to provide a correct classpath with which to initialise a jvmr scalaInterpreter session. ### Setting up Configuring your system to be able to inline Scala Breeze code is very easy. You just need to install Java, R and sbt. Then install the CRAN R package jvmr. At this point you have everything you need except for the R function breezeInit, given at the end of this post. I’ve deferred the function to the end of the post as it is a bit ugly, and the details of it are not important. All it does is get sbt to ensure that Breeze is correctly downloaded and cached and then starts a scalaInterpreter with Breeze on the classpath. With this function available, we can use it within an R session as the following R session illustrates: > b=breezeInit() > b['import breeze.stats.distributions._'] NULL > b['Poisson(10).sample(20).toArray'] [1] 13 14 13 10 7 6 15 14 5 10 14 11 15 8 11 12 6 7 5 7 > summary(b['Gamma(3,2).sample(10000).toArray']) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.2124 3.4630 5.3310 5.9910 7.8390 28.5200 >  So we see that Scala Breeze code can be inlined directly into an R session, and if we are careful about return types, have the results of Scala expressions automatically unpack back into convenient R data structures. ### Summary In this post I have shown how easy it is to inline Scala Breeze code into R using sbt in conjunction with the CRAN package jvmr. This has many potential applications, with the most obvious being the desire to recode slow inner loops from R to Scala. This should give performance quite comparable with alternatives such as Rcpp, with the advantage being that you get to write beautiful, elegant, functional Scala code instead of horrible, ugly, imperative C++ code! ;-) ### The breezeInit function The actual breezeInit() function is given below. It is a little ugly, but very simple. It is obviously easy to customise for different libraries and library versions as required. All of the hard work is done by sbt which must be installed and on the default system path in order for this function to work. breezeInit<-function() { library(jvmr) sbtStr="name := \"tmp\" version := \"0.1\" libraryDependencies ++= Seq( \"org.scalanlp\" %% \"breeze\" % \"0.10\", \"org.scalanlp\" %% \"breeze-natives\" % \"0.10\" ) resolvers ++= Seq( \"Sonatype Snapshots\" at \"https://oss.sonatype.org/content/repositories/snapshots/\", \"Sonatype Releases\" at \"https://oss.sonatype.org/content/repositories/releases/\" ) scalaVersion := \"2.11.2\" lazy val printClasspath = taskKey[Unit](\"Dump classpath\") printClasspath := { (fullClasspath in Runtime value) foreach { e => print(e.data+\"!\") } } " tmps=file(file.path(tempdir(),"build.sbt"),"w") cat(sbtStr,file=tmps) close(tmps) owd=getwd() setwd(tempdir()) cpstr=system2("sbt","printClasspath",stdout=TRUE) cpst=cpstr[length(cpstr)] cpsp=strsplit(cpst,"!")[[1]] cp=cpsp[2:(length(cpsp)-1)] si=scalaInterpreter(cp,use.jvmr.class.path=FALSE) setwd(owd) si }  ## Calling Scala code from R using jvmr ### Introduction In previous posts I have explained why I think that Scala is a good language to use for statistical computing and data science. Despite this, R is very convenient for simple exploratory data analysis and visualisation – currently more convenient than Scala. I explained in my recent talk at the RSS what (relatively straightforward) things would need to be developed for Scala in order to make R completely redundant, but for the short term at least, it seems likely that I will need to use both R and Scala for my day-to-day work. Since I use both Scala and R for statistical computing, it is very convenient to have a degree of interoperability between the two languages. I could call R from Scala code or Scala from R code, or both. Fortunately, some software tools have been developed recently which make this much simpler than it used to be. The software is jvmr, and as explained at the website, it enables calling Java and Scala from R and calling R from Java and Scala. I have previously discussed calling Java from R using the R CRAN package rJava. In this post I will focus on calling Scala from R using the CRAN package jvmr, which depends on rJava. I may examine calling R from Scala in a future post. On a system with Java installed, it should be possible to install the jvmr R package with a simple install.packages("jvmr")  from the R command prompt. The package has the usual documentation associated with it, but the draft paper describing the package is the best way to get an overview of its capabilities and a walk-through of simple usage. ### A Gibbs sampler in Scala using Breeze For illustration I’m going to use a Scala implementation of a Gibbs sampler which relies on the Breeze scientific library, and will be built using the simple build tool, sbt. Most non-trivial Scala projects depend on various versions of external libraries, and sbt is an easy way to build even very complex projects trivially on any system with Java installed. You don’t even need to have Scala installed in order to build and run projects using sbt. I give some simple complete worked examples of building and running Scala sbt projects in the github repo associated with my recent RSS talk. Installing sbt is trivial as explained in the repo READMEs. For this post, the Scala code, gibbs.scala is given below: package gibbs object Gibbs { import scala.annotation.tailrec import scala.math.sqrt import breeze.stats.distributions.{Gamma,Gaussian} case class State(x: Double, y: Double) { override def toString: String = x.toString + " , " + y + "\n" } def nextIter(s: State): State = { val newX = Gamma(3.0, 1.0/((s.y)*(s.y)+4.0)).draw State(newX, Gaussian(1.0/(newX+1), 1.0/sqrt(2*newX+2)).draw) } @tailrec def nextThinnedIter(s: State,left: Int): State = if (left==0) s else nextThinnedIter(nextIter(s),left-1) def genIters(s: State, stop: Int, thin: Int): List[State] = { @tailrec def go(s: State, left: Int, acc: List[State]): List[State] = if (left>0) go(nextThinnedIter(s,thin), left-1, s::acc) else acc go(s,stop,Nil).reverse } def main(args: Array[String]) = { if (args.length != 3) { println("Usage: sbt \"run <outFile> <iters> <thin>\"") sys.exit(1) } else { val outF=args(0) val iters=args(1).toInt val thin=args(2).toInt val out = genIters(State(0.0,0.0),iters,thin) val s = new java.io.FileWriter(outF) s.write("x , y\n") out map { it => s.write(it.toString) } s.close } } }  This code requires Scala and the Breeze scientific library in order to build. We can specify this in a sbt build file, which should be called build.sbt and placed in the same directory as the Scala code. name := "gibbs" version := "0.1" scalacOptions ++= Seq("-unchecked", "-deprecation", "-feature") libraryDependencies ++= Seq( "org.scalanlp" %% "breeze" % "0.10", "org.scalanlp" %% "breeze-natives" % "0.10" ) resolvers ++= Seq( "Sonatype Snapshots" at "https://oss.sonatype.org/content/repositories/snapshots/", "Sonatype Releases" at "https://oss.sonatype.org/content/repositories/releases/" ) scalaVersion := "2.11.2"  Now, from a system command prompt in the directory where the files are situated, it should be possible to download all dependencies and compile and run the code with a simple sbt "run output.csv 50000 1000"  #### Calling via R system calls Since this code takes a relatively long time to run, calling it from R via simple system calls isn’t a particularly terrible idea. For example, we can do this from the R command prompt with the following commands system("sbt \"run output.csv 50000 1000\"") out=read.csv("output.csv") library(smfsb) mcmcSummary(out,rows=2)  This works fine, but won’t work so well for code which needs to be called repeatedly. For this, tighter integration between R and Scala would be useful, which is where jvmr comes in. #### Calling sbt-based Scala projects via jvmr jvmr provides a very simple way to embed a Scala interpreter within an R session, to be able to execute Scala expressions from R and to have the results returned back to the R session for further processing. The main issue with using this in practice is managing dependencies on external libraries and setting the Scala classpath correctly. For an sbt project such as we are considering here, it is relatively easy to get sbt to provide us with all of the information we need in a fully automated way. First, we need to add a new task to our sbt build instructions, which will output the full classpath in a way that is easy to parse from R. Just add the following to the end of the file build.sbt: lazy val printClasspath = taskKey[Unit]("Dump classpath") printClasspath := { (fullClasspath in Runtime value) foreach { e => print(e.data+"!") } }  Be aware that blank lines are significant in sbt build files. Once we have this in our build file, we can write a small R function to get the classpath from sbt and then initialise a jvmr scalaInterpreter with the correct full classpath needed for the project. An R function which does this, sbtInit(), is given below sbtInit<-function() { library(jvmr) system2("sbt","compile") cpstr=system2("sbt","printClasspath",stdout=TRUE) cpst=cpstr[length(cpstr)] cpsp=strsplit(cpst,"!")[[1]] cp=cpsp[1:(length(cpsp)-1)] scalaInterpreter(cp,use.jvmr.class.path=FALSE) }  With this function at our disposal, it becomes trivial to call our Scala code direct from the R interpreter, as the following code illustrates. sc=sbtInit() sc['import gibbs.Gibbs._'] out=sc['genIters(State(0.0,0.0),50000,1000).toArray.map{s=>Array(s.x,s.y)}'] library(smfsb) mcmcSummary(out,rows=2)  Here we call the getIters function directly, rather than via the main method. This function returns an immutable List of States. Since R doesn’t understand this, we map it to an Array of Arrays, which R then unpacks into an R matrix for us to store in the matrix out. ### Summary The CRAN package jvmr makes it very easy to embed a Scala interpreter within an R session. However, for most non-trivial statistical computing problems, the Scala code will have dependence on external scientific libraries such as Breeze. The standard way to easily manage external dependencies in the Scala ecosystem is sbt. Given an sbt-based Scala project, it is easy to add a task to the sbt build file and a function to R in order to initialise the jvmr Scala interpreter with the full classpath needed to call arbitrary Scala functions. This provides very convenient inter-operability between R and Scala for many statistical computing applications. ## One-way ANOVA with fixed and random effects from a Bayesian perspective This blog post is derived from a computer practical session that I ran as part of my new course on Statistics for Big Data, previously discussed. This course covered a lot of material very quickly. In particular, I deferred introducing notions of hierarchical modelling until the Bayesian part of the course, where I feel it is more natural and powerful. However, some of the terminology associated with hierarchical statistical modelling probably seems a bit mysterious to those without a strong background in classical statistical modelling, and so this practical session was intended to clear up some potential confusion. I will analyse a simple one-way Analysis of Variance (ANOVA) model from a Bayesian perspective, making sure to highlight the difference between fixed and random effects in a Bayesian context where everything is random, as well as emphasising the associated identifiability issues. R code is used to illustrate the ideas. ### Example scenario We will consider the body mass index (BMI) of new male undergraduate students at a selection of UK Universities. Let us suppose that our data consist of measurements of (log) BMI for a random sample of 1,000 males at each of 8 Universities. We are interested to know if there are any differences between the Universities. Again, we want to model the process as we would simulate it, so thinking about how we would simulate such data is instructive. We start by assuming that the log BMI is a normal random quantity, and that the variance is common across the Universities in question (this is quite a big assumption, and it is easy to relax). We assume that the mean of this normal distribution is University-specific, but that we do not have strong prior opinions regarding the way in which the Universities differ. That said, we expect that the Universities would not be very different from one another. ### Simulating data A simple simulation of the data with some plausible parameters can be carried out as follows. set.seed(1) Z=matrix(rnorm(1000*8,3.1,0.1),nrow=8) RE=rnorm(8,0,0.01) X=t(Z+RE) colnames(X)=paste("Uni",1:8,sep="") Data=stack(data.frame(X)) boxplot(exp(values)~ind,data=Data,notch=TRUE)  Make sure that you understand exactly what this code is doing before proceeding. The boxplot showing the simulated data is given below. ### Frequentist analysis We will start with a frequentist analysis of the data. The model we would like to fit is $y_{ij} = \mu + \theta_i + \varepsilon_{ij}$ where i is an indicator for the University and j for the individual within a particular University. The “effect”, $\theta_i$ represents how the ith University differs from the overall mean. We know that this model is not actually identifiable when the model parameters are all treated as “fixed effects”, but R will handle this for us. > mod=lm(values~ind,data=Data) > summary(mod) Call: lm(formula = values ~ ind, data = Data) Residuals: Min 1Q Median 3Q Max -0.36846 -0.06778 -0.00069 0.06910 0.38219 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.101068 0.003223 962.244 < 2e-16 *** indUni2 -0.006516 0.004558 -1.430 0.152826 indUni3 -0.017168 0.004558 -3.767 0.000166 *** indUni4 0.017916 0.004558 3.931 8.53e-05 *** indUni5 -0.022838 0.004558 -5.011 5.53e-07 *** indUni6 -0.001651 0.004558 -0.362 0.717143 indUni7 0.007935 0.004558 1.741 0.081707 . indUni8 0.003373 0.004558 0.740 0.459300 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.1019 on 7992 degrees of freedom Multiple R-squared: 0.01439, Adjusted R-squared: 0.01353 F-statistic: 16.67 on 7 and 7992 DF, p-value: < 2.2e-16  We see that R has handled the identifiability problem using “treatment contrasts”, dropping the fixed effect for the first university, so that the intercept actually represents the mean value for the first University, and the effects for the other Univeristies represent the differences from the first University. If we would prefer to impose a sum constraint, then we can switch to sum contrasts with options(contrasts=rep("contr.sum",2))  and then re-fit the model. > mods=lm(values~ind,data=Data) > summary(mods) Call: lm(formula = values ~ ind, data = Data) Residuals: Min 1Q Median 3Q Max -0.36846 -0.06778 -0.00069 0.06910 0.38219 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.0986991 0.0011394 2719.558 < 2e-16 *** ind1 0.0023687 0.0030146 0.786 0.432048 ind2 -0.0041477 0.0030146 -1.376 0.168905 ind3 -0.0147997 0.0030146 -4.909 9.32e-07 *** ind4 0.0202851 0.0030146 6.729 1.83e-11 *** ind5 -0.0204693 0.0030146 -6.790 1.20e-11 *** ind6 0.0007175 0.0030146 0.238 0.811889 ind7 0.0103039 0.0030146 3.418 0.000634 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.1019 on 7992 degrees of freedom Multiple R-squared: 0.01439, Adjusted R-squared: 0.01353 F-statistic: 16.67 on 7 and 7992 DF, p-value: < 2.2e-16  This has 7 degrees of freedom for the effects, as before, but ensures that the 8 effects sum to precisely zero. This is arguably more interpretable in this case. ### Bayesian analysis We will now analyse the simulated data from a Bayesian perspective, using JAGS. #### Fixed effects All parameters in Bayesian models are uncertain, and therefore random, so there is much confusion regarding the difference between “fixed” and “random” effects in a Bayesian context. For “fixed” effects, our prior captures the idea that we sample the effects independently from a “fixed” (typically vague) prior distribution. We could simply code this up and fit it in JAGS as follows. require(rjags) n=dim(X)[1] p=dim(X)[2] data=list(X=X,n=n,p=p) init=list(mu=2,tau=1) modelstring=" model { for (j in 1:p) { theta[j]~dnorm(0,0.0001) for (i in 1:n) { X[i,j]~dnorm(mu+theta[j],tau) } } mu~dnorm(0,0.0001) tau~dgamma(1,0.0001) } " model=jags.model(textConnection(modelstring),data=data,inits=init) update(model,n.iter=1000) output=coda.samples(model=model,variable.names=c("mu","tau","theta"),n.iter=100000,thin=10) print(summary(output)) plot(output) autocorr.plot(output) pairs(as.matrix(output)) crosscorr.plot(output)  On running the code we can clearly see that this naive approach leads to high posterior correlation between the mean and the effects, due to the fundamental lack of identifiability of the model. This also leads to MCMC mixing problems, but it is important to understand that this computational issue is conceptually entirely separate from the fundamental statisticial identifiability issue. Even if we could avoid MCMC entirely, the identifiability issue would remain. A quick fix for the identifiability issue is to use “treatment contrasts”, just as for the frequentist model. We can implement that as follows. data=list(X=X,n=n,p=p) init=list(mu=2,tau=1) modelstring=" model { for (j in 1:p) { for (i in 1:n) { X[i,j]~dnorm(mu+theta[j],tau) } } theta[1]<-0 for (j in 2:p) { theta[j]~dnorm(0,0.0001) } mu~dnorm(0,0.0001) tau~dgamma(1,0.0001) } " model=jags.model(textConnection(modelstring),data=data,inits=init) update(model,n.iter=1000) output=coda.samples(model=model,variable.names=c("mu","tau","theta"),n.iter=100000,thin=10) print(summary(output)) plot(output) autocorr.plot(output) pairs(as.matrix(output)) crosscorr.plot(output)  Running this we see that the model now works perfectly well, mixes nicely, and gives sensible inferences for the treatment effects. Another source of confusion for models of this type is data formating and indexing in JAGS models. For our balanced data there was not problem passing in data to JAGS as a matrix and specifying the model using nested loops. However, for unbalanced designs this is not necessarily so convenient, and so then it can be helpful to specify the model based on two-column data, as we would use for fitting using lm(). This is illustrated with the following model specification, which is exactly equivalent to the previous model, and should give identical (up to Monte Carlo error) results. N=n*p data=list(y=Data$values,g=Data\$ind,N=N,p=p)
init=list(mu=2,tau=1)
modelstring="
model {
for (i in 1:N) {
y[i]~dnorm(mu+theta[g[i]],tau)
}
theta[1]<-0
for (j in 2:p) {
theta[j]~dnorm(0,0.0001)
}
mu~dnorm(0,0.0001)
tau~dgamma(1,0.0001)
}
"
model=jags.model(textConnection(modelstring),data=data,inits=init)
update(model,n.iter=1000)
output=coda.samples(model=model,variable.names=c("mu","tau","theta"),n.iter=100000,thin=10)
print(summary(output))
plot(output)


As suggested above, this indexing scheme is much more convenient for unbalanced data, and hence widely used. However, since our data is balanced here, we will revert to the matrix approach for the remainder of the post.

One final thing to consider before moving on to random effects is the sum-contrast model. We can implement this in various ways, but I’ve tried to encode it for maximum clarity below, imposing the sum-to-zero constraint via the final effect.

data=list(X=X,n=n,p=p)
init=list(mu=2,tau=1)
modelstring="
model {
for (j in 1:p) {
for (i in 1:n) {
X[i,j]~dnorm(mu+theta[j],tau)
}
}
for (j in 1:(p-1)) {
theta[j]~dnorm(0,0.0001)
}
theta[p] <- -sum(theta[1:(p-1)])
mu~dnorm(0,0.0001)
tau~dgamma(1,0.0001)
}
"
model=jags.model(textConnection(modelstring),data=data,inits=init)
update(model,n.iter=1000)
output=coda.samples(model=model,variable.names=c("mu","tau","theta"),n.iter=100000,thin=10)
print(summary(output))
plot(output)


Again, this works perfectly well and gives similar results to the frequentist analysis.

#### Random effects

The key difference between fixed and random effects in a Bayesian framework is that random effects are not independent, being drawn from a distribution with parameters which are not fixed. Essentially, there is another level of hierarchy involved in the specification of the random effects. This is best illustrated by example. A random effects model for this problem is given below.

data=list(X=X,n=n,p=p)
init=list(mu=2,tau=1)
modelstring="
model {
for (j in 1:p) {
theta[j]~dnorm(0,taut)
for (i in 1:n) {
X[i,j]~dnorm(mu+theta[j],tau)
}
}
mu~dnorm(0,0.0001)
tau~dgamma(1,0.0001)
taut~dgamma(1,0.0001)
}
"
model=jags.model(textConnection(modelstring),data=data,inits=init)
update(model,n.iter=1000)
output=coda.samples(model=model,variable.names=c("mu","tau","taut","theta"),n.iter=100000,thin=10)
print(summary(output))
plot(output)


The only difference between this and our first naive attempt at a Bayesian fixed effects model is that we have put a gamma prior on the precision of the effect. Note that this model now runs and fits perfectly well, with reasonable mixing, and gives sensible parameter inferences. Although the effects here are not constrained to sum-to-zero, like in the case of sum contrasts for a fixed effects model, the prior encourages shrinkage towards zero, and so the random effect distribution can be thought of as a kind of soft version of a hard sum-to-zero constraint. From a predictive perspective, this model is much more powerful. In particular, using a random effects model, we can make strong predictions for unobserved groups (eg. a ninth University), with sensible prediction intervals based on our inferred understanding of how similar different universities are. Using a fixed effects model this isn’t really possible. Even for a Bayesian version of a fixed effects model using proper (but vague) priors, prediction intervals for unobserved groups are not really sensible.

Since we have used simulated data here, we can compare the estimated random effects with the true effects generated during the simulation.

> apply(as.matrix(output),2,mean)
mu           tau          taut      theta[1]      theta[2]
3.098813e+00  9.627110e+01  7.015976e+03  2.086581e-03 -3.935511e-03
theta[3]      theta[4]      theta[5]      theta[6]      theta[7]
-1.389099e-02  1.881528e-02 -1.921854e-02  5.640306e-04  9.529532e-03
theta[8]
5.227518e-03
> RE
[1]  0.002637034 -0.008294518 -0.014616348  0.016839902 -0.015443243
[6] -0.001908871  0.010162117  0.005471262


We see that the Bayesian random effects model has done an excellent job of estimation. If we wished, we could relax the assumption of common variance across the groups by making tau a vector indexed by j, though there is not much point in persuing this here, since we know that the groups do all have the same variance.

#### Strong subjective priors

The above is the usual story regarding fixed and random effects in Bayesian inference. I hope this is reasonably clear, so really I should quit while I’m ahead… However, the issues are really a bit more subtle than I’ve suggested. The inferred precision of the random effects was around 7,000, so now lets re-run the original, naive, “fixed effects” model with a strong subjective Bayesian prior on the distribution of the effects.

data=list(X=X,n=n,p=p)
init=list(mu=2,tau=1)
modelstring="
model {
for (j in 1:p) {
theta[j]~dnorm(0,7000)
for (i in 1:n) {
X[i,j]~dnorm(mu+theta[j],tau)
}
}
mu~dnorm(0,0.0001)
tau~dgamma(1,0.0001)
}
"
model=jags.model(textConnection(modelstring),data=data,inits=init)
update(model,n.iter=1000)
output=coda.samples(model=model,variable.names=c("mu","tau","theta"),n.iter=100000,thin=10)
print(summary(output))
plot(output)


This model also runs perfectly well and gives sensible inferences, despite the fact that the effects are iid from a fixed distribution and there is no hard constraint on the effects. Similarly, we can make sensible predictions, together with appropriate prediction intervals, for an unobserved group. So it isn’t so much the fact that the effects are coupled via an extra level of hierarchy that makes things work. It’s really the fact that the effects are sensibly distributed and not just sampled directly from a vague prior. So for “real” subjective Bayesians the line between fixed and random effects is actually very blurred indeed…

## Statistical computing languages at the RSS

On Friday the Royal Statistical Society hosted a meeting on Statistical computing languages, organised by my colleague Colin Gillespie. Four languages were presented at the meeting: Python, Scala, Matlab and Julia. I presented the talk on Scala. The slides I presented are available, in addition to the code examples and instructions on how to run them, in a public github repo I have created. The talk mainly covered material I have discussed in various previous posts on this blog. What was new was the emphasis on how easy it is to use and run Scala code, and the inclusion of complete examples and instructions on how to run them on any platform with a JVM installed. I also discussed some of the current limitations of Scala as an environment for interactive statistical data analysis and visualisation, and how these limitations could be overcome with a little directed effort. Colin suggested that all of the speakers covered a couple of examples (linear regression and a Monte Carlo integral) in “their” languages, and he provided an R solution. It is interesting to see the examples in the five different languages side by side for comparison purposes. Colin is collecting together all of the material relating to the meeting in a combined github repo, which should fill out over the next few days.

For other Scala posts on this blog, see all of my posts with a “scala” tag.

## Statistics for Big Data

### Doctoral programme in cloud computing for big data

I’ve spent much of this year working to establish our new EPSRC Centre for Doctoral Training in Cloud Computing for Big Data, which partly explains the lack of posts on this blog in recent months. The CDT is now established, with 11 students in the first cohort, and we have begun recruiting for the second cohort, to start in September 2015. We admit roughly equal numbers of students from a Computing Science and Mathematics/Statistics background, and provide an intensive programme of inter-disciplinary training in the first (of four) years. After initial induction and cohort team-building events, the programme begins with 8 weeks of intensive bespoke training developed especially for the CDT students. For the first two weeks the cohort is split into two streams for an initial crash course. Students from a M/S background receive basic training in CS and programming, with an emphasis on object-oriented programming in Java. Students from a CS background get basic training in M/S, emphasising elementary probability and statistics and basic linear algebra. From the start of the third week the entire cohort is trained together. This whole cohort training begins with two 6 week courses running in parallel. One is Programming for big data, concentrating on R programming, Java, databases, and software development tools and techniques. The other course is Statistics for big data, a course that I developed and taught, and the main topic of this post. After the initial 8 weeks of specialist training, the students next take courses on Cloud Computing and Machine learning followed by Big data analytics and Time series data, and finally courses on Research skills and Professional skills, which run along side a Group project.

### Statistics for big data

I have found it an interesting challenge to try and put together a 15 credit (150 hours of student effort) course to start from a basic level of statistical knowledge and cover most of the important concepts and methods necessary for modelling and analysis of large and complex data sets. Although the course was not about Big Data per se, it emphasised scalability in general, and exploitation of linearity in particular. Given the mixed levels of mathematical sophistication of the cohort I felt it important to start off with a practical non-technical introduction. For this I covered the first 6 chapters of Introduction to Statistical Learning with R. This provided the students with a basic grounding in statistical modelling ideas, and practical skills in working with data in R. This book is excellent as a non-technical introduction, but is missing the underpinning mathematical and computational details necessary for understanding how to implement such methods in practice. I’ve not found Elements of Statistical Learning to be very suitable as a student text, so I revised, expanded and updated my old multivariate notes to produce a new set of notes on Multivariate Data Analysis using R (I discussed an early version of these notes in a previous post). These notes emphasise the role of numerical linear algebra for solving problems of linear statistical inference in an efficient and numerically stable manner. In particular, they illustrate the use of different matrix factorisations, such as Cholesky, QR, SVD, as well as spectral decompositions, for constructing efficient solutions. Although some of the students with a weaker mathematical background struggled with some of the more technical topics, the associated practical sessions illustrated how the techniques are used in practice.

After filling in a few additional topics of frequentist linear statistical inference (including ANOVA, contrasts, missing data, and experimental design), we moved on to computational Bayesian inference. For the introductory material, I used a variety of sources, including Bayesian reasoning and machine learning, and some introductory material from my own textbook, Stochastic modelling for systems biology. The emphasis for this part of the course was the use of flexible Bayesian hierarchical modelling using MCMC. The primary software tool used was JAGS, used via the rjags package from R. For more advanced material on Bayesian computation, I made substantial use of various posts on this blog, as well as some other material we use for teaching Bayesian inference as part of our undergraduate programmes. As for the first part of the course, the emphasis was on practical modelling and data analysis rather than theory. A few of the computer practicals from the Bayesian part of the course are on-line. I may re-write one or two of these practicals as blog posts in due course. Although the emphasis was on flexible modelling, we did touch on efficiency and exploitation of linearity, and I included an extra Chapter on linear Bayesian inference in my multivariate notes in this context. I rounded off this part of the course with a lecture on parallelisation of Monte Carlo algorithms, along the lines of my BIRS lecture on that topic.

The course finished with a group data analysis project, concerned with linear modelling and variable selection for a fairly large heterogeneous data set containing missing data, using both frequentist and Bayesian approaches. The course has just finished, and I’m reasonably happy with how it has gone, but I’ll reflect on it for a couple of weeks and get some feedback before deciding on some revisions to make before delivering it again for the new cohort next year.

## Tuning particle MCMC algorithms

Several papers have appeared recently discussing the issue of how to tune the number of particles used in the particle filter within a particle MCMC algorithm such as particle marginal Metropolis Hastings (PMMH). Three such papers are:

I have discussed psuedo marginal MCMC and particle MCMC algorithms in previous posts. It will be useful to refer back to these posts if these topics are unfamiliar. Within particle MCMC algorithms (and psuedo-marginal MCMC algorithms, more generally), an unbiased estimate of marginal likelihood is constructed using a number of particles. The more particles that are used, the better the estimate of marginal likelihood is, and the resulting MCMC algorithm will behave more like a “real” marginal MCMC algorithm. For a small number of particles, the algorithm will still have exactly the correct target, but the noise in the unbiased estimator of marginal likelihood will lead to poor mixing of the MCMC chain. The idea is to use just enough particles to ensure that there isn’t “too much” noise in the unbiased estimator, but not to waste lots of time producing a super-accurate estimate of marginal likelihood if that isn’t necessary to ensure good mixing of the MCMC chain.

The papers above try to give theoretical justifications for certain “rules of thumb” that are commonly used in practice. One widely adopted scheme is to tune the number of particles so that the variance of the log of the estimate of marginal liklihood is around one. The obvious questions are “where?” and “why?”, and these questions turn out to be connected. As we will see, there isn’t really a good answer to the “where?” question, but what people usually do is use a pilot run to get an estimate of the posterior mean, or mode, or MLE, and then pick one and tune the noise variance at that particular parameter value. As to “why?”, well, the papers above make various (slightly different) assumptions, all of which lead to trading off mixing against computation time to obtain an “optimal” number of particles. They don’t all agree that the variance of the noise should be exactly 1, but they all agree to an order of magnitude.

All of the above papers make the assumption that the noise distribution associated with the marginal likelihood estimate is independent of the parameter at which it is being evaluated, which explains why there isn’t a really good answer to the “where?” question – under the assumption it doesn’t matter what parameter value is used for tuning – they are all the same! Easy. Except that’s quite a big assumption, so it would be nice to know that it is reasonable, and unfortunately it isn’t. Let’s look at an example to see what goes wrong.

#### Example

In Chapter 10 of my book I look in detail at constructing a PMMH algorithm for inferring the parameters of a discretely observed stochastic Lotka-Volterra model. I’ve stepped through the computational details in a previous post which you should refer back to for the necessary background. Following that post, we can construct a particle filter to return an unbiased estimate of marginal likelihood using the following R code (which relies on the smfsb CRAN package):

require(smfsb)
# data
data(LVdata)
data=as.timedData(LVnoise10)
noiseSD=10
# measurement error model
dataLik <- function(x,t,y,log=TRUE,...)
{
ll=sum(dnorm(y,x,noiseSD,log=TRUE))
if (log)
return(ll)
else
return(exp(ll))
}
# now define a sampler for the prior on the initial state
simx0 <- function(N,t0,...)
{
mat=cbind(rpois(N,50),rpois(N,100))
colnames(mat)=c("x1","x2")
mat
}
# construct particle filter
mLLik=pfMLLik(150,simx0,0,stepLVc,dataLik,data)


Again, see the relevant previous post for details. So now mLLik() is a function that will return the log of an unbiased estimate of marginal likelihood (based on 150 particles) given a parameter value at which to evaluate.

What we are currently wondering is whether the noise in the estimate is independent of the parameter at which it is evaluated. We can investigate this for this filter easily by looking at how the estimate varies as the first parameter (prey birth rate) varies. The following code computes a log likelihood estimate across a range of values and plots the result.

mLLik1=function(x){mLLik(th=c(th1=x,th2=0.005,th3=0.6))}
x=seq(0.7,1.3,length.out=5001)
y=sapply(x,mLLik1)
plot(x[y>-1e10],y[y>-1e10])


The resulting plot is as follows (click for full size):

So, looking at the plot, it is very clear that the noise variance certainly isn’t constant as the parameter varies – it varies substantially. Furthermore, the way in which it varies is “dangerous”, in that the noise is smallest in the vicinity of the MLE. So, if a parameter close to the MLE is chosen for tuning the number of particles, this will ensure that the noise is small close to the MLE, but not elsewhere in parameter space. This could have bad consequences for the mixing of the MCMC algorithm as it explores the tails of the posterior distribution.

So with the above in mind, how should one tune the number of particles in a pMCMC algorithm? I can’t give a general answer, but I can explain what I do. We can’t rely on theory, so a pragmatic approach is required. The above rule of thumb usually gives a good starting point for exploration. Then I just directly optimise ESS per CPU second of the pMCMC algorithm from pilot runs for varying numbers of particles (and other tuning parameters in the algorithm). ESS is “expected sample size”, which can be estimated using the effectiveSize() function in the coda CRAN package. Ugly and brutish, but it works…