This weekend I’ve been preparing some material for my upcoming Scala for statistical computing short course. As part of the course, I thought it would be useful to walk through how to think about and structure MCMC codes, and in particular, how to think about MCMC algorithms as infinite streams of state. This material is reasonably stand-alone, so it seems suitable for a blog post. Complete runnable code for the examples in this post are available from my blog repo.

For this post I will just consider a trivial toy Metropolis algorithm using a Uniform random walk proposal to target a standard normal distribution. I’ve considered this problem before on my blog, so if you aren’t very familiar with Metropolis-Hastings algorithms, you might want to quickly review my post on Metropolis-Hastings MCMC algorithms in R before continuing. At the end of that post, I gave the following R code for the Metropolis sampler:

metrop3<-function(n=1000,eps=0.5) { vec=vector("numeric", n) x=0 oldll=dnorm(x,log=TRUE) vec[1]=x for (i in 2:n) { can=x+runif(1,-eps,eps) loglik=dnorm(can,log=TRUE) loga=loglik-oldll if (log(runif(1)) < loga) { x=can oldll=loglik } vec[i]=x } vec }

I will begin this post with a fairly direct translation of this algorithm into Scala:

def metrop1(n: Int = 1000, eps: Double = 0.5): DenseVector[Double] = { val vec = DenseVector.fill(n)(0.0) var x = 0.0 var oldll = Gaussian(0.0, 1.0).logPdf(x) vec(0) = x (1 until n).foreach { i => val can = x + Uniform(-eps, eps).draw val loglik = Gaussian(0.0, 1.0).logPdf(can) val loga = loglik - oldll if (math.log(Uniform(0.0, 1.0).draw) < loga) { x = can oldll = loglik } vec(i) = x } vec }

This code works, and is reasonably fast and efficient, but there are several issues with it from a functional programmers perspective. One issue is that we have committed to storing all MCMC output in RAM in a `DenseVector`

. This probably isn’t an issue here, but for some big problems we might prefer to not store the full set of states, but to just print the states to (say) the console, for possible re-direction to a file. It is easy enough to modify the code to do this:

def metrop2(n: Int = 1000, eps: Double = 0.5): Unit = { var x = 0.0 var oldll = Gaussian(0.0, 1.0).logPdf(x) (1 to n).foreach { i => val can = x + Uniform(-eps, eps).draw val loglik = Gaussian(0.0, 1.0).logPdf(can) val loga = loglik - oldll if (math.log(Uniform(0.0, 1.0).draw) < loga) { x = can oldll = loglik } println(x) } }

But now we have two version of the algorithm. One for storing results locally, and one for streaming results to the console. This is clearly unsatisfactory, but we shall return to this issue shortly. Another issue that will jump out at functional programmers is the reliance on mutable variables for storing the state and old likelihood. Let’s fix that now by re-writing the algorithm as a tail-recursion.

@tailrec def metrop3(n: Int = 1000, eps: Double = 0.5, x: Double = 0.0, oldll: Double = Double.MinValue): Unit = { if (n > 0) { println(x) val can = x + Uniform(-eps, eps).draw val loglik = Gaussian(0.0, 1.0).logPdf(can) val loga = loglik - oldll if (math.log(Uniform(0.0, 1.0).draw) < loga) metrop3(n - 1, eps, can, loglik) else metrop3(n - 1, eps, x, oldll) } }

This has eliminated the `var`

s, and is just as fast and efficient as the previous version of the code. Note that the `@tailrec`

annotation is optional – it just signals to the compiler that we want it to throw an error if for some reason it cannot eliminate the tail call. However, this is for the print-to-console version of the code. What if we actually want to keep the iterations in RAM for subsequent analysis? We can keep the values in an accumulator, as follows.

@tailrec def metrop4(n: Int = 1000, eps: Double = 0.5, x: Double = 0.0, oldll: Double = Double.MinValue, acc: List[Double] = Nil): DenseVector[Double] = { if (n == 0) DenseVector(acc.reverse.toArray) else { val can = x + Uniform(-eps, eps).draw val loglik = Gaussian(0.0, 1.0).logPdf(can) val loga = loglik - oldll if (math.log(Uniform(0.0, 1.0).draw) < loga) metrop4(n - 1, eps, can, loglik, can :: acc) else metrop4(n - 1, eps, x, oldll, x :: acc) } }

This is all fine, but we haven’t yet addressed the issue of having different versions of the code depending on what we want to do with the output. The problem is that we have tied up the logic of advancing the Markov chain with what to do with the output. What we need to do is separate out the code for advancing the state. We can do this by defining a new function.

def newState(x: Double, oldll: Double, eps: Double): (Double, Double) = { val can = x + Uniform(-eps, eps).draw val loglik = Gaussian(0.0, 1.0).logPdf(can) val loga = loglik - oldll if (math.log(Uniform(0.0, 1.0).draw) < loga) (can, loglik) else (x, oldll) }

This function takes as input a current state and associated log likelihood and returns a new state and log likelihood following the execution of one step of a MH algorithm. This separates the concern of state updating from the rest of the code. So now if we want to write code that prints the state, we can write it as

@tailrec def metrop5(n: Int = 1000, eps: Double = 0.5, x: Double = 0.0, oldll: Double = Double.MinValue): Unit = { if (n > 0) { println(x) val ns = newState(x, oldll, eps) metrop5(n - 1, eps, ns._1, ns._2) } }

and if we want to accumulate the set of states visited, we can write that as

@tailrec def metrop6(n: Int = 1000, eps: Double = 0.5, x: Double = 0.0, oldll: Double = Double.MinValue, acc: List[Double] = Nil): DenseVector[Double] = { if (n == 0) DenseVector(acc.reverse.toArray) else { val ns = newState(x, oldll, eps) metrop6(n - 1, eps, ns._1, ns._2, ns._1 :: acc) } }

Both of these functions call `newState`

to do the real work, and concentrate on what to do with the sequence of states. However, both of these functions repeat the logic of how to iterate over the sequence of states.

Ideally we would like to abstract out the details of how to do state iteration from the code as well. Most functional languages have some concept of a `Stream`

, which represents a (potentially infinite) sequence of states. The `Stream`

can embody the logic of how to perform state iteration, allowing us to abstract that away from our code, as well.

To do this, we will restructure our code slightly so that it more clearly maps old state to new state.

def nextState(eps: Double)(state: (Double, Double)): (Double, Double) = { val x = state._1 val oldll = state._2 val can = x + Uniform(-eps, eps).draw val loglik = Gaussian(0.0, 1.0).logPdf(can) val loga = loglik - oldll if (math.log(Uniform(0.0, 1.0).draw) < loga) (can, loglik) else (x, oldll) }

The "real" state of the chain is just `x`

, but if we want to avoid recalculation of the old likelihood, then we need to make this part of the chain’s state. We can use this `nextState`

function in order to construct a `Stream`

.

def metrop7(eps: Double = 0.5, x: Double = 0.0, oldll: Double = Double.MinValue): Stream[Double] = Stream.iterate((x, oldll))(nextState(eps)) map (_._1)

The result of calling this is an infinite stream of states. Obviously it isn’t computed – that would require infinite computation, but it captures the logic of iteration and computation in a `Stream`

, that can be thought of as a lazy `List`

. We can get values out by converting the `Stream`

to a regular collection, being careful to truncate the `Stream`

to one of finite length beforehand! eg. `metrop7().drop(1000).take(10000).toArray`

will do a burn-in of 1,000 iterations followed by a main monitoring run of length 10,000, capturing the results in an `Array`

. Note that `metrop7().drop(1000).take(10000)`

is a `Stream`

, and so nothing is actually computed until the `toArray`

is encountered. Conversely, if printing to console is required, just replace the `.toArray`

with `.foreach(println)`

.

The above stream-based approach to MCMC iteration is clean and elegant, and deals nicely with issues like burn-in and thinning (which can be handled similarly). This is how I typically write MCMC codes these days. However, functional programming purists would still have issues with this approach, as it isn’t quite pure functional. The problem is that the code isn’t pure – it has a side-effect, which is to mutate the state of the under-pinning pseudo-random number generator. If the code was pure, calling `nextState`

with the same inputs would always give the same result. Clearly this isn’t the case here, as we have specifically designed the function to be stochastic, returning a randomly sampled value from the desired probability distribution. So `nextState`

represents a function for randomly sampling from a conditional probability distribution.

Now, ultimately all code has side-effects, or there would be no point in running it! But in functional programming the desire is to make as much of the code as possible pure, and to push side-effects to the very edges of the code. So it’s fine to have side-effects in your `main`

method, but not buried deep in your code. Here the side-effect is at the very heart of the code, which is why it is potentially an issue.

To keep things as simple as possible, at this point we will stop worrying about carrying forward the old likelihood, and hard-code a value of `eps`

. Generalisation is straightforward. We can make our code pure by instead defining a function which represents the conditional probability distribution itself. For this we use a *probability monad*, which in Breeze is called `Rand`

. We can couple together such functions using monadic binds (`flatMap`

in Scala), expressed most neatly using for-comprehensions. So we can write our transition kernel as

def kernel(x: Double): Rand[Double] = for { innov <- Uniform(-0.5, 0.5) can = x + innov oldll = Gaussian(0.0, 1.0).logPdf(x) loglik = Gaussian(0.0, 1.0).logPdf(can) loga = loglik - oldll u <- Uniform(0.0, 1.0) } yield if (math.log(u) < loga) can else x

This is now pure – the same input `x`

will always return the same probability distribution – the conditional distribution of the next state given the current state. We can draw random samples from this distribution if we must, but it’s probably better to work as long as possible with pure functions. So next we need to encapsulate the iteration logic. Breeze has a `MarkovChain`

object which can take kernels of this form and return a stochastic `Process`

object representing the iteration logic, as follows.

MarkovChain(0.0)(kernel). steps. drop(1000). take(10000). foreach(println)

The `steps`

method contains the logic of how to advance the state of the chain. But again note that no computation actually takes place until the `foreach`

method is encountered – this is when the sampling occurs and the side-effects happen.

Metropolis-Hastings is a common use-case for Markov chains, so Breeze actually has a helper method built-in that will construct a MH sampler directly from an initial state, a proposal kernel, and a (log) target.

MarkovChain. metropolisHastings(0.0, (x: Double) => Uniform(x - 0.5, x + 0.5))(x => Gaussian(0.0, 1.0).logPdf(x)). steps. drop(1000). take(10000). toArray

Note that if you are using the MH functionality in Breeze, it is important to make sure that you are using version 0.13 (or later), as I fixed a few issues with the MH code shortly prior to the 0.13 release.

Viewing MCMC algorithms as infinite streams of state is useful for writing elegant, generic, flexible code. Streams occur everywhere in programming, and so there are lots of libraries for working with them. In this post I used the simple `Stream`

from the Scala standard library, but there are much more powerful and flexible stream libraries for Scala, including fs2 and Akka-streams. But whatever libraries you are using, the fundamental concepts are the same. The most straightforward approach to implementation is to define impure stochastic streams to consume. However, a pure functional approach is also possible, and the Breeze library defines some useful functions to facilitate this approach. I’m still a little bit ambivalent about whether the pure approach is worth the additional cognitive overhead, but it’s certainly very interesting and worth playing with and thinking about the pros and cons.

Complete runnable code for the examples in this post are available from my blog repo.

]]>

Apache Spark is a Scala library for analysing "big data". It can be used for analysing huge (internet-scale) datasets distributed across large clusters of machines. The analysis can be anything from the computation of simple descriptive statistics associated with the datasets, through to rather sophisticated machine learning pipelines involving data pre-processing, transformation, nonlinear model fitting and regularisation parameter tuning (via methods such as cross-validation). A relatively impartial overview can be found in the Apache Spark Wikipedia page.

Although Spark is really aimed at data that can’t easily be analysed on a laptop, it is actually very easy to install and use (in standalone mode) on a laptop, and a good laptop with a fast multicore processor and plenty of RAM is fine for datasets up to a few gigabytes in size. This post will walk through getting started with Spark, installing it locally (not requiring admin/root access) doing some simple descriptive analysis, and moving on to fit a simple linear regression model to some simulated data. After this walk-through it should be relatively easy to take things further by reading the Spark documentation, which is generally pretty good.

Anyone who is interested in learning more about setting up and using Spark clusters may want to have a quick look over on my personal blog (mainly concerned with the Raspberry Pi), where I have previously considered installing Spark on a Raspberry Pi 2, setting up a small Spark cluster, and setting up a larger Spark cluster. Although these posts are based around the Raspberry Pi, most of the material there is quite generic, since the Raspberry Pi is just a small (Debian-based) Linux server.

The only pre-requisite for installing Spark is a recent Java installation. On Debian-based Linux systems (such as Ubuntu), Java can be installed with:

sudo apt-get update sudo apt-get install openjdk-8-jdk

For other systems you should Google for the best way to install Java. If you aren’t sure whether you have Java or not, type `java -version`

into a terminal window. If you get a version number of the form 1.7.x or 1.8.x you should be fine.

Once you have Java installed, you can download and install Spark in any appropriate place in your file-system. If you are running Linux, or a Unix-alike, just `cd`

to an appropriate place and enter the following commands:

wget http://www.eu.apache.org/dist/spark/spark-2.1.0/spark-2.1.0-bin-hadoop2.7.tgz tar xvfz spark-2.1.0-bin-hadoop2.7.tgz cd spark-2.1.0-bin-hadoop2.7 bin/run-example SparkPi 10

If all goes well, the last command should run an example. Don’t worry if there are lots of INFO and WARN messages – we will sort that out shortly. On other systems it should simply be a matter of downloading and unpacking Spark somewhere appropriate, then running the example from the top-level Spark directory. Get Spark from the downloads page. You should get version 2.1.0 built for Hadoop 2.7. It doesn’t matter if you don’t have Hadoop installed – it is not required for single-machine use.

The INFO messages are useful for debugging cluster installations, but are too verbose for general use. On a Linux system you can turn down the verbosity with:

sed 's/rootCategory=INFO/rootCategory=WARN/g' < conf/log4j.properties.template > conf/log4j.properties

On other systems, copy the file `log4j.properties.template`

in the `conf`

sub-directory to `log4j.properties`

and edit the file, replacing `INFO`

with `WARN`

on the relevant line. Check it has worked by re-running the `SparkPi`

example – it should be much less verbose this time. You can also try some other examples:

bin/run-example SparkLR ls examples/src/main/scala/org/apache/spark/examples/

There are several different ways to use Spark. For this walk-through we are just going to use it interactively from the "Spark shell". We can pop up a shell with:

bin/spark-shell --master local[4]

The "4" refers to the number of worker threads to use. Four is probably fine for most decent laptops. `Ctrl-D`

or `:quit`

will exit the Spark shell and take you back to your OS shell. It is more convenient to have the Spark `bin`

directory in your path. If you are using `bash`

or a similar OS shell, you can temporarily add the Spark `bin`

to your path with the OS shell command:

export PATH=$PATH:`pwd`/bin

You can make this permanent by adding a line like this (but with the full path hard-coded) to your `.profile`

or similar start-up dot-file. I prefer not to do this, as I typically have several different Spark versions on my laptop and want to be able to select exactly the version I need. If you are not running `bash`

, Google how to add a directory to your path. Check the path update has worked by starting up a shell with:

spark-shell --master local[4]

Note that if you want to run a script containing Spark commands to be run in "batch mode", you could do it with a command like:

spark-shell --driver-memory 25g --master local[4] < spark-script.scala | tee script-out.txt

There are much better ways to develop and submit batch jobs to Spark clusters, but I won’t discuss those in this post. Note that while Spark is running, diagnostic information about the "cluster" can be obtained by pointing a web browser at port 4040 on the master, which here is just http://localhost:4040/ – this is extremely useful for debugging purposes.

We are now ready to start using Spark. From a Spark shell in the top-level directory, enter:

sc.textFile("README.md").count

If all goes well, you should get a count of the number of lines in the file `README.md`

. The value `sc`

is the "Spark context", containing information about the Spark cluster (here it is just a laptop, but in general it could be a large cluster of machines, each with many processors and each processor with many cores). The `textFile`

method loads up the file into an RDD (Resilient Distributed Dataset). The RDD is the fundamental abstraction provided by Spark. It is a lazy distributed parallel monadic collection. After loading a text file like this, each element of the collection represents one line of the file. I’ve talked about monadic collections in previous posts, so if this isn’t a familiar concept, it might be worth having a quick skim through at least the post on first steps with monads in Scala. The point is that although RDDs are potentially huge and distributed over a large cluster, using them is very similar to using any other monadic collection in Scala. We can unpack the previous command slightly as follows:

val rdd1 = sc.textFile("README.md") rdd1 rdd1.count

Note that RDDs are "lazy", and this is important for optimising complex pipelines. So here, after assigning the value `rdd1`

, no data is actually loaded into memory. All of the actual computation is deferred until an "action" is called – `count`

is an example of such an action, and therefore triggers the loading of data into memory and the counting of elements.

We can now look at a very slightly more complex pipeline – counting the number of words in a text file rather than the number of lines. This can be done as follows:

sc.textFile("README.md"). map(_.trim). flatMap(_.split(' ')). count

Note that `map`

and `flatMap`

are both lazy ("transformations" in Spark terminology), and so no computation is triggered until the final action, `count`

is called. The call to `map`

will just trim any redundant white-space from the line ends. So after the call to `map`

the RDD will still have one element for each line of the file. However, the call to `flatMap`

splits each line on white-space, so after this call each element of the RDD will correspond to a word, and not a line. So, the final `count`

will again count the number of elements in the RDD, but here this corresponds to the number of words in the file.

A final example before moving on to look at quantitative data analysis: counting the frequency with which each character occurs in a file. This can be done as follows:

sc.textFile("README.md"). map(_.toLowerCase). flatMap(_.toCharArray). map{(_,1)}. reduceByKey(_+_). collect

The first call to `map`

converts upper case characters to lower case, as we don’t want separate counts for upper and lower case characters. The call to `flatMap`

then makes each element of the RDD correspond to a single character in the file. The second call to `map`

transforms each element of the RDD to a key-value pair, where the key is the character and the value is the integer 1. RDDs have special methods for key-value pairs in this form – the method `reduceByKey`

is one such – it applies the reduction operation (here just "+") to all values corresponding to a particular value of the key. Since each character has the value 1, the sum of the values will be a character count. Note that the reduction will be done in parallel, and for this to work it is vital that the reduction operation is associative. Simple addition of integers is clearly associative, so here we are fine. Note that `reduceByKey`

is a (lazy) transformation, and so the computation needs to be triggered by a call to the action `collect`

.

On most Unix-like systems there is a file called `words`

that is used for spell-checking. The example below applies the character count to this file. Note the calls to `filter`

, which filter out any elements of the RDD not matching the predicate. Here it is used to filter out special characters.

sc.textFile("/usr/share/dict/words"). map(_.trim). map(_.toLowerCase). flatMap(_.toCharArray). filter(_ > '/'). filter(_ < '}'). map{(_,1)}. reduceByKey(_+_). collect

We first need some quantitative data, so let’s simulate some. Breeze is the standard Scala library for scientific and statistical computing. I’ve given a quick introduction to Breeze in a previous post. Spark has a dependence on Breeze, and therefore can be used from inside the Spark shell – this is very useful. So, we start by using Breeze to simulate a vector of normal random quantities:

import breeze.stats.distributions._ val x = Gaussian(1.0,2.0).sample(10000)

Note, though, that `x`

is just a regular Breeze Vector, a simple serial collection all stored in RAM on the master thread. To use it as a Spark RDD, we must convert it to one, using the `parallelize`

function:

val xRdd = sc.parallelize(x)

Now `xRdd`

is an RDD, and so we can do Spark transformations and actions on it. There are some special methods for RDDs containing numeric values:

xRdd.mean xRdd.sampleVariance

Each summary statistic is computed with a single pass through the data, but if several summary statistics are required, it is inefficient to make a separate pass through the data for each summary, so the `stats`

method makes a single pass through the data returning a `StatsCounter`

object that can be used to compute various summary statistics.

val xStats = xRdd.stats xStats.mean xStats.sampleVariance xStats.sum

The `StatsCounter`

methods are: `count`

, `mean`

, `sum`

, `max`

, `min`

, `variance`

, `sampleVariance`

, `stdev`

, `sampleStdev`

.

Moving beyond very simple descriptive statistics, we will look at a simple linear regression model, which will also allow us to introduce Spark `DataFrame`

s – a high level abstraction layered on top of RDDs which makes working with tabular data much more convenient, especially in the context of statistical modelling.

We start with some standard (non-Spark) Scala Breeze code to simulate some data from a simple linear regression model. We use the `x`

already simulated as our first covariate. Then we simulate a second covariate, `x2`

. Then, using some residual noise, `eps`

, we simulate a regression model scenario, where we know that the "true" intercept is 1.5 and the "true" covariate regression coefficients are 2.0 and 1.0.

val x2 = Gaussian(0.0,1.0).sample(10000) val xx = x zip x2 val lp = xx map {p => 2.0*p._1 + 1.0*p._2 + 1.5} val eps = Gaussian(0.0,1.0).sample(10000) val y = (lp zip eps) map (p => p._1 + p._2) val yx = (y zip xx) map (p => (p._1,p._2._1,p._2._2)) val rddLR = sc.parallelize(yx)

Note that the last line converts the regular Scala Breeze collection into a Spark RDD using `parallelize`

. We could, in principle, do regression modelling using raw RDDs, and early versions of Spark required this. However, statisticians used to statistical languages such as R know that data frames are useful for working with tabular data. I gave a brief overview of Scala data frame libraries in a previous post. We can convert an RDD of tuples to a Spark `DataFrame`

as follows:

val dfLR = rddLR.toDF("y","x1","x2") dfLR.show dfLR.show(5)

Note that `show`

shows the first few rows of a `DataFrame`

, and giving it a numeric argument specifies the number to show. This is very useful for quick sanity-checking of `DataFrame`

contents.

Note that there are other ways of getting data into a Spark `DataFrame`

. One of the simplest ways to get data into Spark from other systems is via a CSV file. A properly formatted CSV file with a header row can be read into Spark with a command like:

// Don't run unless you have an appropriate CSV file... val df = spark.read. option("header","true"). option("inferSchema","true"). csv("myCsvFile.csv")

This requires two passes over the data – one to infer the schema and one to actually read the data. For very large datasets it is better to declare the schema and not use automatic schema inference. However, for very large datasets, CSV probably isn’t a great choice of format anyway. Spark supports many more efficient data storage formats. Note that Spark also has functions for querying SQL (and other) databases, and reading query results directly into `DataFrame`

objects. For people familiar with databases, this is often the most convenient way of ingesting data into Spark. See the Spark DataFrames guide and the API docs for DataFrameReader for further information.

Spark has an extensive library of tools for the development of sophisticated machine learning pipelines. Included in this are functions for fitting linear regression models, regularised regression models (Lasso, ridge, elastic net), generalised linear models, including logistic regression models, etc., and tools for optimising regularisation parameters, for example, using cross-validation. For this post I’m just going to show how to fit a simple OLS linear regression model: see the ML pipeline documentation for further information, especially the docs on classification and regression.

We start by creating an object for fitting linear regression models:

import org.apache.spark.ml.regression.LinearRegression import org.apache.spark.ml.linalg._ val lm = new LinearRegression lm.explainParams lm.getStandardization lm.setStandardization(false) lm.getStandardization lm.explainParams

Note that there are many parameters associated with the fitting algorithm, including regularisation parameters. These are set to defaults corresponding to no regularisation (simple OLS). Note, however, that the algorithm defaults to standardising covariates to be mean zero variance one. We can turn that off before fitting the model if desired.

Also note that the model fitting algorithm assumes that the `DataFrame`

to be fit has (at least) two columns, one called `label`

containing the response variable, and one called `features`

, where each element is actually a `Vectors`

of covariates. So we first need to transform our `DataFrame`

into the required format.

// Transform data frame to required format val dflr = (dfLR map {row => (row.getDouble(0), Vectors.dense(row.getDouble(1),row.getDouble(2)))}). toDF("label","features") dflr.show(5)

Now we have the data in the correct format, it is simple to fit the model and look at the estimated parameters.

// Fit model val fit = lm.fit(dflr) fit.intercept fit.coefficients

You should see that the estimated parameters are close to the "true" parameters that were used to simulate from the model. More detailed diagnostics can be obtained from the fitted summary object.

val summ = fit.summary summ.r2 summ.rootMeanSquaredError summ.coefficientStandardErrors summ.pValues summ.tValues summ.predictions summ.residuals

So, that’s how to fit a simple OLS linear regression model. Fitting GLMs (including logistic regression) is very similar, and setting up routines to tune regularisation parameters via cross-validation is not much more difficult.

As previously mentioned, once you are up and running with a Spark shell, the official Spark documentation is reasonably good. First go through the quick start guide, then the programming guide, then the ML guide, and finally, consult the API docs. I discussed books on scala for data science in the previous post – many of these cover Spark to a greater or lesser extent.

I recently gave a talk on some of the general principles behind the use of functional programming for scalable statistical computing, and how concepts from category theory, such as monads, can help. The PDF slides are available. I’m not sure how comprehensible they will be without my explanations and white-board diagrams, but come to think of it, I’m not sure how comprehensible they were *with* my explanations and white-board diagrams… Also note that I occasionally run a three-day short-course on Scala for statistical computing, and much of the final day is concerned with using Apache Spark.

]]>

People regularly ask me about books and other resources for getting started with Scala for statistical computing and data science. This post will focus on books, but it’s worth briefly noting that there are a number of other resources available, on-line and otherwise, that are also worth considering. I particularly like the Coursera course Functional Programming Principles in Scala – I still think this is probably the best way to get started with Scala and functional programming for most people. In fact, there is an entire Functional Programming in Scala Specialization that is worth considering – I’ll probably discuss that more in another post. I’ve got a draft page of Scala links which has a bias towards scientific and statistical computing, and I’m currently putting together a short course in that area, which I’ll also discuss further in future posts. But this post will concentrate on books.

Before one can dive into statistical computing and data science using Scala, it’s a good idea to understand a bit about the language and about functional programming. There are by now many books on Scala, and I haven’t carefully reviewed all of them, but I’ve looked at enough to have an idea about good ways of getting started.

- Programming in Scala: Third edition, Odersky et al, Artima.
- This is
*the*Scala book, often referred to on-line as**PinS**. It is a weighty tome, and works through the Scala language in detail, starting from the basics. Every serious Scala programmer should own this book. However, it isn’t the easiest introduction to the language. - Scala for the Impatient, Horstmann, Addison-Wesley.
- As the name suggests, this is a much quicker and easier introduction to Scala than PinS, but assumes reasonable familiarity with programming in general, and sort-of assumes that the reader has a basic knowledge of Java and the JVM ecosystem. That said, it does not assume that the reader is a Java expert. My feeling is that for someone who has a reasonable programming background and a passing familiarity with Java, then this book is probably the best introduction to the language. Note that there is a second edition in the works.
- Functional Programming in Scala Chiusano and Bjarnason, Manning.
- It is possible to write Scala code in the style of "Java-without-the-semi-colons", but really the whole point of Scala is to move beyond that kind of Object-Oriented programming style. How much you venture down the path towards pure Functional Programming is very much a matter of taste, but many of the best Scala programmers are pretty hard-core FP, and there’s probably a reason for that. But many people coming to Scala don’t have a strong FP background, and getting up to speed with strongly-typed FP isn’t easy for people who only know an imperative (Object-Oriented) style of programming.
*This*is the book that will help you to make the jump to FP. Sometimes referred to online as**FPiS**, or more often even just as the**red book**, this is also a book that every serious Scala programmer should own (and read!). Note that is isn’t really a book*about*Scala – it is a book about strongly typed FP that just "happens" to*use*Scala for illustrating the ideas. Consequently, you will probably want to augment this book with a book that really is about Scala, such as one of the books above. Since this is the first book on the list published by Manning, I should also mention how much I like computing books from this publisher. They are typically well-produced, and their paper books (pBooks) come with complimentary access to well-produced DRM-free eBook versions, however you purchase them. - Functional and Reactive Domain Modeling, Ghosh, Manning.
- This is another book that isn’t really
*about*Scala, but about software engineering using a strongly typed FP language. But again, it uses Scala to illustrate the ideas, and is an excellent read. You can think of it as a more practical "hands-on" follow-up to the red book, which shows how the ideas from the red book translate into effective solutions to real-world problems. - Structure and Interpretation of Computer Programs, second edition Abelson et al, MIT Press.
- This is not a Scala book! This is the only book in this list which doesn’t use Scala at all. I’ve included it on the list because it is one of the best books on programming that I’ve read, and is the book that I wish someone had told me about 20 years ago! In fact the book uses Scheme (a Lisp derivative) as the language to illustrate the ideas. There are obviously important differences between Scala and Scheme – eg. Scala is strongly statically typed and compiled, whereas Scheme is dynamically typed and interpreted. However, there are also similarities – eg. both languages support and encourage a functional style of programming but are not pure FP languages. Referred to on-line as
**SICP**this book is a classic. Note that there is no need to buy a paper copy if you like eBooks, since electronic versions are available free on-line.

- Scala for Data Science, Bugnion, Packt.
- Not to be confused with the (terrible) book, Scala for machine learning by the same publisher. Scala for Data Science is my top recommendation for getting started with statistical computing and data science applications using Scala. I have reviewed this book in another post, so I won’t say more about it here (but I like it).
- Scala Data Analysis Cookbook, Manivannan, Packt.
- I’m not a huge fan of the cookbook format, but this book is really mis-named, as it isn’t really a cookbook and isn’t really about data analysis in Scala! It is really a book about Apache Spark, and proceeds fairly sequentially in the form of a tutorial introduction to Spark. Spark is an impressive piece of technology, and it is obviously one of the factors driving interest in Scala, but it’s important to understand that Spark isn’t Scala, and that many typical data science applications will be better tackled using Scala without Spark. I’ve not read this book cover-to-cover as it offers little over Scala for Data Science, but its coverage of Spark is a bit more up-to-date than the Spark books I mention below, so it could be of interest to those who are mainly interested in Scala for Spark.
- Scala High Performance Programming, Theron and Diamant, Packt.
- This is an interesting book, fundamentally about developing high performance streaming data processing algorithm pipelines in Scala. It makes no reference to Spark. The running application is an on-line financial trading system. It takes a deep dive into understanding performance in Scala and on the JVM, and looks at how to benchmark and profile performance, diagnose bottlenecks and optimise code. This is likely to be of more interest to those interested in developing efficient algorithms for scientific and statistical computing rather than applied data scientists, but it covers some interesting material not covered by any of the other books in this list.
- Learning Spark, Karau et al, O’Reilly.
- This book provides an introduction to Apache Spark, written by some of the people who developed it. Spark is a big data analytics framework built on top of Scala. It is arguably the best available framework for big data analytics on computing clusters in the cloud, and hence there is a lot of interest in it. The book is a perfectly good introduction to Spark, and shows most examples implemented using the Java and Python APIs in addition to the canonical Scala (Spark Shell) implementation. This is useful for people working with multiple languages, but can be mildly irritating to anyone who is only interested in Scala. However, the big problem with this (and every other) book on Spark is that Spark is evolving very quickly, and so by the time any book on Spark is written and published it is inevitably very out of date. It’s not clear that it is worth buying a book specifically about Spark at this stage, or whether it would be better to go for a book like
*Scala for Data Science*, which has a couple of chapters of introduction to Spark, which can then provide a starting point for engaging with Spark’s on-line documentation (which is reasonably good). - Advanced Analytics with Spark, Ryza et al, O’Reilly.
- This book has a bit of a "cookbook" feel to it, which some people like and some don’t. It’s really more like an "edited volume" with different chapters authored by different people. Unlike Learning Spark it focuses exclusively on the Scala API. The book basically covers the development of a bunch of different machine learning pipelines for a variety of applications. My main problem with this book is that it has aged particularly badly, as all of the pipelines are developed with raw RDDs, which isn’t how ML pipelines in Spark are constructed any more. So again, it’s difficult for me to recommend. The message here is that if you are thinking of buying a book about Spark, check very carefully when it was published and what version of Spark it covers and whether that is sufficiently recent to be of relevance to you.

There are lots of books to get started with Scala for statistical computing and data science applications. My "bare minimum" recommendation would be some generic Scala book (doesn’t really matter which one), the *red book*, and *Scala for data science*. After reading those, you will be very well placed to top-up your knowledge as required with on-line resources.

]]>

- Scala for Data Science, Bugnion, Packt, 2016.

*Disclaimer: This book review has not been solicited by the publisher (or anyone else) in any way. I purchased the review copy of this book myself. I have not received any benefit from the writing of this review.*

On this blog I previously reviewed the (terrible) book, Scala for machine learning by the same publisher. I was therefore rather wary of buying this book. But the topic coverage looked good, so I decided to buy it, and wasn’t disappointed. Scala for Data Science is my top recommendation for getting started with statistical computing and data science applications using Scala.

The book assumes a basic familiarity with programming in Scala, at around the level of someone who has completed the Functional Programming Principles in Scala Coursera course. That is, it (quite sensibly) doesn’t attempt to teach the reader how to program in Scala, but rather how to approach the development of data science applications using Scala. It introduces more advanced Scala idioms gradually (eg. typeclasses don’t appear until Chapter 5), so it is relatively approachable for those who aren’t yet Scala experts. The book does cover Apache Spark, but Spark isn’t introduced until Chapter 10, so it isn’t “just another Spark book”. Most of the book is about developing data science applications in Scala, completely independently of Spark. That said, it also provides one of the better introductions to Spark, so doubles up as a pretty good introductory Spark book, in addition to being a good introduction to the development of data science applications with Scala. It should probably be emphasised that the book is very much focused on data science, rather than statistical computing, but there is plenty of material of relevance to those who are more interested in statistical computing than applied data science.

*Scala and Data Science*– motivation for using Scala in preference to certain other languages I could mention…*Manipulating data with Breeze*– Breeze is the standard Scala library for scientific and statistical computing. It’s pretty good, but documentation is rather lacking. This Chapter provides a good tutorial introduction to Breeze, which should be enough to get people going sufficiently to be able to make some sense of the available on-line documentation.*Plotting with breeze-viz*– Breeze has some support for plotting and visualisation of data. It’s somewhat limited when compared to what is available in R, but is fine for interactive exploratory analysis. However, the available on-line documentation for breeze-viz is almost non-existent. This Chapter is the best introduction to breeze-viz that I have seen.*Parallel collections and futures*– the Scala standard library has built-in support for parallel and concurrent programming based on functional programming concepts such as parallel (monadic) collections and Futures. Again, this Chapter provides an excellent introduction to these powerful concepts, allowing the reader to start developing parallel algorithms for multi-core hardware with minimal fuss.*Scala and SQL through JDBC*– this Chapter looks at connecting to databases using standard JVM mechanisms such as JDBC. However, it gradually introduces more functional ways of interfacing with databases using typeclasses, motivating:*Slick – a functional interface for SQL*– an introduction to the Slick library for a more Scala-esque way of database interfacing.*Web APIs*– the practicalities of talking to web APIs. eg. authenticated HTTP requests and parsing of JSON responses.*Scala and MongoDB*– working with a NoSQL database from Scala*Concurrency with Akka*– Akka is the canonical implementation of the actor model in Scala, for building large concurrent applications. It is the foundation on which Spark is built.*Distributed batch processing with Spark*– a tutorial introduction to Apache Spark. Spark is a big data analytics framework built on top of Scala and Akka. It is arguably the best available framework for big data analytics on computing clusters in the cloud, and hence there is a lot of interest in it. Indeed, Spark is driving some of the interest in Scala.*Spark SQL and DataFrames*– interfacing with databases using Spark, and more importantly, an introduction to Spark’s DataFrame abstraction, which is now fundamental to developing machine learning pipelines in Spark.*Distributed machine learning with MLLib*– MLLib is the machine learning library for Spark. It is worth emphasising that unlike many early books on Spark, this chapter covers the newer DataFrame-based pipeline API, in addition to the original RDD-based API. Together, Chapters 10, 11 and 12 provide a pretty good tutorial introduction to Spark. After working through these, it should be easy to engage with the official on-line Spark documentation.*Web APIs with Play*– is concerned with developing a web API at the end of a data science pipeline.*Visualisation with D3 and the Play framework*– is concerned with integrating visualisation into a data science web application.

This book provides a good tutorial introduction to a large number of topics relevant to statisticians and data scientists interested in developing data science applications using Scala. After working through this book, readers should be well-placed to augment their knowledge with readily searchable on-line documentation.

In a follow-up post I will give a quick overview of some other books relevant to getting started with Scala for statistical computing and data science.

]]>

The Systems Biology Markup Language (SBML) is an XML-based format for representation and exchange of biochemical network models. SBML is supported by most systems biology modelling tools, allowing the export of a model in SBML from one tool and then reading in another tool. Because it offers a standard way of representing biochemical networks in an unambiguous way, it can also be used as the standard way of representing models in databases of biochemical network models, such as BioModels. I haven’t talked about SBML much in this blog, so far, but I discuss it in detail in my book, Stochastic modelling for systems biology. SBML is a “good thing”, and everyone who works with (deterministic or stochastic) biochemical network models should know a bit about it.

The SBML format is fairly complex to parse and generate correctly, so it’s preferable to use a software library to take care of the details. libSBML is the community standard library developed for this purpose. It is a C++ library, but has interfaces for other languages, such as Python and Java. However, whilst it’s perfectly possible to use native libraries on the JVM, they aren’t so convenient to work with, especially in conjunction with modern automatic build and deployment tools. So when working on the JVM, a pure JVM library for working with SBML would be a lot more convenient. JSBML is exactly that – a pure Java library for working with SBML on the JVM. As of version 1.2, it is also available from Maven Central, making it super-convenient to use with modern build tools such as Maven and sbt. In this post I’ll walk through getting started with using Scala and sbt to build and run a trivial JSBML example, and highlight a couple of gotchas and provide pointers for further reading.

Since JSBML is now on Maven Central, adding a dependency on it should just be a matter of adding the line

libraryDependencies += "org.sbml.jsbml" % "jsbml" % "1.2"

to your sbt `build.sbt`

file. However, for slightly mysterious reasons this doesn’t quite work. It works fine for compilation, but at runtime some dependencies are missing. I suspect this is a slight problem with the current JSBML build, but it could also be a bug/feature in sbt. Either way, the problem can be solved by explicitly including log4j dependencies in the build. So just adding:

libraryDependencies ++= Seq( "org.sbml.jsbml" % "jsbml" % "1.2", "org.apache.logging.log4j" % "log4j-1.2-api" % "2.3", "org.apache.logging.log4j" % "log4j-api" % "2.3", "org.apache.logging.log4j" % "log4j-core" % "2.3" )

to the build file is sufficient to make everything work properly.

Below is a complete Scala program to read an SBML file from disk and print to console some very basic information about the model.

object JsbmlApp { import org.sbml.jsbml.SBMLReader import scala.collection.JavaConversions._ def main(args: Array[String]): Unit = { val filename = if (args.length == 0) "ch07-mm-stoch.xml" else args(0) val reader = new SBMLReader val document = reader.readSBML(filename) val model = document.getModel println(model.getId + "\n" + model.getName) val listOfSpecies = model.getListOfSpecies val ns = model.getNumSpecies println(s"$ns Species:") listOfSpecies.iterator.foreach(species => { println(" " + species.getId + "\t" + species.getName + "\t" + species.getCompartment + "\t" + species.getInitialAmount) }) val nr = model.getNumReactions println(s"$nr Reactions.") } }

There are just a few things worth noting about this simple example. The first gotcha is to try and resist the temptation to import all SBML classes into the namespace (with `import org.sbml.jsbml._`

). This is poor programming practice at the best of times, but here it is especially problematic. Scala programmers will be aware that `Unit`

is a very important type in the Scala language, which has nothing to do with the JSBML class `Unit`

, which represents a physical unit of measurement. The clash can be avoided by using the fully qualified name, `org.sbml.jsbml.Unit`

wherever the JSBML `Unit`

class is intended, but that is rather cumbersome, so the typical Scala mechanism for dealing with this is to rename the class on import, using, for example:

import org.sbml.jsbml.{ Unit => JsbmlUnit }

Then in code it is clear that `Unit`

refers to the Scala type and `JsbmlUnit`

refers to the JSBML class.

Also note that `JavaConversions`

has been imported. This provides an implicit conversion from a Java to a Scala iterator, and this simplifies iterating over SBML `listOf`

s. Here it is used it to implicitly convert the `listOfSpecies`

Java iterator into a Scala iterator so that I can call `foreach`

on it.

This complete runnable example is available in my blog repo on github. This example will run on any system with a recent JVM installed. It does not require Scala, or libSBML, or JSBML, or any other dependency (sbt will take care of dependency resolution).

Once you are up and running with a simple example like this, the JSBML Documentation is fine. Start by reading the User guide and then use the API Documentation.

Working with SBML in Scala is quite convenient using JSBML. It is easy to include a dependence on JSBML in Scala sbt projects. JSBML has a typical Java Object-Oriented API that is somewhat unnatural in Scala, but isn’t too bad using a few tricks, such as implicit iterator conversion. It wouldn’t be very difficult to layer a more functional API on top of JSBML, but I don’t have the energy to do that. See my blog repo for the full runnable example.

]]>

Many modern algorithms in computational Bayesian statistics have at their heart a particle filter or some other sequential Monte Carlo (SMC) procedure. In this blog I’ve discussed particle MCMC algorithms which use a particle filter in the inner-loop in order to compute a (noisy, unbiased) estimate of the marginal likelihood of the data. These algorithms are often very computationally intensive, either because the forward model used to propagate the particles is expensive, or because the likelihood associated with each particle/observation is expensive (or both). In this case it is desirable to parallelise the particle filter to run on all available cores of a machine, or in some cases, it would even be desirable to distribute the the particle filter computation across a cluster of machines.

Parallelisation is difficult when using the conventional imperative programming languages typically used in scientific and statistical computing, but is much easier using modern functional languages such as Scala. In fact, in languages such as Scala it is possible to describe algorithms at a higher level of abstraction, so that exactly the same algorithm can run in serial, run in parallel across all available cores on a single machine, or run in parallel across a cluster of machines, all without changing any code. Doing so renders parallelisation a non-issue. In this post I’ll talk through how to do this for a simple bootstrap particle filter, but the same principle applies for a large range of statistical computing algorithms.

In the previous post I gave a quick introduction to the monad concept, and to monadic collections in particular. Many computational tasks in statistics can be accomplished using a sequence of operations on monadic collections. We would like to write code that is independent of any particular implementation of a monadic collection, so that we can switch to a different implementation without changing the code of our algorithm (for example, switching from a serial to a parallel collection). But in strongly typed languages we need to know at compile time that the collection we use has the methods that we require. Typeclasses provide a nice solution to this problem. I don’t want to get bogged down in a big discussion about Scala typeclasses here, but suffice to say that they describe a family of types conforming to a particular interface in an *ad hoc* loosely coupled way (they are said to provide ad hoc polymorphism). They are *not* the same as classes in traditional O-O languages, but they do solve a similar problem to the adaptor design pattern, in a much cleaner way. We can describe a simple typeclass for our monadic collection as follows:

trait GenericColl[C[_]] { def map[A, B](ca: C[A])(f: A => B): C[B] def reduce[A](ca: C[A])(f: (A, A) => A): A def flatMap[A, B, D[B] <: GenTraversable[B]](ca: C[A])(f: A => D[B]): C[B] def zip[A, B](ca: C[A])(cb: C[B]): C[(A, B)] def length[A](ca: C[A]): Int }

In the typeclass we just list the methods that we expect our generic collection to provide, but do not say anything about how they are implemented. For example, we know that operations such as `map` and `reduce` can be executed in parallel, but this is a separate concern. We can now write code that can be used for any collection conforming to the requirements of this typeclass. The full code for this example is provided in the associated github repo for this blog, and includes the obvious syntax for this typeclass, and typeclass instances for the Scala collections `Vector` and `ParVector`, that we will exploit later in the example.

We can now write some code for a single observation update of a bootstrap particle filter.

def update[S: State, O: Observation, C[_]: GenericColl]( dataLik: (S, O) => LogLik, stepFun: S => S )(x: C[S], o: O): (LogLik, C[S]) = { val xp = x map (stepFun(_)) val lw = xp map (dataLik(_, o)) val max = lw reduce (math.max(_, _)) val rw = lw map (lwi => math.exp(lwi - max)) val srw = rw reduce (_ + _) val l = rw.length val z = rw zip xp val rx = z flatMap (p => Vector.fill(Poisson(p._1 * l / srw).draw)(p._2)) (max + math.log(srw / l), rx) }

This is a very simple bootstrap filter, using Poisson resampling for simplicity and data locality, but does include use of the log-sum-exp trick to prevent over/underflow of raw weight calculations, and tracks the marginal (log-)likelihood of the observation. With this function we can now pass in a “prior” particle distribution in any collection conforming to our typeclass, together with a propagator function, an observation (log-)likelihood, and an observation, and it will return back a new collection of particles of exactly the same type that was provided for input. Note that all of the operations we require can be accomplished with the standard monadic collection operations declared in our typeclass.

Once we have a function for executing one step of a particle filter, we can produce a function for particle filtering as a functional fold over a sequence of observations:

def pFilter[S: State, O: Observation, C[_]: GenericColl, D[O] <: GenTraversable[O]]( x0: C[S], data: D[O], dataLik: (S, O) => LogLik, stepFun: S => S ): (LogLik, C[S]) = { val updater = update[S, O, C](dataLik, stepFun) _ data.foldLeft((0.0, x0))((prev, o) => { val next = updater(prev._2, o) (prev._1 + next._1, next._2) }) }

Folding data structures is a fundamental concept in functional programming, and is exactly what is required for any kind of filtering problem. Note that Brian Beckman has recently written a series of articles on Kalman filtering as a functional fold.

So far we haven’t said anything about parameters or parameter estimation, but this is appropriate, since parametrisation is a separate concern from filtering. However, once we have a function for particle filtering, we can produce a function concerned with evaluating marginal likelihoods trivially:

def pfMll[S: State, P: Parameter, O: Observation, C[_]: GenericColl, D[O] <: GenTraversable[O]]( simX0: P => C[S], stepFun: P => S => S, dataLik: P => (S, O) => LogLik, data: D[O] ): (P => LogLik) = (th: P) => pFilter(simX0(th), data, dataLik(th), stepFun(th))._1

Note that this higher-order function does not return a value, but instead a function which will accept a parameter as input and return a (log-)likelihood as output. This can then be used for parameter estimation purposes, perhaps being used in a PMMH pMCMC algorithm, or something else. Again, this is a separate concern.

Here I’ll just give a completely trivial toy example, purely to show how the functions work. For avoidance of doubt, *I know* that there are many better/simpler/easier ways to tackle this problem! Here we will just look at inferring the auto-regression parameter of a linear Gaussian AR(1)-plus-noise model using the functions we have developed.

First we can simulate some synthetic data from this model, using a value of 0.8 for the auto-regression parameter:

val inNoise = Gaussian(0.0, 1.0).sample(99) val state = DenseVector(inNoise.scanLeft(0.0)((s, i) => 0.8 * s + i).toArray) val noise = DenseVector(Gaussian(0.0, 2.0).sample(100).toArray) val data = (state + noise).toArray.toList

Now assuming that we don’t know the auto-regression parameter, we can construct a function to evaluate the likelihood of different parameter values as follows:

val mll = pfMll( (th: Double) => Gaussian(0.0, 10.0).sample(10000).toVector.par, (th: Double) => (s: Double) => Gaussian(th * s, 1.0).draw, (th: Double) => (s: Double, o: Double) => Gaussian(s, 2.0).logPdf(o), data )

Note that the 4 characters “`.par`” at the end of line 2 are the only difference between running this code serially or in parallel! Now we can run this code by calling the returned function with different values. So, hopefully `mll(0.8)` will return a larger log-likelihood than (say) `mll(0.6)` or `mll(0.9)`. The example code in the github repo plots the results of calling `mll()` for a range of values (note that if that was the genuine use-case, then it would be much better to parallellise the parameter range than the particle filter, due to providing better parallelisation granularity, but many other examples require parallelisation of the particle filter itself). In this particular example, both the forward model and the likelihood are very cheap operations, so there is little to be gained from parallelisation. Nevertheless, I still get a speedup of more than a factor of two using the parallel version on my laptop.

In this post we have shown how typeclasses can be used in Scala to write code that is parallelisation-agnostic. Code written in this way can be run on one or many cores as desired. We’ve illustrated the concept with a scalable particle filter, but nothing about the approach is specific to that application. It would be easy to build up a library of statistical routines this way, all of which can effectively exploit available parallel hardware. Further, although we haven’t demonstrated it here, it is trivial to extend this idea to allow code to be distribution over a cluster of parallel machines if necessary. For example, if an Apache Spark cluster is available, it is easy to make a Spark RDD instance for our generic collection typeclass, that will then allow us to run our (unmodified) particle filter code over a Spark cluster. This emphasises the fact that Spark can be useful for distributing computation as well as just processing “big data”. I’ll say more about Spark in subsequent posts.

]]>

In the previous post I gave a quick introduction to some important concepts in functional programming, such as HOFs, closures, currying and partial application, and hopefully gave some insight into why these concepts might be useful in the context of scientific computing. Another concept that is very important in modern functional programming is that of the monad. Monads are one of those concepts that turns out to be very simple and intuitive once you “get it”, but completely impenetrable until you do! Now, there zillions of monad tutorials out there, and I don’t think that I have anything particularly insightful to add to the discussion. That said, most of the tutorials focus on problems and examples that are some way removed from the interests of statisticians and scientific programmers. So in this post I want to try and give a very informal and intuitive introduction to the monad concept in a way that I hope will resonate with people from a more scientific computing background.

The term “monad” is borrowed from that of the corresponding concept in category theory. The connection between functional programming and category theory is strong and deep. I intend to expore this more in future posts, but for this post the connection is not important and no knowledge of category theory is assumed (or imparted!).

All of the code used in this post in contained in the first-monads directory of my blog repo. The best way to follow this post is to copy-and-paste commands one-at-a-time from this post to a Scala REPL or `sbt console`. Note that only the numerical linear algebra examples later in this post require any non-standard dependencies.

The `map` method is one of the first concepts one meets when beginning functional programming. It is a higher order method on many (immutable) *collection* and other *container* types. Let’s start by looking at how `map` operates on `Lists`.

val x = (0 to 4).toList // x: List[Int] = List(0, 1, 2, 3, 4) val x2 = x map { x => x * 3 } // x2: List[Int] = List(0, 3, 6, 9, 12) val x3 = x map { _ * 3 } // x3: List[Int] = List(0, 3, 6, 9, 12) val x4 = x map { _ * 0.1 } // x4: List[Double] = List(0.0, 0.1, 0.2, 0.30000000000000004, 0.4)

The last example shows that a `List[T]` can be converted to a `List[S]` if `map` is passed a function of type `T => S`. Of course there’s nothing particularly special about `List` here. It works with other collection types in the same way, as the following example with (immutable) `Vector` illustrates:

val xv = x.toVector // xv: Vector[Int] = Vector(0, 1, 2, 3, 4) val xv2 = xv map { _ * 0.2 } // xv2: scala.collection.immutable.Vector[Double] = Vector(0.0, 0.2, 0.4, 0.6000000000000001, 0.8) val xv3 = for (xi <- xv) yield (xi * 0.2) // xv3: scala.collection.immutable.Vector[Double] = Vector(0.0, 0.2, 0.4, 0.6000000000000001, 0.8)

Note here that the `for` comprehension generating `xv3` is exactly equivalent to the `map` call generating `xv2` – the for-comprehension is just *syntactic sugar* for the `map` call. The benefit of this syntax will become apparent in the more complex examples we consider later.

Many collection and other container types have a `map` method that behaves this way. Any parametrised type that does have a `map` method like this is known as a Functor. Again, the name is due to category theory, but that doesn’t matter for this post. From a Scala-programmer perspective, a functor can be thought of as a trait, in pseudo-code as

trait F[T] { def map(f: T => S): F[S] }

with `F` representing the functor. In fact it turns out to be better to think of a functor as a type class, but that is yet another topic for a future post… Also note that to be a functor in the strict sense (from a category theory perspective), the map method must behave sensibly – that is, it must satisfy the functor laws. But again, I’m keeping things informal and intuitive for this post – there are plenty of other monad tutorials which emphasise the category theory connections.

Once we can map functions over elements of containers, we soon start mapping functions which themselves return values of the container type. eg. we can map a function returning a `List` over the elements of a `List`, as illustrated below.

val x5 = x map { x => List(x - 0.1, x + 0.1) } // x5: List[List[Double]] = List(List(-0.1, 0.1), List(0.9, 1.1), List(1.9, 2.1), List(2.9, 3.1), List(3.9, 4.1))

Clearly this returns a list-of-lists. Sometimes this is what we want, but very often we actually want to *flatten* down to a single list so that, for example, we can subsequently map over all of the elements of the base type with a single map. We could take the list-of-lists and then flatten it, but this pattern is so common that the act of mapping and then flattening is often considered to be a basic operation, often known in Scala as `flatMap`. So for our toy example, we could carry out the `flatMap` as follows.

val x6 = x flatMap { x => List(x - 0.1, x + 0.1) } // x6: List[Double] = List(-0.1, 0.1, 0.9, 1.1, 1.9, 2.1, 2.9, 3.1, 3.9, 4.1)

The ubiquity of this pattern becomes more apparent when we start thinking about iterating over multiple collections. For example, suppose now that we have two lists, `x` and `y`, and that we want to iterate over all pairs of elements consisting of one element from each list.

val y = (0 to 12 by 2).toList // y: List[Int] = List(0, 2, 4, 6, 8, 10, 12) val xy = x flatMap { xi => y map { yi => xi * yi } } // xy: List[Int] = List(0, 0, 0, 0, 0, 0, 0, 0, 2, 4, 6, 8, 10, 12, 0, 4, 8, 12, 16, 20, 24, 0, 6, 12, 18, 24, 30, 36, 0, 8, 16, 24, 32, 40, 48)

This pattern of having one or more nested `flatMaps` followed by a final `map` in order to iterate over multiple collections is very common. It is exactly this pattern that the for-comprehension is syntactic sugar for. So we can re-write the above using a for-comprehension

val xy2 = for { xi <- x yi <- y } yield (xi * yi) // xy2: List[Int] = List(0, 0, 0, 0, 0, 0, 0, 0, 2, 4, 6, 8, 10, 12, 0, 4, 8, 12, 16, 20, 24, 0, 6, 12, 18, 24, 30, 36, 0, 8, 16, 24, 32, 40, 48)

This for-comprehension (usually called a for-expression in Scala) has an intuitive syntax reminiscent of the kind of thing one might write in an imperative language. But it is important to remember that `<-` is not actually an imperative assignment. The for-comprehension really does expand to the pure-functional nested `flatMap` and `map` call given above.

Recalling that a functor is a parameterised type with a `map` method, we can now say that a *monad* is just a functor which also has a `flatMap` method. We can write this in pseudo-code as

trait M[T] { def map(f: T => S): M[S] def flatMap(f: T => M[S]): M[S] }

Not all functors can have a flattening operation, so not all functors are monads, but all monads are functors. Monads are therefore more powerful than functors. Of course, more power is not always good. The principle of least power is one of the main principles of functional programming, but monads are useful for sequencing dependent computations, as illustrated by for-comprehensions. In fact, since for-comprehensions de-sugar to calls to map and flatMap, monads are precisely what are required in order to be usable in for-comprehensions. Collections supporting map and flatMap are referred to as *monadic*. Most Scala collections are monadic, and operating on them using map and flatMap operations, or using for-comprehensions is referred to as *monadic-style*. People will often refer to the monadic nature of a collection (or other container) using the word monad, eg. the “List monad”.

So far the functors and monads we have been working with have been collections, but not all monads are collections, and in fact collections are in some ways atypical examples of monads. Many monads are *containers* or *wrappers*, so it will be useful to see examples of monads which are not collections.

One of the first monads that many people encounter is the `Option` monad (referred to as the Maybe monad in Haskell, and Optional in Java 8). You can think of it as being a strange kind of “collection” that can contain at most one element. So it will either contain an element or it won’t, and so can be used to wrap the result of a computation which might fail. If the computation succeeds, the value computed can be wrapped in the Option (using the type `Some`), and if it fails, it will not contain a value of the required type, but simply be the value `None`. It provides a referentially transparent and type-safe alternative to raising exceptions or returning NULL references. We can transform Options using `map`.

val three = Option(3) // three: Option[Int] = Some(3) val twelve = three map (_ * 4) // twelve: Option[Int] = Some(12)

But when we start combining the results of multiple computations that could fail, we run into exactly the same issues as before.

val four = Option(4) // four: Option[Int] = Some(4) val twelveB = three map (i => four map (i * _)) // twelveB: Option[Option[Int]] = Some(Some(12))

Here we have ended up with an Option wrapped in another Option, which is not what we want. But we now know the solution, which is to replace the first `map` with `flatMap`, or better still, use a for-comprehension.

val twelveC = three flatMap (i => four map (i * _)) // twelveC: Option[Int] = Some(12) val twelveD = for { i <- three j <- four } yield (i * j) // twelveD: Option[Int] = Some(12)

Again, the for-comprehension is a little bit easier to understand than the chaining of calls to `flatMap` and `map`. Note that in the for-comprehension we don’t worry about whether or not the Options actually contain values – we just concentrate on the “happy path”, where they both do, safe in the knowledge that the Option monad will take care of the failure cases for us. Two of the possible failure cases are illustrated below.

val oops: Option[Int] = None // oops: Option[Int] = None val oopsB = for { i <- three j <- oops } yield (i * j) // oopsB: Option[Int] = None val oopsC = for { i <- oops j <- four } yield (i * j) // oopsC: Option[Int] = None

This is a typical benefit of code written in a monadic style. We chain together multiple computations thinking only about the canonical case and trusting the monad to take care of any additional computational context for us.

Those with a background in scientific computing are probably already familiar with the NaN value in IEEE floating point. In many regards, this value and the rules around its behaviour mean that `Float` and `Double` types in IEEE compliant languages behave as an Option monad with `NaN` as the None value. This is simply illustrated below.

val nan = Double.NaN 3.0 * 4.0 // res0: Double = 12.0 3.0 * nan // res1: Double = NaN nan * 4.0 // res2: Double = NaN

The `NaN` value arises naturally when computations fail. eg.

val nanB = 0.0 / 0.0 // nanB: Double = NaN

This Option-like behaviour of `Float` and `Double` means that it is quite rare to see examples of `Option[Float]` or `Option[Double]` in Scala code. But there are some disadvantages of the IEEE approach, as discussed elsewhere. Also note that this only works for Floats and Doubles, and not for other types, including, say, `Int`.

val nanC=0/0 // This raises a runtime exception!

Good practical examples of scientific computations which can fail crop up frequently in numerical linear algebra, so it’s useful to see how `Option` can simplify code in that context. Note that the code in this section requires the Breeze library, so should be run from an sbt console using the sbt build file associated with this post.

In statistical applications, one often needs to compute the Cholesky factorisation of a square symmetric matrix. This operation is built into Breeze as the function `cholesky`. However the factorisation will fail if the matrix provided is not positive semi-definite, and in this case the `cholesky` function will throw a runtime exception. We will use the built in `cholesky` function to create our own function, `safeChol` (using a monad called `Try` which is closely related to the Option monad) returning an Option of a matrix rather than a matrix. This function will not throw an exception, but instead return `None` in the case of failure, as illustrated below.

import breeze.linalg._ def safeChol(m: DenseMatrix[Double]): Option[DenseMatrix[Double]] = scala.util.Try(cholesky(m)).toOption val m = DenseMatrix((2.0, 1.0), (1.0, 3.0)) val c = safeChol(m) // c: Option[breeze.linalg.DenseMatrix[Double]] = // Some(1.4142135623730951 0.0 // 0.7071067811865475 1.5811388300841898 ) val m2 = DenseMatrix((1.0, 2.0), (2.0, 3.0)) val c2 = safeChol(m2) // c2: Option[breeze.linalg.DenseMatrix[Double]] = None

A Cholesky factorisation is often followed by a forward or backward solve. This operation may also fail, independently of whether the Cholesky factorisation fails. There doesn’t seem to be a forward solve function directly exposed in the Breeze API, but we can easily define one, which I call `dangerousForwardSolve`, as it will throw an exception if it fails, just like the `cholesky` function. But just as for the `cholesky` function, we can wrap up the dangerous function into a safe one that returns an Option.

import com.github.fommil.netlib.BLAS.{getInstance => blas} def dangerousForwardSolve(A: DenseMatrix[Double], y: DenseVector[Double]): DenseVector[Double] = { val yc = y.copy blas.dtrsv("L", "N", "N", A.cols, A.toArray, A.rows, yc.data, 1) yc } def safeForwardSolve(A: DenseMatrix[Double], y: DenseVector[Double]): Option[DenseVector[Double]] = scala.util.Try(dangerousForwardSolve(A, y)).toOption

Now we can write a very simple function which chains these two operations together, as follows.

def safeStd(A: DenseMatrix[Double], y: DenseVector[Double]): Option[DenseVector[Double]] = for { L <- safeChol(A) z <- safeForwardSolve(L, y) } yield z safeStd(m,DenseVector(1.0,2.0)) // res15: Option[breeze.linalg.DenseVector[Double]] = Some(DenseVector(0.7071067811865475, 0.9486832980505138))

Note how clean and simple this function is, concentrating purely on the “happy path” where both operations succeed and letting the Option monad worry about the three different cases where at least one of the operations fails.

Let’s finish with a monad for parallel and asynchronous computation, the `Future` monad. The Future monad is used for wrapping up slow computations and dispatching them to another thread for completion. The call to `Future` returns immediately, allowing the calling thread to continue while the additional thread processes the slow work. So at that stage, the Future will not have completed, and will not contain a value, but at some (unpredictable) time in the future it (hopefully) will (hence the name). In the following code snippet I construct two Futures that will each take at least 10 seconds to complete. On the main thread I then use a for-comprehension to chain the two computations together. Again, this will return immediately returning another Future that at some point in the future will contain the result of the derived computation. Then, purely for illustration, I force the main thread to stop and wait for that third future (`f3`) to complete, printing the result to the console.

import scala.concurrent.duration._ import scala.concurrent.{Future,ExecutionContext,Await} import ExecutionContext.Implicits.global val f1=Future{ Thread.sleep(10000) 1 } val f2=Future{ Thread.sleep(10000) 2 } val f3=for { v1 <- f1 v2 <- f2 } yield (v1+v2) println(Await.result(f3,30.second))

When you paste this into your console you should observe that you get the result in 10 seconds, as `f1` and `f2` execute in parallel on separate threads. So the Future monad is one (of many) ways to get started with parallel and async programming in Scala.

In this post I’ve tried to give a quick informal introduction to the monad concept, and tried to use examples that will make sense to those interested in scientific and statistical computing. There’s loads more to say about monads, and there are many more commonly encountered useful monads that haven’t been covered in this post. I’ve skipped over lots of details, especially those relating to the formal definitions of functors and monads, including the laws that map and flatMap must satisfy and why. But those kinds of details can be easily picked up from other monad tutorials. Anyone interested in pursuing the formal connections may be interested in a page of links I’m collating on category theory for FP. In particular, I quite like the series of blog posts on category theory for programmers. As I’ve mentioned in previous posts, I also really like the book Functional Programming in Scala, which I strongly recommend to anyone who wants to improve their Scala code. In a subsequent post I’ll explain how monadic style is relevant to issues relating to the statistical analysis of big data, as exemplified in Apache Spark. It’s probably also worth mentioning that there is another kind of functor that turns out to be exceptionally useful in functional programming: the applicative functor. This is more powerful than a basic functor, but less powerful than a monad. It turns out to be useful for computations which need to be sequenced but are not sequentially dependent (context-free rather than context-sensitive), and is a little bit more general and flexible than a monad in cases where it is appropriate.

]]>

Functional programming (FP) is a programming style that emphasises the use of referentially transparent pure functions and immutable data structures. Higher order functions (HOFs) tend to be used extensively to enable a clean functional programming style. A HOF is just a function that either takes a function as an argument or returns a function. For example, the default `List` type in Scala is immutable. So, if one defines a list via

val l1 = List(1,2,3)

we add a new value to the front of the list by creating a new list from the old list and leaving the old list unchanged:

val l2 = 4 :: l1 // List(4, 1, 2, 3)

We can create a new list the same length as an existing list by applying the same function to each element of the list in turn using `map`:

val l3 = l2 map { x => x*x } // List(16, 1, 4, 9)

We could write this slightly differently as

val l4 = l2.map(x => x*x)

which makes it clearer that `map` is a higher order function on lists. In fact, the presence of a `map` method on `List[_]` makes it a functor, but that is a topic for another post.

HOFs are ubiquitous in FP, and very powerful. But there are a few techniques for working with functions in Scala (and other FP languages) which make creating and using HOFs more convenient.

There are many, many reasons for using functions and HOFs in scientific and statistical computing (optimising, integrating, differentiating, or sampling, to name just a few). But the basic idea can be illustrated simply by considering the problem of plotting a function of one scalar variable.

All of the code associated with this post is available in the curry directory of my blog repo. Full instructions for running the code are included in the README. The code includes a simple short method, `plotFun` which uses breeze to produce a simple plot of a user supplied function. For example:

import Currying._ plotFun(x => x*x)

produces the plot:

We can use this method to plot other functions, for example:

def myQuad1(x: Double): Double = x*x - 2*x + 1 plotFun(myQuad1) def myQuad2(x: Double): Double = x*x - 3*x - 1 plotFun(myQuad2)

Now technically, `myQuad1` and `myQuad2` are *methods* rather than *functions*. The distinction is a bit subtle, and they can often be used interchangeably, but the difference does sometimes matter, so it is good to understand it. To actually define a function and plot it, we could instead use code like:

val myQuad3: (Double => Double) = x => -x*x + 2 plotFun(myQuad3)

Note that here `myQuad3` is a *value* whose *type* is a *function* mapping a `Double` to a `Double`. This is a proper function. This style of function declaration should make sense to people coming from other functional languages such as Haskell, but is potentially confusing to those coming from O-O languages such as Java. Note that is is easy to convert a method to a function using an underscore, so that, for example, `myQuad2 _` will give the function corresponding to `myQuad2`. Note that there is no corresponding way to get a method from a function, so that is one reason for preferring method declaration syntax (and there are others, such as the ability to parametrise method declarations with generic types).

Now, rather than defining lots of specific instances of quadratic functions from scratch, it would make more sense to define a generic quadratic function and then just plot particular instances of this generic quadratic. It is simple enough to define a generic quadratic with:

def quadratic(a: Double, b: Double, c: Double, x: Double): Double = a*x*x + b*x + c

But we clearly can’t pass that in to the plotting function directly, as it has the wrong type signature (not `Double => Double`), and the specific values of `a`, `b` and `c` need to be given. This issue crops up a lot in scientific and statistical computing – there is a function which has some additional parameters which need to be fixed before the function can actually be used. This is referred to as the “function environment problem” by Oliveira and Stewart (section 8.5). Fortunately, in functional languages it’s easy enough to use this function to create a new “partially specified” function and pass that in. For example, we could just do

plotFun(x => quadratic(3,2,1,x))

We can define another function, `quadFun`, which allows us to construct these partially applied function closures, and use it as follows:

def quadFun(a: Double, b: Double, c: Double): Double => Double = x => quadratic(a,b,c,x) val myQuad4 = quadFun(2,1,3) plotFun(myQuad4) plotFun(quadFun(1,2,3))

Here, `quadFun` is a HOF in the sense that it returns a function closure corresponding to the partially applied `quadratic` function. The returned function has the type `Double => Double`, so we can use it wherever a function with this signature is expected. Note that the function carries around with it its lexical “environment”, specifically, the values of `a`, `b` and `c` specified at the time `quadFun` was called. This style of constructing closures works in most lexically scoped languages which have functions as first class objects. I use this style of programming a lot in several different languages. In particular, I’ve written previously about lexical scope and function closures in R.

Again, the intention is perhaps slightly more explicit if we re-write the above using function syntax as:

val quadFunF: (Double,Double,Double) => Double => Double = (a,b,c) => x => quadratic(a,b,c,x) val myQuad5 = quadFunF(-1,1,2) plotFun(myQuad5) plotFun(quadFunF(1,-2,3))

Now, this concept of partial application is so prevalent in FP that some languages have special syntactic support for it. In Scala, we can partially apply using an underscore to represent unapplied parameters as:

val myQuad6 = quadratic(1,2,3,_: Double) plotFun(myQuad6)

In Scala we can also directly write our functions in curried form, with parameters (or parameter lists) ordered as they are to be applied. So, for this example, we could define (partially) curried `quad` and use it with:

def quad(a: Double, b: Double, c: Double)(x: Double): Double = a*x*x + b*x + c plotFun(quad(1,2,-3)) val myQuad7 = quad(1,0,1) _ plotFun(myQuad7)

Note the use of an underscore to convert a partially applied method to a function. Also note that every function has a method `curried` which turns an uncurried function into a (fully) curried function. So in the case of our quadratic function, the fully curried version will be a chain of four functions.

def quadCurried = (quadratic _).curried plotFun(quadCurried(1)(2)(3))

Again, note the strategic use of an underscore. The underscore isn’t necessary if we have a true function to start with, as the following illustrates:

val quadraticF: (Double,Double,Double,Double) => Double = (a,b,c,x) => a*x*x + b*x + c def quadCurried2 = quadraticF.curried plotFun(quadCurried2(-1)(2)(3))

Working with functions, closures, HOFs and partial application is fundamental to effective functional programming style. Currying functions is one approach to handling the function environment problem, and is the standard approach in languages such as Haskell. However, in Scala there are other possible approaches, such as using underscores for partial application. The preferred approach will depend on the context. Currying is often used for HOFs accepting a function as argument (as it can help with type inference), and also in conjunction with *implicits* (beyond the scope of this post – pun intended). In other contexts partial application using underscores seems to be more commonly used.

- Oliveira, S, Stewart, D.E. (2006) Writing scientific software, CUP.
- Horstmann, C. S. (2012) Scala for the impatient, Addison Wesley.
- Chiusano, P., Bjarnason, R. (2014) Functional programming in Scala, Manning

]]>

To statisticians and data scientists used to working in R, the concept of a data frame is one of the most natural and basic starting points for statistical computing and data analysis. It always surprises me that data frames aren’t a core concept in most programming languages’ standard libraries, since they are essentially a representation of a relational database table, and relational databases are pretty ubiquitous in data processing and related computing. For statistical modelling and data science, having functions designed for data frames is much more elegant than using functions designed to work directly on vectors and matrices, for example. Trivial things like being able to refer to columns by a readable name rather than a numeric index makes a huge difference, before we even get into issues like columns of heterogeneous types, coherent handling of missing data, etc. This is why modelling in R is typically nicer than in certain other languages I could mention, where libraries for scientific and numerical computing existed for a long time before libraries for data frames were added to the language ecosystem.

To build good libraries for statistical computing in Scala, it will be helpful to build those libraries using a good data frame implementation. With that in mind I’ve started to look for existing Scala data frame libraries and to compare them.

For this post I’m going to consider a very simple data manipulation task: first reading in a CSV file from disk into a data frame object, then filtering out some rows, then adding a derived column, then finally writing the data frame back to disk as a CSV file. We will start by looking at how this would be done in R. First we need an example CSV file. Since many R packages contain example datasets, we will use one of those. We will export `Cars93` from the MASS package:

library(MASS) write.csv(Cars93,"cars93.csv",row.names=FALSE)

If MASS isn’t installed, it can be installed with a simple `install.packages("MASS")`. The above code snippet generates a CSV file to be used for the example. Typing `?Cars93` will give some information about the dataset, including the original source.

Our analysis task is going to be to load the file from disk, filter out cars with `EngineSize` larger than 4 (litres), add a new column to the data frame, `WeightKG`, containing the weight of the car in KG, derived from the column `Weight` (in pounds), and then write back to disk in CSV format. This is the kind of thing that R excels at (pun intended):

df=read.csv("cars93.csv") print(dim(df)) df = df[df$EngineSize<=4.0,] print(dim(df)) df$WeightKG = df$Weight*0.453592 print(dim(df)) write.csv(df,"cars93m.csv",row.names=FALSE)

Now let’s see how a similar task could be accomplished using Scala data frames.

Saddle is probably the best known data frame library for Scala. It is strongly influenced by the pandas library for Python. A simple Saddle session for accomplishing this task might proceed as follows:

val file = CsvFile("cars93.csv") val df = CsvParser.parse(file).withColIndex(0) println(df) val df2 = df.rfilter(_("EngineSize"). mapValues(CsvParser.parseDouble).at(0)<=4.0) println(df2) val wkg=df2.col("Weight").mapValues(CsvParser.parseDouble). mapValues(_*0.453592).setColIndex(Index("WeightKG")) val df3=df2.joinPreserveColIx(wkg.mapValues(_.toString)) println(df3) df3.writeCsvFile("saddle-out.csv")

Although this looks OK, it’s not completely satisfactory, as the data frame is actually representing a matrix of `Strings`. Although you can have a data frame containing columns of any type, since Saddle data frames are backed by a matrix object (with type corresponding to the common super-type), the handling of columns of heterogeneous types always seems rather cumbersome. I suspect that it is this clumsy handling of heterogeneously typed columns that has motivated the development of alternative data frame libraries for Scala.

Scala-datatable is a lightweight minimal immutable data table for Scala, with good support for columns of differing types. However, it is currently really very minimal, and doesn’t have CSV import or export, for example. That said, there are several CSV libraries for Scala, so it’s quite easy to write functions to import from CSV into a datatable and write CSV back out from one. I’ve a couple of example functions, `readCsv()` and `writeCsv()` in the full code examples associated with this post. Now since datatable supports heterogeneous column types and I don’t want to write a type guesser, my `readCsv()` function expects information regarding the column types. This could be relaxed with a bit of effort. An example session follows:

val colTypes=Map("DriveTrain" -> StringCol, "Min.Price" -> Double, "Cylinders" -> Int, "Horsepower" -> Int, "Length" -> Int, "Make" -> StringCol, "Passengers" -> Int, "Width" -> Int, "Fuel.tank.capacity" -> Double, "Origin" -> StringCol, "Wheelbase" -> Int, "Price" -> Double, "Luggage.room" -> Double, "Weight" -> Int, "Model" -> StringCol, "Max.Price" -> Double, "Manufacturer" -> StringCol, "EngineSize" -> Double, "AirBags" -> StringCol, "Man.trans.avail" -> StringCol, "Rear.seat.room" -> Double, "RPM" -> Int, "Turn.circle" -> Double, "MPG.highway" -> Int, "MPG.city" -> Int, "Rev.per.mile" -> Int, "Type" -> StringCol) val df=readCsv("Cars93",new FileReader("cars93.csv"),colTypes) println(df.length,df.columns.length) val df2=df.filter(row=>row.as[Double]("EngineSize")<=4.0).toDataTable println(df2.length,df2.columns.length) val oldCol=df2.columns("Weight").as[Int] val newCol=new DataColumn[Double]("WeightKG",oldCol.data.map{_.toDouble*0.453592}) val df3=df2.columns.add(newCol).get println(df3.length,df3.columns.length) writeCsv(df3,new File("out.csv"))

Apart from the declaration of column types, the code is actually a little bit cleaner than the corresponding Saddle code, and the column types are all properly preserved and appropriately handled. However, a significant limitation of this data frame is that it doesn’t seem to have special handling of missing values, requiring some kind of manually coded “special value” approach from users of this data frame. This is likely to limit the appeal of this library for general statistical and data science applications.

Framian is a full-featured data frame library for Scala, open-sourced by Pellucid analytics. It is strongly influenced by R data frame libraries, and aims to provide most of the features that R users would expect. It has good support for clean handling of heterogeneously typed columns (using shapeless), handles missing data, and includes good CSV import:

val df=Csv.parseFile(new File("cars93.csv")).labeled.toFrame println(""+df.rows+" "+df.cols) val df2=df.filter(Cols("EngineSize").as[Double])( _ <= 4.0 ) println(""+df2.rows+" "+df2.cols) val df3=df2.map(Cols("Weight").as[Int],"WeightKG")(r=>r.toDouble*0.453592) println(""+df3.rows+" "+df3.cols) println(df3.colIndex) val csv = Csv.fromFrame(new CsvFormat(",", header = true))(df3) new PrintWriter("out.csv") { write(csv.toString); close }

This is arguably the cleanest solution so far. Unfortunately the output isn’t quite right(!), as there currently seems to be a bug in `Csv.fromFrame` which causes the ordering of columns to get out of sync with the ordering of the column headers. Presumably this bug will soon be fixed, and if not it is easy to write a CSV writer for these frames, as I did above for scala-datatable.

The three data frames considered so far are all standard single-machine, non-distributed, in-memory objects. The Scala data frame implementation currently subject to the most social media buzz is a different beast entirely. A `DataFrame` object has recently been added to Apache Spark. I’ve already discussed the problems of first developing a data analysis library without data frames and then attempting to bolt a data frame object on top post-hoc. Spark has repeated this mistake, but it’s still much better to have a data frame in Spark than not. Spark is a Scala framework for the distributed processing and analysis of huge datasets on a cluster. I will discuss it further in future posts. If you have a legitimate need for this kind of set-up, then Spark is a pretty impressive piece of technology (though note that there are competitors, such as flink). However, for datasets that can be analysed on a single machine, then Spark seems like a rather slow and clunky sledgehammer to crack a nut. So, for datasets in the terabyte range and above, Spark DataFrames are great, but for datasets smaller than a few gigs, it’s probably not the best solution. With those caveats in mind, here’s how to solve our problem using Spark DataFrames (and the spark-csv library) in the Spark Shell:

val df = sqlContext.read.format("com.databricks.spark.csv"). option("header", "true"). option("inferSchema","true"). load("cars93.csv") val df2=df.filter("EngineSize <= 4.0") val col=df2.col("Weight")*0.453592 val df3=df2.withColumn("WeightKG",col) df3.write.format("com.databricks.spark.csv"). option("header","true"). save("out-csv")

If you really need a distributed data frame library, then you will probably want to use Spark. However, for the vast majority of statistical modelling and data science tasks, Spark is likely to be unnecessarily complex and heavyweight. The other three libraries considered all have pros and cons. They are all largely one-person hobby projects, quite immature, and not currently under very active development. Saddle is fine for when you just want to add column headings to a matrix. Scala-datatable is lightweight and immutable, if you don’t care about missing values. On balance, I think Framian is probably the most full-featured “batteries included” R-like data frame, and so is likely to be most attractive to statisticians and data scientists. However, it’s pretty immature, and the dependence on shapeless may be of concern to those who prefer libraries to be lean and devoid of sorcery!

I’d be really interested to know of other people’s experiences of these libraries, so please do comment if you have any views, and especially if you have opinions on the relative merits of the different libraries.

The full source code for all of these examples, including sbt build files, can be found in a new github repo I’ve created for the code examples associated with this blog.

]]>

In the previous post I showed how the `rscala` package (which has replaced the `jvmr` package) can be used to call Scala code from within R. In this post I will show how to call R from Scala code. I have previously described how to do this using `jvmr`. This post is really just an update to show how things work with `rscala`.

Since I’m focusing here on Scala sbt projects, I’m assuming that sbt is installed, in addition to `rscala` (described in the previous post). The only “trick” required for calling back to R from Scala is telling sbt where the `rscala` jar file is located. You can find the location from the R console as illustrated by the following session:

> library(rscala) > rscala::rscalaJar("2.11") [1] "/home/ndjw1/R/x86_64-pc-linux-gnu-library/3.2/rscala/java/rscala_2.11-1.0.6.jar"

This location (which will obviously be different for you) can then be added in to your sbt classpath by adding the following line to your `build.sbt` file:

unmanagedJars in Compile += file("/home/ndjw1/R/x86_64-pc-linux-gnu-library/3.2/rscala/java/rscala_2.11-1.0.6.jar")

Once this is done, calling out to R from your Scala sbt project can be carried out as described in the `rscala` documentation. For completeness, a working example is given below.

In this example I will use Scala to simulate some data consistent with a Poisson regression model, and then push the data to R to fit it using the R function `glm()`, and then pull back the fitted regression coefficients into Scala. This is obviously a very artificial example, but the point is to show how it is possible to call back to R for some statistical procedure that may be “missing” from Scala.

The dependencies for this project are described in the file `build.sbt`

name := "rscala test" version := "0.1" scalacOptions ++= Seq("-unchecked", "-deprecation", "-feature") libraryDependencies ++= Seq( "org.scalanlp" %% "breeze" % "0.10", "org.scalanlp" %% "breeze-natives" % "0.10" ) resolvers ++= Seq( "Sonatype Snapshots" at "https://oss.sonatype.org/content/repositories/snapshots/", "Sonatype Releases" at "https://oss.sonatype.org/content/repositories/releases/" ) unmanagedJars in Compile += file("/home/ndjw1/R/x86_64-pc-linux-gnu-library/3.2/rscala/java/rscala_2.11-1.0.6.jar") scalaVersion := "2.11.6"

The complete Scala program is contained in the file `PoisReg.scala`

import org.ddahl.rscala.callback._ import breeze.stats.distributions._ import breeze.linalg._ object ScalaToRTest { def main(args: Array[String]) = { // first simulate some data consistent with a Poisson regression model val x = Uniform(50,60).sample(1000) val eta = x map { xi => (xi * 0.1) - 3 } val mu = eta map { math.exp(_) } val y = mu map { Poisson(_).draw } // call to R to fit the Poission regression model val R = RClient() // initialise an R interpreter R.x=x.toArray // send x to R R.y=y.toArray // send y to R R.eval("mod <- glm(y~x,family=poisson())") // fit the model in R // pull the fitted coefficents back into scala val beta = DenseVector[Double](R.evalD1("mod$coefficients")) // print the fitted coefficents println(beta) } }

If these two files are put in an empty directory, the code can be compiled and run by typing `sbt run` from the command prompt in the relevant directory. The commented code should be self-explanatory, but see the `rscala` documentation for further details. In particular, the rscala scaladoc is useful.

]]>