Statistical computing with Scala free on-line course

I’ve recently delivered a three-day intensive short-course on Scala for statistical computing and data science. The course seemed to go well, and the experience has convinced me that Scala should be used a lot more by statisticians and data scientists for a range of problems in statistical computing. In particular, the simplicity of writing fast efficient parallel algorithms is reason alone to take a careful look at Scala. With a view to helping more statisticians get to grips with Scala, I’ve decided to freely release all of the essential materials associated with the course: the course notes (as PDF), code fragments, complete examples, end-of-chapter exercises, etc. Although I developed the materials with the training course in mind, the course notes are reasonably self-contained, making the course quite suitable for self-study. At some point I will probably flesh out the notes into a proper book, but that will probably take me a little while.

I’ve written a brief self-study guide to point people in the right direction. For people studying the material in their spare time, the course is probably best done over nine weeks (one chapter per week), and this will then cover material at a similar rate to a typical MOOC.

The nine chapters are:

1. Introduction
2. Scala and FP Basics
3. Collections
4. Scala Breeze
5. Monte Carlo
6. Statistical modelling
7. Tools
8. Apache Spark
9. Advanced topics

For anyone frustrated by the limitations of dynamic languages such as R, Python or Octave, this course should provide a good pathway to an altogether more sophisticated, modern programming paradigm.


MCMC as a Stream


This weekend I’ve been preparing some material for my upcoming Scala for statistical computing short course. As part of the course, I thought it would be useful to walk through how to think about and structure MCMC codes, and in particular, how to think about MCMC algorithms as infinite streams of state. This material is reasonably stand-alone, so it seems suitable for a blog post. Complete runnable code for the examples in this post are available from my blog repo.

A simple MH sampler

For this post I will just consider a trivial toy Metropolis algorithm using a Uniform random walk proposal to target a standard normal distribution. I’ve considered this problem before on my blog, so if you aren’t very familiar with Metropolis-Hastings algorithms, you might want to quickly review my post on Metropolis-Hastings MCMC algorithms in R before continuing. At the end of that post, I gave the following R code for the Metropolis sampler:

        vec=vector("numeric", n)
        for (i in 2:n) {
                if (log(runif(1)) < loga) { 

I will begin this post with a fairly direct translation of this algorithm into Scala:

def metrop1(n: Int = 1000, eps: Double = 0.5): DenseVector[Double] = {
    val vec = DenseVector.fill(n)(0.0)
    var x = 0.0
    var oldll = Gaussian(0.0, 1.0).logPdf(x)
    vec(0) = x
    (1 until n).foreach { i =>
      val can = x + Uniform(-eps, eps).draw
      val loglik = Gaussian(0.0, 1.0).logPdf(can)
      val loga = loglik - oldll
      if (math.log(Uniform(0.0, 1.0).draw) < loga) {
        x = can
        oldll = loglik
      vec(i) = x

This code works, and is reasonably fast and efficient, but there are several issues with it from a functional programmers perspective. One issue is that we have committed to storing all MCMC output in RAM in a DenseVector. This probably isn’t an issue here, but for some big problems we might prefer to not store the full set of states, but to just print the states to (say) the console, for possible re-direction to a file. It is easy enough to modify the code to do this:

def metrop2(n: Int = 1000, eps: Double = 0.5): Unit = {
    var x = 0.0
    var oldll = Gaussian(0.0, 1.0).logPdf(x)
    (1 to n).foreach { i =>
      val can = x + Uniform(-eps, eps).draw
      val loglik = Gaussian(0.0, 1.0).logPdf(can)
      val loga = loglik - oldll
      if (math.log(Uniform(0.0, 1.0).draw) < loga) {
        x = can
        oldll = loglik

But now we have two version of the algorithm. One for storing results locally, and one for streaming results to the console. This is clearly unsatisfactory, but we shall return to this issue shortly. Another issue that will jump out at functional programmers is the reliance on mutable variables for storing the state and old likelihood. Let’s fix that now by re-writing the algorithm as a tail-recursion.

def metrop3(n: Int = 1000, eps: Double = 0.5, x: Double = 0.0, oldll: Double = Double.MinValue): Unit = {
    if (n > 0) {
      val can = x + Uniform(-eps, eps).draw
      val loglik = Gaussian(0.0, 1.0).logPdf(can)
      val loga = loglik - oldll
      if (math.log(Uniform(0.0, 1.0).draw) < loga)
        metrop3(n - 1, eps, can, loglik)
        metrop3(n - 1, eps, x, oldll)

This has eliminated the vars, and is just as fast and efficient as the previous version of the code. Note that the @tailrec annotation is optional – it just signals to the compiler that we want it to throw an error if for some reason it cannot eliminate the tail call. However, this is for the print-to-console version of the code. What if we actually want to keep the iterations in RAM for subsequent analysis? We can keep the values in an accumulator, as follows.

def metrop4(n: Int = 1000, eps: Double = 0.5, x: Double = 0.0, oldll: Double = Double.MinValue, acc: List[Double] = Nil): DenseVector[Double] = {
    if (n == 0)
    else {
      val can = x + Uniform(-eps, eps).draw
      val loglik = Gaussian(0.0, 1.0).logPdf(can)
      val loga = loglik - oldll
      if (math.log(Uniform(0.0, 1.0).draw) < loga)
        metrop4(n - 1, eps, can, loglik, can :: acc)
        metrop4(n - 1, eps, x, oldll, x :: acc)

Factoring out the updating logic

This is all fine, but we haven’t yet addressed the issue of having different versions of the code depending on what we want to do with the output. The problem is that we have tied up the logic of advancing the Markov chain with what to do with the output. What we need to do is separate out the code for advancing the state. We can do this by defining a new function.

def newState(x: Double, oldll: Double, eps: Double): (Double, Double) = {
    val can = x + Uniform(-eps, eps).draw
    val loglik = Gaussian(0.0, 1.0).logPdf(can)
    val loga = loglik - oldll
    if (math.log(Uniform(0.0, 1.0).draw) < loga) (can, loglik) else (x, oldll)

This function takes as input a current state and associated log likelihood and returns a new state and log likelihood following the execution of one step of a MH algorithm. This separates the concern of state updating from the rest of the code. So now if we want to write code that prints the state, we can write it as

  def metrop5(n: Int = 1000, eps: Double = 0.5, x: Double = 0.0, oldll: Double = Double.MinValue): Unit = {
    if (n > 0) {
      val ns = newState(x, oldll, eps)
      metrop5(n - 1, eps, ns._1, ns._2)

and if we want to accumulate the set of states visited, we can write that as

  def metrop6(n: Int = 1000, eps: Double = 0.5, x: Double = 0.0, oldll: Double = Double.MinValue, acc: List[Double] = Nil): DenseVector[Double] = {
    if (n == 0) DenseVector(acc.reverse.toArray) else {
      val ns = newState(x, oldll, eps)
      metrop6(n - 1, eps, ns._1, ns._2, ns._1 :: acc)

Both of these functions call newState to do the real work, and concentrate on what to do with the sequence of states. However, both of these functions repeat the logic of how to iterate over the sequence of states.

MCMC as a stream

Ideally we would like to abstract out the details of how to do state iteration from the code as well. Most functional languages have some concept of a Stream, which represents a (potentially infinite) sequence of states. The Stream can embody the logic of how to perform state iteration, allowing us to abstract that away from our code, as well.

To do this, we will restructure our code slightly so that it more clearly maps old state to new state.

def nextState(eps: Double)(state: (Double, Double)): (Double, Double) = {
    val x = state._1
    val oldll = state._2
    val can = x + Uniform(-eps, eps).draw
    val loglik = Gaussian(0.0, 1.0).logPdf(can)
    val loga = loglik - oldll
    if (math.log(Uniform(0.0, 1.0).draw) < loga) (can, loglik) else (x, oldll)

The "real" state of the chain is just x, but if we want to avoid recalculation of the old likelihood, then we need to make this part of the chain’s state. We can use this nextState function in order to construct a Stream.

  def metrop7(eps: Double = 0.5, x: Double = 0.0, oldll: Double = Double.MinValue): Stream[Double] =
    Stream.iterate((x, oldll))(nextState(eps)) map (_._1)

The result of calling this is an infinite stream of states. Obviously it isn’t computed – that would require infinite computation, but it captures the logic of iteration and computation in a Stream, that can be thought of as a lazy List. We can get values out by converting the Stream to a regular collection, being careful to truncate the Stream to one of finite length beforehand! eg. metrop7().drop(1000).take(10000).toArray will do a burn-in of 1,000 iterations followed by a main monitoring run of length 10,000, capturing the results in an Array. Note that metrop7().drop(1000).take(10000) is a Stream, and so nothing is actually computed until the toArray is encountered. Conversely, if printing to console is required, just replace the .toArray with .foreach(println).

The above stream-based approach to MCMC iteration is clean and elegant, and deals nicely with issues like burn-in and thinning (which can be handled similarly). This is how I typically write MCMC codes these days. However, functional programming purists would still have issues with this approach, as it isn’t quite pure functional. The problem is that the code isn’t pure – it has a side-effect, which is to mutate the state of the under-pinning pseudo-random number generator. If the code was pure, calling nextState with the same inputs would always give the same result. Clearly this isn’t the case here, as we have specifically designed the function to be stochastic, returning a randomly sampled value from the desired probability distribution. So nextState represents a function for randomly sampling from a conditional probability distribution.

A pure functional approach

Now, ultimately all code has side-effects, or there would be no point in running it! But in functional programming the desire is to make as much of the code as possible pure, and to push side-effects to the very edges of the code. So it’s fine to have side-effects in your main method, but not buried deep in your code. Here the side-effect is at the very heart of the code, which is why it is potentially an issue.

To keep things as simple as possible, at this point we will stop worrying about carrying forward the old likelihood, and hard-code a value of eps. Generalisation is straightforward. We can make our code pure by instead defining a function which represents the conditional probability distribution itself. For this we use a probability monad, which in Breeze is called Rand. We can couple together such functions using monadic binds (flatMap in Scala), expressed most neatly using for-comprehensions. So we can write our transition kernel as

def kernel(x: Double): Rand[Double] = for {
    innov <- Uniform(-0.5, 0.5)
    can = x + innov
    oldll = Gaussian(0.0, 1.0).logPdf(x)
    loglik = Gaussian(0.0, 1.0).logPdf(can)
    loga = loglik - oldll
    u <- Uniform(0.0, 1.0)
} yield if (math.log(u) < loga) can else x

This is now pure – the same input x will always return the same probability distribution – the conditional distribution of the next state given the current state. We can draw random samples from this distribution if we must, but it’s probably better to work as long as possible with pure functions. So next we need to encapsulate the iteration logic. Breeze has a MarkovChain object which can take kernels of this form and return a stochastic Process object representing the iteration logic, as follows.


The steps method contains the logic of how to advance the state of the chain. But again note that no computation actually takes place until the foreach method is encountered – this is when the sampling occurs and the side-effects happen.

Metropolis-Hastings is a common use-case for Markov chains, so Breeze actually has a helper method built-in that will construct a MH sampler directly from an initial state, a proposal kernel, and a (log) target.

  metropolisHastings(0.0, (x: Double) =>
  Uniform(x - 0.5, x + 0.5))(x =>
  Gaussian(0.0, 1.0).logPdf(x)).

Note that if you are using the MH functionality in Breeze, it is important to make sure that you are using version 0.13 (or later), as I fixed a few issues with the MH code shortly prior to the 0.13 release.


Viewing MCMC algorithms as infinite streams of state is useful for writing elegant, generic, flexible code. Streams occur everywhere in programming, and so there are lots of libraries for working with them. In this post I used the simple Stream from the Scala standard library, but there are much more powerful and flexible stream libraries for Scala, including fs2 and Akka-streams. But whatever libraries you are using, the fundamental concepts are the same. The most straightforward approach to implementation is to define impure stochastic streams to consume. However, a pure functional approach is also possible, and the Breeze library defines some useful functions to facilitate this approach. I’m still a little bit ambivalent about whether the pure approach is worth the additional cognitive overhead, but it’s certainly very interesting and worth playing with and thinking about the pros and cons.

Complete runnable code for the examples in this post are available from my blog repo.

Books on Scala for statistical computing and data science


People regularly ask me about books and other resources for getting started with Scala for statistical computing and data science. This post will focus on books, but it’s worth briefly noting that there are a number of other resources available, on-line and otherwise, that are also worth considering. I particularly like the Coursera course Functional Programming Principles in Scala – I still think this is probably the best way to get started with Scala and functional programming for most people. In fact, there is an entire Functional Programming in Scala Specialization that is worth considering – I’ll probably discuss that more in another post. I’ve got a draft page of Scala links which has a bias towards scientific and statistical computing, and I’m currently putting together a short course in that area, which I’ll also discuss further in future posts. But this post will concentrate on books.

Reading list

Getting started with Scala

Before one can dive into statistical computing and data science using Scala, it’s a good idea to understand a bit about the language and about functional programming. There are by now many books on Scala, and I haven’t carefully reviewed all of them, but I’ve looked at enough to have an idea about good ways of getting started.

  • Programming in Scala: Third edition, Odersky et al, Artima.
    • This is the Scala book, often referred to on-line as PinS. It is a weighty tome, and works through the Scala language in detail, starting from the basics. Every serious Scala programmer should own this book. However, it isn’t the easiest introduction to the language.
  • Scala for the Impatient, Horstmann, Addison-Wesley.
    • As the name suggests, this is a much quicker and easier introduction to Scala than PinS, but assumes reasonable familiarity with programming in general, and sort-of assumes that the reader has a basic knowledge of Java and the JVM ecosystem. That said, it does not assume that the reader is a Java expert. My feeling is that for someone who has a reasonable programming background and a passing familiarity with Java, then this book is probably the best introduction to the language. Note that there is a second edition in the works.
  • Functional Programming in Scala Chiusano and Bjarnason, Manning.
    • It is possible to write Scala code in the style of "Java-without-the-semi-colons", but really the whole point of Scala is to move beyond that kind of Object-Oriented programming style. How much you venture down the path towards pure Functional Programming is very much a matter of taste, but many of the best Scala programmers are pretty hard-core FP, and there’s probably a reason for that. But many people coming to Scala don’t have a strong FP background, and getting up to speed with strongly-typed FP isn’t easy for people who only know an imperative (Object-Oriented) style of programming. This is the book that will help you to make the jump to FP. Sometimes referred to online as FPiS, or more often even just as the red book, this is also a book that every serious Scala programmer should own (and read!). Note that is isn’t really a book about Scala – it is a book about strongly typed FP that just "happens" to use Scala for illustrating the ideas. Consequently, you will probably want to augment this book with a book that really is about Scala, such as one of the books above. Since this is the first book on the list published by Manning, I should also mention how much I like computing books from this publisher. They are typically well-produced, and their paper books (pBooks) come with complimentary access to well-produced DRM-free eBook versions, however you purchase them.
  • Functional and Reactive Domain Modeling, Ghosh, Manning.
    • This is another book that isn’t really about Scala, but about software engineering using a strongly typed FP language. But again, it uses Scala to illustrate the ideas, and is an excellent read. You can think of it as a more practical "hands-on" follow-up to the red book, which shows how the ideas from the red book translate into effective solutions to real-world problems.
  • Structure and Interpretation of Computer Programs, second edition Abelson et al, MIT Press.
    • This is not a Scala book! This is the only book in this list which doesn’t use Scala at all. I’ve included it on the list because it is one of the best books on programming that I’ve read, and is the book that I wish someone had told me about 20 years ago! In fact the book uses Scheme (a Lisp derivative) as the language to illustrate the ideas. There are obviously important differences between Scala and Scheme – eg. Scala is strongly statically typed and compiled, whereas Scheme is dynamically typed and interpreted. However, there are also similarities – eg. both languages support and encourage a functional style of programming but are not pure FP languages. Referred to on-line as SICP this book is a classic. Note that there is no need to buy a paper copy if you like eBooks, since electronic versions are available free on-line.

Scala for statistical computing and data science

  • Scala for Data Science, Bugnion, Packt.
    • Not to be confused with the (terrible) book, Scala for machine learning by the same publisher. Scala for Data Science is my top recommendation for getting started with statistical computing and data science applications using Scala. I have reviewed this book in another post, so I won’t say more about it here (but I like it).
  • Scala Data Analysis Cookbook, Manivannan, Packt.
    • I’m not a huge fan of the cookbook format, but this book is really mis-named, as it isn’t really a cookbook and isn’t really about data analysis in Scala! It is really a book about Apache Spark, and proceeds fairly sequentially in the form of a tutorial introduction to Spark. Spark is an impressive piece of technology, and it is obviously one of the factors driving interest in Scala, but it’s important to understand that Spark isn’t Scala, and that many typical data science applications will be better tackled using Scala without Spark. I’ve not read this book cover-to-cover as it offers little over Scala for Data Science, but its coverage of Spark is a bit more up-to-date than the Spark books I mention below, so it could be of interest to those who are mainly interested in Scala for Spark.
  • Scala High Performance Programming, Theron and Diamant, Packt.
    • This is an interesting book, fundamentally about developing high performance streaming data processing algorithm pipelines in Scala. It makes no reference to Spark. The running application is an on-line financial trading system. It takes a deep dive into understanding performance in Scala and on the JVM, and looks at how to benchmark and profile performance, diagnose bottlenecks and optimise code. This is likely to be of more interest to those interested in developing efficient algorithms for scientific and statistical computing rather than applied data scientists, but it covers some interesting material not covered by any of the other books in this list.
  • Learning Spark, Karau et al, O’Reilly.
    • This book provides an introduction to Apache Spark, written by some of the people who developed it. Spark is a big data analytics framework built on top of Scala. It is arguably the best available framework for big data analytics on computing clusters in the cloud, and hence there is a lot of interest in it. The book is a perfectly good introduction to Spark, and shows most examples implemented using the Java and Python APIs in addition to the canonical Scala (Spark Shell) implementation. This is useful for people working with multiple languages, but can be mildly irritating to anyone who is only interested in Scala. However, the big problem with this (and every other) book on Spark is that Spark is evolving very quickly, and so by the time any book on Spark is written and published it is inevitably very out of date. It’s not clear that it is worth buying a book specifically about Spark at this stage, or whether it would be better to go for a book like Scala for Data Science, which has a couple of chapters of introduction to Spark, which can then provide a starting point for engaging with Spark’s on-line documentation (which is reasonably good).
  • Advanced Analytics with Spark, Ryza et al, O’Reilly.
    • This book has a bit of a "cookbook" feel to it, which some people like and some don’t. It’s really more like an "edited volume" with different chapters authored by different people. Unlike Learning Spark it focuses exclusively on the Scala API. The book basically covers the development of a bunch of different machine learning pipelines for a variety of applications. My main problem with this book is that it has aged particularly badly, as all of the pipelines are developed with raw RDDs, which isn’t how ML pipelines in Spark are constructed any more. So again, it’s difficult for me to recommend. The message here is that if you are thinking of buying a book about Spark, check very carefully when it was published and what version of Spark it covers and whether that is sufficiently recent to be of relevance to you.


There are lots of books to get started with Scala for statistical computing and data science applications. My "bare minimum" recommendation would be some generic Scala book (doesn’t really matter which one), the red book, and Scala for data science. After reading those, you will be very well placed to top-up your knowledge as required with on-line resources.

Data frames and tables in Scala


To statisticians and data scientists used to working in R, the concept of a data frame is one of the most natural and basic starting points for statistical computing and data analysis. It always surprises me that data frames aren’t a core concept in most programming languages’ standard libraries, since they are essentially a representation of a relational database table, and relational databases are pretty ubiquitous in data processing and related computing. For statistical modelling and data science, having functions designed for data frames is much more elegant than using functions designed to work directly on vectors and matrices, for example. Trivial things like being able to refer to columns by a readable name rather than a numeric index makes a huge difference, before we even get into issues like columns of heterogeneous types, coherent handling of missing data, etc. This is why modelling in R is typically nicer than in certain other languages I could mention, where libraries for scientific and numerical computing existed for a long time before libraries for data frames were added to the language ecosystem.

To build good libraries for statistical computing in Scala, it will be helpful to build those libraries using a good data frame implementation. With that in mind I’ve started to look for existing Scala data frame libraries and to compare them.

A simple data manipulation task

For this post I’m going to consider a very simple data manipulation task: first reading in a CSV file from disk into a data frame object, then filtering out some rows, then adding a derived column, then finally writing the data frame back to disk as a CSV file. We will start by looking at how this would be done in R. First we need an example CSV file. Since many R packages contain example datasets, we will use one of those. We will export Cars93 from the MASS package:


If MASS isn’t installed, it can be installed with a simple install.packages("MASS"). The above code snippet generates a CSV file to be used for the example. Typing ?Cars93 will give some information about the dataset, including the original source.

Our analysis task is going to be to load the file from disk, filter out cars with EngineSize larger than 4 (litres), add a new column to the data frame, WeightKG, containing the weight of the car in KG, derived from the column Weight (in pounds), and then write back to disk in CSV format. This is the kind of thing that R excels at (pun intended):

df = df[df$EngineSize<=4.0,]
df$WeightKG = df$Weight*0.453592

Now let’s see how a similar task could be accomplished using Scala data frames.

Data frames and tables in Scala


Saddle is probably the best known data frame library for Scala. It is strongly influenced by the pandas library for Python. A simple Saddle session for accomplishing this task might proceed as follows:

val file = CsvFile("cars93.csv")
val df = CsvParser.parse(file).withColIndex(0)
val df2 = df.rfilter(_("EngineSize").
val wkg=df2.col("Weight").mapValues(CsvParser.parseDouble).
val df3=df2.joinPreserveColIx(wkg.mapValues(_.toString))

Although this looks OK, it’s not completely satisfactory, as the data frame is actually representing a matrix of Strings. Although you can have a data frame containing columns of any type, since Saddle data frames are backed by a matrix object (with type corresponding to the common super-type), the handling of columns of heterogeneous types always seems rather cumbersome. I suspect that it is this clumsy handling of heterogeneously typed columns that has motivated the development of alternative data frame libraries for Scala.


Scala-datatable is a lightweight minimal immutable data table for Scala, with good support for columns of differing types. However, it is currently really very minimal, and doesn’t have CSV import or export, for example. That said, there are several CSV libraries for Scala, so it’s quite easy to write functions to import from CSV into a datatable and write CSV back out from one. I’ve a couple of example functions, readCsv() and writeCsv() in the full code examples associated with this post. Now since datatable supports heterogeneous column types and I don’t want to write a type guesser, my readCsv() function expects information regarding the column types. This could be relaxed with a bit of effort. An example session follows:

    val colTypes=Map("DriveTrain" -> StringCol, 
                     "Min.Price" -> Double, 
                     "Cylinders" -> Int, 
                     "Horsepower" -> Int, 
                     "Length" -> Int, 
                     "Make" -> StringCol, 
                     "Passengers" -> Int, 
                     "Width" -> Int, 
                     "Fuel.tank.capacity" -> Double, 
                     "Origin" -> StringCol, 
                     "Wheelbase" -> Int, 
                     "Price" -> Double, 
                     "" -> Double, 
                     "Weight" -> Int, 
                     "Model" -> StringCol, 
                     "Max.Price" -> Double, 
                     "Manufacturer" -> StringCol, 
                     "EngineSize" -> Double, 
                     "AirBags" -> StringCol, 
                     "Man.trans.avail" -> StringCol, 
                     "" -> Double, 
                     "RPM" -> Int, 
                     "" -> Double, 
                     "MPG.highway" -> Int, 
                     "" -> Int, 
                     "Rev.per.mile" -> Int, 
                     "Type" -> StringCol)
    val df=readCsv("Cars93",new FileReader("cars93.csv"),colTypes)
    val df2=df.filter(row=>[Double]("EngineSize")<=4.0).toDataTable

    val oldCol=df2.columns("Weight").as[Int]
    val newCol=new DataColumn[Double]("WeightKG",{_.toDouble*0.453592})
    val df3=df2.columns.add(newCol).get

    writeCsv(df3,new File("out.csv"))

Apart from the declaration of column types, the code is actually a little bit cleaner than the corresponding Saddle code, and the column types are all properly preserved and appropriately handled. However, a significant limitation of this data frame is that it doesn’t seem to have special handling of missing values, requiring some kind of manually coded “special value” approach from users of this data frame. This is likely to limit the appeal of this library for general statistical and data science applications.


Framian is a full-featured data frame library for Scala, open-sourced by Pellucid analytics. It is strongly influenced by R data frame libraries, and aims to provide most of the features that R users would expect. It has good support for clean handling of heterogeneously typed columns (using shapeless), handles missing data, and includes good CSV import:

val df=Csv.parseFile(new File("cars93.csv")).labeled.toFrame
println(""+df.rows+" "+df.cols)
val df2=df.filter(Cols("EngineSize").as[Double])( _ <= 4.0 )
println(""+df2.rows+" "+df2.cols)
println(""+df3.rows+" "+df3.cols)
val csv = Csv.fromFrame(new CsvFormat(",", header = true))(df3)
new PrintWriter("out.csv") { write(csv.toString); close }

This is arguably the cleanest solution so far. Unfortunately the output isn’t quite right(!), as there currently seems to be a bug in Csv.fromFrame which causes the ordering of columns to get out of sync with the ordering of the column headers. Presumably this bug will soon be fixed, and if not it is easy to write a CSV writer for these frames, as I did above for scala-datatable.

Spark DataFrames

The three data frames considered so far are all standard single-machine, non-distributed, in-memory objects. The Scala data frame implementation currently subject to the most social media buzz is a different beast entirely. A DataFrame object has recently been added to Apache Spark. I’ve already discussed the problems of first developing a data analysis library without data frames and then attempting to bolt a data frame object on top post-hoc. Spark has repeated this mistake, but it’s still much better to have a data frame in Spark than not. Spark is a Scala framework for the distributed processing and analysis of huge datasets on a cluster. I will discuss it further in future posts. If you have a legitimate need for this kind of set-up, then Spark is a pretty impressive piece of technology (though note that there are competitors, such as flink). However, for datasets that can be analysed on a single machine, then Spark seems like a rather slow and clunky sledgehammer to crack a nut. So, for datasets in the terabyte range and above, Spark DataFrames are great, but for datasets smaller than a few gigs, it’s probably not the best solution. With those caveats in mind, here’s how to solve our problem using Spark DataFrames (and the spark-csv library) in the Spark Shell:

val df ="com.databricks.spark.csv").
                         option("header", "true").
val df2=df.filter("EngineSize <= 4.0")
val col=df2.col("Weight")*0.453592
val df3=df2.withColumn("WeightKG",col)


If you really need a distributed data frame library, then you will probably want to use Spark. However, for the vast majority of statistical modelling and data science tasks, Spark is likely to be unnecessarily complex and heavyweight. The other three libraries considered all have pros and cons. They are all largely one-person hobby projects, quite immature, and not currently under very active development. Saddle is fine for when you just want to add column headings to a matrix. Scala-datatable is lightweight and immutable, if you don’t care about missing values. On balance, I think Framian is probably the most full-featured “batteries included” R-like data frame, and so is likely to be most attractive to statisticians and data scientists. However, it’s pretty immature, and the dependence on shapeless may be of concern to those who prefer libraries to be lean and devoid of sorcery!

I’d be really interested to know of other people’s experiences of these libraries, so please do comment if you have any views, and especially if you have opinions on the relative merits of the different libraries.

The full source code for all of these examples, including sbt build files, can be found in a new github repo I’ve created for the code examples associated with this blog.

Calling Scala code from R using rscala


In a previous post I looked at how to call Scala code from R using a CRAN package called jvmr. This package now seems to have been replaced by a new package called rscala. Like the old package, it requires a pre-existing Java installation. Unlike the old package, however, it no longer depends on rJava, which may simplify some installations. The rscala package is well documented, with a reference manual and a draft paper. In this post I will concentrate on the issue of calling sbt-based projects with dependencies on external libraries (such as breeze).

On a system with Java installed, it should be possible to install the rscala package with a simple


from the R command prompt. Calling


will check that it has worked. The package will do a sensible search for a Scala installation and use it if it can find one. If it can’t find one (or can only find an installation older than 2.10.x), it will fail. In this case you can download and install a Scala installation specifically for rscala using the command


This option is likely to be attractive to sbt (or IDE) users who don’t like to rely on a system-wide scala installation.

A Gibbs sampler in Scala using Breeze

For illustration I’m going to use a Scala implementation of a Gibbs sampler. The Scala code, gibbs.scala is given below:

package gibbs

object Gibbs {

    import scala.annotation.tailrec
    import scala.math.sqrt
    import breeze.stats.distributions.{Gamma,Gaussian}

    case class State(x: Double, y: Double) {
      override def toString: String = x.toString + " , " + y + "\n"

    def nextIter(s: State): State = {
      val newX = Gamma(3.0, 1.0/((s.y)*(s.y)+4.0)).draw
      State(newX, Gaussian(1.0/(newX+1), 1.0/sqrt(2*newX+2)).draw)

    @tailrec def nextThinnedIter(s: State,left: Int): State =
      if (left==0) s else nextThinnedIter(nextIter(s),left-1)

    def genIters(s: State, stop: Int, thin: Int): List[State] = {
      @tailrec def go(s: State, left: Int, acc: List[State]): List[State] =
        if (left>0)
          go(nextThinnedIter(s,thin), left-1, s::acc)
          else acc

    def main(args: Array[String]) = {
      if (args.length != 3) {
        println("Usage: sbt \"run <outFile> <iters> <thin>\"")
      } else {
        val outF=args(0)
        val iters=args(1).toInt
        val thin=args(2).toInt
        val out = genIters(State(0.0,0.0),iters,thin)
        val s = new
        s.write("x , y\n")
        out map { it => s.write(it.toString) }


This code requires Scala and the Breeze scientific library in order to build. We can specify this in a sbt build file, which should be called build.sbt and placed in the same directory as the Scala code.

name := "gibbs"

version := "0.1"

scalacOptions ++= Seq("-unchecked", "-deprecation", "-feature")

libraryDependencies  ++= Seq(
            "org.scalanlp" %% "breeze" % "0.10",
            "org.scalanlp" %% "breeze-natives" % "0.10"

resolvers ++= Seq(
            "Sonatype Snapshots" at "",
            "Sonatype Releases" at ""

scalaVersion := "2.11.6"

Now, from a system command prompt in the directory where the files are situated, it should be possible to download all dependencies and compile and run the code with a simple

sbt "run output.csv 50000 1000"

sbt magically manages all of the dependencies for us so that we don’t have to worry about them. However, for calling from R, it may be desirable to run the code without running sbt. There are several ways to achieve this, but the simplest is to build an “assembly jar” or “fat jar”, which is a Java byte-code file containing all code and libraries required in order to run the code on any system with a Java installation.

To build an assembly jar first create a subdirectory called project (the name matters), an in it place two files. The first should be called assembly.sbt, and should contain the line

addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.13.0")

Since the version of the assembly tool can depend on the version of sbt, it is also best to fix the version of sbt being used by creating another file in the project directory called, which should contain the line


Now return to the parent directory and run

sbt assembly

If this works, it should create a fat jar target/scala-2.11/gibbs-assembly-0.1.jar. You can check it works by running

java -jar target/scala-2.11/gibbs-assembly-0.1.jar output.csv 10000 10

Assuming that it does, you are now ready to try running the code from within R.

Calling via R system calls

Since this code takes a relatively long time to run, calling it from R via simple system calls isn’t a particularly terrible idea. For example, we can do this from the R command prompt with the following commands

system("java -jar target/scala-2.11/gibbs-assembly-0.1.jar output.csv 50000 1000")

This works fine, but is a bit clunky. Tighter integration between R and Scala would be useful, which is where rscala comes in.

Calling assembly Scala projects via rscala

rscala provides a very simple way to embed a Scala interpreter within an R session, to be able to execute Scala expressions from R and to have the results returned back to the R session for further processing. The main issue with using this in practice is managing dependencies on external libraries and setting the Scala classpath correctly. By using an assembly jar we can bypass most of these issues, and it becomes trivial to call our Scala code direct from the R interpreter, as the following code illustrates.

sc%~%'import gibbs.Gibbs._'

Here we call the getIters function directly, rather than via the main method. This function returns an immutable List of States. Since R doesn’t understand this, we map it to an Array of Arrays, which R then unpacks into an R matrix for us to store in the matrix out.


The CRAN package rscala makes it very easy to embed a Scala interpreter within an R session. However, for most non-trivial statistical computing problems, the Scala code will have dependence on external scientific libraries such as Breeze. The standard way to easily manage external dependencies in the Scala ecosystem is sbt. Given an sbt-based Scala project, it is easy to generate an assembly jar in order to initialise the rscala Scala interpreter with the classpath needed to call arbitrary Scala functions. This provides very convenient inter-operability between R and Scala for many statistical computing applications.

Scala for Machine Learning [book review]

Full disclosure: I received a free electronic version of this book from the publisher for the purposes of review.

There is clearly a market for a good book about using Scala for statistical computing, machine learning and data science. So when the publisher of “Scala for Machine Learning” offered me a copy for review purposes, I eagerly accepted. Three months later, I have eventually forced myself to read through the whole book, but I was very disappointed. It is important to be clear that I’m not just disappointed because I personally didn’t get much from the book – I am not really the target audience. I am disappointed because I struggle to envisage any audience that will benefit greatly from reading this book. There are several potential audiences for a book with this title: eg. people with little knowledge of Scala or machine learning (ML), people with knowledge of Scala but not ML, people with knowledge of ML but not Scala, and people with knowledge of both. I think there is scope for a book targeting any of those audiences. Personally, I fall in the latter category. The book author claimed to be aiming primarily at those who know Scala but not ML. This is sensible in that the book assumes a good working knowledge of Scala, and uses advanced features of the Scala language without any explanation: this book is certainly not appropriate for people hoping to learn about Scala in the context of ML. However, it is also a problem, as this would probably be the worst book I have ever encountered for learning about ML from scratch, and there are a lot of poor books about ML! The book just picks ML algorithms out of thin air without any proper explanation or justification, and blindly applies them to tedious financial data sets irrespective of whether or not it would be in any way appropriate to do so. It presents ML as an incoherent “bag of tricks” to be used indiscriminately on any data of the correct “shape”. It is by no means the only ML book to take such an approach, but there are many much better books which don’t. The author also claims that the book will be useful to people who know ML but not Scala, but as previously explained, I do not think that this is the case (eg. monadic traits appear on the fifth page, without proper explanation, and containing typos). I think that the only audience that could potentially benefit from this book would be people who know some Scala and some ML and want to see some practical examples of real world implementations of ML algorithms in Scala. I think those people will also be disappointed, for reasons outlined below.

The first problem with the book is that it is just full of errors and typos. It doesn’t really matter to me that essentially all of the equations in the first chapter are wrong – I already know the difference between an expectation and a sample mean, and know Bayes theorem – so I can just see that they are wrong, correct them, and move on. But for the intended audience it would be a complete nightmare. I wonder about the quality of copy-editing and technical review that this book received – it is really not of “publishable” quality. All of the descriptions of statistical/ML methods and algorithms are incredibly superficial, and usually contain factual errors or typos. One should not attempt to learn ML by reading this book. So the only hope for this book is that the Scala implementations of ML algorithms are useful and insightful. Again, I was disappointed.

For reasons that are not adequately explained or justified, the author decides to use a combination of plain Scala interfaced to legacy Java libraries (especially Apache Commons Math) for all of the example implementations. In addition, the author is curiously obsessed with an F# style pipe operator, which doesn’t seem to bring much practical benefit. Consequently, all of the code looks like a strange and rather inelegant combination of Java, Scala, C++, and F#, with a hint of Haskell, and really doesn’t look like clean idiomatic Scala code at all. For me this was the biggest disappointment of all – I really wouldn’t want any of this code in my own Scala code base (though the licensing restrictions on the code probably forbid this, anyway). It is a real shame that Scala libraries such as Breeze were not used for all of the examples – this would have led to much cleaner and more idiomatic Scala code, which could have really taken proper advantage of the functional power of the Scala language. As it is, advanced Scala features were used without much visible pay-off. Reading this book one could easily get the (incorrect) impression that Scala is an unnecessarily complex language which doesn’t offer much advantage over Java for implementing ML algorithms.

On the positive side, the book consists of nearly 500 pages of text, covering a wide range of ML algorithms and examples, and has a zip file of associated code containing the implementation and examples, which builds using sbt. If anyone is interested in seeing examples of ML algorithms implemented in Scala using Java rather than Scala libraries together with a F# pipe operator, then there is definitely something of substance here of interest.


It should be clear from the above review that I think there is still a gap in the market for a good book about using Scala for statistical computing, machine learning and data science. Hopefully someone will fill this gap soon. In the meantime it is necessary to learn about Scala and ML separately, and to put the ideas together yourself. This isn’t so difficult, as there are many good resources and code repositories to help. For learning about ML, I would recommend starting off with ISLR, which uses R for the examples (but if you work in data science, you need to know R anyway). Once the basic concepts are understood, one can move on to a serious text, such as Machine Learning (which has associated Matlab code). Converting algorithms from R or Matlab to Scala (plus Breeze) is generally very straightforward, if you know Scala. For learning Scala, there are many on-line resources. If you want books, I recommend Functional Programming in Scala and Programming in Scala, 2e. Once you know about Scala, learn about scientific computing using Scala by figuring out Breeze. At some point you will probably also want to know about Spark, and there are now books on this becoming available – I’ve just got a copy of Learning Spark, which looks OK.

Statistical computing languages at the RSS

On Friday the Royal Statistical Society hosted a meeting on Statistical computing languages, organised by my colleague Colin Gillespie. Four languages were presented at the meeting: Python, Scala, Matlab and Julia. I presented the talk on Scala. The slides I presented are available, in addition to the code examples and instructions on how to run them, in a public github repo I have created. The talk mainly covered material I have discussed in various previous posts on this blog. What was new was the emphasis on how easy it is to use and run Scala code, and the inclusion of complete examples and instructions on how to run them on any platform with a JVM installed. I also discussed some of the current limitations of Scala as an environment for interactive statistical data analysis and visualisation, and how these limitations could be overcome with a little directed effort. Colin suggested that all of the speakers covered a couple of examples (linear regression and a Monte Carlo integral) in “their” languages, and he provided an R solution. It is interesting to see the examples in the five different languages side by side for comparison purposes. Colin is collecting together all of the material relating to the meeting in a combined github repo, which should fill out over the next few days.

For other Scala posts on this blog, see all of my posts with a “scala” tag.

Brief introduction to Scala and Breeze for statistical computing


In the previous post I outlined why I think Scala is a good language for statistical computing and data science. In this post I want to give a quick taste of Scala and the Breeze numerical library to whet the appetite of the uninitiated. This post certainly won’t provide enough material to get started using Scala in anger – but I’ll try and provide a few pointers along the way. It also won’t be very interesting to anyone who knows Scala – I’m not introducing any of the very cool Scala stuff here – I think that some of the most powerful and interesting Scala language features can be a bit frightening for new users.

To reproduce the examples, you need to install Scala and Breeze. This isn’t very tricky, but I don’t want to get bogged down with a detailed walk-through here – I want to concentrate on the Scala language and Breeze library. You just need to install a recent version of Java, then Scala, and then Breeze. You might also want SBT and/or the ScalaIDE, though neither of these are necessary. Then you need to run the Scala REPL with the Breeze library in the classpath. There are several ways one can do this. The most obvious is to just run scala with the path to Breeze manually specified (or specified in an environment variable). Alternatively, you could run a console from an sbt session with a Breeze dependency (which is what I actually did for this post), or you could use a Scala Worksheet from inside a ScalaIDE project with a Breeze dependency.

A Scala REPL session

A first glimpse of Scala

We’ll start with a few simple Scala concepts that are not dependent on Breeze. For further information, see the Scala documentation.

Welcome to Scala version 2.10.3 (OpenJDK 64-Bit Server VM, Java 1.7.0_25).
Type in expressions to have them evaluated.
Type :help for more information.

scala> val a = 5
a: Int = 5

scala> a
res0: Int = 5

So far, so good. Using the Scala REPL is much like using the Python or R command line, so will be very familiar to anyone used to these or similar languages. The first thing to note is that labels need to be declared on first use. We have declared a to be a val. These are immutable values, which can not be just re-assigned, as the following code illustrates.

scala> a = 6
<console>:8: error: reassignment to val
       a = 6
scala> a
res1: Int = 5

Immutability seems to baffle people unfamiliar with functional programming. But fear not, as Scala allows declaration of mutable variables as well:

scala> var b = 7
b: Int = 7

scala> b
res2: Int = 7

scala> b = 8
b: Int = 8

scala> b
res3: Int = 8

The Zen of functional programming is to realise that immutability is generally a good thing, but that really isn’t the point of this post. Scala has excellent support for both mutable and immutable collections as part of the standard library. See the API docs for more details. For example, it has immutable lists.

scala> val c = List(3,4,5,6)
c: List[Int] = List(3, 4, 5, 6)

scala> c(1)
res4: Int = 4

scala> c.sum
res5: Int = 18

scala> c.length
res6: Int = 4

scala> c.product
res7: Int = 360

Again, this should be pretty familiar stuff for anyone familiar with Python. Note that the sum and product methods are special cases of reduce operations, which are well supported in Scala. For example, we could compute the sum reduction using

scala> c.foldLeft(0)((x,y) => x+y)
res8: Int = 18

or the slightly more condensed form given below, and similarly for the product reduction.

scala> c.foldLeft(0)(_+_)
res9: Int = 18

scala> c.foldLeft(1)(_*_)
res10: Int = 360

Scala also has a nice immutable Vector class, which offers a range of constant time operations (but note that this has nothing to do with the mutable Vector class that is part of the Breeze library).

scala> val d = Vector(2,3,4,5,6,7,8,9)
d: scala.collection.immutable.Vector[Int] = Vector(2, 3, 4, 5, 6, 7, 8, 9)

scala> d
res11: scala.collection.immutable.Vector[Int] = Vector(2, 3, 4, 5, 6, 7, 8, 9)

scala> d.slice(3,6)
res12: scala.collection.immutable.Vector[Int] = Vector(5, 6, 7)

scala> val e = d.updated(3,0)
e: scala.collection.immutable.Vector[Int] = Vector(2, 3, 4, 0, 6, 7, 8, 9)

scala> d
res13: scala.collection.immutable.Vector[Int] = Vector(2, 3, 4, 5, 6, 7, 8, 9)

scala> e
res14: scala.collection.immutable.Vector[Int] = Vector(2, 3, 4, 0, 6, 7, 8, 9)

Note that when e is created as an updated version of d the whole of d is not copied – only the parts that have been updated. And we don’t have to worry that aspects of d and e point to the same information in memory, as they are both immutable… As should be clear by now, Scala has excellent support for functional programming techniques. In addition to the reduce operations mentioned already, maps and filters are also well covered.

scala> val f=(1 to 10).toList
f: List[Int] = List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

scala> f
res15: List[Int] = List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

scala> => x*x)
res16: List[Int] = List(1, 4, 9, 16, 25, 36, 49, 64, 81, 100)

scala> f map {x => x*x}
res17: List[Int] = List(1, 4, 9, 16, 25, 36, 49, 64, 81, 100)

scala> f filter {_ > 4}
res18: List[Int] = List(5, 6, 7, 8, 9, 10)

Note how Scala allows methods with a single argument to be written as an infix operator, making for more readable code.

A first look at Breeze

The next part of the session requires the Breeze library – see the Breeze quickstart guide for further details. We begin by taking a quick look at everyone’s favourite topic of non-uniform random number generation. Let’s start by generating a couple of draws from a Poisson distribution with mean 3.

scala> import breeze.stats.distributions._
import breeze.stats.distributions._

scala> val poi = Poisson(3.0)
poi: breeze.stats.distributions.Poisson = Poisson(3.0)

scala> poi.draw
res19: Int = 2

scala> poi.draw
res20: Int = 3

If more than a single draw is required, an iid sample can be obtained.

scala> val x = poi.sample(10)
x: IndexedSeq[Int] = Vector(2, 3, 3, 4, 2, 2, 1, 2, 4, 2)

scala> x
res21: IndexedSeq[Int] = Vector(2, 3, 3, 4, 2, 2, 1, 2, 4, 2)

scala> x.sum
res22: Int = 25

scala> x.length
res23: Int = 10

scala> x.sum.toDouble/x.length
res24: Double = 2.5

Note that this Vector is mutable. The probability mass function (PMF) of the Poisson distribution is also available.

scala> poi.probabilityOf(2)
res25: Double = 0.22404180765538775

scala> x map {x => poi.probabilityOf(x)}
res26: IndexedSeq[Double] = Vector(0.22404180765538775, 0.22404180765538775, 0.22404180765538775, 0.16803135574154085, 0.22404180765538775, 0.22404180765538775, 0.14936120510359185, 0.22404180765538775, 0.16803135574154085, 0.22404180765538775)

scala> x map {poi.probabilityOf(_)}
res27: IndexedSeq[Double] = Vector(0.22404180765538775, 0.22404180765538775, 0.22404180765538775, 0.16803135574154085, 0.22404180765538775, 0.22404180765538775, 0.14936120510359185, 0.22404180765538775, 0.16803135574154085, 0.22404180765538775)

Obviously, Gaussian variables (and Gamma, and several others) are supported in a similar way.

scala> val gau=Gaussian(0.0,1.0)
gau: breeze.stats.distributions.Gaussian = Gaussian(0.0, 1.0)

scala> gau.draw
res28: Double = 1.606121255846881

scala> gau.draw
res29: Double = -0.1747896055492152

scala> val y=gau.sample(20)
y: IndexedSeq[Double] = Vector(-1.3758577012869702, -1.2148314970824652, -0.022501190144116855, 0.3244006323566883, 0.35978577573558407, 0.9651857500320781, -0.40834034207848985, 0.11583348205331555, -0.8797699986810634, -0.33609738668214695, 0.7043252811790879, -1.2045594639823656, 0.19442688045065826, -0.31442160076087067, 0.06313451540562891, -1.5304745838587115, -1.2372764884467027, 0.5875490994217284, -0.9385520597707431, -0.6647903243363228)

scala> y
res30: IndexedSeq[Double] = Vector(-1.3758577012869702, -1.2148314970824652, -0.022501190144116855, 0.3244006323566883, 0.35978577573558407, 0.9651857500320781, -0.40834034207848985, 0.11583348205331555, -0.8797699986810634, -0.33609738668214695, 0.7043252811790879, -1.2045594639823656, 0.19442688045065826, -0.31442160076087067, 0.06313451540562891, -1.5304745838587115, -1.2372764884467027, 0.5875490994217284, -0.9385520597707431, -0.6647903243363228)

scala> y.sum/y.length
res31: Double = -0.34064156102380994

scala> y map {gau.logPdf(_)}
res32: IndexedSeq[Double] = Vector(-1.8654307403000054, -1.6568463163564844, -0.9191916849836235, -0.9715564183413823, -0.9836614354155007, -1.3847302992371653, -1.0023094506890617, -0.9256472309869705, -1.3059361584943119, -0.975419259871957, -1.1669755840586733, -1.6444202843394145, -0.93783943912556, -0.9683690047171869, -0.9209315167224245, -2.090114759123421, -1.6843650876361744, -1.0915455053203147, -1.359378517654625, -1.1399116208702693)

scala> Gamma(2.0,3.0).sample(5)
res33: IndexedSeq[Double] = Vector(2.38436441278546, 2.125017198373521, 2.333118708811143, 5.880076392566909, 2.0901427084667503)

This is all good stuff for those of us who like to do Markov chain Monte Carlo. There are not masses of statistical data analysis routines built into Breeze, but a few basic tools are provided, including some basic summary statistics.

scala> import breeze.stats.DescriptiveStats._
import breeze.stats.DescriptiveStats._

scala> mean(y)
res34: Double = -0.34064156102380994

scala> variance(y)
res35: Double = 0.574257149387757

scala> meanAndVariance(y)
res36: (Double, Double) = (-0.34064156102380994,0.574257149387757)

Support for linear algebra is an important part of any scientific library. Here the Breeze developers have made the wise decision to provide a nice Scala interface to netlib-java. This in turn calls out to any native optimised BLAS or LAPACK libraries installed on the system, but will fall back to Java code if no optimised libraries are available. This means that linear algebra code using Scala and Breeze should run as fast as code written in any other language, including C, C++ and Fortran, provided that optimised libraries are installed on the system. For further details see the Breeze linear algebra guide. Let’s start by creating and messing with a dense vector.

scala> import breeze.linalg._
import breeze.linalg._

scala> val v=DenseVector(y.toArray)
v: breeze.linalg.DenseVector[Double] = DenseVector(-1.3758577012869702, -1.2148314970824652, -0.022501190144116855, 0.3244006323566883, 0.35978577573558407, 0.9651857500320781, -0.40834034207848985, 0.11583348205331555, -0.8797699986810634, -0.33609738668214695, 0.7043252811790879, -1.2045594639823656, 0.19442688045065826, -0.31442160076087067, 0.06313451540562891, -1.5304745838587115, -1.2372764884467027, 0.5875490994217284, -0.9385520597707431, -0.6647903243363228)

scala> v(1) = 0

scala> v
res38: breeze.linalg.DenseVector[Double] = DenseVector(-1.3758577012869702, 0.0, -0.022501190144116855, 0.3244006323566883, 0.35978577573558407, 0.9651857500320781, -0.40834034207848985, 0.11583348205331555, -0.8797699986810634, -0.33609738668214695, 0.7043252811790879, -1.2045594639823656, 0.19442688045065826, -0.31442160076087067, 0.06313451540562891, -1.5304745838587115, -1.2372764884467027, 0.5875490994217284, -0.9385520597707431, -0.6647903243363228)

scala> v(1 to 3) := 1.0
res39: breeze.linalg.DenseVector[Double] = DenseVector(1.0, 1.0, 1.0)

scala> v
res40: breeze.linalg.DenseVector[Double] = DenseVector(-1.3758577012869702, 1.0, 1.0, 1.0, 0.35978577573558407, 0.9651857500320781, -0.40834034207848985, 0.11583348205331555, -0.8797699986810634, -0.33609738668214695, 0.7043252811790879, -1.2045594639823656, 0.19442688045065826, -0.31442160076087067, 0.06313451540562891, -1.5304745838587115, -1.2372764884467027, 0.5875490994217284, -0.9385520597707431, -0.6647903243363228)

scala> v(1 to 3) := DenseVector(1.0,1.5,2.0)
res41: breeze.linalg.DenseVector[Double] = DenseVector(1.0, 1.5, 2.0)

scala> v
res42: breeze.linalg.DenseVector[Double] = DenseVector(-1.3758577012869702, 1.0, 1.5, 2.0, 0.35978577573558407, 0.9651857500320781, -0.40834034207848985, 0.11583348205331555, -0.8797699986810634, -0.33609738668214695, 0.7043252811790879, -1.2045594639823656, 0.19442688045065826, -0.31442160076087067, 0.06313451540562891, -1.5304745838587115, -1.2372764884467027, 0.5875490994217284, -0.9385520597707431, -0.6647903243363228)

scala> v :> 0.0
res43: breeze.linalg.BitVector = BitVector(1, 2, 3, 4, 5, 7, 10, 12, 14, 17)

scala> (v :> 0.0).toArray
res44: Array[Boolean] = Array(false, true, true, true, true, true, false, true, false, false, true, false, true, false, true, false, false, true, false, false)

Next let’s create and mess around with some dense matrices.

scala> val m = new DenseMatrix(5,4,linspace(1.0,20.0,20).toArray)
m: breeze.linalg.DenseMatrix[Double] = 
1.0  6.0   11.0  16.0  
2.0  7.0   12.0  17.0  
3.0  8.0   13.0  18.0  
4.0  9.0   14.0  19.0  
5.0  10.0  15.0  20.0  

scala> m
res45: breeze.linalg.DenseMatrix[Double] = 
1.0  6.0   11.0  16.0  
2.0  7.0   12.0  17.0  
3.0  8.0   13.0  18.0  
4.0  9.0   14.0  19.0  
5.0  10.0  15.0  20.0  

scala> m.rows
res46: Int = 5

scala> m.cols
res47: Int = 4

scala> m(::,1)
res48: breeze.linalg.DenseVector[Double] = DenseVector(6.0, 7.0, 8.0, 9.0, 10.0)

scala> m(1,::)
res49: breeze.linalg.DenseMatrix[Double] = 2.0  7.0  12.0  17.0  

scala> m(1,::) := linspace(1.0,2.0,4)
res50: breeze.linalg.DenseMatrix[Double] = 1.0  1.3333333333333333  1.6666666666666665  2.0  

scala> m
res51: breeze.linalg.DenseMatrix[Double] = 
1.0  6.0                 11.0                16.0  
1.0  1.3333333333333333  1.6666666666666665  2.0   
3.0  8.0                 13.0                18.0  
4.0  9.0                 14.0                19.0  
5.0  10.0                15.0                20.0  


scala> val n = m.t
n: breeze.linalg.DenseMatrix[Double] = 
1.0   1.0                 3.0   4.0   5.0   
6.0   1.3333333333333333  8.0   9.0   10.0  
11.0  1.6666666666666665  13.0  14.0  15.0  
16.0  2.0                 18.0  19.0  20.0  

scala> n
res52: breeze.linalg.DenseMatrix[Double] = 
1.0   1.0                 3.0   4.0   5.0   
6.0   1.3333333333333333  8.0   9.0   10.0  
11.0  1.6666666666666665  13.0  14.0  15.0  
16.0  2.0                 18.0  19.0  20.0  

scala> val o = m*n
o: breeze.linalg.DenseMatrix[Double] = 
414.0              59.33333333333333  482.0              516.0              550.0              
59.33333333333333  9.555555555555555  71.33333333333333  77.33333333333333  83.33333333333333  
482.0              71.33333333333333  566.0              608.0              650.0              
516.0              77.33333333333333  608.0              654.0              700.0              
550.0              83.33333333333333  650.0              700.0              750.0              

scala> o
res53: breeze.linalg.DenseMatrix[Double] = 
414.0              59.33333333333333  482.0              516.0              550.0              
59.33333333333333  9.555555555555555  71.33333333333333  77.33333333333333  83.33333333333333  
482.0              71.33333333333333  566.0              608.0              650.0              
516.0              77.33333333333333  608.0              654.0              700.0              
550.0              83.33333333333333  650.0              700.0              750.0              

scala> val p = n*m
p: breeze.linalg.DenseMatrix[Double] = 
52.0                117.33333333333333  182.66666666666666  248.0              
117.33333333333333  282.77777777777777  448.22222222222223  613.6666666666667  
182.66666666666666  448.22222222222223  713.7777777777778   979.3333333333334  
248.0               613.6666666666667   979.3333333333334   1345.0             

scala> p
res54: breeze.linalg.DenseMatrix[Double] = 
52.0                117.33333333333333  182.66666666666666  248.0              
117.33333333333333  282.77777777777777  448.22222222222223  613.6666666666667  
182.66666666666666  448.22222222222223  713.7777777777778   979.3333333333334  
248.0               613.6666666666667   979.3333333333334   1345.0             

So, messing around with vectors and matrices is more-or-less as convenient as in well-known dynamic and math languages. To conclude this section, let us see how to simulate some data from a regression model and then solve the least squares problem to obtain the estimated regression coefficients. We will simulate 1,000 observations from a model with 5 covariates.

scala> val X = new DenseMatrix(1000,5,gau.sample(5000).toArray)
X: breeze.linalg.DenseMatrix[Double] = 
-0.40186606934180685  0.9847148198711287    ... (5 total)
-0.4760404521336951   -0.833737041320742    ...
-0.3315199616926892   -0.19460446824586297  ...
-0.14764615494496836  -0.17947658245206904  ...
-0.8357372755800905   -2.456222113596015    ...
-0.44458309216683184  1.848007773944826     ...
0.060314034896221065  0.5254462055311016    ...
0.8637867740789016    -0.9712570453363925   ...
0.11620167261655819   -1.2231380938032232   ...
-0.3335514290842617   -0.7487303696662753   ...
-0.5598937433421866   0.11083382409013512   ...
-1.7213395389510568   1.1717491221846357    ...
-1.078873342208984    0.9386859686451607    ...
-0.7793854546738327   -0.9829373863442161   ...
-1.054275201631216    0.10100826507456745   ...
-0.6947188686537832   1.215...
scala> val b0 = linspace(1.0,2.0,5)
b0: breeze.linalg.DenseVector[Double] = DenseVector(1.0, 1.25, 1.5, 1.75, 2.0)

scala> val y0 = X * b0
y0: breeze.linalg.DenseVector[Double] = DenseVector(0.08200546839589107, -0.5992571365601228, -5.646398002309553, -7.346136663325798, -8.486423788193362, 1.451119214541837, -0.25792385841948406, 2.324936340609002, -1.2285599639827862, -4.030261316643863, -4.1732627416377674, -0.5077151099958077, -0.2087263741903591, 0.46678616461409383, 2.0244342278575975, 1.775756468177401, -4.799821190728213, -1.8518388060564481, 1.5892306875621767, -1.6528539564387008, 1.4064864330994125, -0.8734630221484178, -7.75470002781836, -0.2893619536998493, -5.972958583649336, -4.952666733286302, 0.5431255990489059, -2.477076684976403, -0.6473617571867107, -0.509338416957489, -1.5415350935719594, -0.47068802465681125, 2.546118380362026, -7.940401988804477, -1.037049442788122, -1.564016663370888, -3.3147087994...
scala> val y = y0 + DenseVector(gau.sample(1000).toArray)
y: breeze.linalg.DenseVector[Double] = DenseVector(-0.572127338358624, -0.16481167194161406, -4.213873268823003, -10.142015065601388, -7.893898543052863, 1.7881055848475076, -0.26987820512025357, 3.3289433195054148, -2.514141419925489, -4.643625974157769, -3.8061000214061886, 0.6462624993109218, 0.23603338389134149, 1.0211137806779267, 2.0061727641393317, 0.022624943149799348, -5.429601401989341, -1.836181225242386, 1.0265599173053048, -0.1673732536615371, 0.8418249443853956, -1.1547110533101967, -8.392100167478764, -1.1586377992526877, -6.400362975646245, -5.487018086963841, 0.3038055584347069, -1.2247410435868684, -0.06476921390724344, -1.5039074374120407, -1.0189111630970076, 1.307339668865724, 2.048320821568789, -8.769328824477714, -0.9104251029228555, -1.3533910178496698, -2.178788...
scala> val b = X \ y  // defaults to a QR-solve of the least squares problem
b: breeze.linalg.DenseVector[Double] = DenseVector(0.9952708232116663, 1.2344546192238952, 1.5543512339052412, 1.744091673457169, 1.9874158953720507)

So all of the most important building blocks for statistical computing are included in the Breeze library.

At this point it is really worth reminding yourself that Scala is actually a statically typed language, despite the fact that in this session we have not explicitly declared the type of anything at all! This is because Scala has type inference, which makes type declarations optional when it is straightforward for the compiler to figure out what the types must be. For example, for our very first expression, val a = 5, because the RHS is an Int, it is clear that the LHS must also be an Int, and so the compiler infers that the type of a must be an Int, and treats the code as if the type had been declared as val a: Int = 5. This type inference makes Scala feel very much like a dynamic language in general use. Typically, we carefully specify the types of function arguments (and often the return type of the function, too), but then for the main body of each function, just let the compiler figure out all of the types and write code as if the language were dynamic. To me, this seems like the best of all worlds. The convenience of dynamic languages with the safety of static typing.

Declaring the types of function arguments is not usually a big deal, as the following simple example demonstrates.

scala> def mean(arr: Array[Int]): Double = {
     |   arr.sum.toDouble/arr.length
     | }
mean: (arr: Array[Int])Double

scala> mean(Array(3,1,4,5))
res55: Double = 3.25

A complete Scala program

For completeness, I will finish this post with a very simple but complete Scala/Breeze program. In a previous post I discussed a simple Gibbs sampler in Scala, but in that post I used the Java COLT library for random number generation. Below is a version using Breeze instead.

object BreezeGibbs {

  import breeze.stats.distributions._
  import scala.math.sqrt

  class State(val x: Double, val y: Double)

  def nextIter(s: State): State = {
    val newX = Gamma(3.0, 1.0 / ((s.y) * (s.y) + 4.0)).draw()
    new State(newX, Gaussian(1.0 / (newX + 1), 1.0 / sqrt(2 * newX + 2)).draw())

  def nextThinnedIter(s: State, left: Int): State = {
    if (left == 0) s
    else nextThinnedIter(nextIter(s), left - 1)

  def genIters(s: State, current: Int, stop: Int, thin: Int): State = {
    if (!(current > stop)) {
      println(current + " " + s.x + " " + s.y)
      genIters(nextThinnedIter(s, thin), current + 1, stop, thin)
    } else s

  def main(args: Array[String]) {
    println("Iter x y")
    genIters(new State(0.0, 0.0), 1, 50000, 1000)



In this post I’ve tried to give a quick taste of the Scala language and the Breeze library for those used to dynamic languages for statistical computing. Hopefully I’ve illustrated that the basics don’t look too different, so there is no reason to fear Scala. It is perfectly possible to start using Scala as a better and faster Python or R. Once you’ve mastered the basics, you can then start exploring the full power of the language. There’s loads of introductory Scala material to be found on-line. It probably makes sense to start with the links I’ve highlighted above. After that, just start searching – there’s an interesting set of tutorials I noticed just the other day. A very time-efficient way to learn Scala quickly is to do the FP with Scala course on Coursera, but whether this makes sense will depend on when it is next running. For those who prefer real books, the book Programming in Scala is the standard reference, and I’ve also found Functional programming in Scala to be useful (free text of the first edition of the former and a draft of the latter can be found on-line).

REPL Script

Below is a copy of the complete REPL script, for reference.

// start with non-Breeze stuff

val a = 5
a = 6

var b = 7
b = 8

val c = List(3,4,5,6)
c.foldLeft(0)((x,y) => x+y)

val d = Vector(2,3,4,5,6,7,8,9)
val e = d.updated(3,0)

val f=(1 to 10).toList
f => x*x)
f map {x => x*x}
f filter {_ > 4}

// introduce breeze through random distributions

import breeze.stats.distributions._
val poi = Poisson(3.0)
val x = poi.sample(10)
x map {x => poi.probabilityOf(x)}
x map {poi.probabilityOf(_)}

val gau=Gaussian(0.0,1.0)
val y=gau.sample(20)
y map {gau.logPdf(_)}


import breeze.stats.DescriptiveStats._

// move on to linear algebra

import breeze.linalg._
val v=DenseVector(y.toArray)
v(1) = 0
v(1 to 3) := 1.0
v(1 to 3) := DenseVector(1.0,1.5,2.0)
v :> 0.0
(v :> 0.0).toArray

val m = new DenseMatrix(5,4,linspace(1.0,20.0,20).toArray)
m(1,::) := linspace(1.0,2.0,4)

val n = m.t
val o = m*n
val p = n*m

// regression and QR solution

val X = new DenseMatrix(1000,5,gau.sample(5000).toArray)
val b0 = linspace(1.0,2.0,5)
val y0 = X * b0
val y = y0 + DenseVector(gau.sample(1000).toArray)
val b = X \ y  // defaults to a QR-solve of the least squares problem

// a simple function example

def mean(arr: Array[Int]): Double = {


Scala as a platform for statistical computing and data science

There has been a lot of discussion on-line recently about languages for data analysis, statistical computing, and data science more generally. I don’t really want to go into the detail of why I believe that all of the common choices are fundamentally and unfixably flawed – language wars are so unseemly. Instead I want to explain why I’ve been using the Scala programming language recently and why, despite being far from perfect, I personally consider it to be a good language to form a platform for efficient and scalable statistical computing. Obviously, language choice is to some extent a personal preference, implicitly taking into account subjective trade-offs between features different individuals consider to be important. So I’ll start by listing some language/library/ecosystem features that I think are important, and then explain why.

A feature wish list

It should:

  • be a general purpose language with a sizable user community and an array of general purpose libraries, including good GUI libraries, networking and web frameworks
  • be free, open-source and platform independent
  • be fast and efficient
  • have a good, well-designed library for scientific computing, including non-uniform random number generation and linear algebra
  • have a strong type system, and be statically typed with good compile-time type checking and type safety
  • have reasonable type inference
  • have a REPL for interactive use
  • have good tool support (including build tools, doc tools, testing tools, and an intelligent IDE)
  • have excellent support for functional programming, including support for immutability and immutable data structures and “monadic” design
  • allow imperative programming for those (rare) occasions where it makes sense
  • be designed with concurrency and parallelism in mind, having excellent language and library support for building really scalable concurrent and parallel applications

The not-very-surprising punch-line is that Scala ticks all of those boxes and that I don’t know of any other languages that do. But before expanding on the above, it is worth noting a couple of (perhaps surprising) omissions. For example:

  • have excellent data viz capability built-in
  • have vast numbers of statistical routines in the standard library

The above are points (and there are other similar points) where other languages (for example, R), currently score better than Scala. It is not that these things are not important – indeed, they are highly desirable. But I consider them to be of lesser importance as they are much easier to fix, given a suitable platform, than fixing an unsuitable language and platform. Visualisation is not trivial, but it is not fantastically difficult in a language with excellent GUI libraries. Similarly, most statistical routines are quite straightforward to implement for anyone with reasonable expertise in scientific and statistical computing and numerical linear algebra. These are things that are relatively easy for a community to contribute to. Building a great programming language, on the other hand, is really, really, difficult.

I will now expand briefly on each point in turn.

be a general purpose language with a sizable user community and an array of general purpose libraries, including good GUI libraries, networking and web frameworks

History has demonstrated, time and time again, that domain specific languages (DSLs) are synonymous with idiosyncratic, inconsistent languages that are terrible for anything other than what they were specifically designed for. They can often be great for precisely the thing that they were designed for, but people always want to do other things, and that is when the problems start. For the avoidance of controversy I won’t go into details, but the whole Python versus R thing is a perfect illustration of this general versus specific trade-off. Similarly, although there has been some buzz around another new language recently, which is faster than R and Python, my feeling is that the last thing the world needs right now is Just Unother Language for Indexed Arrays…

In this day-and-age it is vital that statistical code can use a variety of libraries, and communicate with well-designed network libraries and web frameworks, as statistical analysis does not exist in a vacuum. Scala certainly fits the bill here, being used in a large number of important high-profile systems, ensuring a lively, well-motivated ecosystem. There are numerous well-maintained libraries for almost any task. Picking on web frameworks, for example, there are a number of excellent libraries, including Lift and Play. Scala also has the advantage of offering seamless Java integration, for those (increasingly rare) occasions when a native Scala library for the task at hand doesn’t exist.

be free, open-source and platform independent

This hardly needs expanding upon, other than to observe that there are a few well-known commercial software solutions for scientific, statistical and mathematical computing. There are all kinds of problems with using closed proprietary systems, including transparency and reproducibility, but also platform and scalability problems. eg. running code requiring a license server in the cloud. The academic statistical community has largely moved away from commercial software, and I don’t think there is any going back. Scala is open source and runs on the JVM, which is about as platform independent as it is possible to get.

be fast and efficient

Speed and efficiency continue to be important, despite increasing processor speeds. Computationally intensive algorithms are being pushed to ever larger and more complex models and data sets. Compute cycles and memory efficiency really matter, and can’t be ignored. This doesn’t mean that we all have to code in C/C++/Fortran, but we can’t afford to code in languages which are orders of magnitude slower. This will always be a problem. Scala code generally runs well within a factor of 2 of comparable native code – see my Gibbs sampler post for a simple example including timings.

have a good, well-designed library for scientific computing, including non-uniform random number generation and linear algebra

I hesitated about including this in my list of essentials, because it is certainly something that can, in principle, be added to a language at a later date. However, building such libraries is far from trivial, and they need to be well-designed, comprehensive and efficient. For Scala, Breeze is rapidly becoming the standard scientific library, including special functions, non-uniform random number generation and numerical linear algebra. For a data library, there is Saddle, and for a scalable analytics library there is Spark. These libraries certainly don’t cover everything that can be found in R/CRAN, but they provide a fairly solid foundation on which to build.

have a strong type system, and be statically typed with good compile-time type checking and type safety

I love dynamic languages – they are fun and exciting. It is fun to quickly throw together a few functions in a scripting language without worrying about declaring the types of anything. And it is exciting to see the myriad of strange and unanticipated ways your code can crash-and-burn at runtime! 😉 But this excitement soon wears off, and you end up adding lots of boilerplate argument checking code that would not only be much cleaner and simpler in a statically typed language, but would be checked at compile-time, making the static code faster and more efficient. For messing about prototyping, dynamic languages are attractive, but as a solid platform for statistical computing, they really don’t make sense. Scala has a strong type system offering a high degree of compile-time checking, making it a safe and efficient language.

have reasonable type inference

A common issue with statically typed languages is that they lead to verbose code containing many redundant type declarations that the compiler ought to be able to check. This doesn’t just mean more typing – it leads to verbose code that can hide the program logic. Languages with type inference offer the best of both worlds – the safety of static typing without the verbosity. Scala does a satisfactory job here.

have a REPL for interactive use

One thing that dynamic languages have taught us is that it is actually incredibly useful to have a REPL for interactive analysis. This is true generally, but especially so for statistical computing, where human intervention is often desirable. Again, Scala has a nice REPL.

have good tool support (including build tools, doc tools, testing tools, and an intelligent IDE)

Tools matter. Scala has an excellent build tool in the SBT. It has code documentation in the form of scaladoc (similar to javadoc). It has a unit testing framework, and a reasonably intelligent IDE in the form of the Scala IDE (based on Eclipse).

have excellent support for functional programming, including support for immutability and immutable data structures and “monadic” design

I, like many others, am gradually coming to realise that functional programming offers many advantages over other programming styles. In particular, it provides best route to building scalable software, in terms of both program complexity and data size/complexity. Scala has good support for functional programming, including immutable named values, immutable data structures and for-comprehensions. And if off-the-shelf Scala isn’t sufficiently functional already, libraries such as scalaz make it even more so.

allow imperative programming for those (rare) occasions where it makes sense

Although most algorithms in scientific computing are typically conceived of and implemented in an imperative style, I’m increasingly convinced that most can be recast in a pure functional way without significant loss of efficiency, and with significant benefits. That said, there really are some problems that are more efficient to implement in an imperative framework. It is therefore important that the language is not so “pure” functional that this is forbidden. Again, Scala fits the bill.

be designed with concurrency and parallelism in mind, having excellent language and library support for building really scalable concurrent and parallel applications

These days scalability typically means exploiting concurrency and parallelism. In an imperative world this is hard, and libraries such as MPI prove that it is difficult to bolt parallelism on top of a language post-hoc. Check-points, communication overhead, deadlocks and race conditions make it very difficult to build codes that scale well to more than a few processors. Concurrency is more straightforward in functional languages, and this is one of the reasons for the recent resurgence of functional languages and programming. Scala has good concurrency support built-in, and libraries such as Akka make it relatively easy to build truly scalable software.


The Scala programming language ticks many boxes when it comes to forming a nice solid foundation for building a platform for efficient scalable statistical computing. Although I still use R and Python almost every day, I’m increasingly using Scala for serious algorithm development. In the short term I can interface to my Scala code from R using jvmr, but in the longer term I hope that Scala will become a complete framework for statistics and data science. In a subsequent post I will attempt to give a very brief introduction to Scala and the Breeze numerical library.